Breiman-JASA-EstimatingOptimalTransformations-1985
Breiman-JASA-EstimatingOptimalTransformations-1985
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Taylor & Francis, Ltd., American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Estimating Optimal Transformations for Multiple
Regression and Correlation
LEO BREIMAN and JEROME H. FRIEDMAN*
In regression analysis the response variable Y and the predictor applied in situations where the response or the predictors in-
variables XI, . . , Xp are often replaced by functions 0(Y) and volve arbitrary mixtures of continuous ordered variables and
4I(XI), . . . p, (Xp). We discuss a procedure for estimating categorical variables (ordered or unordered). The functions 0,
those functions 0* and 4 * . . . * that minimize e2 = E{[0(Y) 01, . . ., ~ p are real-valued. If the original variable is cate-
_ lp=, 0j(Xj)]2}/var[0(Y)], given only a sample {(Yk, Xkl, gorical, the application of 0 or Xi assigns a real-valued score
. . ., Xkp), 1 ' k ? N} and making minimal assumptions to each of its categorical values.
concerning the data distribution or the form of the solution The procedure is nonparametric. The optimal transformation
functions. For the bivariate case, p = 1, 0* and 4* satisfy p* estimates are based solely on the data sample {(Yk, Xkl, .
= p(0*, 4*) = max0,0p[0(Y), +(X)], where p is the product Xkp), 1 ? k ? N} with minimal assumptions concerning the
moment correlation coefficient and p* is the maximal corre- data distribution and the form of the optimal transformations.
lation between X and Y. Our procedure thus also provides a In particular, we do not require the transformation functions to
method for estimating the maximal correlation between two be from a particular parameterized family or even monotone.
variables. (Later we illustrate situations in which the optimal transfor-
mations are not monotone.)
KEY WORDS: Smoothing; ACE. It is applicable to at least three situations:
580
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 581
where p is the product-moment-correlation coefficient. The There are no analogous results, however, for stationary er-
quantity p*(X, Y) is known as the maximal correlation between godic series or controlled designs. To remedy this we show that
X and Y, and it is used as a general measure of dependence there are sequences of data smooths that have the requisite
(Gebelein 1947; also see Renyi 1959, Sarmanov 1958a, b, and properties in all three cases.
Lancaster 1958). The maximal correlation has the following This article is presented in two distinct parts. Sections 1-4
properties (Renyi 1959): give a fairly nontechnical overview of the method and discuss
its application to data. Section 5 and Appendix A are, of ne-
1. 0 ' p*(X, Y) ' 1.
cessity, more technical, presenting the theoretical foundation
2. p*(X, Y) = 0 if and only if X and Y are independent.
for the procedure.
3. If there exists a relation of the form u(X) = v(Y), where
There is relevant previous work. Closest in spirit to the ACE
u and v are Borel-measurable functions with var[u(X)] > 0,
algorithm we develop is the MORALS algorithm of Young et
then p*(X, Y) = 1.
al. (1976) (also see de Leeuw et al. 1976). It uses an alternating
Therefore, in the bivariate case our procedure can also be re- least squares fit, but it restricts transformations on discrete
garded as a method for estimating the maximal correlation be- ordered variables to be monotonic and transformations on con-
tween two variables, providing as a by-product estimates of the tinuous variables to be linear or polynomial. No theoretical
functions 0*, 4*, that achieve the maximum. framework for MORALS is given.
In the next section, we describe our procedure for finding Renyi (1959) gave a proof of the existence of optimal trans-
optimal transformations using algorithmic notation, deferring formations in the bivariate case under conditions similar to ours
mathematical justifications to Section 5 and Appendix A. We in the general case. He also derived integral equations satisfied
next illustrate the procedure in Section 3 by applying it to a by 0* and q* with kernels depending on the bivariate density
simulated data set in which the optimal transformations are of X and Y and concentrated on finding solutions assuming this
known. The estimates are surprisingly good. Our algorithm is density known. The equations seem generally intractable with
also applied to the Boston housing data of Harrison and Rub- only a few known solutions. He did not consider the problem
infeld (1978) as listed in Belsley et al. (1980). The transfor- of estimating 0*, q9* from data.
mations found by the algorithm generally differ from those Kolmogorov (see Sarmanov and Zaharov 1960 and Lancaster
applied in the original analysis. Finally, we apply the procedure 1969) proved that if Y1, . . . , Yq, XI, . . ., Xp have a joint
to a multiple time series arising from an air pollution study. A normal distribution, then the functions 0(YI, . . . , Yq), 4(XI,
FORTRAN implementation of our algorithm is available from * . . , Xp) having maximum correlation are linear. It follows
either author. Section 4 presents a general discussion and relates from this that in the regression model
this procedure to other empirical methods for finding transfor-
mations.
Section 5 and Appendix A provide some theoretical frame- i=l
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
582 Journal of the American Statistical Association, September 1985
with I * - [E( )2] 12. Next, consider the unrestricted min- Each iteration of the inner For loop minimizes e2 (2.4) with
imization of (2.1) with respect to +(X) for a given 0(Y). The respect to the function 4k(Xk), k = 1, . . . , p, with all other
solution is functions fixed at their previous evaluations (execution of the
For loop). The outer loop is iterated until one complete pass
OI(X) = E[0(Y) I X]. (2.3) over the predictor variables (inner For loop) fails to decrease
Equations (2.2) and (2.3) form the basis of an iterative opti- e2 (2.4).
mization procedure involving alternating conditional expecta- Substituting this procedure for the corresponding single func-
tions (ACE). tion optimization in the bivariate ACE algorithm gives rise to
the full ACE algorithm for minimizing the (2.4) e2.
Basic ACE Algorithm
ACE Algorithm
Set 0(Y) = Y/IIYII;
Iterate until e2(0, q) fails to decrease: Set 0(Y) = Y/IIYII and +,(X1), . . ., 4p(Xp) = 0;
XI(X) = E[0(Y) I X]; Iterate until e2(0, 4,, . . ., 4p) fails to decrease;
replace +(X) with XI(X); Iterate until e2(0, 1, . . ., /p) fails to decrease;
01(Y) = E[O(X) I Y]/IIE[k(X) I Y]II; Fork= ltopDo:
replace 0(Y) with 0I(Y); Ok,l(Xk) = E[0(Y) - Eilk qi(Xi) | Xk];
End Iteration Loop; replace 4k(Xk) with kk, I(Xk);
0 and 0 are the solutions O* and 4*; End For Loop;
End Algorithm. End Inner Iteration Loop;
01(Y) = E[Yi=, Ii(Xi) I Y]IIIE[= I Oi(Xi) IY];
This algorithm decreases (2.1) at each step by alternatingly
replace 0(Y) with 01(Y);
minimizing with respect to one function and holding the other
End Outer Iteration Loop;
fixed at its previous evaluation. Each iteration (execution of
0, 4, . . . , Op are the solutions 0*, 0, . ,p;
the iteration loop) performs one pair of these single-function
End ACE Algorithm.
minimizations. The process begins with an initial guess for one
of the functions (0 = Y/IIYII) and ends when a complete iteration In Section 5, we prove that the ACE algorithm converges to
pass fails to decrease e2. In Section 5, we prove that the al- optimal transformations.
gorithm converges to optimal transformations Q*, O*.
Now consider the more general case of multiple predictors 3. APPLICATIONS
XI,. . . , Xp. We proceed in direct analogy with the basic ACE In the previous section, the ACE algorithm was developed
algorithm. We minimize in the context of known distributions. In practice, data distri-
butions are seldom known. Instead, one has a data set {(Yk,
e2(0, q5, * , kp) = E[0(Y) - I dj(XJ)1, (2.4) Xkl, . . . , Xkp), 1 k ? N} that is presumed to be a sample
from Y, XI, . . ., Xp. The goal is to estimate the optimal
holding EQ2 = 1, EO = E I = E4p = 0, through a transformation functions 0(Y), 41(XI), . . . , 4p(Xp) from the
series of single-function minimizations involving bivariate con-
data. This can be accomplished by applying the ACE algorithm
to the data with the quantities e2, liii, and the conditional ex-
ditional expectations. For a given set of functions q$1(XI), .
pectations replaced by suitable estimates. The resulting func-
Op(Xp), minimization of (2.4) with respect to ?(Y) yields
tions 0, 4*, I . . ., Op are then taken as estimates of the
corresponding optimal transformations.
01(Y) = E[ I Y](xi) I j E[ i (xi) I Y (2.5) The estimate for e2 is the usual mean squared error for regres-
sion:
The next step is to minimize (2.4) with respect to 4I(X1),
... ., qp(Xp), given 0(Y). This is obtained through another e2(o, * * 4P ) I N E 0O(Yk) I Oj(Xk)]
iterative algorithm. Consider the minimization of (2.4) with Nk=l L J=
respect to a single function Ok(Xk) for given 0(Y) and a given If g(y, xl, . . ., xp) is a function defined for all data values,
set 41, . . , 4 k-1I, 4k+17 * , 4p. The solution is then u1gh12 is replaced by
The corresponding iterative algorithm is as follows: For the case of categorical variables, the conditional expectation
estimates are straightforward: If the data are {(Xk, Zk)}, k = 1,
Set 41(XI), . . . , 4p(Xp) = 0;
N, and Z is categorical, then
Iterate until e2(0, 4', .P . . , 4) fails to decrease;
Fork= ltopDo:
E[XIZ=z] = 2 Xk ,
Xkk,l(Xk) = E[0(Y) - i#k q5i(Xi) I XkI; Zk.Z Zk Z
replace kk(Xk) with jk1 I(Xk);
End For Loop; where X is real-valued and the sums are over the subset of
End Iteration Loop; observations having (categorical) value Z = z. For variables
01, . . Xpare the solution functions. that can assume many ordered values, the estimation is based
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 583
on smoothing techniques. Such procedures have been the sub- to study the relation between air pollution (ozone) and various
ject of considerable study (e.g., see Gasser and Rosenblatt meteorological quantities.
1979, Cleveland 1979, and Craven and Wahba 1979). Since Our first example consists of 200 bivariate observations {(Yk,
the smoother is repeatedly applied in the algorithm, high speed Xk), 1 ? k ? 2001 generated from the model
is desirable, as well as adaptability to local curvature. We use
Yk = exp[xk + Ek],
a smoother employing local linear fits with varying window
width determined by local cross-validation (the "super- with the xl and the 8k drawn independently from a standard
AA
smoother"; see Friedman and Stuetzle 1982). normal distribution N(0, 1). Figure 1 (a) shows a scatterplot of
The algorithm evaluates 0*, 04, . . ., /* at all the corre- these data. Figures 1(b)-l(d) show the results of applying the
sponding data values; that is, 0*(y) is evaluated at the set of ACE algorithm to the data. The estimated optimal transfor-
data values {Ykl, k = 1, . . . , N. The simplest way to under- mation 0*(y) is shown in Figure 1(b)'s plot of 0*(Yk) versus
stand the shape of the transformations is by means of a plot of Yk, 1 s k s 200. Figure 1(c) is a plot of 4*(Xk) versus Xk.
the function versus the corresponding data values-that is, through These plots suggest the transformations 0(y) = log(y) and +(x)
the plots of 0*(Yk) versus Yk and 41, . . . , 4 versus the data = X3, which are optimal for the parent distribution. Figure 1 (d)
values of xl, . . . , xp, respectively. is a plot of 0*(Yk) versus 4*(Xk). This plot indicates a more
In this section, we illustrate the ACE procedure by applying linear relation between the transformed variables than that be-
it to various data sets. In order to evaluate performance on finite tween the untransformed ones.
samples, the procedure is first applied to simulated data for The next issue we address is how much the algorithm overfits
which the optimal transformations are known. We next apply the data due to the repeated smoothings, resulting in inflated
it to the Boston housing data of Harrison and Rubinfeld (1978) estimates of the maximal correlation p* and of R*2 = 1 -
as listed in Belsley et al. (1980), contrasting the ACE trans- e*2. The answer, on the simulated data sets we have generated,
formations with those used in the original analysis. For our last is surprisingly little.
example, we apply the ACE procedure to a multiple time series To illustrate this, we contrast two estimates of p* and R*2
40 0
a ~~~~~~~~~~~~~~~~~~c
y vs. xiF ()
bd
20
-1 0 1 -1 0
2 0*(Y) - 2_ 2 0b*(y) vs. 0*(x)
-1'-
-2 -2
A0
0 20 40 60 -2 -1 0
-2K I I I~~ I III -
Figure 1. First Example: (a) Original Data; (b) Transform on y; (c) Transform on x; (d) Transformed Data.
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
584 Journal of the American Statistical Association, September 1985
Table 1. Comparison of p* Estimates Table 3. Estimate Differences
Standard Standard
Estimate Mean Deviation Estimate Mean Deviation
p* direct .700 .034
ACE .709 .036 p- R2
R*2- .001
.012.015
.022
using the above model. The known optimal transformations are This example illustrates that the ACE algorithm is able to
0(Y) = log Y, +(X) = X3. Therefore, we define the direct produce nonmonotonic estimates for both response and predic-
estimate p for p*, given any data set generated as above by the tor transformations.
sample correlation between log Yk and xl and set R2 = p2. The For our next example, we apply the ACE algorithm to the
ACE algorithm produces the estimates Boston housing market data of Harrison and Rubinfeld (1978).
A complete listing of these data appears in Belsley et al. (1980).
lN
Harrison and Rubinfeld used these data to estimate marginal
P N E= 6*(Yk)
=I *i(Xk)
Nk=I1 air pollution damages as revealed in the housing market. Central
to their analysis was a housing value equation that relates the
and R*2 = 1 -I - e p*2 In this model p* = .707 and R*2
median value of owner-occupied homes in each of the 506
- .5.
census tracts in the Boston Standard Metropolitan Statistical
For 100 data sets, each of size 200, generated from the above
Area to air pollution (as reflected in concentration of nitrogen
model, the means and standard deviations of the p* estimates
oxides) and to 12 other variables that are thought to affect
are in Table 1. The means and standard deviations of the R *2
housing prices. This equation was estimated by trying to de-
estimates are in Table 2.
termine the best-fitting functional form of housing price on
We also computed the differences p* - p and R*2 - R2
these 13 variables. By experimenting with a number of possible
for the 100 data sets. The means and standard deviations are
transformations of the 14 variables (response and 13 predictors),
in Table 3.
Harrison and Rubinfeld settled on an equation of the form
The preceding experiment was duplicated for smaller sample
size N = 100. In this case we obtained the differences in log(MV) = al + a2(RM)2 + a3AGE
Table 4.
We next show an application of the procedure to simulated + a4log(DIS) + a5log(RAD) + a6TAX
data generated from the model
+ a7PTRATIO + a8(B - .63)2
Yk = exp[sin(27tXk) + Ck12], 1 ? k ? 200, + aglog(LSTAT) + ajOCRIM + aj1ZN
with the Xk sampled from a uniform distribution U(0, 1) and
+ a12INDUS + a13CHAS + a14(NOX)P + c.
the Ck drawn independently of the Xk from a standard normal
distribution N(0, 1). Figure 2(a) shows a scatterplot of these A brief description of each variable is given in Appendix B.
data. Figures 2(b) and 2(c) show the optimal transformation (For a more complete description, see Harrison and Rubinfeld
estimates 0*(y) and +*(x). Although log(y) and sin(2irx) are 1978, table 4.) The coefficients al, . . . , a14 were determined
not the optimal transformations for this model [owing to the by a least squares fit to measurements of the 14 variables for
non-normal distribution of sin(2irx)], these transformations are the 506 census tracts. The best value for the exponent p was
still clearly suggested by the resulting estimates. found to be 2.0, by a numerical optimization (grid search). This
Our next example consists of a sample of 200 triples {Yk, "basic equation" was used to generate estimates for the will-
Xkl, Xk2), 1 ' k ' 200} drawn from the model Y = XIX2, with ingness to pay for and the marginal benefits of clean air. Har-
XI and X2 generated independently from a uniform distribution rison and Rubinfeld (1978) noted that the results are highly
U(- 1, 1). Note that 0(Y) = log(Y) and Oj(Xj) = log Xj sensitive to the particular specification of the form of the hous-
(j = 1, 2) cannot be solutions here, since Y, XI, and X2 all ing price equation.
assume negative values. Figure 3(a) shows a plot of 0*(Yk) We applied the ACE algorithm to the transformed measure-
versus Yk, and Figures 3(b) and 3(c) show corresponding plots ments (y', xl .. x13) (using p = 2 for NOX) appearing in the
of j* (Xkl) and 45(Xk2) (1 ' k ' 200). All three solution basic equation. To the extent that these transformations are close
transformation functions are seen to be double-valued. The to the optimal ones, the algorithm will produce almost linear
optimal transformations for this problem are 0*(Y) = log|Y| functions. Departures from linearity indicate transformations
and 4j(Xj) = loglXjl (j = 1, 2). The estimates clearly reflect that can improve the quality of the fit.
this structure except near the origin, where the smoother cannot In this (and the following) example we apply the procedure
reproduce the infinite discontinuity in the derivative. in a forward stepwise manner. For the first pass we consider
Table 2. Comparison of R*2 Estimates Table 4. Estimate Differences, Sample Size 100
Standard Standard
Estimate Mean Deviation Estimate Mean Deviation
R*2 direct .492 .047 p* - p .029 .034
ACE .503 .050 R*- R2.042 .051
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 585
_ . 1 l l I | I~~~~~~~~~~~~~~~~~~~I | |I T
2a
6 _y vs. x
a
6~~~~~~~~~~~~~~~~~~~~~
1 _!
. 0.2:~~~~~~~~~~~
0.4 0 1
2
b
b
0 .2 4. 0.6.
0I
0
2 0*(y)~~~~~~~0*x
-1 _-0.5 0 0.5 1
1.0 C 2* (X2)
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
586 Joumal of the American Statistical Association, September 1985
3 f4*(I0g
2 e - 00MV)''''IW^?|I 3 ai ~~~~ ~~~~~~~~~~~~~~0.4e
0*(PTRATIO)
0 -- 0.0
1 ~~~~0.2
.2La
2
b2*(MV)
0.4t' 0.2*(TAX)
8.5 I9 I9.5
I I10
, ,10.5
I I,_
11__2
12 14__,_____";
16 18 20 22
4 ''' i i '~~~~~~~~~~~~~~0.
1 0 r t i0.2
0
0.0
C 9. 01
0 10 20 30 40 50 200 400 600
I ~~~~-0.1
/ ~~~~~~~~~~-0.2
1. d .*(log LSTAT) h
0.5 - 1 ~2 -:. : ,
0.0
- 1 .o -2N-
-4 -3 -2 -1 -1 0 1 2 3 4
Figure 4. Boston Hdousing Data: (a) Transformed Iog(MV); (b) Transformed MV; (c) Transformed RM2 (a= .492); (d) Transformed log(LSTAT)
(a - .417); (e) Transformed PT Rtatio (a = .147); (f) Transformed Tax (a - .122); (g) Transformed NOX2 (a = .09); (ih) Transformed y Versus
Predictor of Transformed y.
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 587
mation 0(y'). This function is seen to have a positive curvature meteorology in the Los Angeles basin. The data consist of daily
for central values of y', connecting two straight line segments measurements of ozone concentration (maximum one hour av-
of different slope in either side. This suggests that the loga- erage) and eight meteorological quantities for 330 days of 1976.
rithmic transformation may be too severe. Figure 4(b) shows Appendix C lists the variables used in the study. The ACE
the transformation 0(y) resulting when the (forward stepwise) algorithm was applied here in the same forward stepwise man-
ACE algorithm is applied to the original untransformed census ner as in the previous (housing data) example. Four variables
measurements. (The same predictor variable set appears in this were selected. These are the first four listed in Appendix C.
model.) This analysis indicates that, if anything, a mild trans- The resulting R2 was .78. Running the ACE algorithm with all
formation, involving positive curvature, is most appropriate for eight predictor variables produces an R2 of .79.
the response variable. In order to assess the extent to which these meteorological
Figures 4(c)-4(f) show the ACE transformations (X)k, (x,). variables capture the daily variation of the ozone level, the
kk4(Xk4) for the (transformed) predictor variables x' appearing variable day-of-the-year was added and the ACE algorithm was
in the final model. The standard deviation u(4,*) is indicated run with it and the four selected meteorological variables. This
in each graph. This provides a measure of how strongly each can detect possible seasonal effects not captured by the mete-
4>*(xj) enters into the model for 0*(y'). [Note that v(0) = orological variables. The resulting R2 was .82. Figures 5(a)-
1.] The two terms that enter most strongly involve the number 5(f) show the optimal transformation estimates.
of rooms squared [Figure 4(c)] and the logarithm of the fraction The solution for the response transformation, Figure 5(a),
of population that is of lower status [Figure 4(d)]. The nearly shows that, at most, a very mild transformation with negative
linear shape of the latter transformation suggests that the orig- curvature is indicated. Similarly, Figure 5(b) indicates that there
inal logarithmic transformation was appropriate for this vari- is no compelling necessity to consider a transformation on the
able. The transformation on the number of rooms squared vari- most influential predictor variable, Sandburg Air Force Base
able is far from linear, however, indicating that a simple quadratic Temperature. The solution transformation estimates for the re-
does not adequately capture its relationship to housing value. maining variables, however, are all highly nonlinear (and non-
For fewer than six rooms, housing value is roughly independent monotonic). For example, Figure 5(d) suggests that the ozone
of room number, whereas for larger values there is a strong concentration is much more influenced by the magnitude than
increasing linear dependence. The remaining two variables that the sign of the pressure gradient.
enter into this model are pupil-teacher ratio and property tax The solution for the day-of-the-year variable, Figure 5(f),
rate. The solution transformation for the former, Figure 4(e), indicates a substantial seasonal effect after accounting for the
is seen to be approximately linear whereas that for the latter, meteorological variables. This effect is minimum at the year
Figure 4(f), has considerable nonlinear structure. For tax rates boundaries and has a broad maximum peaking at about May
of up to $320, housing price seems to fall rapidly with increas- 1. This can be compared with the dependence of ozone pollution
ing tax, whereas for larger rates the association is roughly on day-of-the-year alone, without taking into account the me-
constant. teorological variables. Figure 5(g) shows a smooth of ozone
Although the variable (NOX)2 was not selected by our step- concentration on day-of-the-year. This smooth has an R2 of .38
wise procedure, we can try to estimate its marginal effect on and is seen to peak about three months later (August 3).
median home value by including it with the four selected vari- The fact that the day-of-the-year transformation peaked at
ables and running ACE with the resulting five predictor vari- the beginning of May was initially puzzling to us, since the
ables. The increase in R2 over the four-predictor model was highest pollution days occur from July to September. This latter
.006. The solution transformations on the response and original fact is confirmed by the day-of-the-year transformation with
four predictors changed very little. The solution transformation the meteorological variables removed. Our current belief is that
for (NOX)2 is shown in Figure 4(g). This curve is a nonmon- with the meteorological variables entered, day-of-the-year be-
otonic function of NOX2, not well approximated by a linear (or comes a partial surrogate for hours of daylight before and during
monotone) function. This makes it difficult to formulate a sim- the morning commuter rush. The decline past May 1 may then
ple interpretation of the willingness to pay for clean air from be explained by the fact that daylight saving time goes into
these data. For low concentration values, housing prices seem effect in Los Angeles on the last Sunday in April.
to increase with increasing (NOX)2, whereas for higher values These data illustrate that ACE is useful in uncovering inter-
this trend is substantially reversed. esting and suggestive relationships. The form of the dependence
Figure 4(h) shows a scatterplot of O*(Yk) verus _j_ f* (Xkj) on the Daggett pressure gradient and on the day-of-the-year
for the four-predictor model. This plot shows no evidence of would be extremely difficult to find by any previous method-
additional structure not captured in the model ology.
4. DISCUSSION
4
0()= , /j*(Xj) + e.
j=1
The ACE algorithm provides a fully automated method for
The e^*2 resulting from the use of the ACE transformations was estimating optimal transformations in multiple regression. It
.11,? as compared to the e2 value of .20 produced by the Harrison also provides a method for estimating maximal correlation be-
and Rubinfeld (1978) transformations involving all 14 varia- tween random variables. It differs from other empirical methods
bles. for finding transformations (Box and Tidwell 1962; Anscombe
For our final example, we use the ACE algorithm to study and Tukey 1963; Box and Cox 1964; Kruskal 1964, 1965;
the relationship between atmospheric ozone concentration and Fraser 1967; Box and Hill 1974; Linsey 1972, 1974; Wood
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
588 Journal of the American Statistical Association, September1i985
2 a ~~~~ ~~~~~~~~~~0.3 -
0 (UP03) 10.2 ~*(VSTY)
0 -j 0.0 5 ~~~~~0.1
bf
-1 ~~~~~~~~~~~~~~~~~~~~~~~-0.1
0.50. 0.0~~~~~~~~~~~~~~~~~~~.
-0.2
-0.5
-0.4
0.0 0.0
1.0
7 q5~~~~~*(IBHT)
0.5
0.1~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~1
0.1 ~~~~~~~~~~~~~~~~~~~-0.5
-0.2
d0.2 - $~~~~~*(DGPG)
0.0
-0.2
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 589
1974; Mosteller and Tukey 1977; and Tukey 1982) in that the dication of how good the analyst's guess is. We have found
"best" transformations of the response and predictor variables that the plots themselves often give surprising new insights into
are unambiguously defined and estimated without use of ad hoc the relationship between the response and predictor variables.
heuristics, restrictive distributional assumptions, or restriction As with any regression procedure, a high degree of associ-
of the transformation to a particular parametric family. ation between predictor variables can sometimes cause the in-
The algorithm is reasonably computer efficient. On the Bos- dividual transformation estimates to be highly variable, even
ton housing data set comprising 506 data points with 14 vari- though the complete model is reasonably stable. When this is
ables each, the run took 12 seconds of central processing unit suspected, running the algorithm on randomly selected subsets
(CPU) time on an IBM 3081 computer. Our guess is that this of the data, or on bootstrap samples (Efron 1979), can assist
translates into 2.5 minutes on a VAX 11/750 computer. To in assessing the variability.
extrapolate to other problems, use the estimate that running The ACE method has generality beyond that exploited here.
time is proportional to (number of variables) x (sample size). An immediate generalization would involve multiple response
A strong advantage of the ACE procedure is the ability to variables YI, . . . , Yq. The generalized algorithm would esti-
incorporate variables of quite different type in terms of the set mate optimal transformations 0*, . . .0, O*, 04*, . . ., p* that
of values they can assume. The transformation functions 0(y), minimize
01(xj), . . . , Op(xp) assume values on the real line. Their
arguments can, however, assume values on any set. For ex- EL 01 (Y1) - o )(Xj)~
ample, ordered real, periodic (circularly valued) real, ordered,
and unordered categorical variables can be incorporated in the
same regression equation. For periodic variables, the smoother subject to EO = O,= I 1, ..., q, E = O,j = 1, ...,
p, and IIY, 01(Y1)112 = 1.
window need only wrap around the boundaries. For categorical
This extension generalizes the ACE procedure in a sense
variables, the procedure can be regarded as estimating optimal
similar to that in which canonical correlation generalized linear
scores for each of their values. (The special case of a categorical
regression.
response and a single categorical predictor variable is known
The ACE algorithm (Section 2) is easily modified to incor-
as canonical analysis-see Kendall and Stuart 1967, p. 568-
porate this extension. An inner loop over the response variables,
and the optimal scores can, in this case, also be obtained by
analogous to that for the predictor variables, replaces the single-
solution of a matrix eigenvector problem.)
function minimization.
The ACE procedure can also handle variables of mixed type.
For example, a variable indicating present marital status might 5. OPTIMAL TRANSFORMATIONS IN
take on an integer value (number of years married) or one of FUNCTION SPACE
several categorical values (N = never, D = divorced, W =
widowed, etc.). This presents no additional complication in 5.1 Introduction
estimating conditional expectations. This ability provides a In this section, we first prove the existence of optimal trans-
straightforward way to handle missing data values (Young et formations (Theorem 5.2). Then we show that the ACE algo-
al. 1976). In addition to the regular sets of values realized by rithm converges to an optimal transformation (Theorems 5.4
a variable, it can also take on the value "missing." and 5.5).
In some situations the analyst, after running ACE, may want Define random variables to take values either in the reals or
to estimate values of y rather than 0*(y), given a specific value in a finite or countable unordered set. Given a set of random
of x. One method for doing this is to attempt to compute variables Y, XI, . . . , Xp , a transformation is defined by a set
0 Q*- ( j*(Xj)). Letting Z = 1j=1 Ij*(XJ), however, we of real-valued measurable functions (0, 4), . . ., 4)P) = (0,
know that the best least squares predictor of Y of the form Z(Z)
4), each function defined on the range of the corresponding
is given by E(Y I Z). This is implemented in the current ACE random variables, such that
program by predicting y as the function of ljP=I 4* (xj), ob-
tained by smoothing the data values of y on the data values of EO(Y) = 0, E/j(Xj) = 0, j = 1, . . ., p
Ej> j* (xj). We are grateful to Arthur Owens for suggesting
E02(y) < oo, E)j2(Xj) < oo, j = 1. p. (5.1)
this simple and elegant prediction procedure.
The solution functions 0*(y) and 4 (x1), . . ., * (xp) can Use the notation
be stored as a set of values associated with each observation
(Yk, Xkl, . . . , xkp), 1 ? k ? N. Since 0(y) and +(x), however, +(X) = E 4(Xi). (5.2)
are usually smooth (for continuous y, x), they can be easily
approximated and stored as cubic spline functions (deBoor 1978) Denote the set of all transformations by W.
with a few knots.
As a tool for data analysis, the ACE procedure provides Definition 5.1. A transformation (0*, q*) is optimal for
regression if E(0*)2 = 1 and
graphical output to indicate a need for transformations as well
as to guide in their choice. If a particular plot suggests a familiar e*2 = E[O*(Y) - (*(X)12
functional form for a transformation, then the data can be pre-
transformed using this functional form and the ACE algorithm
= inf {E[0(Y) - 4(X)]2; EQ2 =1}
can be rerun. The linearity (or nonlinearity) of the resulting
ACE transformation on the variable in question gives an in- Definition 5 . 2. A transformation (Q* *, + * *) is optimal for
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
590 Journal of the American Statistical Association, September 1985
correlation if E(0**)2 = 1, k(o**)2 = 1, and Proposition 5.1. The set of all functions f of the form
p= E[0**(Y)4**(X)]
f(Y, X) = O(Y) + , 41(X1), 0 E H2(Y), fj E H2(Xj),
= sup {E[O(Y)4(X)]; E(4)2 = 1, EO2 = 1}.
with the inner product and norm
Theorem 5.1. If (0**, 4**) is optimal for correlation, then
(g, f) = E[gf], lf 112 = Ef2,
0* = 0**, 4* = p*4** is optimal for regression, and the
converse. Furthermore, e*2 1 -p*. is a Hilbert space denoted by H2. The subspace of all functions
Proof. Write 4 of the form
Write
are individually a.s. zero.
E[0n'j(Xj)0n'i(Xi)] = E[1n,'(Xj)E(0n'i(Xi) I Xj)]
To formulate the second assumption, we use Definition 5.3.
to see that Assumption 5.2 implies E4n)t,n'i E4j4i (i = j),
Definition 5.3. Define the Hilbert spaces H2(Y), H2(XI),
and similarly for EOn'n4',j. Furthermore II) < lim inf Ikkn iII,
. , H2(Xp) as the sets of functions satisfying (5.1) with the
usual inner product; that is, H2(Xi) is the set of all measurable 11011 - lim inf lIn'll. Thus, defining f = 0 + Ejoj,
4, such that E4j(Xj) = 0, Eoj2(Xj) < oo with (0j', 4j) =
E[j' (Xj)0j(Xj)] . lf 112 = 110 + , 2 & < lim inf lIf '112 = 0,
I
ii
= 0. On the other hand,
Assumption 5.2 is satisfied in most cases of interest. A suf- Hence, if f = 0, then lim inf llfnlI2 ? 1.
ficient condition is given by the following. Let X, Y be random
Corollary 5.1. If fn w f in H2, then 0n > 0 in H2(Y), Onj
variables with joint density fx,y and marginals fx, fy. Then the
4j in H2(Xj), j = 1, . . ., p, and the converse.
conditional expectation operator on H2(Y)-* H2(X) is compact
Proof. If fn = On + O 4nj w 0 + Ej 4j, then by Prop-
if
osition 5.2, lim sup II?nIl < ?o, lim sup II4,nIll < ??. Take n' such
that on 0', 4t, n - 4)J, and let f' = 0' + Ej 4);. Then for
f f [fkyIfXfY]dxdy < o(.
any g E H2, (g, fn ')- (g, f') so (g, f) = (g, f') all g. The
Theorem 5.2. Under Assumptions 5.1 and 5.2, optimal converse iS easier.
transformations exist.
Definition 5.4. In H2, let Py P1, and Px denote the projection
Some machinery is needed. operators into H2(Y), H2(Xj), and H2(X), respectively.
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 591
On H2(Xi), Pj (j # i) is the conditional expectation operator, This implies
and similarly for Py.
IIPyI*110* = UO*, 11P4*4* = V+*
Proposition 5.3. Py is compact on H2(X) -> H2(Y), and Px
so that JlPy4*11 is an eigenvalue A* of U, V. Computing gives
is compact on H2(Y) -> H2(X).
110* - 4*112 = 1 - A*. Now take 0 any eigenfunction of U
Proof. Take t)n E H2(X), 4, t). This implies, by Cor-
corresponding to A, with 11011 = 1. Let 4 = P,0; then 110 -
ollary 5.1, that ()nj - ()i. By Assumption 5.2, PyOnj -4 PY 4)
)112 = 1 - A. This shows that Q*, O* are not optimal unless
so that Py4n -4 Py4. Now take 0 E H2(Y), 4 E H2(X); then
A*= . The rest of the theorem is straightforward verification.
(0, Py4) = (0, 4) = (PxO, 4). Thus Px: H2(Y) -> H2(X) is
the adjoint of Py and hence compact. Corollary 5.2. If A has multiplicity one, then the optimal
transformation is unique up to a sign change. In any case, the
Now to complete the proof of Theorem 5.2, consider the
set of optimal transformations is finite dimensional.
functional 110 - )112 on the set of all (0, 4) with 110112 = 1. For
any 0, 4, 5.4 Alternating Conditional Methods
110 - Q112 ? 110 - pX0II2. Direct solution of the equations AO = UO or A4 = V4 is
If there is a 0* that achieves the minimum of 110 - PXOII2 over formidable. Attempting to use data to directly estimate the
solutions is just as difficult. In the bivariate case, if X, Y are
110112 = 1, then an optimal transformation is 0*, PxO*. On 110112
categorical, then 40 = UO becomes a matrix eigenvalue prob-
lem and is tractable. This is the case treated in Kendall and
110 - PX0II2 = 1 - IIPX0II2. Stuart (1967).
Let s = {supllPxOll; 11011 = 1}. Take On such that IlInII2 = 1, On The ACE algorithm is founded on the observation that there
is an iterative method for finding optimal transformations. We
-4 0, and IIPX0nll s-> . By the compactness of Px, IIPXOIll
illustrate this in the bivariate case. The goal is to minimize
IIPxOlI = s. Furthermore, 11011 ' 1. If 11011 < 1, then for 0' =
110(Y) - 4(X)112 with 110112 = 1. Denote PxO = E(0 I X), Py4
0/11011, we get the contradiction IIPxO'II > s. Hence 11011 = 1
= E(O I Y). Start with any first-guess function 0O(Y) having
and (0, Px0) is an optimal transformation. This argument as-
a nonzero projection on the eigenspace of the largest eigenvalue
sumes that s > 0. If s = 0, then 110 - Px0II = 1 for all 0 with
of U. Then define a sequence of functions by
11011 = 1, and any (0, 0) is optimal.
110* - +*II2 = 1 - 2(0*, Xt*) ? Ikg*112. Theorem 5.4. If IIPEOOII # 0, define an optimal transfor-
Note that (0*, 4)*) = (0*, Py4i*) _ IIPy4)*II with equality only mation by 0* = PESOOIIPEOOII, k* = PXO* Then 110Jn - ?*11
if Q* = cPy4)*, c constant. Therefore, Q* = y*lP*l. ? 0,1k,, - (*I>O11
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
592 Journal of the American Statistical Association, September 1985
Proof. Notice that O,+, = UO,/1lUO,l. For any n, 0On = the argument given before leads to 0 = 0. For any 4 and
ant,* + gn, where gn I E, because, if it is true for n, then E > 0, take W4l so that 114 - W+l11 e. Then lITmIll e E +
IITmW4111, which completes the proof.
On+1 = (an7,* + Ugn)/||an7O + Ugn|i
There are two versions of the double loop. In the first, the
and Ugn is I to E. For any g I E, lUghi ' rilgil, where r <
initial functions 40 are the limiting functions produced by the
. Since an+ I = i{an/hUOnI0, g9n+1 = Ugn/llUOnl0; then
preceding inner loop. This is called the restart version. In the
hh9n + 1||/an + 1 = |I Ug9n|/Xan < (rI)II9gIIIan - second, the initial functions are 00 0. This is thefresh start
version. The main theoretical difference is that a stronger con-
Thus lignillan ' c(rIl)". But 110nil = 1, a' + lignih2 = 1 im-
sistency result holds for the fresh start. Restart is a faster-
plying a 2 - 1. Since ao > 0, then an > 0; so an -' 1. Now
running algorithm, and it is embodied in the ACE code.
use i1?n - 0*112 = (1 - an)2 + ilgn i12 to reach the conclusion.
Since I4,+1 - 011I*I X = nIPxO - PXt0*I C 0l,n - 0* IIthe theorem The Single-Loop Algorithm
follows.
The original implementation of ACE combined a single it-
The Inner Loop. (a) Start with functions 0, 40. (b) If, after eration of the inner loop with an iteration of the outer loop.
m stages of iteration, the functions are 4)m), then define, for j Thus it is summarized by the following.
=1, 2,...,1~p,
1. Start with 00, k0 = 0.
4)(M+l)
<> j =
(0 -_(M M 4m+
E iC))_E gi 1))
I))
2. If the current functions are 0n, 4n, define P)n+1 by
Theorem 5.5. Let 'm = Ej 'P(m). Then IIPxO - Q I-I > 0. 3. Let On+1 = Pkn+1/IIPy4n+ 11. Run to convergence.
Proof. Define the operator T by
This is a cleaner algorithm than the double loop, and its
T = (Il-PPp)(l - p_ ) ..(Il-PI) . implementation on data runs at least twice as fast as the double
loop and requires only a single convergence test. Unfortunately,
Then the iteration in the inner loop is expressed as
we have been unable to prove that it converges in function
0 - 4m+I = T(0 - Pm) space. Assuming convergence, it can be shown that the limiting
0 is an eigenfunction of U. But giving conditions for 0 to
= Tm+l(0 - 4)) (5.5) correspond to i, or even showing that 0 will correspond to i,
Write 0 - 00 = 0 - PO + PxO - ~0. Noting that T(O- "almost always" seems difficult. For this reason, we adopted
PxO) = 0 - PxO, (5.5) becomes the double-loop algorithm instead.
JJ
of the data smooth used. The results are fragmentary. Convergence
(q9 q$) = (q$ 4j4) = E (PJq, 4j) = 0. of the algorithm is proven only for a restricted class of smooths. In
practice, in more than 1,000 runs of ACE on a wide variety of data
sets and using three different types of smooths, we have seen only
The operator T can be decomposed as I + W, where W is
one instance of failure to converge. A fairly general, but weak, con-
compact. Now we claim that IITmWII -* 0 on H2(X). To prove
sistency proof is given. We conjecture the form of a stronger con-
this, let y > 0 and define
sistency result.
G(y) = sup {IITW4)I/IIW41I; 1111 sI 1, IIW)II ' y}. A.2 Data Smooths
Define a data set D to be a set {x, XN} of N points in p-
Take Xn w 4), Ik1I < 1, II|WVnll 2 y so that IITWIIIIIWnII- dimensional space; that is, Xk = (Xkl, . . , Xkp). Let q)N be the collection
G(y). Then 1111 SI 1, IIW4jII > y, and G(y) = IITW4)I/IIW411. of all such data sets. For fixed D, define F(x) as the space of all real-
Thus G(y) < 1 for all y > 0 and is clearly nonincreasing in y. valued functions 4 defined on D; that is, 4 E F(x) is defined by the
Then N real numbers {+(xl), . . . I)(XN)}. Define F(x,), j = 1, . . ., p,
as the space of all real-valued functions defined on the set {xl,, x2 ,
IITmW`WII = IITWTM"- 1(11 G(IITm - lW411)IITm- 1W41
*. * , XNj}-
Put yo = IIWIl Ym = Gm(ym)yo; then IITmWII - Ym. But clearly Definition A.l. A data smooth S of x on xj is a mapping S: F(x)
-*F(x,) defined for every D in GPN. If 4) E F(x), denote the corre-
The range of W is dense in H2(X). Otherwise, there is a 4)' sponding element in F(xj) by 5(4) | xj) and its values by 5(4) I Xkj).
# 0 such that (4)', W4)) = 0, all 4). This implies (W*4)', 4)) Let xbe any one of x,,... p Some examples of data smooths
- 0 or W'*4' = 0. Then IIT*4)'II - II4)'II, and a repetition of are the following.
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 593
1. Histogram. Divide the real axis into disjoint intervals {I,}. If smooth to be constant preserving so that the modified smooths take
xk E I,, define constants into zero.
The ACE algorithm is defined by the following.
S(O I xk) = - > 4(Xm). 1- 0( )(Yk) = Yk, Cb50(xkJ) = 0.
nX,,m4kE1
3. Kernel. Take K(x) defined on the reals with maximum at x = 3. Set Q(n+1) = SY(i 0)/)IISy(li 0j)IIN Go back to the inner loop
with Oj?)' = 4, (restart) or Oj50 = 0 (fresh start). Continue until con-
mf
0. Then
vergence.
S(4O I Xk) = O 4?(xm)K(xm - Xk) E K(x, - Xk). To formalize this algorithm, introduce the space H2(O, +) with
elements (0, 04, . . ., ,p), 0 E H2(y), 4, E H2(x,), and subspaces
4. Regression. Fix M and order Xk as in example 2. At Xk, re- H2(0) with elements (0, 0, 0, . . ., 0) = 0 and H2(W) with elements
gress the values of 4)(xk+M) . . ., 4)(xk+M), excluding O(Xk), on (O, 01, . ., p) = +4
Xk-M, . . ., Xk+M, excluding Xk, getting a regression line L(x). Put For f = (fo, f,., fp) in H2(0, 4)), define S,: H2(O, 4)
S(I I Xk) = L(xk). If M points are not available on each side of Xk, H2(0, 4)) by
make up the deficiency on the other side.
(S,f) =0, j ? i
5. Supersmoother. See Friedman and Stuetzle (1982).
Some properties that are relevant to the behavior of smoothers are =fi + Sij( f,), j=i
\,oj
given next. These properties hold only if they are true for all D C & ,
Starting with 0 = (0, 0, 0, . . , 0), 4)(m) = (0, 0(m)), one complete
1. Linearity. A smooth is linear if
cycle in the inner loop is described by
S(aqi + /42) = aSq51 + fS4)2
0 - + (m I ) = I Sp)(I - Sp - ) ... (I - Sj)(O t() (A.2)
for all 41, ()2 E F(x) and all constants a, ,B.
2. Constant Preserving. If 4 E F(x) is constant (4-c), then Define T on H2(0, 4) H2(0, 4)) as the product operator in (A.2).
Then
SO = c.
To give a further property, introduce the inner product ( )N on 4)(m) = 0 - Tm(0 - 4)(O)) (A.3)
F(x) defined by
If, for a given 0, the inner loop converges, then the limiting 4)
satisfies
(4), 4')N = - 4)(Xk)4)'(Xk)
Nk
S(0- 4) = 0, ] = 1, P. (A.4)
and the corresponding norm 11 IIN That is, the smooth of the residuals on any predictor variable is zero.
3. Boundedness. S is bounded by M if
Adding
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
594 Journal of the American Statistical Association, September 1985
iTmf -O 0 for all f E H2(4,) is that smooths are "close" enough to being self-adjoint so that their largest
eigenvalue is real, positive, and less than one.
det[A! - (I - Si/A)-I(I - S))] (A.9)
A.4 Consistency of ACE
has no zeros in JA I 1 except A = 1. For '0, 01, . . ., 0,, any functions in H2(Y), H2(X) , H2(Xp),
Proof. For Tmf 0, all f E H2(4), it is necessary and sufficient and any data set D E 9N, define functions Pj(0, I x,) by
that the spectral radius of T be less than one. The equation Tf = 2f
Pj(Ij Xkj) = E(O,(Xi) I Xj = XkJ). (A. 15)
in component form is
Let 4j in H2(xj) be defined as the restriction of 4)j to the set of data
Af, = -Si(Ai f, + E f, j = 1,. p. (A.1O) values {xl,, . . ., x,j} minus its mean value over the data values.
I<j ,,j Assume that the N data vectors (Yk, Xk) are samples from the dis-
Let s = li fi and rewrite (A. 10) as tribution of (Y, X, . . ., Xp), not necessarily independent or even
random (see Section A.5).
(Ai -Sj)fj = sj (I -iA) E f,-s) (A.I1) Definition A.2. Let S(m, S/') be any sequence of data smooths.
They are mean squared consistent if
If A = 1, (A.11) becomes (I - Sj)fj = -Sjs or s = -As. By EIISj(N)(0 I xj) - P N(4i | xjJN -
assumption, this implies that s = 0, and hence fj = 0, for all j. This
for all 00, .. ,p as above, with the analogous definition for S(N.
rules out A = 1 as an eigenvalue of T'. For A $ 1, but A greater than
Whether or not the algorithm converges, a weak consistency result
the maximum of the spectral radii of the S, (j = 1, . . p), define
can be given under general conditions for the fresh-start algorithm.
g, = (1 - A) i< f,- s. Then f, = (g+- gj))/(I - s), 5o
Start with 00 E H2(Y). On each data set, run the inner-loop iteration
(A! - S,)(g1+1 - g,) = (1 - A)S,g1 m times; that is, define
121 s 1 l(i - S,iaru( - S,)1I If 0* is the optimal transformation PEOOIIIPEOOII, 4* = Px0*, then as
m, I - a) in any way,
has no solutions A with JAI > 1 and then ruling out solutions with JAI
110(-; m, 1) - 0*11 -? 0, llj(I ; m, 1) -04*1l -* 0.
= 1.
Assuming that the inner loop converges to PC, the outer loop it- Proof. First note that for any product of smooths S(,N) ... kv
eration is given by
EIIS~' ..."Sth0
EllS(N - pPI,h....^ON
S(NO So 00k? - O.
0 (n +11 = Sy p @(n) / | Sy PO IN)
This is illustrated with S,v)SJN)00 (i 5 j). Since EIIS>N)00 -
Put the matrix SyP = U so that
PAUII - 0, then Sf00 = Pjo0 + 4j,N' where EIIjNIN|- 0.
O(n 1) = 'O(n)/llCJII ll (A.n14) Therefore
If the eigenvalue A of UL having largest absolute value is real and SFN)(Sj(0o) = St PJ00 + SrN)4)JN.
positive, then Q(n+ 1) converges to the projection of 0(0) on the eigenspace
By assumption, 11S(M)0j,N11N < M110j,NIIN, where M does not depend on
of A. The limiting 0, PO is a solution of (A.4) and (A.5). If i is not N. Therefore EIISlN)4)j,NIk2 - 0. By assumption, EIIS(N)P10o -
real and positive, then f9() oscillates and does not converge. If the P,P,0011N -- 0 so that EIISVv)SjN)0o - PPj0 1k2 ->O.
smooths are self-adjoint and non-negative definite, then SYP is the
Proposition A.1. If ON is defined in H2(y) for all data sets D, and
product of two self-adjoint non-negative definite matrices; hence it has
0 E H2(Y) such that
only real non-negative eigenvalues. We are unable to find conditions
guaranteeing this for more general smooths. EIION(y) - 0(y)112 0,
It can be easily shown that with modifications near the endpoints, then
the nearest neighbor smooth satisfies the preceding conditions. Our
current research indicates a possibility that other types of common E ON(Y) 0(y) 2 2
smooths can also be modified into self-adjoint, non-negative definite
smooths with non-negative matrix elements. For these, ACE conver-
Proof. Write 0/11011 = 0/IIOIIN + 0(/101 - l/IIOIIN). Then two
gence is guaranteed by the preceding arguments.
parts are needed: first, to show that
ACE, however, has invariably converged using a variety of non-
self-adjoint smooths (with one exception found using an odd type of
kernel smooth). We conjecture that for most data sets, reasonable 11IIONIIN IIOIIN IN
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 595
and second, to show that Write
Wm = Um P~,E W = U -PE;
F 1- IIOIIN 2N so IlWm - Wll -- 0 again. Now,
WI = I | RQL, Wm)di
= IIONIIk + 111N - 2(ON, 0)N 27r I|=r R
and
= (IIONIIN - IIIIN) + 2(1IOIINIIONIIN - (ON, 0)N)-
Both terms are positive, and since EVN 2- 0, E(I10NIIN - IIOIIN)2 - 0 -ilmi rI | gR(A, Wm)11dJAJ,
27r 12=r
and E(IIOIINIIONIIN - (ON, 0)N) O 0. By assumption, 1101kN 110112,
resulting in SN 40 where dI)4 is arc length along JIH = r. On JiA = r, for m m io, IIR(A,
Now look at Wm)II_ is continuous and bounded. Furthermore, IIR(Q, Wm)II -> IIR({,
W)II uniformly. If M(r) = maxlpI=rIIR(Q, W)II, then
WN = - 0 O2(yk)[liII01IN - 1/11011]2
Nk IIWII < r'M(r)(1 + Am),
Then ul 00 P(m)00
0(; m, 1) = Um 0lIU n
U1 0 PE 0O PE-00
The last step in the proof is showing that where 8m,i 0 as m, I -l oo. Thus
Proposition A.2. As m - oo, Um - U in the uniform operator and the right side goes to zero as m, I - oo.
norm. The term weak consistency is used above because we have in mind
Proof. llUmO - U0II = IlPyTmPx0ll _ IlTmPx0ll. Now on H2(Y), a desirable stronger result. We conjecture that for reasonable smooths,
IlTmPxIl -O 0. If not, take 0mg 110mll = 1 such that 1T1'PX0mll ? 6, all the set CN = {(Y1, Xl), . . ., (YN, XN); algorithm converges} satisfies
m. Let Om,'4 0; then PX0m s PxO and P(CN) --+1 and that for 0N, the limit on CN starting from a fixed 00,
The operator Um is not necessarily self-adjoint, but it is compact. elements. Then we know that the algorithm converges to some ON,
and we conjecture that E[II0N - 0*N] 0
By Proposition (A.2), if 0(sp(U)) is any open set containing sp(U),
then for m sufficiently large, sp(Um) C 0(sp(U)). Suppose, for sim-
A.5 Mean Squared Consistency of Nearest
plicity, that the eigenspace EA corresponding to the largest eigenvalue
Neighbor Smooths
i of U is one-dimensional. (The proof goes through if E, is higher-
dimensional, but it is more complicated.) Then for any open neigh- To show that the ACE algorithm is applicable in a situation, we
borhood 0 of A, and m sufficiently large, there is only one eigenvalue need to verify that the assumptions of Theorem (A.2) can be satisfied.
Am of Um in 0, )m Lrn> s, and the projection P(m) of Um corresponding We do this, first assuming that the data (Y,, X), . (YN, XN) are
to )r converges to PEA in the uniform operator topology. Moreover, samples from a two-dimensional stationary, ergodic process. Then the
'ir can be taken as the eigenvalue of Urn having largest absolute value. ergodic theorem implies that for any 0 E L2(Y), 11011k -l 11012 and,
If iL' is the second largest eigenvalue of U and 4m is the eigenvalue trivially, E~I0ISI >~ 1lOW
of Urn having the second highest absolute value, then (assuming E,~ To show that we can get a bounded, linear sequence of smooths
is one-dimensional) 4m > A'. that are mean squared consistent, we use the nearest neighbor smooths.
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
596 Journal of the American Statistical Association, September 1985
Theorem A.3. Let (Y1, XI),'. . . , (YN, XN) be samples from a By the ergodic theorem, for a countable {x"} dense on the real line,
stationary ergodic process such that the distribution of X has no atoms. and c E W', P(W') = 1,
Then there exists a mean squared consistent sequence of nearest-
('N(X., w0) = gN(X, CO) - Pb(g I Xn) -O 0.
neighbor smooths of Y on X.
Use (A. 19) to establish that for any bounded interval J and any wo E
The proof begins with Lemma A. 1.
W', (N(X, co) 0 uniformly for x E J. Then write
Lemma A.J. Suppose that P(dx) has no atoms, and let PN(dx) 1N
P(dx). Take 3N> O, 6N- > O; define J(x; E) = [x - c, x + ?]; ll|DN(X, 0)IIN = E > N'(Xk, w)I(Xk E J)
N k=1
and
N
CN(x) = min{e; PN(J(x, ?)) 2 AN} + - >k=Fkl, o41(Xk E' J).
Nk=
e(x) = min{e; P(J(x, e)) 6 }.
The first term is bounded and goes to zero for co E W'; hence its
Then using A to denote symmetric difference,
expectation goes to zero. The expectation of the second tenn is bounded
PN(J(X, EN(X)) A J(x, e(x))) -* 0 uniformly in x (A.18) by cP(X E ' J). Since J can be taken arbitrarily large, this completes
and the proof.
lim sup sup PN(J(x, E(x)) A J(y, E(y))) c &X(h), (A. 19) Using the inequality
N {(x,y);Ix-yjIt}
EjIS6'g - Pxglls 2 Ej|S( g - P6gll + 21IP6g - Pxgll2
where s1(h)- 0 ash- 0.
gives
Proof. Let FN(x), F(x) be the cumulative df corresponding to PN,
P. Since FN - F and F is continuous, then it follows that lim sup EjjS?g - Pxgll2 ? 21jP6g - Pxgll2.
supIFN(x) - F(x)I -- O. Proposition A.4. For any 4(x) c L2(X), lim,,,0jjP& - O.11 - 0.
Proof. For 4 bounded and continuous,
To prove (A. 18), note that
O I q(x')I(x' E J(x, e(x)))P(dx') - (x)
PN(J(X, 9N) A J(X, E))
_ 1PN(J(X EN)) - PN(J(X, 0)) as (5-- 0 for every x. Since suplP,5 - ? c for all (, then IIP,4
1 |N - PN(J(X, 9N))l - oil -- 0. The proposition follows if it can be shown that for every
0 E L2(X), lim sup6llP0ll < o. But
+ 1|N - 31 + IFN(X + ?(x)) - F(x + ?(x))|
+ IFN(X - ?(X)) - FN(x - ( , IP6l12 = f [ O f k(x')I(x' E J(x, C(x)))P(dx')1 P(dx)
which does it. To prove (A. 19), it is sufficient to show that
sup P(J(x, e(x)) A J(y, ?(y))) c ?X(h)- S O (X )2 p(d) [ I(X' E- J(x, e(x)))P(dx)]
x,y, k-yj5h
First, note that Suppose that x' is such that there are numbers E+, c- with P([x', x'
+ c+]) = (, P([x', x' - -]) = 6. Then x' E J(x, E(x)) implies
|?(x) - s(Y)I S Ix - yA.
xi - e x x' + +, and
If J(x, E(x)), J(y, e(y)) overlap, then their symmetric difference con-
sists of two intervals I,, 12 such that JIj ? 2jx - Yl, 1I21 C 21x - yl. 116 f I(x' E J(x, c(x)))P(dx) ? 2. (A.20)
There is an ho > 0 such that if |x - y ho, the two neighborhoods
If, say, P([x', co)) < ( then x 2? x' - c and (A.20) still holds, and
always overlap. Otherwise there is a sequence {x"}, with e(x,) -* 0
similarly if P((- oo, x']) < 3.
and P(J(x", e(x"))) = 3, which is impossible, since P has no atoms.
Then for h s ho Take {OnJ to be a countable set of functions dense in L2(Y). By
Propositions A.3 and A.4, for any c > 0, we can select 6(e, n), N(6,
x,y;jx-yt-h |iI92h
sup P(J(x, e(x)) A J(y, e(y))) s 2 sup P(I)
n) so that for all n,
and the right-hand side goes to zero as h -> 0. E1lS'On - PX0,Ik2 ? c for ( s ((, n), N 2 N(5, n).
The lemma is applied as follows: Let g(y) be any bounded function
Let cM I 0 as M -* 0o; define 3M = minnlM 6(c, n) and N(M) =
in L2(Y). Define P6(g I x), using If) to denote the indicator function,
maxn.M N(6M, n). Then
as
E1IS,N 0n - PX0n112 < CM for n ? M, N 2 N(M).
11/ g(y) I(x' E J(x, e(x)))P(dy, dx')
Put M(N) = max{M; N ? max(M, N(M))}. Then M(N) -> 00 as N
= 11/ Px(g I x') I(x' E J(x, e(x)))P(dx'). oo, and the sequence of smooths SI is mean squared consistent
for all On. Noting that for B E L2(Y),
Note that Pa is bounded and continuous in x. Denote by SW the smooths
with M = [NJ]. Proposition A.3 follows.
EIIS0B -PX0II2 s 3EIISI0n - PXOn N + 9110 - OnlI2
Proposition A.3. ElISg g - Pjgllj -- 0 for fixed 3. completes the proof of the theorem.
Proof. By (A. 18), with probability one, The fact that ACE uses modified smooths SWg = Smg -
Av(S?g) and functions g such that Eg = 0 causes no problems, since
Sr (g I x) = (1/ [N]) I g(yj)I(x1 E J(x, EN(X)))
IIAv(S rg)II = (Av(SNg))2
and
can be replaced for all x by
gN(x, a)) = (11[3N]) > g(y3)1(x, E J(x, iE(x))), Av(Sag, gN(x, cf),
where w is a sample sequence. using the notation of Proposition A.3.
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 597
Assume g is bounded, and write - M') directly below (above). For a regression smooth,
I N ( ) 1N S(4 I x) = f + [rF(0, x)/](x - xx), (A.21)
Av(SI) g) =N k Ni +
where /X, xx are the averages of 0(yj, x, over the indexes in J(x),
By the ergodic theorem, the second term goes a.s. to EPj(g I X), and and Fx(4, x), U2 are the covariance between IYk), Xk and the variance
an argument mimicking the proof of Proposition A.3 shows that the of Xk over the indexes in J(x).
first term goes to zero a.s. Write the second term in (A.21) as
Finally, write
[Fx(& x)OlR[(x - Xx)ux]
IEP6(g I X)| = IEP6(g I X) - EPxgI s lIP64 - 4lI,
If there are M points above and below in J(x), it is not hard to show
where 0 = Pxg. Thus, Theorem A.3 can be easily changed to account
that
for modified smooths.
In the controlled experiment situation, the {Xk} are not random, but l(x - XX)/I s 1
the condition PN(dx) P(dx) is imposed. Additional assumptions are This is not true near endpoints where (x -x )/Ix can become arbi-
necessary. trarily large as M gets large. This endpoint behavior keeps regression
Assumption A. 1. For O(Y) any bounded function in L2(Y), E(O(Y) from being uniformly bounded. To remedy this, define a function
| X = x) is continuous in x. [x], = x, lxl? s1
Assumption A.2. For i # i and +(x) any bounded continuous = sign(x), lxi > 1,
function, E(O(X,) I X, = x) is continuous in x.
and define the modified regression smooth by
A necessary result is Proposition A.5.
S(4 I x) = x + Fr(4, x)/Ux[(x - XX)/Ux],. (A.22)
Proposition A.S. For O(y) bounded in L2(Y) and +(x) bounded
and continuous, This modified smooth is bounded by 2.
IN Theorem A.4. If, as N -> oo, M -> oo, MIN -> 0, and P(dx) has
- E O(yJ)o(xJ) as > EO(Y)O(X). no atoms, then the modified regression smooths are mean squared
NJ=I
consistent.
Let TN = J=, O(YJ)4(xJ). Then ETN = J7 g(x,)+(xj), g(x) =
E[O(Y) I X = x]. By hypothesis, ETNIN-> EO(Y) (X). Furthermore, The proof is in Breiman and Friedman (1982). We are almost certain
that the modified regression smooths are also mean squared consistent
N
ON var(TN) = E E[O(y) - g(x )]20(X ) for stationary ergodic time series and in the weaker sense for controlled
experiments, but under less definitive conditions on rates at which M
N
00.
=E h (x,) 0(x,),
*I
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
598 Journal of the American Statistical Association, September 1985
IBTP-inversion base temperature (F?) Curve Estimation," in Lecture Notes in Mathematics, No. 757, New York:
Springer-Verlag.
WDSP-wind speed (mph)
Gebelein, H. (1947), "Das Statitistiche Problem der Korrelation als Variations
Dependent Variable: und Eigenwert Problem und Sein Zusammenhang mit der Ausgleichung-
Srechnung," Zeitschrift Fuer Angewandte Mathematik und Mechanik, 21,
UP03-Upland ozone concentration (ppm) 364-379.
Harrison, D., and Rubinfeld, D. L. (1978), "Hedonic Housing Prices and the
Demand for Clean Air," Journal of Environmental Economics Management,
[Received August 1982. Revised July 1984.]
5, 81-102.
Kendall, M. A., and Stuart, A. (1967), The Advanced Theory of Statistics
REFERENCES (Vol. 2), New York: Hafner Publishing.
Kimeldorf, G., May, J. H., and Sampson, A. R. (1982), "Concordant and
Anscombe, F. J., and Tukey, J. W. (1963), "The Examination and Analysis Discordant Monotone Correlations and Their Evaluations by Nonlinear
of Residuals," Technometrics, 5, 141-160. Optimization," Studies in the Management Sciences (19): Optimization in
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980), Regression Diagnostics, Statistics, eds. S. H. Zanakis and J. S. Rustagi, Amsterdam: North-Holland,
New York: John Wiley. pp. 117-130.
Box, G. E. P., and Cox, D. R. (1964), "An Analysis of Transformations," Kruskal, J. B. (1964), "Nonmetric Multidimensional Scaling: A Numerical
Journal of the Royal Statistical Society, Ser. B, 26, 211-252. Method," Psychometrika, 29, 115-129.
Box, G. E. P., and Hill, W. J. (1974), "Correcting Inhomogeneity of Variance (1965), "Analysis of Factorial Experiments by Estimating Monotone
With Power Transformation Weighting," Technometrics, 16, 385-389. Transformations of the Data," Journal of the Royal Statistical Society, Ser.
Box, G. E. P., and Tidwell, P. W. (1962), "Transformations of the Independent B, 27, 251-263.
Variables," Technometrics, 4, 531-550. Lancaster, H. 0. (1958), "The Structure of Bivariate Distributions," Annals
Breiman, L., and Friedman, J. (1982), "Estimating Optimal Transformations of Mathematical Statistics, 29, 719-736.
for Multiple Regression and Correlation," Technical Report 9, University (1969), The Chi-Squared Distribution, New York: John Wiley.
of California, Berkeley, Dept. of Statistics. Linsey, J. K. (1972), "Fitting Response Surfaces With Power Trans-
Cleveland, W. S. (1979), "Robust Locally Weighted Regression and Smoothing formations," Journal of the Royal Statistical Society, Ser. C, 21, 234-237.
Scatterplots," Journal of the American Statistical Association, 74, 828-836. (1974), "Construction and Comparison of Statistical Models," Journal
Craven, P., and Wahba, G. (1979), "Smoothing Noisy Data With Spline Func- of the Royal Statistical Society, Ser. B, 36, 418-425.
tions: Estimating the Correct Degree of Smoothing by the Method of Gen- Mosteller, F., and Tukey, J. W. (1977); Data Analysis and Regression, Read-
eralized Cross-Validation," Numerische Mathematik, 31, 317-403. ing, MA: Addison-Wesley.
Csaki, P., and Fisher, J. (1963), "On the General Notion of Maximal Renyi, A. (1959), "On Measures of Dependence," Acta Mathematica Aca-
Correlation," Magyar Tudomanyos Akademia, Budapest, Matematikai Ko- demiae Scientiarum Hungaricae, 10, 441-451.
tato Intezet, Kozlemenyei, 8, 27-51. Sarmanov, 0. V. (1958a), "The Maximal Correlation Coefficient (Symmetric
DeBoor, C. (1978), A Practical Guide to Splines, New York: Springer-Verlag. Case)," Doklady Akademii Nauk UzSSR, 120, 715-718.
De Leeuw, J., Young, F. W., and Takane, Y. (1976), "Additive Structure in (1958b), "The Maximal Correlation Coefficient (Nonsymmetric
Qualitative Data: An Alternating Least Squares Method With Optimal Scal- Case)," Doklady Akademii Nauk UzSSR, 121, 52-55.
ing Features," Psychometrika, 41, 471-503. Sarmanov, 0. V., and Zaharov, V. K. (1960), "Maximum Coefficients of
Devroye, L. (1981), "On the Almost Everywhere Convergence of Nonpara- Multiple Correlation," Doklady Akademii Nauk UzSSR, 130, 269-271.
metric Regression Function Estimates," The Annals of Statistics, 9, 1310- Spiegelman, C., and Sacks, J. (1980), "Consistent Window Estimation in
1319. Nonparametric Regression," The Annals of Statistics, 8, 240-246.
Devroye, L., and Wagner, T. J. (1980), "Distribution-Free Consistency Results Stone, C. J. (1977), "Consistent Nonparametric Regression," The Annals of
in Nonparametric Discrimination and Regression Function Estimation," The Statistics, 7, 139-149.
Annals of Statistics, 8, 231-239. Tukey, J. W. (1982), "The Use of Smelting in Guiding Re-Expression," in
Efron, B. (1979), "Bootstrap Methods: Another Look at the Jackknife," The Modern Data Analysis, eds. J. Laurner and A. Siegel, New York: Academic
Annals of Statistics, 7, 1-26. Press.
Fraser, D. A. S. (1967), "Data Transformations and the Linear Model, " Annals Wood, J. T. (1974), "An Extension of the Analysis of Transformations of Box
of Mathematical Statistics, 38, 1456-1465. and Cox," Journal of the Royal Statistical Society, Ser. C, 23, 278-283.
Friedman, J. H., and Stuetzle, W. (1982), "Smoothing of Scatterplots," Tech- Young, F. W., de Leeuw, J., and Takane, Y. (1976), "Regression With Qual-
nical Report ORION006, Stanford University, Dept. of Statistics. itative and Quantitative Variables: An Alternating Least Squares Method
Gasser, T., and Rosenblatt, M. (eds.) (1979), "Smoothing Techniques for With Optimal Scaling Features," Psychometrika, 41, 505-529.
Comment
DARYL PREGIBON and YEHUDA VARDI*
In data analysis, the choice of transformations is often done for narrowing the gap between mathematical statistics and data
subjectively. ACE is a major attempt to bring objectivity to analysis, and for providing the data analyst with a useful tool.
this area. As Breiman and Friedman have demonstrated with
1. ACE IN THEORY: HOW MEANINGFUL IS
their examples, and as we have experienced with our own,
MAXIMAL CORRELATION?
ACE is a powerful tool indeed. Our comments are sometimes
critical in nature and reflect our view that there is much more To keep our discussion simple we limit it here to the bivariate
to be done on the subject. We consider the methodology a case, though the issues that we raise are equally relevant to the
significant contribution to statistics, however, and would like general case. The basis of ACE lies in the properties of maximal
to compliment the authors for attacking an important problem,
? 1985 American Statistical Association
* Daryl Pregibon and Yehuda Vardi are Members of Technical Staff, Journal of the American Statistical Association
AT & T Bell Laboratories, Murray Hill, NJ 07974. September 1985, Vol. 80, No. 391, Theory and Methods
This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms