Classification and Regression Trees Wadsworth Statistics Probability
Classification and Regression Trees Wadsworth Statistics Probability
of Contents
Dedication
Title Page
Copyright Page
PREFACE
Acknowledgements
Chapter 1 - BACKGROUND
1.1 CLASSIFIERS AS PARTITIONS
1.2 USE OF DATA IN CONSTRUCTING CLASSIFIERS
1.3 THE PURPOSES OF CLASSIFICATION ANALYSIS
1.4 ESTIMATING ACCURACY
1.5 THE BAYES RULE AND CURRENT CLASSIFICATION
PROCEDURES
Chapter 2 - INTRODUCTION TO TREE CLASSIFICATION
2.1 THE SHIP CLASSIFICATION PROBLEM
2.2 TREE STRUCTURED CLASSIFIERS
2.3 CONSTRUCTION OF THE TREE CLASSIFIER
2.4 INITIAL TREE GROWING METHODOLOGY
2.5 METHODOLOGICAL DEVELOPMENT
2.6 TWO RUNNING EXAMPLES
2.7 THE ADVANTAGES OF THE TREE STRUCTURED APPROACH
Chapter 3 - RIGHT SIZED TREES AND HONEST ESTIMATES
3.1 INTRODUCTION
3.2 GETTING READY TO PRUNE
3.3 MINIMAL COST-COMPLEXITY PRUNING
3.4 THE BEST PRUNED SUBTREE: AN ESTIMATION PROBLEM
3.5 SOME EXAMPLES
APPENDIX
Chapter 4 - SPLITTING RULES
4.1 REDUCING MISCLASSIFICATION COST
4.2 THE TWO-CLASS PROBLEM
4.3 THE MULTICLASS PROBLEM: UNIT COSTS
4.4 PRIORS AND VARIABLE MISCLASSIFICATION COSTS
4.5 TWO EXAMPLES
4.6 CLASS PROBABILITY TREES VIA GINI
2
APPENDIX
Chapter 5 - STRENGTHENING AND INTERPRETING
5.1 INTRODUCTION
5.2 VARIABLE COMBINATIONS
5.3 SURROGATE SPLITS AND THEIR USES
5.4 ESTIMATING WITHIN-NODE COST
5.5 INTERPRETATION AND EXPLORATION
5.6 COMPUTATIONAL EFFICIENCY
5.7 COMPARISON OF ACCURACY WITH OTHER METHODS
APPENDIX
Chapter 6 - MEDICAL DIAGNOSIS AND PROGNOSIS
6.1 PROGNOSIS AFTER HEART ATTACK
6.2 DIAGNOSING HEART ATTACKS
6.3 IMMUNOSUPPRESSION AND THE DIAGNOSIS OF CANCER
6.4 GAIT ANALYSIS AND THE DETECTION OF OUTLIERS
6.5 RELATED WORK ON COMPUTER-AIDED DIAGNOSIS
Chapter 7 - MASS SPECTRA CLASSIFICATION
7.1 INTRODUCTION
7.2 GENERALIZED TREE CONSTRUCTION
7.3 THE BROMINE TREE: A NONSTANDARD EXAMPLE
Chapter 8 - REGRESSION TREES
8.1 INTRODUCTION
8.2 AN EXAMPLE
8.3 LEAST SQUARES REGRESSION
8.4 TREE STRUCTURED REGRESSION
8.5 PRUNING AND ESTIMATING
8.6 A SIMULATED EXAMPLE
8.7 TWO CROSS-VALIDATION ISSUES
8.8 STANDARD STRUCTURE TREES
8.9 USING SURROGATE SPLITS
8.10 INTERPRETATION
8.11 LEAST ABSOLUTE DEVIATION REGRESSION
8.12 OVERALL CONCLUSIONS
Chapter 9 - BAYES RULES AND PARTITIONS
9.1 BAYES RULE
9.2 BAYES RULE FOR A PARTITION
3
9.3 RISK REDUCTION SPLITTING RULE
9.4 CATEGORICAL SPLITS
Chapter 10 - OPTIMAL PRUNING
10.1 TREE TERMINOLOGY
10.2 OPTIMALLY PRUNED SUBTREES
10.3 AN EXPLICIT OPTIMAL PRUNING ALGORITHM
Chapter 11 - CONSTRUCTION OF TREES FROM A LEARNING SAMPLE
11.1 ESTIMATED BAYES RULE FOR A PARTITION
11.2 EMPIRICAL RISK REDUCTION SPLITTING RULE
11.3 OPTIMAL PRUNING
11.4 TEST SAMPLES
11.5 CROSS-VALIDATION
11.6 FINAL TREE SELECTION
11.7 BOOTSTRAP ESTIMATE OF OVERALL RISK
11.8 END-CUT PREFERENCE
Chapter 12 - CONSISTENCY
12.1 EMPIRICAL DISTRIBUTIONS
12.2 REGRESSION
12.3 CLASSIFICATION
12.4 PROOFS FOR SECTION 12.1
12.5 PROOFS FOR SECTION 12.2
12.6 PROOFS FOR SECTION 12.3
BIBLIOGRAPHY
NOTATION INDEX
SUBJECT INDEX
4
Lovingly dedicated to our children
Jessica, Rebecca, Kymm;
Melanie;
Elyse, Adam, Rachel, Stephen;
Daniel and Kevin
5
6
7
8
Library of Congress Cataloging-in-Publication Data
Main entry under title:
Classification and regession trees.
(The Wadsworth statistics/probability series)
Bibliography: p.
Includes Index.
ISBN 0-412-04841-8
1. Discriminant analysis. 2. Regression analysis.
3. Trees (Graph theory) I. Breiman, Leo. II. Title:
Regression trees. II. Series.
QA278.65.C54 1984
519.5′36—dc20
83-19708
CIP
This book contains information obtained from authentic and highly regarded sources. Reprinted material is
quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts
have been made to publish reliable data and information, but the author and the publisher cannot assume
responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or
mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval
system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating
new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such
copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only
for identification and explanation, without intent to infringe.
Visit the CRC Press Web site at www.crcpress.com
First CRC Press reprint 1998
© 1984, 1993 by Chapman & Hall
No claim to original U.S. Government works
International Standard Book Number 0-412-04841-8
Library of Congress Card Number 83-19708
Printed in the United States of America 7 8 9 0
Printed on acid-free paper
9
PREFACE
The tree methodology discussed in this book is a child of the computer age. Unlike
many other statistical procedures which were moved from pencil and paper to
calculators and then to computers, this use of trees was unthinkable before computers.
Binary trees give an interesting and often illuminating way of looking at data in
classification or regression problems. They should not be used to the exclusion of
other methods. We do not claim that they are always better. They do add a flexible
nonparametric tool to the data analyst’s arsenal.
Both practical and theoretical sides have been developed in our study of tree
methods. The book reflects these two sides. The first eight chapters are largely
expository and cover the use of trees as a data analysis method. These were written by
Leo Breiman with the exception of Chapter 6 by Richard Olshen. Jerome Friedman
developed the software and ran the examples.
Chapters 9 through 12 place trees in a more mathematical context and prove some
of their fundamental properties. The first three of these chapters were written by
Charles Stone and the last was jointly written by Stone and Olshen.
Trees, as well as many other powerful data analytic tools (factor analysis, nonmetric
scaling, and so forth) were originated by social scientists motivated by the need to
cope with actual problems and data. Use of trees in regression dates back to the AID
(Automatic Interaction Detection) program developed at the Institute for Social
Research, University of Michigan, by Morgan and Sonquist in the early 1960s. The
ancestor classification program is THAID, developed at the institute in the early 1970s
by Morgan and Messenger. The research and developments described in this book are
aimed at strengthening and extending these original methods.
Our work on trees began in 1973 when Breiman and Friedman, independently of
each other, “reinvented the wheel” and began to use tree methods in classification.
Later, they joined forces and were joined in turn by Stone, who contributed
significantly to the methodological development. Olshen was an early user of tree
methods in medical applications and contributed to their theoretical development.
Our blossoming fascination with trees and the number of ideas passing back and
forth and being incorporated by Friedman into CART (Classification and Regression
Trees) soon gave birth to the idea of a book on the subject. In 1980 conception
occurred. While the pregnancy has been rather prolonged, we hope that the baby
appears acceptably healthy to the members of our statistical community.
The layout of the book is
10
11
ACKNOWLEDGMENTS
Three other people were instrumental in our research: William Meisel, who early on
saw the potential in tree structured methods and encouraged their development;
Laurence Rafsky, who participated in some of the early exchanges of ideas; and Louis
Gordon, who collaborated with Richard Olshen in theoretical work. Many helpful
comments were supplied by Peter Bickel, William Eddy, John Hartigan, and Paul
Tukey, who all reviewed an early version of the manuscript.
Part of the research, especially that of Breiman and Friedman, was supported by the
Office of Naval Research (Contract No. N00014- 82-K-0054), and we appreciate our
warm relations with Edward Wegman and Douglas De Priest of that agency. Stone’s
work was supported partly by the Office of Naval Research on the same contract and
partly by the National Science Foundation (Grant No. MCS 80- 02732). Olshen’s work
was supported by the National Science Foundation (Grant No. MCS 79-06228) and
the National Institutes of Health (Grant No. CA-26666).
We were fortunate in having the services of typists Ruth Suzuki, Rosaland
Englander, Joan Pappas, and Elaine Morici, who displayed the old-fashioned virtues of
patience, tolerance, and competence.
We are also grateful to our editor, John Kimmel of Wadsworth, for his abiding faith
that eventually a worthy book would emerge, and to the production editor, Andrea
Cava, for her diligence and skillful supervision.
12
1
BACKGROUND
At the University of California, San Diego Medical Center, when a heart attack patient
is admitted, 19 variables are measured during the first 24 hours. These include blood
pressure, age, and 17 other ordered and binary variables summarizing the medical
symptoms considered as important indicators of the patient’s condition.
The goal of a recent medical study (see Chapter 6) was the development of a
method to identify high risk patients (those who will not survive at least 30 days) on
the basis of the initial 24-hour data.
Figure 1.1 is a picture of the tree structured classification rule that was produced in
the study. The letter F means not high risk; G means high risk.
This rule classifies incoming patients as F or G depending on the yes-no answers to
at most three questions. Its simplicity raises the suspicion that standard statistical
classification methods may give classification rules that are more accurate. When
these were tried, the rules produced were considerably more intricate, but less
accurate.
The methodology used to construct tree structured rules is the major story of this
monograph.
FIGURE 1.1
13
1.1 CLASSIFIERS AS PARTITIONS
14
membership in C to every measurement vector x in X. That is, given any x ∈ X, the
rule assigns one of the classes {1, ..., J} to x.
DEFINITION 1.1. A classifier or classification rule is a function d(x) defined on X so
that for every x, d(x) is equal to one of the numbers 1, 2, ..., J.
Another way of looking at a classifier is to define Aj as the subset of X on which
d(x) = j; that is,
Aj = {x; d(x) = j}.
The sets A1, ... , Aj are disjoint and X = Aj. Thus, the Aj form a partition of X. This
gives the equivalent
DEFINITION 1.2. A classifier is a partition of X into J disjoint subsets A1, ... , Aj, X =
15
1.2 USE OF DATA IN CONSTRUCTING CLASSIFIERS
Classifiers are not constructed whimsically. They are based on past experience.
Doctors know, for example, that elderly heart attack patients with low blood pressure
are generally high risk. Los Angelenos know that one hot, high pollution day is likely
to be followed by another.
In systematic classifier construction, past experience is summarized by a learning
sample. This consists of the measurement data on N cases observed in the past
together with their actual classification.
In the medical diagnostic project the learning sample consisted of the records of 215
heart attack patients admitted to the hospital, all of whom survived the initial 24-hour
period. The records contained the outcome of the initial 19 measurements together
with an identification of those patients that did not survive at least 30 days.
The learning sample for the ozone classification project contained 6 years (1972-
1977) of daily measurements on over 400 meteorological variables and hourly air
pollution measurements at 30 locations in the Los Angeles basin.
The data for the chlorine project consisted of the mass spectra of about 30,000
compounds having known molecular structure. For each compound the mass spectra
can be expressed as a measurement vector of dimension equal to the molecular weight.
The set of 30,000 measurement vectors was of variable dimensionality, ranging from
about 50 to over 1000.
We assume throughout the remainder of this monograph that the construction of a
classifier is based on a learning sample, where DEFINITION 1.3. A learning sample
consists of data (x1, j1), ..., (xN, jN) on N cases where xn ∈ X and jn ∈ {1, ..., J}, n = 1,
..., N. The learning sample is denoted by L; i.e.,
L = {(x1, j1) ..., (xN, jN)}.
We distinguish two general types of variables that can appear in the measurement
vector.
DEFINITION 1.4. A variable is called ordered or numerical if its measured values are
real numbers. A variable is categorical if it takes values in a finite set not having any
natural ordering.
A categorical variable, for instance, could take values in the set {red, blue, green}.
In the medical data, blood pressure and age are ordered variables.
Finally, define
DEFINITION 1.5. If all measurement vectors xn are of fixed dimensionality, we say
that the data have standard structure.
In the medical and ozone projects, a fixed set of variables is measured on each case (or
day); the data have standard structure. The mass spectra data have nonstandard
structure.
16
1.3 THE PURPOSES OF CLASSIFICATION ANALYSIS
Depending on the problem, the basic purpose of a classification study can be either to
produce an accurate classifier or to uncover the predictive structure of the problem. If
we are aiming at the latter, then we are trying to get an understanding of what
variables or interactions of variables drive the phenomenon—that is, to give simple
characterizations of the conditions (in terms of the measurement variables x ∈ X) that
determine when an object is in one class rather than another. These two are not
exclusive. Most often, in our experience, the goals will be both accurate prediction and
understanding. Sometimes one or the other will have greater emphasis.
In the mass spectra project, the emphasis was on prediction. The purpose was to
develop an efficient and accurate on-line algorithm that would accept as input the mass
spectrum of an unknown compound and classify the compound as either chlorine
containing or not.
The ozone project shared goals. The work toward understanding which
meteorological variables and interactions between them were associated with alert-
level days was an integral part of the development of a classifier.
The tree structured classification rule of Figure 1.1 gives some interesting insights
into the medical diagnostic problem. All cases with blood pressure less than or equal
to 91 are predicted high risks. For cases with blood pressure greater than 91, the
classification depends only on age and whether sinus tachycardia is present. For the
purpose of distinguishing between high and low risk cases, once age is recorded, only
two variables need to be measured.
An important criterion for a good classification procedure is that it not only produce
accurate classifiers (within the limits of the data) but that it also provide insight and
understanding into the predictive structure of the data.
Many of the presently available statistical techniques were designed for small data
sets having standard structure with all variables of the same type; the underlying
assumption was that the phenomenon is homogeneous. That is, that the same
relationship between variables held over all of the measurement space. This led to
models where only a few parameters were necessary to trace the effects of the various
factors involved.
With large data sets involving many variables, more structure can be discerned and
a variety of different approaches tried. But largeness by itself does not necessarily
imply a richness of structure.
What makes a data set interesting is not only its size but also its complexity, where
complexity can include such considerations as:
High dimensionality
A mixture of data types
Nonstandard data structure
17
and, perhaps most challenging, nonhomogeneity; that is, different relationships hold
between variables in different parts of the measurement space.
Along with complex data sets comes “the curse of dimensionality” (a phrase due to
Bellman, 1961). The difficulty is that the higher the dimensionality, the sparser and
more spread apart are the data points. Ten points on the unit interval are not distant
neighbors. But 10 points on a 10-dimensional unit rectangle are like oases in the
desert.
For instance, with 100 points, constructing a 10-cell histogram on the unit interval is
a reasonable procedure. In M dimensions, a histogram that uses 10 intervals in each
dimension produces 10M cells. For even moderate M, a very large data set would be
needed to get a sensible histogram.
Another way of looking at the “curse of dimensionality” is the number of
parameters needed to specify distributions in M dimensions:
Normal: O(M2)
Binary: O(2M)
Unless one makes the very strong assumption that the variables are independent, the
number of parameters usually needed to specify an M-dimensional distribution goes
up much faster than O(M). To put this another way, the complexity of a data set
increases rapidly with increasing dimensionality.
With accelerating computer usage, complex, high dimensional data bases, with
variable dimensionality or mixed data types, nonhomogeneities, etc., are no longer odd
rarities.
In response to the increasing dimensionality of data sets, the most widely used
multivariate procedures all contain some sort of dimensionality reduction process.
Stepwise variable selection and variable subset selection in regression and
discriminant analysis are examples.
Although the drawbacks in some of the present multivariate reduction tools are well
known, they are a response to a clear need. To analyze and understand complex data
sets, methods are needed which in some sense select salient features of the data,
discard the background noise, and feed back to the analyst understandable summaries
of the information.
18
1.4 ESTIMATING ACCURACY
Given a classifier, that is, given a function d(x) defined on X taking values in C, we
denote by R*(d) its “true misclassification rate.” The question raised in this section is:
What is truth and how can it be estimated?
One way to see how accurate a classifier is (that is, to estimate R*(d)) is to test the
classifier on subsequent cases whose correct classification has been observed. For
instance, in the ozone project, the classifier was developed using the data from the
years 1972-1975. Then its accuracy was estimated by using the 1976-1977 data. That
is, R*(d) was estimated as the proportion of days in 1976-1977 that were misclassified
when d(x) was used on the previous day data.
In one part of the mass spectra project, the 30,000 spectra were randomly divided
into one set of 20,000 and another of 10,000. The 20,000 were used to construct the
classifier. The other 10,000 were then run through the classifier and the proportion
misclassified used as an estimate of R*(d).
The value of R*(d) can be conceptualized in this way: Using L, construct d. Now,
draw another very large (virtually infinite) set of cases from the same population as L
was drawn from. Observe the correct classification for each of these cases, and also
find the predicted classification using d(x). The proportion misclassified by d is the
value of R*(d).
To make the preceding concept precise, a probability model is needed. Define the
space X C as a set of all couples (x, j) wherex∈X and j is a class label, j ∈ C. Let
P(A, j) be a probability on X C, A ⊂ X, j ∈ C (niceties such as Borel measurability
will be ignored). The interpretation of P(A, j) is that a case drawn at random from the
relevant population has probability P(A, j) that its measurement vector x is in A and its
class is j. Assume that the learning sample L consists of N cases (x1, j1), ..., (xN , jN)
independently drawn at random from the distribution P(A, j). Construct d(x) using L.
Then define R*(d) as the probability that d will misclassify a new sample drawn from
the same distribution as L.
DEFINITION 1.6 Take (X, Y) , X ∈ X, Y ∈ C, to be a new sample from the probability
distribution P(A, j); i.e.,
19
This model must be applied cautiously. Successive pairs of days in the ozone data
are certainly not independent. Its usefulness is that it gives a beginning conceptual
framework for the definition of “truth.”
How can R*(d) be estimated? There is no difficulty in the examples of simulated
data given in this monograph. The data in L are sampled independently from a desired
distribution using a pseudorandom number generator. After d(x) is constructed, 5000
additional cases are drawn from the same distribution independently of L and
classified by d. The proportion misclassified among those 5000 is the estimate of R*
(d).
In actual problems, only the data in L are available with little prospect of getting an
additional large sample of classified cases. Then L must be used both to construct d(x)
and to estimate R*(d). We refer to such estimates of R*(d) as internal estimates. A
summary and large bibliography concerning such estimates is in Toussaint (1974).
Three types of internal estimates will be of interest to us. The first, least accurate,
and most commonly used is the resubstitution estimate.
After the classifier d is constructed, the cases in L are run through the classifier. The
proportion of cases misclassified is the resubstitution estimate. To put this in equation
form:
DEFINITION 1.7. Define the indicator function X(·) to be 1 if the statement inside the
parentheses is true, otherwise zero.
The resubstitution estimate, denoted R(d), is
(1.8)
The problem with the resubstitution estimate is that it is computed using the same
data used to construct d, instead of an independent sample. All classification
procedures, either directly or indirectly, attempt to minimize R(d). Using the
subsequent value of R(d) as an estimate of R*(d) can give an overly optimistic picture
of the accuracy of d.
As an exaggerated example, take d(x) to be defined by a partition A , ..., Aj such that
Aj contains all measurement vectors xn in L with jn = j and the vectors x∈Xnot equal to
some xn are assigned in an arbitrary random fashion to one or the other of the Aj. Then
R(d) = 0, but it is hard to believe that R*(d) is anywhere near zero.
The second method is test sample estimation. Here the cases in L are divided into
two sets L1 and L2. Only the cases in L1 are used to construct d. Then the cases in L2
are used to estimate R*(d). If N2 is the number of cases in L2, then the test sample
estimate, Rts(d), is given by
20
(1.9)
In this method, care needs to be taken so that the cases in L2 can be considered as
independent of the cases in L1 and drawn from the same distribution. The most
common procedure used to help ensure these properties is to draw L2 at random from
L. Frequently, L2 is taken as 1/3 of the cases in L, but we do not know of any
theoretical justification for this 2/3, 1/3 split.
The test sample approach has the drawback that it reduces effective sample size. In
a 2/3, 1/3 split, only 2/3 of the data are used to construct d, and only 1/3 to estimate R*
(d). If the sample size is large, as in the mass spectra problem, this is a minor
difficulty, and test sample estimation is honest and efficient.
For smaller sample sizes, another method, called v-fold cross-validation, is preferred
(see the review by M. Stone, 1977). The cases in L are randomly divided into V subsets
of as nearly equal size as possible. Denote these subsets by L1, ..., Lv. Assume that the
procedure for constructing a classifier can be applied to any learning sample. For
every v, v = 1, ..., v, apply the procedure using as learning samples L - Lv, i.e., the cases
in L not in Lv, and let d(v)(x) be the resulting classifier. Since none of the cases in Lv
has been used in the construction of d(v), a test sample estimate for R*(d(v)) is
(1.10)
where Nv N/v is the number of cases in Lv. Now using the same procedure again,
construct the classifier d using all of L.
For V large, each of the V classifiers is constructed using a learning sample of size
N(1 - 1/v) nearly as large as L. The basic assumption of cross-validation is that the
procedure is “stable.” That is, that the classifiers d(v), v = 1, ..., v, each constructed
using almost all of L, have misclassification rates R*(d(v)) nearly equal to R*(d).
Guided by this heuristic, define the v-fold cross-validation estimate Rcv(d) as
(1.11)
21
N-fold cross-validation is the “leave-one-out” estimate. For each n, n = 1, ..., N, the
nth case is set aside and the classifier constructed using the other N - 1 cases. Then the
nth case is used as a single-case test sample and R*(d) estimated by (1.11).
Cross-validation is parsimonious with data. Every case in L is used to construct d,
and every case is used exactly once in a test sample. In tree structured classifiers
tenfold cross-validation has been used, and the resulting estimators have been
satisfactorily close to R*(d) on simulated data.
The bootstrap method can also be used to estimate R*(d), but may not work well
when applied to tree structured classifiers (see Section 11.7).
22
1.5 THE BAYES RULE AND CURRENT CLASSIFICATION
PROCEDURES
The major guide that has been used in the construction of classifiers is the concept of
the Bayes rule. If the data are drawn from a probability distribution P(A, j), then the
form of the most accurate rule can be given in terms of P(A, j). This rule is called the
Bayes rule and is denoted by dB(x).
To be more precise, suppose that (X, Y), X ∈ X, Y ∈ C, is a random sample from the
probability distribution P(A, j) on X x C; i.e., P(X ∈ A, Y = j) = P(A, j).
DEFINITION 1.12. dB(x) is a Bayes rule if for any other classifier d(x),
RB = P(dB(X) ≠ Y).
To illustrate how dB(x) can be derived from P(A, j), we give its form in an important
special case.
DEFINITION 1.13. Define the prior class probabilities π(j), j = 1, ..., J, as
π(j) = P(Y = j)
and the probability distribution of the jth class measurement vectors by
P(A|j) = ∫A fj(x)dx.
Then,
THEOREM 1.15. Under Assumption 1.14 the Bayes rule is defined by
(1.16)
23
and the Bayes misclassification rate is
(1.17)
Although dB is called the Bayes rule, it is also recognizable as a maximum
likelihood rule: Classify x as that j for which fj(x)π(j) is maximum. As a minor point,
note that (1.16) does not uniquely define dB(x) on points x such that max fj(x)π(j) is
achieved by two or more different j’s. In this situation, define dB(x) arbitrarily to be
any one of the maximizing j’s.
The proof of Theorem 1.15 is simple. For any classifier d, under Assumption 1.14,
and equality is achieved if d(x) equals that j for which fj(x)π(j) is a maximum.
Therefore, the rule dB given in (1.16) has the property that for any other classifier d,
This shows that dB is a Bayes rule and establishes (1.17) as the correct equation for the
Bayes misclassification rate.
In the simulated examples we use later on, the data are generated from a known
probability distribution. For these examples, dB was derived and then the values of RB
computed. Since RB is the minimum misclassification rate attainable, knowing RB and
comparing it with the accuracy of the tree structured classifiers give some idea of how
24
effective they are.
In practice, neither the π(j) nor the fj(x) are known. The π(j) can either be estimated
as the proportion of class j cases in L or their values supplied through other knowledge
about the problem. The thorny issue is getting at the fj(x). The three most commonly
used classification procedures
Discriminant analysis
Kernel density estimation
Kth nearest neighbor
attempt, in different ways, to approximate the Bayes rule by using the learning
samples L to get estimates of fj(x).
Discriminant analysis assumes that all fj(x) are multivariate normal densities with
common covariance matrix Γ and different means vectors {µj}. Estimating Γ and the
µj in the usual way gives estimates fj(x) of the fj(x). These are substituted into the
Bayes optimal rule to give the classification partition
∫ K(∥x∥)dx = 1.
Then fj(x) is estimated by
where Nj is the number of cases in the jth class and the sum is over the Nj
25
measurement vectors xn corresponding to cases in the jth class.
The kth nearest neighbor rule, due to Fix and Hodges (1951), has this simple form:
Define a metric ∥x∥ on X and fix an integer k > 0. At any point x, find the k nearest
neighbors to x in L. Classify x as class j if more of the k nearest neighbors are in class j
than in any other class. (This is equivalent to using density estimates for fj based on
the number of class j points among the k nearest neighbors.)
The kernal density estimation and kth nearest neighbor methods make minimal
assumptions about the form of the underlying distribution. But there are serious
limitations common to both methods.
1. They are sensitive to the choice of the metric ∥x∥, and there is usually no
intrinsically preferred definition.
2. There is no natural or simple way to handle categorical variables and missing
data.
3. They are computationally expensive as classifiers; L must be stored, the
interpoint distances and d(x) recomputed for each new point x.
4. Most serious, they give very little usable information regarding the structure of
the data.
Surveys of the literature on these and other methods of classification are given in
Kanal (1974) and Hand (1981).
The use of classification trees did not come about as an abstract exercise. Problems
arose that could not be handled in an easy or natural way by any of the methods
discussed above. The next chapter begins with a description of one of these problems.
26
2
27
2.1 THE SHIP CLASSIFICATION PROBLEM
The ship classification project (Hooper and Lucero, 1976) involved recognition of six
ship classes through their radar range profiles. The data were gathered by an airplane
flying in large circles around ships of six distinct structural types. The electronics in
the airborne radar gave the intensity of the radar return as a function of distance (or
range) at 2-foot intervals from the airplane to the objects reflecting the radar pulses.
Over each small time period, then, the airplane took a profile of the ship, where the
profile consisted of the intensity of the radar returns from various parts of the ship
versus the distance of these parts from the airplane.
The intensity of the radar returns from the ocean was small. When the profile was
smoothed, it was not difficult to detect the ranges where the ship returns began and
ended. The data were normalized so that one end of the smoothed profile corresponded
to x = 0 and the other end to x = 1. The resulting radar range profiles were then
continuous curves on the interval 0 ≤ x ≤ 1, going to zero at the endpoints and
otherwise positive (see Figure 2.1).
FIGURE 2.1 Typical range profile.
The peaks on each profile correspond to major structural elements on the ship that act
as reflectors.
Unfortunately, the shape of the profile changes with the angle θ between the
centerline of the ship and the airplane (see Figure 2.2).
28
FIGURE 2.2
At the broadside angles (θ = 90°, θ = 270°), the points on the ship closest and farthest
from the airplane may differ only by a few dozen feet. The profile contains very little
information and may have only one peak. At bow and stern (θ = 0°, θ = 180°) the
profile has the most detail.
The data consisted of a number of profiles for each of the different ship classes
taken at angles about 20° apart around the compass. The goal was to construct a
classifier which could take as input a profile at an unknown angle from one of the six
classes and produce reliable predictions of class membership.
After some initial inspection, it was noticed that while the profiles changed with
angle, the positions of the peaks stayed relatively constant. That is, in the profiles
belonging to a given ship class, as long as a peak did not disappear, its x-coordinate
stayed about the same (if bow and stern were appropriately labeled).
One of the initial difficulties in the project was reduction of dimensionality. Much
of the information in any given profile was redundant. Profile heights corresponding to
neighboring x values were highly correlated. In view of the initial look at the profiles,
the decision was made to extract from each profile the vector of locations of the local
maxima. Thus, to each profile was associated a vector of the form (x1, x2, ...), where x1
was the position of the first local maximum, etc.
This brought a new difficulty. The data had variable dimensionality, ranging from a
low of 1 to a high of 15. None of the available classification methods seemed
appropriate to this data structure. The most satisfactory solution was the tree structured
approach outlined in the following sections.
29
2.2 TREE STRUCTURED CLASSIFIERS
Tree structured classifiers, or, more correctly, binary tree structured classifiers, are
constructed by repeated splits of subsets of X into two descendant subsets, beginning
with X itself. This process, for a hypothetical six-class tree, is pictured in Figure 2.3. In
Figure 2.3, X2 and X3 are disjoint, with
FIGURE 2.3
X = X2 U X3. Similarly, X4 and X5 are disjoint with X4 U X5 = X2, and X6 ∪ X7 = X3.
Those subsets which are not split, in this case X6, X8, X10, X11, X12, X14, X15, X16, and
X17, are called terminal subsets. This is indicated, here and in subsequent figures, by a
rectangular box; nonterminal subsets are indicated by circles.
The terminal subsets form a partition of X. Each terminal subset is designated by a
class label. There may be two or more terminal subsets with the same class label. The
partition corresponding to the classifier is gotten by putting together all terminal
subsets corresponding to the same class. Thus,
30
A1 = X15
A3 = X10 ∪ X16
A5 = X8
A2 = X11 ∪ X14
A4 = X6 ∪ X17
A6 = X12.
The splits are formed by conditions on the coordinates of x = (x1, x2, ...). For
example, split 1 of X into X2 and X3 could be of the form
(2.1)
Split 3 of X3 into X6 and X7 could be of the form
X6 = {x ∈ X3; x3 + x5 ≤ -2}
X7 = {x ∈ X3; x3 + x5 > -2}.
The tree classifier predicts a class for the measurement vector x in this way: From
the definition of the first split, it is determined whether x goes into X2 or X3. For
example, if (2.1) is used, x goes into X2 if x4 ≤ 7, and into X3 if x4 > 7. If x goes into
X3, then from the definition of split 3, it is determined whether x goes into X6 or X7.
When x finally moves into a terminal subset, its predicted class is given by the class
label attached to that terminal subset.
At this point we change terminology to that of tree theory. From now on,
A node t = a subset of X and
The root node t1 = X.
Terminal subsets become terminal nodes, and nonterminal subsets are nonterminal
nodes. Figure 2.3 becomes relabeled as shown in Figure 2.4.
The entire construction of a tree, then, revolves around three elements:
31
FIGURE 2.4
The crux of the problem is how to use the data L to determine the splits, the terminal
nodes, and their assignments. It turns out that the class assignment problem is simple.
The whole story is in finding good splits and in knowing when to stop splitting.
32
2.3 CONSTRUCTION OF THE TREE CLASSIFIER
The first problem in tree construction is how to use L to determine the binary splits of
X into smaller and smaller pieces. The fundamental idea is to select each split of a
subset so that the data in each of the descendant subsets are “purer” than the data in
the parent subset.
For instance, in the six-class ship problem, denote by p1, ..., p6 the proportions of
class 1, ..., 6 profiles in any node. For the root node t1, (p1, ..., p6) = . A
good split of t1 would be one that separates the profiles in L so that all profiles in
classes 1, 2, 3 go to the left node and the profiles in 4, 5, 6 go to the right node (Figure
2.5).
FIGURE 2.5
Once a good split of t1 is found, then a search is made for good splits of each of the
two descendant nodes t2, t3.
This idea of finding splits of nodes so as to give “purer” descendant nodes was
implemented in this way:
1. Define the node proportions p(j|t), j = 1, ..., 6, to be the proportion of the cases
xn ∈ t belonging to class j, so that p(1|t) + ... + p(6|t) = 1.
2. Define a measure i(t) of the impurity of t as a nonnegative function φ of the
p(1|t), ..., p(6|t) such that φ = maximum, φ (1, 0, 0, 0, 0, 0) = 0,
φ (0, 1, 0, 0, 0, 0) = 0, ... , φ (0, 0, 0, 0, 0, 1) = 0
That is, the node impurity is largest when all classes are equally mixed together in it,
and smallest when the node contains only one class.
33
For any node t, suppose that there is a candidate split s of the node which divides it
into tL and tR such that a proportion pL of the cases in t go into tL and a proportion pR
go into tR (Figure 2.6).
FIGURE 2.6
Then the goodness of the split is defined to be the decrease in impurity
3. Define a candidate set S of binary splits s art each node. Generally, it is simpler
to conceive of the set S of splits as being generated by a set of questions Q,
where each question in Q is of the form
Is x ∈ A?, A ⊂ X.
Then the associated split s sends all xn in t that answer “yes” to tL and all xn in t
that answer “no” to tR.
In the ship project the node impurity was defined as
There is no convincing justification for this specific form of i(t). It was selected simply
because it was a familiar function having the properties required by step 2. In later
work, other definitions of i(t) became preferable.
The set of questions 2 were of the form:
Do you have a local maximum in the interval [a, b], where a ≤ b and a, b range
from 0 to 1 in steps of .01?
34
Thus, a typical question in Q was: Do you have a local max in [.31, .53]? Put slightly
differently, since the reduced data vectors consisted of the local max positions, the
question was: Do you have any coordinate in the range [.31, .53]?
This gave a total of
Then t1 was split into t2 and t3 using the split s* and the same search procedure for the
best s ∈ S repeated on both t2 and t3 separately.
To terminate the tree growing, a heuristic rule was designed. When a node t was
reached such that no significant decrease in impurity was possible, then t was not split
and became a terminal node.
The class character of a terminal node was determined by the plurality rule.
Specifically, if
35
FIGURE 2.7
36
2.4 INITIAL TREE GROWING METHODOLOGY
(2.2)
as the resubstitution estimate for the probability that a case will both be in class j
and fall into node t.
The resubstitution estimate p(t) of the probability that any case falls into node t is
defined by
(2.3)
The resubstitution estimate of the probability that a case is in class j given that it falls
into node t is given by
37
(2.4)
and satisfies
Σj p(j|t) = 1.
When {π(j)} = {Nj/N}, then p(j|t) = Nj(t)/N(t), so the {p(j|t)} are the relative
proportions of class j cases in node t.
Note that throughout this book lower case p will denote an estimated probability
and upper case P a theoretical probability.
The four elements needed in the initial tree growing procedure were
38
2.4.1 The Standard Set of Questions
If the data have standard structure, the class 2 of questions can be standardized.
Assume that the measurement vectors have the form
x = = (x1,..., xM),
where M is the fixed dimensionality and the variables x1, ..., xM can be a mixture of
ordered and categorical types. The standardized set of questions Q is defined as
follows:
Is x1 ≤ 3.2?
Is x3 ≤ -6.8?
Is x4 ∈ {b1, b3} ?
and so on. There are not an infinite number of distinct splits of the data. For
example, if x1 is ordered, then the data points in L contain at most N distinct values
x1,1, x1,2, ..., x1,N of x1.
There are at most N different splits generated by the set of questions {Is x1 ≤ c?}.
These are given by {Is x1 ≤ cn?}, n = 1, ..., N′ ≤ N, where the cn are taken halfway
between consecutive distinct data values of x1.
For a categorical variable xm, since {xm ∈ s} and {xm ∉ s} generate the same split
with tL and tR reversed, if xm takes on L distinct values, then 2L-1 - 1 splits are defined
on the values of xm.
At each node the tree algorithm searches through the variables one by one,
beginning with x1 and continuing up to xM. For each variable it finds the best split.
Then it compares the M best single variable splits and selects the best of the best.
The computer program CART (Classification and Regression Trees) incorporates
39
this standardized set of splits. Since most of the problems ordinarily encountered have
a standard data structure, it has become a flexible and widely used tool.
When fixed-dimensional data have only ordered variables, another way of looking
at the tree structured procedure is as a recursive partitioning of the data space into
rectangles.
Consider a two-class tree using data consisting of two ordered variables x1, x2 with
0 ≤ xi ≤ 1, i = 1, 2. Suppose that the tree diagram looks like Figure 2.8.
An equivalent way of looking at this tree is that it divides the unit square as shown
in Figure 2.9.
From this geometric viewpoint, the tree procedure recursively partitions X into
rectangles such that the populations within each rectangle become more and more
class homogeneous.
FIGURE 2.8
40
FIGURE 2.9
41
2.4.2 The Splitting and Stop-Splitting Rule
The goodness of split criterion was originally derived from an impurity function.
DEFINITION 2.5. An impurity function is a function φ defined on the set of all J-
tuples of numbers (p1 , ..., p ) satisfying pj ≥ 0, j = 1, ..., J, ∑j Pj = 1 with the properties
It is easy to see that selecting the splits that maximize Δi (s, t) is equivalent to
selecting those splits that minimize the overall tree impurity I(T). Take any nodet∈
and using a split s, split the node into tL and tR. The new tree T‘ has impurity
42
The decrease in tree impurity is
(2.7)
Define the proportions pL, pR of the node t population that go to tL and tR,
respectively, by
pL = p(tL)/p(t) pR = p(tR)/p(t)·
Then
pL + pR = 1
Since ΔI(s, t) differs from Δi(s, t) by the factor p(t), the same split s* maximizes both
expressions. Thus, the split selection procedure can be thought of as a repeated attempt
to minimize overall tree impurity.
The initial stop-splitting rule was simple (and unsatisfactory). Set a threshold β > 0,
and declare a node t terminal if
(2.8)
43
2.4.3 The Class Assignment Rule and Resubstitution Estimates
Denote
R(t) = r(t)p(t).
Then the resubstitution estimate for the overall misclassification rate R*(T) of the tree
classifier T is
Up to now, the assumption has been tacitly made that the cost or loss in
44
misclassifying a class j object as a class i object was the same for all i ≠ j. In some
classification problems this is not a realistic setting. Therefore, we introduce a set of
misclassification costs c(i|j), where
DEFINITION 2.12. c(i|j) is the cost of misclassifying a class j object as a class i object
and satisfies
(i) c(i|j) ≥ 0, i ≠ j,
(ii) c(i|j) = 0, i = j.
Given a node t with estimated node probabilities p(j|t), j = 1, ..., J, if a randomly
selected object of unknown class falls into t and is classified as class i, then the
estimated expected misclassification cost is
C(i\j)p(j\t).
A natural node assignment rule is to select i to minimize this expression. Therefore,
and define the resubstitution estimate R(T) of the misclassification cost of the tree T by
where R(t) = r(t)p(t).
Note that in the unit misclassification cost case, C(i|j) = 1, i ≠ j,
and the minimum cost rule reduces to the rule given in Definition 2.13.
Henceforth, we take j*(t) as the class assignment rule with no further worry.
An important property of R(T) is that the more one splits in any way, the smaller
R(T) becomes. More precisely, if T′‘ is gotten from T by splitting in any way a
terminal node of T, then
R(T’) ≤ R(T)
45
Putting this another way:
PROPOSITION 2.14. For any split of a node t into tL and tR,
46
2.5 METHODOLOGICAL DEVELOPMENT
In spite of the attractiveness of tree structured classifiers, it soon became apparent that
there were serious deficiencies in the tree growing method used in the ship
classification project.
The methodology developed to deal with the deficiencies and to make tree
structured classification more flexible and accurate is covered in the next three
chapters and, in a more theoretical setting, in Chapters 9 to 12.
As a roadmap, a brief outline follows of the issues and the methods developed to
deal with them.
47
2.5.1 Growing Right Sized Trees: A Primary Issue
The most significant difficulty was that the trees often gave dishonest results. For
instance, suppose that the stopping rule (2.8) is used, with the threshold set so low that
every terminal node has only one data point. Then the p(j|t) are all zero or 1. R(t) = 0,
and the resubstitution estimate R(T) for the misclassification rate is zero. In general,
R(T) decreases as the number of terminal nodes increases. The more you split, the
better you think you are doing.
Very few of the trees initially grown held up when a test sample was run through
them. The resubstitution estimates R(T) were unrealistically low and the trees larger
than the information in the data warranted. This results from the nature of the tree
growing process as it continuously optimizes and squeezes boundaries between classes
in the learning set.
Using more complicated stopping rules did not help. Depending on the
thresholding, the splitting was either stopped too soon at some terminal nodes or
continued too far in other parts of the tree. A satisfactory resolution came only after a
fundamental shift in focus. Instead of attempting to stop the splitting at the right set of
terminal nodes, continue the splitting until all terminal nodes are very small, resulting
in a large tree. Selectively prune (recombine) this large tree upward, getting a
decreasing sequence of subtrees. Then use cross-validation or test sample estimates to
pick out that subtree having the lowest estimated misclassification rate.
This process is a central element in the methodology and is covered in the next
chapter.
48
2.5.2 Splitting Rules
Many different criteria can be defined for selecting the best split at each node. As
noted, in the ship classification project, the split selected was the split that most
reduced the node impurity defined by
The other is the twoing rule: At a node t, with s splitting t into tL and tR, choose the
split s that maximizes
49
2.5.3 Variable Combinations
Another deficiency in trees using a standard structure is that all splits are on single
variables. Put another way, all splits are perpendicular to the coordinate axes. In
situations where the class structure depends on combinations of variables, the standard
tree program will do poorly at uncovering the structure. For instance, consider the
two-class two-variable data illustrated in Figure 2.10.
FIGURE 2.10
Discriminant analysis would do very well on this data set. The standard tree program
would split many times in an attempt to approximate the separating hyperplane by
rectangles. It would be difficult to see the linear structures of the data by examining
the tree output.
In problems where linear structure is suspected, the set of allowable splits is
extended to include all linear combination splits of the form
Is ∑m am xm ≤ c?
An algorithm was developed to search through such splits in an effort to find the one
that maximizes the goodness of split criterion.
With linear combination splits incorporated, tree structured classification gives
results competitive with or better than linear discriminant analysis on problems such as
that shown in Figure 2.10.
In another type of classification problem, variables tend to appear in certain
Boolean combinations. For instance, in medical diagnosis, frequently the diagnostic
questions are of the type: Does the patient have (symptom A and symptom B) or
symptom c?
In problems where a Boolean combination structure is suspected and the
50
measurement space is high dimensional, it is useful to extend the allowable splits to
include Boolean combinations of splits. However, since the number of such splits can
increase rapidly with the length of the Boolean expressions allowed, a stepwise
procedure is devised to make the search computationally feasible.
Both of these variable combination methods are covered in Chapter 5.
51
2.5.4 Missing Values
There are two aspects to the missing value problem. First, some cases in may have
some missing measurement values (we assume the class label is always present).
Second, we may want the completed tree to predict a class label for a measurement
vector some of whose values are missing.
The missing value algorithm discussed in Chapter 5 handles both of these problems
through the use of surrogate splits. The idea is this: Define a measure of similarity
between any two splits s, s‘ of a node t. If the best split of t is the split s on the variable
xm, find the split s’ on the variables other than x that is most similar to s. Call s‘ the
best surrogate for s. Similarly, define the second best surrogate, third best, and so on.
If a case has xm missing in its measurement, decide whether it goes to tL or tR by
using the best surrogate split. If it is missing the variable containing the best surrogate
split, use the second best, and so on.
52
2.5.5 Tree Interpretation
Another difficulty in tree structured classification is that the simple structure of the
final classification tree can be deceptive, leading to misinterpretation. For instance, if a
variable is never split on in the final tree, one interpretation might be that the variable
has very little association with class membership. The truth may be that its effect was
masked by other variables.
Tree structures may be unstable. If one variable narrowly masks another, then a
small change in the priors or learning sample may shift the split from one variable to
another. Two splits of a node may be dissimilar but have almost the same goodness of
split. Small changes may sometimes favor one, sometimes the other.
Various procedures are suggested in Chapter 5 to assist in tree interpretation. In
particular, a method is given for ranking variables in terms of their potential effect on
the classification. Even though a variable may not appear in a split on the final tree, its
ranking may be high, giving an indication of masking.
Often, the analyst may want to explore a range of parameters and/or the effects of
adding or deleting variables. Since a number of trees will be grown and used for
comparison, the accuracy of full cross-validation may not be necessary and an
alternative rapid method for growing exploratory trees is suggested.
53
2.5.6 Within-Node Misclassification Rate Estimates
54
2.5.7 Computational Efficiency
55
2.5.8 Other Issues
There are some other problems inherent in the tree procedure. In brief, “end cuts” or
splits that tend to separate a node into one small and one large subset are, all other
things being equal, favored over “middle cuts.” This is discussed in Section 11.8.
Also, variable selection is biased in favor of those variables having more values and
thus offering more splits.
Finally, another problem frequently mentioned (by others, not by us) is that the tree
procedure is only one-step optimal and not overall optimal. That is, suppose the tree
growing procedure produced, say, 11 rectangular terminal nodes. If one could search
all possible partitions of X into 11 rectangles for that partition that minimizes the sum
of the node impurities, the two results might be quite different.
This issue is analogous to the familiar question in linear regression of how well the
stepwise procedures do as compared with “best subsets” procedures. We do not
address this problem. At this stage of computer technology, an overall optimal tree
growing procedure does not appear feasible for any reasonably sized data set.
The issue of “honesty” is more critical than “optimality.” Constructing a good
classifier whose performance will stand up under test samples and that is useful and
practical is our first priority.
56
2.6 TWO RUNNING EXAMPLES
To illustrate various parts of the methodology, two models have been constructed for
generating data. The first is a digit recognition model. It has a simple structure and can
be analyzed analytically. It is complemented by a more complex waveform recognition
model.
57
2.6.1 Digit Recognition Example
Digits are ordinarily displayed on electronic watches and calculators using seven
horizontal and vertical lights in on-off combinations (see Figure 2.11).
FIGURE 2.11
Number the lights as shown in Figure 2.12. Let i denote the ith digit, i = 1, 2, ..., 9,
0, and take (xi1, ...,xi7 ) to be a seven-dimensional vector of zeros and ones with xim =
1 if the light in the mth position is on for the ith digit, and xim = 0 otherwise. The
values of xim are given in Table 2.1. Set c = {1, ..., 10} and let X be the set of all
possible 7-tuples = (x1, ..., x7) of zeros and ones.
FIGURE 2.12
58
TABLE 2.1
The data for the example are generated from a faulty calculator. Each of the seven
lights has probability .1 of not doing what it is supposed to do. More precisely, the
data consist of outcomes from the random vector (x1, ..., x7, Y) where Y is the class
label and assumes the values 1, ..., 10 with equal probability and the x1, ..., x7 are zero-
one variables. Given the value of Y, the X1, ..., X7 are each independently equal to the
value corresponding to Y in Table 2.1 with probability .9 and are in error with
probability .1.
For example,
and
Two hundred samples from this distribution constitute the learning sample. That is,
every case in is of the form (x1, ..., x7, j) wherej∈Cis a class label and the
measurement vector x1, ..., x7 consists of zeros and ones. The data in are displayed
in Table 2.2.
To use a tree structured classification construction on , it is necessary to specify:
First: the set 2 of questions.
(In this example, 2 consisted of the seven questions, Is xm = 0?, m = 1, ..., 7.)
Second: a rule for selecting the best split.
(The twoing criterion mentioned in Section 2.5.2 was used.)
59
Third: a criterion for choosing the right sized tree.
(The pruning cross-validation method covered in Chapter 3 was used.)
The resulting tree is shown in Figure 2.13. There are 10 terminal nodes, each
corresponding to one class. This is accidental. In general, classification trees may have
any number of terminal nodes corresponding to a single class, and occasionally, some
class may have no corresponding terminal nodes.
Underneath each intermediate node is the question leading to the split. Those
answering yes go left, no to the right.
It is interesting to compare the tree with Table 2.1. For instance, terminal node 8
corresponding to class 1 consists of all measurement vectors such that x5 = 0, x4 = 0,
and x1 = 0. If the calculator were not faulty, these three conditions would completely
distinguish the digit 1 from the others. Similarly, node 9 corresponding to the digit 7
consists of all vectors such that x5 = 0, x4 = 0, and x1 = 1. Again, for a perfect
calculator, the digit 7 is distinguished by these conditions.
60
TABLE 2.2
61
FIGURE 2.13
For this particular tree, R*(I) was estimated by using a test sample of size 5000 as
.30. The cross-validation estimate, using only the learning sample, is .30. The
resubstitution estimate is .29. The Bayes rule can be solved for in this example. The
resulting Bayes misclassification rate is .26. In this example, there is not much room
for improvement over the tree classifier.
A variant of the digit recognition problem is gotten by adding pure noise variables.
Let the grid in Figure 2.12 be replaced by the larger grid in Figure 2.14 with 24 line
segments. Take the additional 17 lights to be on or off with probability .5
independently of each other and of the original 7 lights. The 17 lights generate 17
variables x8, ..., x24 with xm = 0 or 1 as the mth light is on or off. These variables are
pure noise in the sense that they are useless for the purpose of digit recognition.
62
FIGURE 2.14
A learning sample for this problem was constructed by taking each case in the
learning sample for the digit recognition problem, sampling 17 measurements from the
pure noise distribution, and adding these to the 7 measurements already in the case.
The new learning sample, then, consists of 200 cases, each having 24 measurement
values.
A classification tree was grown using this learning set, with the same questions,
split selection criterion, and pruning cross-validation method as in the previous tree.
The tree selected was identical to the tree of Figure 2.13. A test set of size 5000
estimated R*(T) as .30. The cross-validation estimate is .31 and the resubstitution
estimate is .29.
The fact that even with 17 added noise variables the tree selected is identical to the
original 7-variable digit recognition tree illustrates an important advantage of tree
structured classification over nearest neighbor or other nonparametric methods. Its
inherent structure is based on a procedure for distinguishing between those variables
useful for classification and those which are not. Of course, the distinction between
useful and useless variables is rarely as clearly defined as in the preceding example,
and the tree structured approach rarely so obviously triumphant.
63
2.6.2 Waveform Recognition Problem
Because of the elementary structure of the digit recognition problem, another, more
complex, example was constructed. It is a three-class problem based on the waveforms
h1(t), h2(t), h3(t) graphed in Figure 2.15.
Each class consists of a random convex combination of two of these waveforms
sampled at the integers with noise added. More specifically, the measurement vectors
are 21 dimensional: x = (x1, ..., x21). To generate a class 1 vector x, independently
generate a uniform random number u and 21 random numbers ε1, ..., ε21 normally
distributed with mean zero and variance 1. Then set
64
the eleventh coordinate the third waveform has its peak, so low values of x11 would
tend to characterize class 1. The split sends 27 out of 36 class Is left, with most of
classes 2 and 3 going right.
The right node resulting from the original split has almost equal proportions of
classes 1 and 2 (64, 68). The mechanism attempts to separate these two classes by the
split x10 ≤ 2.6 near the peak of the third waveform. We would expect class 1 to be low
at x10 and class 2 to be high. The resulting split confirms our expectations with most
of class 1 going left and most of class 2 going right.
The numbers to the right of each terminal node are the normalized counts, by class,
of the 5000 test sample vectors that were run down the tree and ended up in that
terminal node. For comparability, the counts are normalized so that the total in the
node equals the learning sample total.
65
FIGURE 2.15 Waveforms.
66
67
FIGURE 2.16
68
FIGURE 2.17 Waveform tree.
FIGURE 2.18
69
Comparing these two lists for the various terminal nodes, we can see that the
learning sample node proportions can occasionally give quite misleading estimates.
This is particularly true for the smaller nodes.
An analytic expression can be derived for the Bayes rule in this problem. Using this
rule on a test sample of size 5000 gave an error rate of .14.
This example was selected to illustrate features of the tree construction procedure.
That the tree classifier error rate is about twice the Bayes rate is an indication that
single coordinate splits are not well suited to extracting the available information from
this particular data set. The use of combinations of variables will be looked at in
Section 5.2.
70
2.7 THE ADVANTAGES OF THE TREE STRUCTURED APPROACH
5. It gives, with no additional effort, not only a classification, but also an estimate
of the misclassification probability for the object.
For example, the chlorine classification algorithm outputs not only the predicted
classification but also the value r(t) from the terminal node containing the mass spectra
71
of the unknown compound. This is useful in that if r(t), for example, has the value .40
(its maximum is .50 in a two class problem), this indicates that the classification is
highly uncertain and that an opinion from a trained mass spectra analyst may be
desirable. However, an r(t) value of .03 is quite reassuring. (Actually, the r(t) values
given are not the resubstitution estimates, but were estimated using an independent test
set. See Chapter 7.)
FIGURE 2.19
The tree structured method again weighs each point only as one among N and is not
appreciably affected by a few mislabeled points.
8. The tree procedure output gives easily understood and interpreted information
regarding the predictive structure of the data.
The tree structured methods have been used in a variety of applications in
collaboration with nonstatistically oriented chemists, doctors, meteorologists,
physicists, etc. Their reaction, almost universally, has been that the tree classifier
provides an illuminating and natural way of understanding the structure of the
problem.
72
3
73
3.1 INTRODUCTION
This chapter is concerned with two main issues: getting the right sized tree T and
getting more accurate estimates of the true probability of misclassification or of the
true expected misclassification cost R*(T).
The stepwise tree structure does an optimization at each step over a large number of
possible splits of the data. If only resubstitution estimates are used, the usual results
are too much splitting, trees that are much larger than the data warrant, and a
resubstitution estimate R(T) that is biased downward.
For instance, if the splitting is carried out to the point where each terminal node
contains only one data case, then each node is classified by the case it contains, and
the resubstitution estimate gives zero misclassification rate.
In general, more splits result in lower values of the resubstitution estimate R(T). In
this respect, the tree procedure is similar to stepwise linear regression, in which the
estimated R2 increases with each variable entered, encouraging the entry of variables
that have no predictive power when tested on independent samples drawn from the
same distribution.
In fact, stepwise regression simulations have shown that past a certain point, entry
of additional variables will cause the true R2 to decrease. The situation is similar with
trees. Too large a tree will have a higher true misclassification rate than the right sized
tree.
On the other hand, too small a tree will not use some of the classification
information available in , again resulting in a higher true misclassification rate than
the right sized tree.
These problems are illustrated in Table 3.1 with some output from the digit
recognition with noise variables example. Different sized trees were grown. R(T) was
computed for each. An independent sample of size 5000 was run down each tree to
give a more accurate estimate of R*(T). This estimate is denoted Rts(T).
74
TABLE 3.1
Notice two important features in Table 3.1:
1. The estimate R(T) becomes increasingly less accurate as the trees grow larger.
2. The estimates Rts indicate that as the trees initially decrease in size, the true
misclassification rate decreases. Then it hits a minimum at the tree with 10
terminal nodes and begins to climb again as the trees get too small.
(The standard error of Rts was estimated as around .007.)
The selection of overly large trees and the use of the inaccurate resubstitution
estimate have led to much of the past criticism of tree structured procedures.
Trees were produced that seemed to indicate apparent data structure, but the extent
to which the splits were “informative” was questionable. Work was centered on
finding appropriate stopping rules, that is, on finding a criterion for declaring a node
terminal. For example, recall that an early stopping rule consisted of setting a
threshold β and deciding not to split a node if the maximum decrease in impurity was
less than β, that is, if
(3.1)
As it became clear that this rule produced generally unsatisfactory results, other
variants were invented and tested. None were generally acceptable. The problem has
75
two aspects, which can be illustrated by rule (3.1). If β is set too low, then there is too
much splitting and the tree is too large. Increasing β leads to the following difficulty:
There may be nodes t such that max ΔI(s, t) is small. But the descendant nodes tL, tR of
t may have splits with large decreases in impurity. By declaring t terminal, one loses
the good splits on tL or tR.
Finally, the conclusion was reached that looking for the right stopping rule was the
wrong way of looking at the problem. A more satisfactory procedure was found that
consisted of two key elements :
1. Prune instead of stopping. Grow a tree that is much too large and prune it
upward in the “right way” until you finally cut back to the root node.
2. Use more accurate estimates of R*(T) to select the right sized tree from among
the pruned subtrees.
This new framework leads to two immediate questions: How does one prune
upward in the “right way,” and how can better estimates of R*(T) be gotten?
Sections 3.2 and 3.3 discuss a method of pruning upward. The selection of a method
for honestly estimating R*(T) depends on the sample size available. With large sample
size, use of an independent test sample is most economical (Section 3.4.1). With small
sample sizes, the preferred method, cross-validation (Section 3.4.2), necessitates the
growing of auxiliary trees. (See Mabbett, Stone, and Washbrook, 1980, for another
way of using cross-validation to select a classification tree.)
It is important to gauge the accuracy of the estimates of R*(T). Section 3.4.3 gives a
brief discussion of approximate standard error formulas for the estimates. Chapter 11
contains a more complete discussion and the derivations.
Combining pruning and honest estimation produces trees that, in our simulated
examples, have always been close to the optimal size and produces estimates with
satisfactory accuracy. Section 3.4 contains some illustrations and also discusses the
effect on accuracy of increasing or decreasing the number of cross-validation trees.
Characteristically, as a tree is pruned upward, the estimated misclassification rate
first decreases slowly, reaches a gradual minimum, and then increases rapidly as the
number of terminal nodes becomes small. This behavior can be attributed to a trade-
off between bias and variance and is explored in the appendix to this chapter.
76
3.2 GETTING READY TO PRUNE
To fix the notation, recall that the resubstitution estimate for the overall
misclassification cost R*(t) is given by
We refer to R(T) and R(t) as the tree and node misclassification costs.
The first step is to grow a very large tree Tmax by letting the splitting procedure
continue until all terminal nodes are either small or pure or contain only identical
measurement vectors. Here, pure means that the node cases are all in one class. With
unlimited computer time, the best way of growing this initial tree would be to continue
splitting until each terminal node contained exactly one sample case.
The size of the initial tree is not critical as long as it is large enough. Whether one
starts with the largest possible tree T′max or with a smaller, but still sufficiently large
tree Tmax , the pruning process will produce the same subtrees in the following sense:
If the pruning process starting with T′max produces a subtree contained in Tmax , then
the pruning process starting with Tmax will produce exactly the same subtree.
The compromise method adopted for growing a sufficiently large initial tree T
specifies a number Nmin and continues splitting until each terminal node either is pure
or satisfies N(t) ≤ Nmin or contains only identical measurement vectors. Generally,
Nmin has been set at 5, occasionally at 1.
Starting with the large tree Tmax and selectively pruning upward produces a
sequence of subtrees of T max eventually collapsing to the tree {t1} consisting of the
root node.
To define the pruning process more precisely, call a node t‘ lower down on the tree a
descendant of a higher node t if there is a connected path down the tree leading from t
to t’. Then also t is called an ancestor of t′. Thus, in Figure 3.1a, t4, t5, t8, t9, t10 and t11
are all descendants of t2, but not t6 and t7. Similarly, t4, t2, and t1 are ancestors of t9,
but t3 is not an ancestor of tg.
77
FIGURE 3.1
DEFINITION 3.2. A branch Tt of T with root note t ∈ T consists of the node t and all
descendants of t in T.
The branch Tt2 is illustrated in Figure 3.1b.
DEFINITION 3.3. Pruning a branch Tt from a tree T consists of deleting from T all
descendants of t, that is, cutting off all of Tt except its root node. The tree pruned this
way will be denoted by T - Tt.
The pruned treeT-Tt2 is shown in Figure 3.1c.
DEFINITION 3.4. If T′ is gotten from T by successively pruning off branches, then T’
is called a pruned subtree of T and denoted by T′ < T. (Note that T’ and T have the
same root node.)
Even for a moderate sized Tmax containing, say, 30 to 40 nodes, there is an
extremely large number of subtrees and an even larger number of distinct ways of
pruning up to { t1 } . A “selective” pruning procedure is necessary, that is, a selection
of a reasonable number of subtrees, decreasing in size, such that roughly speaking,
each subtree selected is the “best” subtree in its size range.
The word best indicates the use of some criterion for judging how good a subtree T
is. Even though R(T) lacks accuracy as an estimate of R*(T), it is the most natural
criterion to use in comparing different subtrees of the same size.
Regardless of how Tmax was constructed, what splitting criterion was used, and so
on, the selective pruning process starts with the given initial tree Tmax , computes R(t)
for each node t ∈ Tmax, , and progressively prunes Tmax upward to its root node such
that at each stage of pruning, R(T) is as small as possible.
Here is a simple example of such a selective pruning process. Suppose that Tmax has
L terminal nodes. Then construct a sequence of smaller and smaller trees
as follows: For every value of H, 1 < H < L. consider the class TH of all subtrees of
Tmax having L - H terminal nodes. Select TH as that subtree in TH which minimizes
R(T); that is,
78
Put another way, TH is the minimal cost subtree having L - H nodes.
This is an intuitively appealing procedure and can be efficiently implemented by a
backward dynamic programming algorithm. However, it has some drawbacks. Perhaps
the most important is that the sequence of subtrees is not nested, that is, TH+1 is not
necessarily a subtree of TH. As we go through the sequence, nodes may reappear that
were previously cut off. In short, the sequence of subtrees is not formed by a
progressive upward pruning.
Instead, we have adopted another selection method discussed in the next section. A
preliminary version of this method was described in Breiman and Stone (1978).
79
3.3 MINIMAL COST-COMPLEXITY PRUNING
Rα (T) = R (T) + α | | .
Thus, Rα (T) is a linear combination of the cost of the tree and its complexity. If we
think of α as the complexity cost per terminal node, Rα(T) is formed by adding to the
misclassification cost of the tree a cost penalty for complexity.
Now, for each value of α, find that subtree T(α) Tmax which minimizes Rα (T) ,
i.e.,
If α is small, the penalty for having a large number of terminal nodes is small and T(α)
will be large. For instance, if Tmax is so large that each terminal node contains only
one case, then every case is classified correctly; R (Tmax) = 0, so that Tmax minimizes
R0 (T). As the penalty α per terminal node increases, the minimizing subtrees T(α) will
have fewer terminal nodes. Finally, for α sufficiently large, the minimizing subtree
T(α) will consist of the root node only, and the tree Tmax will have been completely
pruned.
Although α runs through a continuum of values, there are at most a finite number of
subtrees of Tmax . Thus, the pruning process produces a finite sequence of subtrees T1,
T2, T3, ... with progressively fewer terminal nodes. Because of the finiteness, what
happens is that if T(α) is the minimizing tree for a given value of α, then it continues to
be minimizing as α increases until a jump point α′ is reached, and a new tree T(α’)
becomes minimizing and continues to be the minimizer until the next jump point α“·.
Though the pruning process is not difficult to describe, certain critical questions
have been left open. For instance:
Is there a unique subtree T ≤ Tmax which minimizes Rα(T)?
In the minimizing sequence of trees T1, T2, ..., is each subtree gotten by pruning
upward from the previous subtree, i.e., does the nesting T1 > T2 > ··· > {t1} hold?
80
More practically, perhaps the most important problem is that of finding an effective
algorithm for implementing the pruning process. Clearly, a direct search through all
possible subtrees to find the minimizer of Rα(T) is computationally expensive.
This section outlines the resolution of these problems. The uniqueness problem
centers around an appropriate definition and a proof that the object defined really
exists. The inclusion and effective implementation then both follow from a closer
examination of the mechanism of minimal cost-complexity pruning.
Begin with:
DEFINITION 3.6. The smallest minimizing subtree T(α) for complexity parameter α is
defined by the conditions
where is the set of terminal nodes of Tt. In Section 10.2 (Theorem 10.11) we
show that the tree T1 has the following property:
PROPOSITION 3.8. For t any nonterminal node of T1,
81
R(t) > R(Tt) .
Starting with T1, the heart of minimal cost-complexity pruning lies in understanding
that it works by weakest-link cutting. For any node t ∈ T1, denote by {t} the subbranch
of Tt consisting of the single node {t}.
Set
Rα({t}) = R(t) + α.
For any branch Tt, define
Rα(Tt) - R(Tt) α| | .
As long as
the branch Tt has a smaller cost-complexity than the single node {t}. But at some
critical value of α, the two cost-complexities become equal. At this point the
subbranch {t} is smaller than Tt, has the same cost-complexity, and is therefore
preferable. To find this critical value of α, solve the inequality
getting
(3.9)
.By (3.8) the critical value on the right of (3.9) is positive.
Define a function g1(t), t ∈ T1, by
Then define the weakest link t1, in T1 as the node such that
82
g1(t1) = min g1(t) t ∈ T1
and put
α2 = g1(t1).
The node is the weakest link in the sense that as the parameter α increases, it is the
first node such that Rα({t}) becomes equal to Rα(Tt). Then { } becomes preferable to
, and α2 is the value of α at which equality occurs.
Define a new tree T2 < T1 by pruning away the branch , that is,
T2 = T1 - .
Now, using T2 instead of T1, find the weakest link in T2. More precisely, letting T2t be
that part of the branch Tt which is contained in T2, define
and E T2, α3 by
and finding the weakest link in T3 and the corresponding parameter value α4;
now form T4 and repeat again.
If at any stage there is a multiplicity of weakest links, for instance, if
83
then define
TABLE 3.2
Finally, we remark that the sequence of minimal cost-complexity trees is a
subsequence of the sequence of subtrees constructed by finding the minimum cost
subtree for a given number of terminal nodes. For instance, if T(α) has seven terminal
nodes, there is no other subtree T having seven terminal nodes with smaller R(T). If so,
which, by definition, is impossible. (See Stone, 1981, for a brief survey of the
statistical literature involving complexity costs.)
84
3.4 THE BEST PRUNED SUBTREE: AN ESTIMATION PROBLEM
(3.11)
The issue discussed in this section is the construction of relatively unbiased
estimates of the true misclassification cost R*(Tk). Two methods of estimation are
discussed: Use of an independent test sample and cross-validation. Of the two, use of
an independent test sample is computationally more efficient and is preferred when the
learning sample contains a large number of cases. As a useful by-product it gives
relatively unbiased estimates of the node misclassification costs. Cross-validation is
computationally more expensive, but makes more effective use of all cases and gives
useful information regarding the stability of the tree structure.
To study the bias or standard error of an estimate, a probability model is necessary.
Assume in this section the model used previously: The cases in L are N independent
draws from the probability distribution P(A, j) on X × C, and (X, Y) is random with
distribution P(A, j), independent of L. If there are no variable misclassification costs,
recall that R*(d) is defined as
(i) Q*(i|j) = P(d(X) = i|Y = j) so that Q*(i|j) is the probability that a case in j is
classified into i by d. Define
(ii) R* (j) = Σi c(i|j)Q*(i|j) so that R*(j) is the expected cost of misclassification
for class j items. Define
85
(iii) R*(d) = Σj R*(j)π(j) as the expected misclassification cost for the classifier d.
Both test sample and cross-validation provide estimates of Q*(i|j) and R*(j), as well
as R*(d). These are useful outputs of the tree program. The basic idea in both
procedures is that Q*(i|j) can be estimated using simple counts of cases misclassified.
Then R*(j), R*(Tk) are estimated through Definitions 3.12(ii) and (iii). Furthermore,
standard errors can be computed by assuming a simple binomial model for the
estimate of Q*(i|j).
86
3.4.1 Test Sample Estimates
Select a fixed number N(2) of cases at random from L to form the test sample L2. The
remainder L1 form the new learning sample.
The tree Tmax is grown using only L1 and pruned upward to give the sequence T1 >
T2 > ··· > {t1}. That is, the {Tk} sequence of trees is constructed and the terminal
nodes assigned a classification without ever seeing any of the cases in L2.
Now take the cases in L2 and drop them through T1. Each tree Tk assigns a predicted
classification to each case in L2. Since the true class of each case in L2 is known, the
misclassification cost of Tk operating on L2 can be computed. This produces the
estimate Rts(Tk).
In more detail, denote by the number of class j cases in L2. For T any one of
the trees T1, T2, ..., take to be the number of class j cases in L2 whose predicted
classification by T is class i.
The basic estimate is gotten by setting
That is, Q*(i|j) is estimated as the proportion of test sample class j cases that the tree T
For the priors {π(j)} either given or estimated, Definition 3.12 (iii) indicates the
estimate
(3.13)
87
If the priors are data estimated, use L2 to estimate them as π(j) = . In this
case, (3.13) simplifies to
(3.14)
This last expression (3.14) has a simple interpretation. Compute the misclassification
cost for every case in L2 dropped through T and then take the average.
In the unit cost case, Rts(j) is the proportion of class j test cases misclassified, and
with estimated priors Rts(T) is the total proportion of test cases misclassified by T.
Using the assumed probability model, it is easy to show that the estimates Qts(i|j)
are biased only if = 0. For any reasonable distribution of sample sizes, the
88
3.4.2 Cross-Validation Estimates
Unless the sample size in L is quite large, cross-validation is the preferred estimation
method. In fact, the only time we have used test sample estimation recently is in the
various mass spectra projects where the minimum sample size in any class was about
900. (However, see Section 6.2, where a different kind of test sample procedure is
used in a heart attack diagnosis project.)
In v-fold cross-validation, the original learning sample L is divided by random
selection into v subset, Lv , v = 1, ..., v, each containing the same number of cases (as
nearly as possible). The vth learning sample is
so that L(v) contains the fraction (v - 1)/v of the total data cases. Think of v as being
reasonably large. Usually v is taken as 10, so each learning sample L(v) contains 9/10
of the cases.
In v-fold cross-validation, V auxiliary trees are grown together with the main tree
grown on L. The vth auxiliary tree is grown using the learning sample L(v). Start by
growing v overly large trees v = 1, ..., V, as well as Tmax, using the criterion that
the splitting continues until nodes are pure or have fewer cases than Nmin ..
For each value of the complexity parameter α, let T(α), T(v) (α), v = 1, ..., v, be the
, T(v)(α) have been constructed without ever seeing the cases in Lv. Thus, the
cases in Lv can serve as an independent test sample for the tree T(v)(α).
Put Lv down the tree , v = 1, ..., v. Fix the value of max the complexity
parameter α. For every value of v, i, j, define
and set
89
so Nij is the total number of class j test cases classified as i. Each case in L appears
in one and only one test sample Lv. Therefore, the total number of class j cases in all
test samples is Nj, the number of class j cases in L.
The idea now is that for V large, T(v)(α) should have about the same classification
accuracy as T(α). Hence, we make the fundamental step of estimating Q*(ilj) for T(α)
as
(3.15)
For the {π(j)} given or estimated, set
Rcv(j) = ∑i C(i|j)Qcv(i|j)
and put
(3.16)
If the priors are data estimated, set π(j) = Nj/N. Then (3.16) becomes
(3.17)
In the unit cost case, (3.17) is the proportion of test set cases misclassified.
The implementation is simplified by the fact that although α may vary continuously,
the minimal cost-complexity trees grown on L are equal to Tk for αk ≤α<αk+1 . Put
so that α’k is the geometric midpoint of the interval such that T(α) = Tk. Then put
90
Rcv(Tk) = Rcv(T(α’k)),
where the right-hand side is defined by (3.16). That is, Rcv(Tk) is the estimate gotten
by putting the test samples Lv through the trees T(v)(α′k). For the root node tree {t1},
Rcv({t1}) is set equal to the resubstitution cost R({t1}).
Now the rule for selecting the right sized tree (modified in Section 3.4.3) is: Select
the tree such that
91
3.4.3 Standard Errors and the 1 SE Rule
It is a good statistical practice to gauge the uncertainties in the estimates Rts(T) and
Rcv(T) by estimating their standard errors. Chapter 11 derives expressions for these
standard errors as well as for the standard errors in the estimates of Q*(i|j) and R*(j).
The derivations are standard statistics for the test sample estimates, but are heuristic
when cross-validation is used.
As an illustration, we derive the expression for the standard error in Rts(T) when the
priors are estimated from the data in the unit cost case.
The learning sample L1 is used to construct T. Take the test sample L2 to be
independently drawn from the same underlying distribution as L1 but independent of
L1. The estimate Rts(T) is the proportion of cases in L2 misclassified by T (see Section
3.4.1).
Now drop the N2 cases in L2 through T. The probability p* that any single case is
misclassified is R*(T). Thus, we have the binomial situation of N2 independent trials
with probability p* of success at each trial, where p* is estimated as the proportion p
of successes. Clearly, Ep = p*, so p is unbiased. Further,
92
FIGURE 3.2
The position of the minimum of (Tk) within this valley may be unstable. Small
changes in parameter values or even in the seed of the random number generator used
to separates L into V test sets may cause large changes in | for the tree that
minimizes
The 1 SE rule for selecting the right sized tree was created to
93
knowledge of the variables involved may suggest an alternative selection. This is
illustrated in Chapter 6, and a different selection rule is mentioned in Section 11.6.
94
3.4.4 Other Estimation Issues
There are two minor issues concerning the estimation of R*(T) that are more or less
unresolved.
The first is this: Recall that in the selection of the test sample a fraction f = N2/N
was selected at random from L. The exact proportions of the classes in L2 are left to a
random device. An alternative strategy is to select a fixed fraction fj from the Nj cases
in class j in L. The question is: When is the latter strategy preferable to the former?
Similarly, in v-fold cross-validation, L is randomly divided into v nearly equal sets
Lv of cases without regard for class membership. When is it better to constrain the
random selection such that each class is equally spread among the Lv, v = 1, ..., v?
Our most recent and tentative thinking, based on a simplified analytic model and
simulation results, is that constructing the test sample (or samples) to contain fixed
fractions of the classes may produce more accurate estimates. Chapter 8 contains
further discussion of this problem in the regression context.
Another problem is that when the tree is selected by the rule
95
3.5 SOME EXAMPLES
96
3.5.1 Digit Recognition
nodes l |, the cross-validation estimate Rcv(Tk), and the estimate Rts(Tk) gotten from
the 5000-case test sample. The plus-minus values on the cross-validation estimates are
±SE.
TABLE 3.3
Notice that
1. The minimum RCV tree is T2 with 23 terminal nodes (indicated by **). The 1
SE rule selects T6 with 10 terminal nodes (indicated by *).
2. In 9 of the 12 trees, Rts is in the range Rcv. In 11 trees, Rts is in the Rcv ± 2 SE
range.
3. The estimates Rcv are higher than Rts for the larger trees (except for T2), are
about equal for T9 and T10, and are low for the two smallest trees T11 and T12.
In the digit recognition problem with 17 noise variables added and the priors
97
estimated from data, cost-complexity pruning and tenfold cross-validation produced
the results shown in Table 3.4. Recall that N = 200. To grow Tmax , Nmin = 1 was used.
TABLE 3.4
The results differ slightly from the first example.
1. Tree T7 is selected by both the minimum Rcv and the 1 SE rule. It also
minimizes Rts.
2. Only 5 of the 13 trees have Rts in the Rcv ± SE range. But all 13 are in the Rcv ±
2 SE range.
3. The Rcv estimates are consistently high for the larger trees, about equal to Rts in
the midrange (T6-T11), and low for the two smallest trees.
As another check on the cross-validation estimates, four data sets were generated
replicating this example but each with a different random number seed. The 1 SE trees
were selected for each and a test sample of 5000 used to get the Rts estimates. The
results appear in Table 3.5.
98
TABLE 3.5
99
3.5.2 Waveform Classification
For the waveform problem, with N = 300 and Nmin = 1, the results are given in Table
3.6.
Tree T4 is the minimizing tree, and tree T6 is the 1 SE tree. These two trees also
have the lowest Rts. The cross-validation estimates are consistently above the test
sample estimates. In 8 out of 12 trees, Rts is in the range Rcv ± SE. In 3 of the
remaining cases, it is in the ±2 SE range. Note that in all three examples, the most
marked lack of accuracy is in the very small trees. This phenomenon is discussed in
the regression context in Section 8.7.
Again four replicate data sets were generated using different seeds, the 1 SE trees
grown, and 5000 cases used to get Rst. The results are given in Table 3.7.
TABLE 3.6
100
TABLE 3.7
101
3.5.3 How Many Folds in the Cross-Validation?
In all the simulation examples we have run, taking v = 10 gave adequate accuracy. In
some examples, smaller values of v also gave sufficient accuracy. But we have not
come across any situations where taking v larger than 10 gave a significant
improvement in accuracy for the tree selected.
This is illustrated in Tables 3.8 and 3.9. The waveform recognition example and the
digit recognition example were run using V = 2, 5, 10, 25. A test sample of size 5000
was used to give the estimates Rts.
102
103
APPENDIX
FIGURE 3.3
The graph of starts high for | k| = 1, decreases as | | increases, reaches a
shallow minimum region, and then increases slowly to the misclassification rate
corresponding to the largest tree T1. Another feature is that (T1) is invariably less than
twice the minimum misclassification rate min (Tk).
The remainder of this appendix is a heuristic attempt to understand the mechanism
leading to the preceding characteristics. The discussion is not a part of the
methodological development, and readers can, without loss of generality, skip to the
next chapter.
The tree structured procedure attempts to fit the classification surface by a surface
that is constant over multidimensional rectangles. When these rectangles are too large,
that is, too little splitting and small | | k|, the fit is poor. We refer to this lack of
104
fitting of the surface as bias. When l l is large and the space X is split into many
small rectangles, the bias is small. On the other hand, these small rectangles are more
likely to have a plurality of the wrong class. This latter type of error is referred to as
variance,
Since the trade-off between bias and variance is an important characteristic of the
tree structure, we illustrate it by a simple model. Take a two-class situation with priors
π(1), π(2) and M-dimensional data with class j sampled from the density fj(x), j = 1, 2.
The Bayes optimal misclassification rate is
With the space X divided up into L rectangles, S1 ..., SL, let the classification
assigned to sℓ by L be denoted by yℓ. Take (X, y) to be independent of L with the same
distribution. By definition, the true misclassification rate R*(L) for the partition S1, ...,
SL is
(A.1)
Using x( ) as the indicator function, (A.1) becomes
(A. 2)
Define
105
(A.3)
where P(X ∈ Sℓ, j) = P(X ∈ Sℓ, Y = j).
The first two terms in (A.3), namely,
form an approximation to the Bayes rate constructed by averaging the densities over
the rectangles in the partition. The bias B(L) is defined by
1. The bias term B(L) decreases rapidly for small L, more slowly as L increases,
and eventually decreases to zero.
2. The variance term increases slowly as L increases and is bounded by a slow
growth factor in L.
3. For L = N with each Sℓ containing one case in L, the variance term is bounded
by the Bayes rate R*.
Thus, we reach the somewhat surprising conclusion that the largest tree possible has
a misclassification rate not larger than twice the Bayes rate. This is similar to Cover
and Hart’s result (1967) that the first nearest neighbor classification algorithm has,
asymptotically, a misclassification rate bounded by twice the Bayes rate. When the
partition is so small that every rectangle contains only one case in L, then the
classification rule becomes similar to first nearest neighbor classification, and Cover
and Hart’s result becomes relevant.
To illustrate the behavior of the bias term, write
106
Then
(A. 4)
If f1, f2 are continuous and π(1)f1(x) ≠ π(2) f2(x) forx∈Sℓ, then the corresponding term
in the sum (A.4) is zero. For f1, f2 smooth and nonconstant, the hypersurface H ⊂ X
defined by
π(1)f1(x) = π(2)f2(X)
(A.5)
where the sum is over all rectangles sℓ, containing points in H. Clearly, then, B (L) →
0 as we make the rectangles in the partition smaller.
A better bound on B(L) can be gotten using the fact that if Sℓ is small and π(1)f1=
π(2)f2 somewhere in Sℓ, then even the non-zero terms in (A.4) are small. In fact, if
P(1|X = x) - P(2|X = x) is assumed to have bounded first partial derivatives, if f(x) is
zero outside a sufficiently large rectangle, and if the partition is regular enough, then it
can be shown that
(A.6)
where M is the dimension of X.
The inequality (A.6) is indicative of the rapid decrease in B(L) for small values of L
107
and slower decrease for larger L. But it also shows the strong effect of dimensionality
on bias. The number of nodes needed to reduce bias by 50 percent, say, goes up
exponentially with the dimension.
Another interesting facet coming out of this argument is that as L gets larger,
virtually all of the bias is contributed by the region near the hypersurface where
π(1)f1(x) = π(2)f2(x)- If the classes are well separated in the sense that f(x) is small
near H, the bias will be correspondingly small.
In the second term of (A.3), assume that there are nℓ cases of L in Sℓ. Put
For L large, we can approximate the second term by its expectation over L, that is,
put
Compute P( ≠ yℓ|nℓ by assuming that the distribution of class 1 and class 2 cases is
given by nℓ independent trials with probabilities (pℓ, qℓ). Call Pℓ the probability of
heads, and let H be the random number of heads. Then
(A.7)
Our conclusions can be reached using (A.7). For instance, some elementary
inequalities lead to
Since nℓ NP(X ∈ Sℓ), we get
108
(A.8)
This is the slow growth bound referred to in point 2 earlier.
Finally, for nℓ = 1, P(yℓ ≠ | nℓ = qℓ Then if nℓ = 1, for all ℓ,
Thus,
assuming, of course, that the partition is fairly fine. This last equation summarizes our
argument that the variance term is bounded by the Bayes rate for L large, that is, for L
N.
A transition occurs between the situation at which L is large but N/L » 1 and the
limiting L = N. If nℓ is moderately large, then P(yℓ nℓ) is small unless pℓ, qℓ
Therefore, the major contribution to the variance term is from those rectangles Sℓ near
the hypersurface on which π(1)f1 = π(2)f2.
As L becomes a sizable fraction of N, the variance contributions become spread out
among all rectangles sℓ for which nℓ is small.
Since detailed proofs of parts of a heuristic argument are an overembellishment,
they have been omitted here. Although the lack of rigor is obvious, the preceding
derivations and inequalities may help explain the balance involved when the optimal
tree is chosen near the bottom of the R curve.
109
4
SPLITTING RULES
In the previous chapter, a method was given for selecting the right sized tree assuming
that a large tree Tmax had already been grown. Assuming that a set of questions Q. or,
equivalently, a set S of splits at every node t has been specified, then the fundamental
ingredient in growing Tmax is a splitting rule. Splitting rules are defined by specifying
a goodness of split function Φ(s, t) defined for every s ∈ S and node t. At every t, the
split adopted is the split s* which maximizes Φ(s, t).
A natural goodness of split criterion is to take that split at any node which most
reduces the resubstitution estimate of tree misclassification cost. Unfortunately, this
criterion has serious deficiencies, which are discussed in Section 4.1. In the two-class
problem, a class of splitting criteria is introduced which remedies the deficiencies, and
the simplest of these is adopted for use (Section 4.2). In Section 4.3 this criterion is
generalized in two directions for the multiclass problem resulting in the Gini and
twoing criteria.
Section 4.4 deals with the introduction of variable misclassification costs into the
splitting criterion in two ways: through a generalization of the Gini criterion and by
alteration of the priors. Examples are given in Section 4.5.
93
After looking at the outputs of a number of simulated examples, we have come to
these tentative conclusions:
1. The overall misclassification rate of the tree constructed is not sensitive to the
choice of a splitting rule, as long as it is within a reasonable class of rules.
2. The method of incorporation of variable misclassification costs into the
splitting rule has less effect than their incorporation into the pruning criterion,
that is, into R(T).
In Section 4.6, we look at the problem of growing and pruning trees in terms of a
different accuracy criterion. Given a measurement vector of unknown class, suppose
that instead of assigning it to a single class, we are more interested in estimating the
probabilities that it belongs to class 1, ..., class J. What splitting and pruning criteria
should be used to maximize the accuracy of these class probability estimates? In the
formulation used, the answer is intimately connected with the Gini criterion.
For other discussions of splitting rules in classification, see Belson (1959),
Anderson (1966), Hills (1967), Henrichon and Fu (1969), Messenger and Mandell
(1972), Meisel (1972), Meisel and Michalpoulos (1973), Morgan and Messenger
(1973), Friedman (1977), Gordon and Olshen (1978), and Rounds (1980).
110
4.1 REDUCING MISCLASSIFICATION COST
In Section 2.4.2 a framework was given for generating splitting rules. The idea was to
define an impurity function φ(p1, ..., PJ) having certain desirable properties (Definition
2.5). Define the node impurity function i(t) as φ(p(1|t), ..., p(1|t)), set I(t) = i(t)p(t), and
define the tree impurity I(T) as
Then at any current terminal node, choose that split which most reduces I(T) or,
equivalently, maximizes
or
Then the best split of t maximizes
(4.1a)
111
or, equivalently, maximizes
(4.1b)
The corresponding node impurity function (unit cost) is
This function has all the desirable properties listed in Definition 2.5.
Still, in spite of its natural attractiveness, selecting splits to maximize the reduction
in R(T) has two serious defects. The first is that the criterion (4.1) may be zero for all
splits in S.
PROPOSITION 4.2. For any split of t into tL and tR,
or
The right hand side is certainly nonnegative and equals zero under the conditions j*(t)
= j*(tL) = j*(tR).
Now suppose we have a two-class problem with equal priors and are at a node t
which has a preponderance of class 1 cases. It is conceivable that every split of t
112
produces nodes tL, tR where both have class 1 majorities. Then R(t) - R(tL) - R(tR) = 0
for all splits in S and there is no single or small number of best splits.
The second defect is more difficult to quantify. In summary, it is that reducing the
misclassification rate does not appear to be a good criterion for the overall multistep
tree growing procedure.
FIGURE 4.1
Look at the example in Figure 4.1. Suppose that the top node is the root node and
assume equal priors. The first split leads to a tree in which 200 cases are misclassified
and R(T) = 200/800 = .25. In the second split, 200 cases are also misclassified and
R(T) =.25.
Even though both splits are given equal ratings by the R(T) criterion, the second
split is probably more desirable in terms of the future growth of the tree. The first split
gives two nodes, both having r(t) = .25. Both of these nodes will probably need more
splitting to get a tree with a lower value of R(T). The second tree has one node with
r(t) = .33, which will have to be split again on the same grounds. But it also has one
node with 1/4 of the cases in it for which r(t) = 0. This node is terminal, no more work
has to be done on it and it gives, at least on L, perfect classification accuracy.
Besides the degeneracy problem, then, the R(T) criterion does not seem to
appropriately reward splits that are more desirable in the context of the continued
growth of the tree. Other examples show that this behavior can be even more
pronounced as the number of classes gets larger.
This problem is largely caused by the fact that our tree growing structure is based on
a one-step optimization procedure. For example, in bridge, the team goal in every deal
is to take as many tricks as possible. But a team that attempts to take the trick every
time a card is led will almost invariably lose against experienced players.
A better single-play criterion would take into account overall improvement in
strategic position as well as the possibility of taking the current trick. In other words, a
good single-play criterion would incorporate the fact that any play currently selected
has implications for the future course of the game.
Similarly, while the tree T finally constructed should have as small as possible
“true” probability of misclassification, the one-step minimization of R(T) is not
113
desirable either from a practical point of view (degeneracy) or from the overall
strategic viewpoint.
114
4.2 THE TWO-CLASS PROBLEM
115
4.2.1 A Class of Splitting Criteria for the Two-Class Problem
In the two-class unit cost situation, the node impurity function corresponding to the
node misclassification rate is
FIGURE 4.2
This section defines a class of node impurity functions i(t) = φ(p1) which do not
have the undesirable properties of the min (p1, p2) function. We retain the intuitively
reasonable requirements that
116
p1, decreases only linearly in p1. To construct a class of criteria that select the second
split of the example in Figure 4.1 as being more desirable, we will require that φ(p1)
decrease faster than linearly as p1, increases.
FIGURE 4.3
The class of node impurity functions φ(p1) which seems natural to this context is
therefore defined as the class F of functions φ(p1 ), 0 ≤ p1, ≤1, with continuous second
derivatives on 0 ≤ p1, ≤ 1 satisfying
(4.3)
117
FIGURE 4.4
The functions of class F generally look like Figure 4.4. There are some general
properties common to all functions in F. To begin with, let i(t)= φ(p(1|t)), φ ∈ F. Then
PROPOSITION 4.4. For any node t and split s,
Δi.(s, t) ≥0
with equality if, and only if, p(j|tL) = p(jltR) = p(jlt), j = 1, 2.
The proof is in the appendix to this chapter.
The impurity is never increased by splitting. It stays the same only under rare
circumstances. The condition for equality in Proposition 4.4 is generally much more
difficult to satisfy than the corresponding condition using R(t) in Proposition 4.2. The
requirement that φ(p1) be strictly concave largely removes the degeneracy. It can still
happen that a few splits simultaneously achieve the maximum values of Δi(s, t), but in
practice, we have found that multiplicity is exceptional.
118
4.2.2 The Criteria Class and Categorical Variables
Perhaps the most interesting evidence that the class F is a natural class of node
impurity functions comes from a different direction. Suppose that a measurement
variable x is a categorical variable taking values in the set {b1, ..., bL}. At node t, the
That is, p(jlx= bℓ) can be interpreted as the estimated probability of being in class j
given that the object is in node t and its x category is bℓ. Then this result holds.
THEOREM 4.5. Order the p(1|x = bℓ), that is,
is maximizing.
For a categorical variable with L large, this result provides considerable
improvement in computational efficiency. The search is reduced from looking at 2L-1
subsets to L subsets. The intuitive content is clear. The best split should put all those
categories leading to high probabilities of being in class I into one node and the
categories leading to lower class 1 probabilities in the other.
The proof is a generalization of a result due to Fisher (1958) and is given in Section
9.4.
119
4.2.3 Selection of a Single Criterion
φ (x) = a + bx + cx2 .
Condition (4.3)(i) gives a = 0, b + c = 0, so
and (4.3)(ii) implies that b > 0. Without loss of generality we take b = 1, giving the
criterion
(4.6)
The criterion
(4.7)
also belongs to F.
The function p(1|t)p(2|t) is simple and quickly computed. It has a familiar
interpretation. Suppose all class 1 objects in a node t are given the numerical value 1
and class 2 objects the value 0. Then if p(1|t) and p(2|t) are the proportions of the two
classes in the node, the sample variance of the numerical values in the node is equal to
p(1|t)p(2|t).
Since we could not think of any intrinsic reason why one function in the class F
should be preferred to any other, and since preliminary tests indicated that both (4.6)
and (4.7) gave very similar results, the principle of simplicity was appealed to and
p(1|t)p(2|t) selected as the node impurity function in the two-class problem.
120
4.3 THE MULTICLASS PROBLEM: UNIT COSTS
Two different criteria have been adopted for use in the multiclass problem with unit
costs. These come from two different approaches toward the generalization of the two-
class criterion and are called the
Gini criterion
Twoing criterion
121
4.3.1 The Gini Criterion
The concept of a criterion depending on a node impurity measure has already been
introduced. Given a node t with estimated class probabilities p(j|t), j = 1, ..., J, a
measure of node impurity given t
is defined and a search made for the split that most reduces node, or equivalently tree,
impurity. As remarked earlier, the original function selected was
In later work the Gini diversity index was adopted. This has the form
(4.8)
and can also be written as
(4.9)
In the two-class problem, the index reduces to
i(t) = 2p(1|t)p(21t),
122
under this rule is the Gini index
Finally, note that the Gini index considered as a function φ(p1, ..., pJ) of the p1 , ...,
pJ is a quadratic polynomial with nonnegative coefficients. Hence, it is concave in the
sense that for r + s = 1, r ≥ 0, s ≥ 0,
Δi (s , t) ≥ 0.
Actually, it is strictly concave, so that Δi(s, t) = 0 only if p(j|tR) = P(j|tL) = p(j|t), = 1,
... , J.
The Gini index is simple and quickly computed. It can also incorporate symmetric
variable misclassification costs in a natural way (see Section 4.4.2).
123
4.3.2 The Twoing Criterion
The second approach to the multiclass problem adopts a different strategy. Denote the
class of classes by C, i.e.,
C = {1, ...,J}.
At each node, separate the classes into two superclasses,
Δi(s, t, C1)
is used. Now find the split s*(C1) which maximizes Δi(s, t, C1). Then, finally, find
the superclass which maximizes
Δi(s*(C1), t, C1).
The split used on the node is
The idea is then, at every node, to select that conglomeration of classes into two
superclasses so that considered as a two-class problem, the greatest decrease in node
impurity is realized.
This approach to the problem has one significant advantage: It gives “strategic”
splits and informs the user of class similarities. At each node, it sorts the classes into
those two groups which in some sense are most dissimilar and outputs to the user the
optimal grouping as well as the best split s*.
The word strategic is used in the sense that near the top of the tree, this criterion
attempts to group together large numbers of classes that are similar in some
characteristic. Near the bottom of the tree it attempts to isolate single classes. To
illustrate, suppose that in a four-class problem, originally classes 1 and 2 were grouped
together and split off from classes 3 and 4, resulting in a node with membership
124
Then on the next split of this node, the largest potential for decrease in impurity would
be in separating class 1 from class 2.
Spoken word recognition is an example of a problem in which twoing might
function effectively. Given, say, 100 words (classes), the first split might separate
monosyllabic words from multisyllabic words. Future splits might isolate those word
groups having other characteristics in common.
As a more concrete example, Figure 4.5 shows the first few splits in the digit
recognition example. The 10 numbers within each node are the class memberships in
the node. In each split the numbers in brackets by the split arrows are the superclasses
, , for the split. In parentheses in the brackets are the classes whose populations
are already so small in the parent node that their effect in the split is negligible. Zero
populations have been ignored.
FIGURE 4.5
Recall that the lights are numbered as
The first split, on the fifth light, groups together classes 1, 3, 4, 5, 7, 9 and 2, 6, 8, 10.
Clearly, the fifth light should be off for 1, 3, 4, 5, 7, 9 and on for the remaining digits.
The next split on the left is on light 4 and separates classes 1, 7 from classes 3, 4, 5, 9.
On the right, the split on light 2 separates class 2 from 6, 8, 10.
Although twoing seems most desirable with a large number of classes, it is in such
situations that it has an apparent disadvantage in computational efficiency. For
example, with J classes, there are 2J-1 distinct divisions of C into two superclasses. For
125
J = 10, 2J- 1- = 1000. However, the following result shows, rather surprisingly, that
twoing can be reduced to an overall criterion, running at about the same efficiency as
the Gini criterion.
THEOREM 4.10. Under the two-class criterion p(1|t)p(2|t), for a given split s, a
superclass C1(s) that maximizes
Δi(s, t, C1)
is
COROLLARY 4.11. For any node t and split s of t into tL and tR, define the twoing
criterion function Φ (s, t) by
Then the best twoing split s*( ) is given by the split s* which maximizes Φ(s, t) and
is given by
126
In such applications, it is natural to consider the ordered twoing criterion given by
where C1, C2 are partitions of C = {1, ..., J} into two superclasses restricted by the
condition that they be of the form
127
4.3.3 Choice of Criterion: An Example
Both the Gini and twoing criteria have been implemented in CART. Each method has
its own advantages. In either case, we have not succeeded in finding an extension of
Theorem 4.5 on handling categorical data. If the Gini index is used, the number of
categories in any variable should be kept moderate to prevent exhaustive subset
searches. If twoing is used and there are a small number of classes, then for each fixed
superclass selection, Theorem 4.5 can be used on the categorical variables, and then a
direct search can be made for the best superclass.
Choice of a criterion depends on the problem and on what information is desired.
The final classification rule generated seems to be quite insensitive to the choice. To
illustrate, both Gini and twoing were used on a replicate data set in the digit
recognition problem, with final tree selection using the 1 SE rule.
In both cases, trees with 10 terminal nodes were selected. Both trees have the same
test sample accuracy (.33). Figure 4.6 shows the two trees. The numbers underneath
the nodes are the coordinates split on. The numbers in the terminal nodes indicate
node class assignment.
The two trees are very similar. At the node indicated by the arrow, the Gini criterion
prefers a split on coordinate 6, while twoing selects the second coordinate. For the
Gini criterion, the split on the second coordinate was the second best split, and for
twoing, the split on the sixth coordinate was the second best.
The class membership of this node is
128
FIGURE 4.6
Class 2: x2 = 0, x6 = 0
3: x2 = 0, x5 = 0
8: no zeros
9: x4 = 0
129
10: x4 = 0
The split on x6 = 0 preferred by Gini separates out class 2 and sends it left. The twoing
split on x2 = 0 groups together classes 2 and 3 and sends them left with 9 and 10 going
right.
Different data sets were generated from the digit recognition model by changing the
random number generator seed. Trees were grown on the data sets using both the Gini
and twoing criteria. The preceding example was chosen as best illustrating the
difference. In general, the first few splits of the two trees are the same, and the two
trees selected have comparable accuracy. Where they differ, Gini tends to favor a split
into one small, pure node and a large, impure node. Twoing favors splits that tend to
equalize populations in the two descendant nodes.
In the waveform recognition example, the trees constructed by Gini and twoing and
pruned back by the 1 SE rule had (to our surprise) identical splits. When a new seed
was used to generate another waveform data set, the two trees differed slightly. Both
had nine terminal nodes. The Gini tree had slightly better test sample accuracy (.28)
than the twoing tree (.30). The right branch leading from the root node had identical
splits in both trees. Where they differed, in the left branch, the same phenomenon was
observed as in the digit data. The twoing splits tended to produce descendant nodes of
more equal size than the Gini splits.
There are usually only slight differences between the Gini and twoing trees. In
balance, comparing the two on many data sets, where they differ, the Gini splits
generally appear to be better. In fact, one can give examples of two candidate splits of
a node, one of which is clearly superior to another in terms of producing pure
descendant nodes, in which twoing (but not Gini) selects the poorer split. For these
reasons, we usually prefer the use of the Gini criterion.
130
4.4 PRIORS AND VARIABLE MISCLASSIFICATION COSTS
The parameters that can be set in tree structured classification include the priors {π(j)}
and variable misclassification costs {c(i|j)}. These are interrelated to the extent
discussed in Section 4.4.3.
131
4.4.1 Choice of Priors
The priors are a useful set of parameters, and intelligent selection and adjustment of
them can assist in constructing a desirable classification tree.
In some studies, the data set may be very unbalanced between classes. For example,
in the mass spectra data base nonchlorine compounds outnumbered chlorine
compounds by 10 to 1. If the priors are taken proportional to the occurrence of
compounds in the data base, then we start with a misclassification rate of 10 percent:
Everything is classified as not containing chlorine. Growing a classification tree using
such priors decreases the misclassification rate to about 5 percent. But the result is that
nonchlorines have a 3 percent misclassification rate, while chlorines have a 30 percent
misclassification rate.
The mechanism producing this disparity is that if equal numbers of chlorines and
nonchlorines are misclassified, the effect on the chlorine classification rate will be
much larger than on the nonchlorine rate.
The priors can be used to adjust the individual class misclassification rates in any
desired direction. For example, taking equal priors tends to equalize the
misclassification rates. In the chlorine example, equal priors resulted in a 9 percent
misclassification rate for chlorine and a 7 percent rate for nonchlorines. Putting a
larger prior on a class will tend to decrease its misclassification rate, and vice versa.
If the initial choice of priors gives questionable results, we suggest the growing of
some exploratory trees using different priors as outlined in Chapter 5.
132
4.4.2 Variable Misclassification Costs via Gini
In Section 4.3 the assumption was made that the cost of misclassifying a class j case as
a class i case was equal to 1 for all i ≠ j. In general, if variable misclassification costs
{c(i|j)} are specified, then the question arises of how to incorporate these costs into the
splitting rule. For the Gini index there is a simple extension. Again, consider the
suboptimal classification rule which, at a node t, assigns an unknown object into class
j with estimated probability p(j|t). Note that the estimated expected cost using this rule
is
(4.12)
This expression is used as the Gini measure of node impurity i(t) for variable
misclassification costs.
In the two-class problem, (4.12) reduces to
(c(2|1) + c(1|2))p(1|t)p(2|t),
giving the same splitting criterion (essentially) as in the unit cost case. This points
up a difficulty, noted by Bridges (1980), in the way in which the Gini index deals with
variable costs. The coefficient of p(i|t)p(j|t) in (4.12) is c(i|j) + c(j|i). The index
therefore depends only on the symmetrized cost matrix and does not appropriately
adjust to highly nonsymmetric costs.
Another, more theoretical, problem is that i(t) defined by (4.12) is not necessarily a
concave function of the {p(j|t)}, and so Δi(s, t) could conceivably be negative for some
or all splits in S.
133
4.4.3 Variable Misclassification Costs via Altered Priors
Let {π‘(j)} and {c′(i|j)} be altered forms of {π(j)) and {c(i|j)} such that
(4.13)
Then R(T) remains the same when computed using {π‘(j)} and {c’(i|j)}.
Take {c‘(i|j) to be the unit cost matrix and suppose that altered priors {π′(j)} can be
found satisfying (4.13). Then the cost structure of T is equivalent, in the above sense,
to a unit cost problem with the {π(j)} replaced by the {π′(j)}.
If the costs are such that for each class j, there is a constant misclassification cost
C(j) regardless of how it is misclassified, that is, if
Cc(i|j) = C(j), i ≠ j,
then C′(i|j) can be taken as unit costs with the altered priors
(4.14)
134
This suggests that a natural way to deal with a problem having a constant cost
structure C(j) for the jth class is to redefine priors by (4.14) and proceed as though it
were a unit cost problem.
In general, the {π′(j)} should be chosen so that the {C′(i|j)} are as close as possible
to unit cost. This has been implemented in CART by defining the {π′(j)} through
(4.14) using
(4.15)
135
4.5 TWO EXAMPLES
In the waveform recognition problem, recall that the classes are superpositions of two
waveforms as sketched in Figure 4.7. Classes 2 and 3 are mirror images of each other.
Since class 1 is the odd class, we decided to track the results of varying
misclassification costs by making it more costly to misclassify class 1 as a class 2 or 3.
The misclassification cost matrix C(i|j) given in (4.16) was used.
FIGURE 4.7
136
(4.16)
In this case, there are constant class misclassification costs; C(1) = 5, C(2) = 1, C(3) =
1, and altering the priors is the preferred procedure. The example was run using altered
priors and the unit cost Gini criterion and rerun using the original priors and the Gini
criterion incorporating the symmetrized cost matrix.
In the second example, the cost matrix was taken as the symmetrized version of
(4.16):
In this example, use of the Gini criterion with varying costs seems preferred. This is
contrasted with running the example using the Gini criterion with unit costs and
altered priors defined by (4.14) and (4.15).
The results of the first example are summarized in Table 4.1 and Figure 4.8. Except
for the last split in the Gini case, which only skims seven cases off to the left, the trees
are very similar, and their test set costs are the same. The Gini RCV is high, but we
suspected that this was a random fluctuation. To check, we replicated with a different
seed. In the replication, the Gini RCV and Rts differed by less than 1 SE.
137
138
4.9.
In the cost structure of this example, the mistake 2 ↔ 3 is only 1/3 as costly as the
mistakes 1 ↔ 2 and 1 ↔ 3. The priors are altered in the proportions 3:2:2. This does
not produce much difference in the priors, and the two trees are decidedly different.
For instance, in the symmetric Gini, the first split selected is in the midregion, where
class 1 can most easily be separated from 2 and 3.
139
There are some interesting facets in these examples. First, since the Gini criterion
symmetrizes the loss matrix, the same criterion was used to grow the symmetric Gini
trees in both examples. The only difference is that in the first example the pruning up
uses the nonsymmetric losses. In the second example, the pruning uses the symmetric
losses. The difference in the tree structures is substantial.
Second, take the tree grown on the waveform data using unit costs and the original
priors (Figure 2.17). Use the 5000-case test sample together with the symmetric
variable cost matrix to estimate the cost of this tree. The answer is .72, less than either
of the two trees illustrated in Figure 4.9. Yet this tree was grown without regard for the
variable misclassification costs.
The major reason for this apparently odd result is fairly simple. In the first example,
class 1 could be accurately classified as long as the misclassification of classes 2 and 3
as class Is could be more or less ignored. However, using univariate splits, there is a
lower limit to the mutual separation between class 1 and classes 2 and 3. Regardless of
how high 1 ↔ {2, 3} is weighted, matters will not improve much.
Third, in both examples, the higher losses inceases the SE’s of the cross-validation
estimates. Using the 1 SE rule allows a considerable increase in the cost estimate. A
smaller increase may be desirable. For instance, in the altered prior tree (second
example), the minimum cross-validated cost tree had a test sample cost of .71, as
against the .81 cost of the tree selected by the 1 SE rule.
140
4.6 CLASS PROBABILITY TREES VIA GINI
141
4.6.1 Background and Framework
142
However, this criterion poses an awkward problem, since its value depends on the
unknown P(j|x) that we are trying to estimate. Fortunately, the problem can be put into
a different setting that resolves the difficulty. Let X, Y on X x C have the distribution
P(A, j) and define new variables zj, j = 1, ..., J, by
Then
Thus, the MSE of d is simply the sum of its mean square errors as a predictor of the
variables zj, j = 1, ..., J.
The key identity is
PROPOSITION 4.19. For any class probability estimator d,
(4.20)
The proof is a standard and simple exercise in conditional probabilities, which we
omit. There are two interesting pieces of information in Proposition 4.19. The first is
that among all class probability estimators, dB has minimum MSE.
The second, and more important, is that the accuracy of d as defined in Definition
4.17, differs from R*(d) only by the constant term R*(dB). Therefore, to compare the
accuracy of two estimators d1 and d2, we can compare the values of R*(d1) and R*
(d2).
The significant advantage gained here is that R*(d) can be estimated from data,
143
while accuracies cannot. We focus, then, on the problem of using trees to produce
class probability estimates with minimal values of R*.
144
4.6.2 Growing and Pruning Class Probability Trees
Assume that a tree T has been grown on a learning sample (xn, jn), n = 1, ..., N, using
an unspecified splitting rule and has the set of terminal nodes T.
Associated with each terminal node t are the resubstitution estimates p(j|t), j = 1, ...,
J, for the conditional probability of being in class j given node t.
The natural way to use T as a class probability estimator is by defining: If x ∈ t,
then
Then the resubstitution estimate R(T) of R*(T) can be formed by this reasoning: For all
(xn , jn ) with xn ∈ t, jn = j,
where
Then put
145
(4.21)
Evaluating the sum over j in the last expression gives
(4.22)
The surprising thing is that
is exactly the Gini diversity index (4.9). So growing a tree by using the Gini
splitting rule continually minimizes the resubstitution estimate R(T) for the MSE. In
consequence, use of the Gini splitting rule is adopted as the best strategy for growing a
class probability tree.
The major difference between classification trees grown using the Gini rule and
class probability trees is in the pruning and selection process. Classification trees are
pruned using the criterion R(T) + α| |, where
(4.23)
and r(t) is the within-node misclassification cost.
Class probability trees are pruning upward using R(T) + α| | but with r(t) the within
node Gini diversity index.
Take Tmax to be grown as before, and prune upward, getting the sequence T1 > T2 >
··· > {t1}. To get test sample estimates Rts(T) of R*(T) for T any of the Tk, run all the
146
where the sum is over the test sample cases. Then we put
(4.24)
If the priors are data estimated, the test sample estimates of them are used in (4.24).
If T1, ..., TV are the V cross-validation trees associated with T, let d(v), v = 1, ..., V,
denote the corresponding class probability estimators. Define
(4.25)
where the inner sum is over all class j cases in the vth test sample . Now put
(4.26)
If the priors are data estimated, the entire learning sample estimates are used to
estimate the π(j) in (4.26).
Standard errors for the Rts and RCV estimates are derived in Chapter 11.
147
4.6.3 Examples and Comments
Class probability trees were constructed for both the digit and waveform data using the
1 SE rule. The results are summarized in Table 4.3.
TABLE 4.3
We calculated the value of the Gini index for the tree grown on the digit data using
twoing to split and the misclassification rate to prune. The result, using the test sample
data, was .553. This was also done for the waveform tree grown using the Gini index
and pruned using the misclassification rate. The result, on test sample data, was .459.
The improvement, as measured by the Rts values, is less than spectacular. Still, this
procedure has not been extensively tested. There may be situations where the
improvement is more significant.
148
APPENDIX
Δi(s, t)≥ 0
with equality if, and only if, p(j|tL) = p(j|tR) = p(j|t), j = 1, ..., J.
By the strict concavity of φ,
with equality holding if, and only if, p(j|tL) = p(j|tR), j = 1, ..., J. Now
(A.2)
This implies that
with equality if, and only if, p(j|tL) = p(j|tR ), j = 1, ..., J. If this latter holds, using (A.2)
149
again gives p(j|tL) = p(j|tR) = p(j|t), j = 1, ..., J.
Proof of Theorem (4.10) and Its Corollaries
Recall that for a given split s and a division of the classes {1, ..., J} into the two
superclasses C1 and C2, Δi(s, t, C1) denotes the decrease in node impurity computed as
in a two-class problem using the p(1|t)p(2|t) criterion. The problem is to find, for a
given s, a superclass C1(s) so that
(A.3)
Now, define zj = p(j|tL) - p(j|tR) , so = 0. For any real z, let z and z be its positive
and negative parts: z = z and z = 0 ifz≥0; z+ = 0 and z = -z ifz≤0. Then z = z+ - z-, and
|z| = z + z-.
From (A.3), since pL, pR depend only on s and not on C1, then C1(s) either
we see that the maximum value of S equals the absolute value of the minimum
value and that both equal
150
and a maximizing superclass is C1(s) _ = {j; p(j|tL) ≥ p(j|tR)). This proves Theorem
4.10 once (A.3) is derived.
To get (A.3), write
(A.4)
Replacing p(1|t) in (A.4) by pLp(1|tL) + pRp(1|tR) leads to
This last is equivalent to (A.3).
151
5
152
5.1 INTRODUCTION
The methodological development produced some features that were added to the basic
tree structure to make it more flexible, powerful, and efficient.
The tree growing procedure described in the previous chapters uses splits on one
variable at a time. Some problems have a structure that suggests treatment through
combinations of variables. Three methods for using combinations are given in Section
5.2.
Section 5.3 deals with predictive association between splits and the definition of
surrogate splits. This is a useful device which is analogous to correlation in linear
models. It is used to handle data with missing variables and give a ranking of variable
importance.
Although cross-validation gives accurate estimates of overall tree cost, it is not
capable of improving the resubstitution estimates of the individual terminal node costs.
Two heuristic methods for improved within-node cost estimates are discussed in
Section 5.4.
In Section 5.5 the important issues are interpreting and exploring the data structure
through trees. The instability of the tree topology is illustrated. Two avenues for tree
interpretation are discussed: first, a close examination of the tree output; second, the
growing of exploratory trees, both before and after the main tree procedure. A method
is given for rapid computation of exploratory trees.
The question of computational efficiency is covered in Section 5.6. Some
benchmarks are given for tree construction time. With large data sets, a method of
subsampling can be used which significantly decreases the time requirement while
having only a minor effect on the tree structure.
Finally, Section 5.7 gives a comparison of tree structured classification with nearest
neighbor and discriminant function methods as applied to the digit and wave
recognition examples.
The appendix gives a description of the search algorithm for finding best linear
combination splits.
153
5.2 VARIABLE COMBINATIONS
154
5.2.1 Introduction
In Chapter 2 we noted that at times the data structure may be such that it makes more
sense to split on combinations of variables than on the individual original variables.
We have found three useful combination procedures. The first is a search for a best
linear combination split; the second uses Boolean combinations; and the third is
through the addition of features—ad hoc combinations of variables suggested by
examination of the data.
155
5.2.2 Linear Combinations
In some data, the classes are naturally separated by hyperplanes not perpendicular to
the coordinate axes. These problems are difficult for the unmodified tree structured
procedure and result in large trees as the algorithm attempts to approximate the
hyperplanes by multidimensional rectangular regions. To cope with such situations,
the basic structure has been enhanced to allow a search for best splits over linear
combinations of variables.
The linear combination algorithm works as follows. Suppose there are M1 ordered
variables (categorical variables are excluded). If there are missing data, only those
cases complete in the ordered variables are used. At every node t, take a set of
coefficients a = (a1 , ..., ) such that ∥a∥2 = = 1, and search for the best split
of the form
(5.1)
as c ranges over all possible values. Denote this split by s*(a) and the corresponding
decrease in impurity by Δi(s*(a), t). The best set of coefficients a* is that a which
maximizes Δi(s*(a), t). That is,
156
For high-dimensional data this gives a complicated tree structure. At each node one
has to interpret a split based on a linear combination of all ordered measurement
variables. But some of the variables in the combination may contribute very little to
the effectiveness of the split. To simplify the structure, we weed out these variables
through a backward deletion process.
For m ranging from 1 to M1, vary the threshold constant c and find the best split of
the form
That is, we find the best split using the linear combination with coefficients a* but
deleting xm and optimizing on the threshold c. Denote the decrease in impurity using
this split by Δm and
Δ* = Δi(s*(a*), t).
The most important single variable to the split s*(a*) is the one whose deletion
causes the greatest deterioration in performance. More specifically, it is that variable
for which Δm is a minimum. Similarly, the least important single variable is the one for
which Δm is a maximum.
Measure the deterioration due to deleting the most important variable by Δ* -
, and the deterioration due to deleting the least important variable by Δ* -
. Set a constant β, usually .2 or .1, and if
157
158
FIGURE 5.1
The first linear combination separates off class 3 as being low in coordinates 5, 6, 7
and high on the thirteenth coordinate. Then it separates class 1 from class 2 as being
low in coordinates 9, 10, 11 and high in the sixteenth and seventeenth.
The test set error rate is improved by .08 through the use of linear combinations,
and the structure of the tree is considerably simplified.
The linear combination splits are added to the univariate splits. So even with this
option, it may happen that the split selected at a node is univariate.
For other methods of constructing piecewise linear classifiers, see Meisel (1972),
Friedman (1977), and Sklansky (1980).
159
5.2.3 Boolean Combinations
Another class of problems which are difficult for the basic tree procedure is
characterized by high dimensionality and a Boolean structure. A common class of such
problems occurs in medical diagnostic data sets consisting of a large number of binary
variables indicating the presence or absence of certain symptoms and other variables
which are medical test results. Another example is in the classification of chemical
compounds through the peaks in their mass spectra.
Both doctors and mass spectrographers look for certain combinations of variables as
being significant, that is, the presence of symptom 3 together with a positive result on
test 5, or the presence of a peak at location 30 together with another peak at location
44. The general kind of thing that is being looked for is of the form: Does the case
have property (D1 and D2) or (D3 and D5), etc.? We refer to such an expression as a
Boolean combination.
As in the linear case, the basic tree structure may eventually uncover the structure of
the data, but only at the cost of many splits and a resulting disguise of the structure.
The method outlined below has been designed to deal more effectively with a Boolean
structure.
Since the class of all Boolean combinations of splits is extremely large and can lead
to a confusing tree, the class of Boolean combinations considered is restricted to splits
of the form
(5.2)
This includes combinations of the form: Does the patient have symptom 3 and read
positive on test 5? Or, does the spectrum have a peak at 30 with intensity greater than
e1 and a peak at 44 with intensity greater than e2? When the complementary node is
considered, it also includes splits of the form
(5.3)
The class (5.2) of Boolean splits is denoted as
160
and interpreted as the set of all cases sent to tL by every split in the set { ,...,
}. Denote the decrease in impurity of the node t by the split as
(5.4)
Theoretically, the optimal procedure is to maximize (5.4) over all splits on variables
, ..., and then to maximize over all subsets {m1, ..., mn} ⊂ {1, ..., M}. At
present, we do not know of a feasible way to implement this direct maximization
procedure. Instead, a stepwise method is used.
If a split s on an ordered variable x is of the form {Is x ≤ c?}, let be the split {Is x
> c?}. If s is a split on a categorical variable x of the form {Is x ∈ {b1, ..., bh}?},
denote by the split {Is x ∉ {b1, ..., bh}?}.
DEFINITION 5.5. If s is any split of the form {Is x ∈ B?}, then the complementary
split to s is defined as {Is x ∈ BC?}.
Let be the best split on the variable xm and take S* to be the set of splits
161
at s = s*, denote ∩ ∩ = ∩ ∩ s*. Continue adding splits to this
intersection until step 3 is satisfied.
3. Fix β > 0; if at any stage in step 2
162
5.2.4 Using Features
The term feature is used in the pattern recognition literature to denote a variable that is
manufactured as a real-valued function of the variables that were originally measured.
Features are generally constructed for one of two reasons:
1. The original data are high dimensional with little usable information in any one
of the individual coordinates. An attempt is made to “concentrate” the
information by replacing a large number of the original variables by a smaller
number of features.
2. On examination, the structure of the data appears to have certain properties that
can be more sharply seen through the values of appropriate features.
A powerful aspect of the tree structured approach is that it permits the introduction
of a large number of candidate features and then selects the best among them to split
the data on. An example of this is in the reduction of the high-dimensional chemical
spectra problem described in Chapter 7.
We use the waveform recognition data to illustrate the construction and use of
features. Our claim here is that even if we did not know the mechanism that generated
the waveform data, examination of the data would show that an individual waveform
tended to be consistently high in some areas and low in others and, furthermore, that
these regions of highs and lows were the characteristics that discriminated between
classes.
Following this intuitive appraisal, 55 new features were constructed. These were
averages, , over the variables from m1 to m2 for m1, m2 odd, that is
The tree procedure was then run using the original 21 variables and the 55 added
features, for a total of 76 variables.
The tree selected by the 1 SE rule had a cross-validated error of .20 and a test
sample (5000 samples) error rate of .20. The tree had only four terminal nodes (see
Figure 5.2).
163
FIGURE 5.2
The structure in this tree is more discernible than in the unvariate split tree. The first
split separates out class 2 by its low average over x15 to x19. Then, on the right, classes
1 and 3 are separated by their average values over x11 to x13. Then the left node,
mostly class 2 but with 16 class 1 cases, is further purified by splitting on the average
over x9 to x13.
Construction of features is a powerful tool for increasing both accuracy and
understanding of structure, particularly in high dimensional problems. The selection of
features is an art guided by the analyst’s intuition and preliminary exploration of the
data. However, since the tree structure allows an almost unlimited introduction of
features, inventiveness is encouraged.
On the other hand, cross-validation and pruning prevent the new features from
overfitting the data and keep the tree honest.
164
5.3 SURROGATE SPLITS AND THEIR USES
165
5.3.1 Definitions
At any given node t, let s* be the best split of the node into tL and tR. In the basic tree
structure, s* is the best univariate split. In general, s* can be the best linear or Boolean
combination split.
Assume standard structure. Take any variable xm. Let Sm be the set of all splits on
xm, and the set of splits complementary to Sm. For any split sm ∈ Sm ∪ m of the
node t into and , let Nj(LL) be the number of cases in t that both s* and sm send
left, that is, that go into tL ∩ . By our usual procedure, we estimate the probability
that a case falls into tL ∩ as
Then we define the estimated probability pLL(s*, sm) that both s* and sm send a case in
t left as
(5.6)
DEFINITION 5.7. A split ∈ Sm ∪ is called a surrogate split on xm for s* if
166
167
5.3.2 Missing Data
168
classification trees.
169
5.3.3 Examples with Missing Data
To see how (and how much) missing data affects tree construction, the 7-variable digit
recognition data and the 21-variable waveform recognition data had 5 percent, 10
percent, and 25 percent of the data deleted completely at random. On the average, the
number of variables deleted per case and the percent of complete cases are given in
Tables 5.1 and 5.2.
170
171
TABLE 5.5
The difficulty in classifying measurement vectors with missing variables might have
been anticipated. In the model generating the digit data, the overall correlations
between variables are in the .1 to .2 range. If a variable is missing, there are no good
surrogate splits on it.
In contrast, when the same two experiments were run on the waveform data, the Rts
values were all in the .29 to .31 range. With the highly correlated waveform variables,
good surrogate splits were available, and the tree had no difficulty in classifying cases
even with 25 percent of the variables missing.
Caution should be used in generalizing from these examples. If the missingness
does not occur at random, the effects can be larger than just noted. For example, if,
whenever a variable is missing, the variables containing the best surrogate splits also
tend to be missing, then the effect will be magnified. Chapter 8, on regression, has an
example which shows how significant missing data can be.
172
5.3.4 Variable Ranking
A question that has been frequent among tree users is: Which variables are the most
important? The critical issue is how to rank those variables that, while not giving the
best split of a node, may give the second or third best. For instance, a variable x1 may
never occur in any split in the final tree structure. Yet, if a masking variable x2 is
removed and another tree grown, x1 may occur prominently in the splits, and the
resulting tree may be almost as accurate as the original. In such a situation, we would
require that the variable ranking method detect “the importance” of x1.
The most satisfactory answer found to date is based on the surrogate splits . Let T
be the optimal subtree selected by the cross-validation or test sample procedure. If the
Gini splitting rule has been used, then at each node t ∈ T, compute ΔI( , t). If twoing
is used, define ΔI( , t) = ( , t, C1).
DEFINITION 5.9. The measure of importance of variable xm is defined as
(If there is more than one surrogate split on xm at a node, use the one having larger ΔI.)
The concept behind the variable ranking is this: If the best split of node t in the
univariate case is on , and if is being masked at t, that is, if can generate a
split similar to but not quite as good, then at t, ΔI( , t) will be nearly as large as
ΔI( , t).
However, the idea is somewhat more subtle than it appears. This is seen by
contrasting it with another approach that was eventually discarded. In the latter, at
each t, ΔI( , t) was computed, and the measure of importance was defined by
That is, the measure of importance was the sum over all nodes of the decrease in
impurity produced by the best split on xm at each node.
This is unsatisfactory in the following sense. Suppose that at node t, the split
173
has low association with but that ΔI( , t) ranks high among the values ΔI( , t), m
= 1, ..., M. Because of the low association, when t is split into tL and tR by s* , it may
happen that the optimal splits on in either tL or tR (or both) are close to and
have comparatively large ΔI values. Then, essentially the same split has contributed to
the variable importance of x , not only at t but also at tL and/or tR. Further; it can keep
contributing at the nodes below tL and tR until the split is used or its splitting power
dissipated. Thus, the importance given will be misleadingly high.
Definition 5.9 does not have any obvious drawbacks, and in simulation examples
where it is possible to unambiguously define variable importance from the structure of
the example, it has given results in general agreement with expectations.
Since only the relative magnitudes of the M(xm) are interesting, the actual measures
of importance we use are the normalized quantities 100M(xm)/ M(xm). The most
important variable then has measure 100, and the others are in the range 0 to 100.
For example, in the digit recognition data, the variable importances computed are
given in Figure 5.3. Adding 17 zero-one noise variables to the data gives the measures
of importance shown in Figure 5.4.
FIGURE 5.3
FIGURE 5.4
174
In the waveform data, the measures of importance are as shown in Figure 5.5. The
only variables that are split in the tree selected are 6, 7, 10, 11, 15, 16, 17. The
importance measure indicates that the other variables, with the exception of the first
and last few, also carry significant splitting power but are being masked.
FIGURE 5.5
When this same example is run with 19 noise variables added having a N(0, 1)
distribution, the importances for the first 21 variables are as shown in Figure 5.6. The
19 noise variable importances are shown in Figure 5.7.
FIGURE 5.6
175
FIGURE 5.7
Caution should be used in interpreting these variable importances. Importance can
be defined in many different ways. We make no claim that ours is intrinsically best.
Furthermore, they can be appreciably altered by random fluctuations in the data (see
Section 5.5.2).
See Darlington (1968) for a discussion of measures of variable importance in the
context of multiple linear regression.
176
5.4 ESTIMATING WITHIN-NODE COST
In many applications, the user not only wants a classification for any future case, but
also some information about how sure he should be that the case is classified correctly.
The true value of the probability of misclassification given that the case falls into
terminal node t provides this information. The value r(t) is the resubstitution estimate
for this probability in the unit cost case. With variable costs, r(t) is an estimate for the
expected misclassification cost given that the case falls into t.
Unfortunately, the node estimates can be even more misleading than the tree
resubstitution estimate R(T). This is particularly so if the terminal node t is the result
of a number of splits and has a relatively small population. At each split the tree
algorithm tries to slice the data so as to minimize the impurity of the descendants. The
terminal nodes look much purer in terms of the learning sample than on subsequent
test samples. For instance, Table 5.6 gives the resubstitution and test sample estimates
for the 11 terminal nodes of the waveform tree drawn in Figure 2.17.
TABLE 5.6
This table shows clearly how unreliable the estimates r(t) are. Unfortunately, we
have not found a completely satisfactory way of getting better estimates. The cross-
validation trees are usually different. There is no way to set up a correspondence
between the nodes of the cross-validation trees and the main tree.
We have developed an ad hoc method which gives estimates r(t) that in every
example examined are a significant improvement over r(t). The intuitive idea is that
177
the bias in r(t) is roughly inversely proportional to the size of t as measured by p(t).
Therefore, set
(5.10)
where ε and λ are constants to be determined as follows. The resubstitution
estimates satisfy
We first of all insist that the altered estimates add up to the cross-validated estimate
Rcv(T) instead of to R(T). That is,
(5.11)
must be satisfied.
Now consider a tree T′ grown so large that each terminal node contains only one
case. Then, assuming priors estimated from the data, p(t) = 1/N for every t ∈ ’, and
ε/(p(t) + λ) ε/λ. In the two-class problem discussed in the appendix to Chapter 3, a
heuristic argument was given to show that R(T′) ≤ 2RB, where RB is the
Bayes rate. Assuming the inequality is not too far from equality gives
(5.12)
The discussion in the appendix to Chapter 3 is limited to the two-class problem, but
it can be generalized to the multiclass problem with symmetric misclassification costs.
Thus, (5.12) is more generally applicable.
Now we further assume that RB can be adequately approximated by RCV(TK).
This leads to the equation
178
(5.13)
Solving (5.12) and (5.13) together gives
(5.14)
If RCV(T) ≤ R(T), the original resubstitution estimates r(t) are used. Otherwise,
Equation (5.14) has a unique solution for λ > 0, which can be easily computed by a
simple search algorithm.
Table 5.7 gives ȓ(t) for the waveform tree of Figure 2.17.
TABLE 5.7
To compare overall error, note that
179
(5.15)
The ȓ(t) estimates do much better than r(t). Where they have large errors, it is on the
conservative side. Measured overall by (5.15), their error rate is only 39 percent as
large as that of r(t).
These results are typical of the 10 examples we have examined, including both real
and simulated data. As compared with test sample results, ȓ(t) has overall error
consistently less than half that of r(t).
Another method for estimating within-node class probabilities and costs has been
tested only on the simulated waveform data. However, the results were promising
enough to warrant a brief description.
Assume that all measurement variable x1, ..., xM are ordered. In the learning sample
, let s1, ..., sM be the sample standard deviations of the variables x1, ..., xM. Take h > 0
to be a small fixed real number. Then construct a “noisy copy” of the learning sample
as follows: For each n, n = 1, ..., N, let
where
and the , ..., are drawn from a N(0, 1) distribution, independently for
different m and n.
Repeat this procedure using more independent N(0, 1) variables until there are
enough noisy copies to total about 5000 cases. Run these cases through the tree T
selected and compute their misclassification cost Rnc (T, h). Gradually increase h until
a value h0 is found such that
180
probabilities that were in surprisingly good agreement with the results gotten from the
5000-case test sample. Its use cannot be recommended until it is more extensively
tested. Furthermore, it is difficult to see how it can be extended to include categorical
measurement variables.
181
5.5 INTERPRETATION AND EXPLORATION
182
5.5.1 Introduction
Data analysts continually refer to the structure of the data and the search for the
structure of the data. The concept is that the data are roughly generated as structure (or
signal) plus noise. Then the analyst’s function is to find the underlying mechanism that
is generating the data, that is, to separate the signal from the noise and to characterize
the signal.
We have found this a hazardous and chancy business. Even though the tree
structured approach generally gives better insights into the structure than most
competing methods, extensive exploration and careful interpretation are necessary to
arrive at sound conclusions. (See Einhorn, 1972, and Doyle, 1973, for similar remarks
regarding AID.)
Even with caution, data analysis is a mixture of art and science, involving
considerable subjective judgment. Two reasonable analysts, given the same data, may
arrive at different conclusions. Yet both, in an appropriate sense, may be right.
Three subsections cover some of the difficulties and tools to deal with them. The
first illustrates some of the problems. The second briefly covers use of the basic output
in interpretation.
The third discusses the growing of exploratory trees.
183
5.5.2 Instability of Tree Structures
One problem in uncovering the structure is that the relation between class membership
and the measured variables is made complex by the associations between the measured
variables. For instance, the information about class membership in one group of
measured variables may be partially duplicated by the information given in a
completely different group. A variable or group of variables may appear most
important, but still contain only slightly more information than another group of
variables.
A symptom of this in the tree structure is that at any given node, there may be a
number of splits on different variables, all of which give almost the same decrease in
impurity. Since data are noisy, the choice between competing splits is almost random.
However, choosing an alternative split that is almost as good will lead to a different
evolution of the tree from that node downward.
To illustrate, four sets of digit and recognition data were generated in exactly the
same way as described previously, but each with a different random number seed. For
side-by-side comparison, the trees grown on the original data and one of the trees
grown on the new data set are shown in Figures 5.8 and 5.9. In Figure 5.8 the numbers
under a split, for example, the number 5 indicates the split, Is x5 = 0?. The numbers in
the terminal nodes are class identifications.
The two trees are very different. In this example the difference reflects the fact that
the information in seven digits is somewhat redundant, and there are a number of
different classification rules all achieving nearly the same accuracy. Then, depending
on chance fluctuations, one or the other of the rules will be selected.
FIGURE 5.8 Digit recognition tree.
184
TABLE 5.8 Digit Recognition Data
185
FIGURE 5.11 Waveform recognition data: Variable importance.
In practice, tree instability is not nearly as pronounced as in these two simulated
examples. With real data, at least for nodes near the top of the tree, the best split is
usually significantly better than the second best. Adding small amounts of noise to the
186
data does not appreciably change the tree structure, except, perhaps, for a few of the
lower nodes.
187
5.5.3 Using Output
The crux of the matter is that while a diagram of a tree grown on a data set gives an
easily interpretable picture of a structure for the data, it may not be a very complete
picture.
A number of diagnostics can be gotten from the tree output to check for masking
and alternative structure. First, the tree diagrams for the cross-validation trees can be
examined to see how closely they correspond to the “master” tree and what alternative
structures appear. Second, possible alternative paths and masking at each node may be
detected by looking at the best surrogate splits and their association with the optimal
split. Also, looking at the decreases in impurity given by the best splits on the
individual variables and noting splits which closely compete with the optimal split can
give clues in a similar direction. Third, information about overall masking may be
gotten from the variable importance values.
188
5.5.4 Growing Exploratory Trees
Growing a large tree Tmax and pruning it up using a test sample or cross-validation can
be computationally expensive. The analyst may want to grow some exploratory trees
both before and after the primary tree growing.
Some questions that might warrant preliminary study are, Should variable
combinations be used? and What selection of priors gives the best performance? A
major question following a primary tree growing procedure is: What is the effect of
deleting certain variables?
An inexpensive exploratory capability is contained in CART as an optional feature.
The user specifies a value of the complexity parameter α. Then the program produces
the optimal tree T(α). There are two parts that make this construction fast. The first is
that growing a very large tree Tmax and pruning it upward is not necessary. A condition
can be given for growing a sufficient tree Tsuff(α) which is guaranteed to contain T(α)
but is usually only slightly larger.
Another large improvement in efficiency can be made by using subsampling (see
Section 5.6.1). This procedure determines the splits in each node by using a randomly
selected fraction of the data. Comparative computer timings using sufficient trees with
and without subsampling will be discussed in Section 5.6.2.
The sufficient tree Tsuff (α) is given by
DEFINITION 5.16. For any node t, let t1, t2, t3, ..., th, t be the sequence of nodes
leading from t1 to t. Define
If S(t) > 0, split the node t using the optimal split s*. If S(t) < 0, declaretaterminal
node of Tsuff(α).
(Even if s(t) > 0, t may be declared terminal for the same reasons a node is terminal in
Tmax, i.e., N(t) ≤ Nmin.)
PROPOSITION 5.17. T(α) ⊂ Tsuff(α).
The proof is not difficult and is given in Chapter 10 (see Theorem 10.32).
As we pass down through the tree, s(t) is defined recursively by
189
(5.18)
Then the stopping rule for Tsuff(α) can be checked at each node t without searching up
through the path from t1 to t.
The minimal cost-complexity tree T(α) is gotten from Tsuff(α) by pruning upward
until a weakest link is found that satisfies gk(t) ≥ α. Then T(α) equals Tk (see Section
3.3).
A substantial part of the computation in the primary tree growing procedure is put
into the lower branches of the large initi al tree Tmax. For instance, typically the
number of nodes in Tmax is four to five times the number in the 1 SE tree. Use of the
sufficient tree thus produces a significant reduction in running time.
A useful and straightforward application of exploratory trees is in tracking the
effects of variable deletion once a primary tree has been grown. The primary tree
output gives the value αk corresponding to the selected tree. Then deleting one or more
variables and growing the tree T(αk) on the remaining variables give an indication of
the effects of the deletion. The bias Rcv - R is known for the original tree, so that
reasonable estimates for the true cost of the deleted variable trees can be gotten by
assuming the same bias.
The preprimary tree growing exploration is more problematic. There are two
difficulties. One is how to choose α. The second is that too much initial exploration
and data dredging may result in implicitly biasing the primary tree growing process.
This is particularly true if the entire data set is used in the initial exploratory phase.
There are some general guidelines for selecting α. For T(α) ≠ {t1},
Rα(T(α)) ≤ Rα({t1})
or
190
A rough rule of thumb is that using α = R(t1)/2H will produce a tree with about H
terminal nodes.
Since the prior exploratory phase is aimed at broad features, such as the usefulness
of variable combinations and rough adjustment of priors, we suggest that α not be
taken too small. If the learning sample is large, to speed up computations, we advise
the use of the subsampling feature discussed in Section 5.6.1.
191
5.6 COMPUTATIONAL EFFICIENCY
192
5.6.1 Subsampling
The idea is this: In a class J problem, an upper sample size limit N0 is set. As the tree
is grown, starting from the root node on down, subsampling is done at every node until
the total node population falls below N0.
If the population in node t is N1 (t), ..., Nj (t), with
(5.19)
That is, subsample to get total sample size N0 and such that the individual class sample
sizes are as nearly equal as possible. If for every j, Nj(t) ≥ N0/J, then the optimal
subsample is clearly = N0/J.
The difficulty arises when some of the Nj(t) are less than N0/J. Order the classes so
that N1(t) ≤ N2(t) ≤•••≤Nj(t). Then
PROPOSITION 5.20. The , j = 1, ..., J, satisfying (5.10) are defined recursively by :
If , ..., are the first j best choices, then
193
constraints 0 ≤ ≤ Nj(t) and = N0. If N1(t) ≥ N0/J, the solution is to take =
N0/J, j = 1, ..., J. Otherwise, take = N1(t). The problem is now reduced to the J - 1
30 40 90 150,
30 40 65 65.
The best split s* is found using only the subsample. But once the best split is found,
then the entire original population in the node is split using s*. Subsampling affects
only those upper nodes with total population greater than N0. As the tree splitting
continues and smaller nodes are formed with population less than N0 , all of the
sample is used to determine the best split.
The subsampling procedures used, say, in linear regression with a large data set
usually subsample once and do all the regression on the subsample. The rest of the
information is lost forever. In the tree structure, the subsampling affects only the first
few splits in the tree. But in these initial nodes, there are usually a few adjacent splits
on one variable that are markedly superior to all other splits. This clear-cut
information is reflected in the subsample. In the splits further down, the full sample
size is available.
Some care must be taken to weight the subsample appropriately. The class j
proportions in the subsample are not the same as in the original sample in the node.
Recall that
A weighting wj of each case in the jth subsample class is necessary to adjust to these
original node proportions. This is done using the weights
194
(5.21)
Then defining, for the subsample,
(5.22)
Denote the denominators in (5.22) as pʹ(tL), pʹ(tR) and use these definitions to
determine the best split on the subsample.
Unequal priors lead to the question of whether the subsample should be taken, as far
as possible, with equal sample sizes from each class or divided in another proportion.
A large-sample theoretical result indicated that the optimal proportions vary between
being equal and being proportional to . Since this result, at worst, gives sample
proportions less disparate than the priors, the lure of simplicity prevailed, and the
subsampling rule implemented is that outlined above.
Some approximate calculations can be made on the effects of subsampling. The
computational burden at each node (assuming M ordered and no categorical variables)
consists of a sort on each variable and then an evaluation of Φ(s, t) for each split on
each variable. If the population size in a node is N(t), then the quick sort used in CART
requires C1N(t) log N(t) operations on each variable. The Φ computations require N(t)
evaluations on each variable.
If the tree is grown to a uniform depth D, that is, to 2D terminal nodes, then the sort
time is proportional to
195
and the evaluation time to (D + 1)N.
If subsampling is used, with maximum node size N0, so that the tree is grown to
depth D0 = log2(N/N0) before subsampling stops, then the sort time is proportional to
The ratio of sort time using subsampling to total sort time is approximately
(5.23)
A similar argument for the Φ evaluations gives the ratio
(5.24)
The reduction is most significant when D0 is nearly equal to D.
In growing the large preliminary tree Tmax, take Nmin = . Using N0 = 256, the
ratios (5.23) and (5.24) are as follows:
(5.23)
(5.24)
There is substantial savings only for very large data sets. In growing exploratory trees,
196
the required depth D is smaller, so the computational savings using subsampling are
more significant.
The most useful application of subsampling has been in those problems where the
data base is too large to be held in fast memory. One pass through the disk file for a
node puts the subsample in core; the split is then found, and another pass through the
disk puts the rest of the data down the split.
197
5.6.2 How Fast is CART?
The current CART program is running on two machines that are extremes in
computing power. One is a powerful IBM 3081. The other is a small VAX 11/750.
Timing runs were made on both machines.
The examples used were the digit recognition data, the digit recognition data with 17
noise variables added, the waveform recognition data, and the waveform data with 19
noise variables added.
The CPU times listed in Table 5.10 are based on tenfold cross-validation. They
include the entire construction process, including the cross-validation and pruning
necessary to produce the final tree.
TABLE 5.11 CPU Seconds
198
5.7 COMPARISON OF ACCURACY WITH OTHER METHODS
Two other classification methods were compared with the tree structured results on the
simulated data sets. The first was nearest neighbor classification. The second was the
stepwise linear discriminant program in BMDP.
In nearest neighbor classification, the learning sample is used as a collection of
“templates.” A new observation is classified by assigning to it the class of its nearest
neighbor in the learning sample.
Four experiments were carried out using the following learning samples:
1. Digit recognition
2. Digit recognition plus 17 noise variables
3. Waveform
4. Waveform plus 19 noise variables
In (1) and (2) the distance between two sequences of zeros and ones was defined as the
number of places in which they are different. In case of ties, nearest neighbors were
selected at random.
In samples (3) and (4), the distance was the euclidean distance between the
observation vectors.
A test set of 5000 was used to estimate the misclassification rates. The results are
given in Table 5.12.
TABLE 5.12 Misclassification Rates
The accuracy of nearest neighbor is significantly better than the tree algorithm only
in the waveform data. As one might expect, the addition of noise variables seriously
degrades the performance of nearest neighbor classifiers. Still, if noise variables are
weeded out and a sensible metric used, nearest neighbor methods can be reasonably
accurate. Our objections to these methods, stated in Chapter 1, are based on other
characteristics.
Our resident BMDP expert, Alan Hopkins, obligingly ran the BMDP stepwise linear
199
discriminant program on both the 7-variable digit data and the 21-variable waveform
data. Standard default values for F-to-enter and remove were used. Test samples of
size 5000 gave accurate misclassification rate estimates.
In the digit data, all seven variables were entered. The test sample misclassification
rate was .25. This is surprisingly low considering the nonnormal nature of the data.
The classifier has the following form. For a new measurement vector x, evaluate
each of the following linear combinations:
Class 1: -12.8 - .8x1, + 3.0x2 + 8.0x3 - 1.8x4 - 6x5 + 12.3x6 + 1.3x7
Class 2: -24.3 + 10.8x1 + .8x2 + 6.4x3 + 9.2x4 - 13.2x5 - .4x6 + 10.8x7
Class 3: -23.8 + 10.4x1 + 1.4x2 + 8.3x3 + 8.1x4 + 1.2x5 + 9.1x6 + 10.5x7
Class 4: -21.8 - 1.3x1 + 11.5x2 + 7.5x3 + 11.3x4 - .8x5 + 10.3x6 + 2.7x7
Class 5: -24.0 + 8.5x1 + 9.0x2 + .8x3 + 11.8x4 + .9x5 + 9.1x6 + 9.5x7
Class 6: -31.9 + 11.2x1 + 9.1x2 + .9x3 + 11.3x4 + 13.7x5 + 7.7x6 + 10.7x7
Class 7: -14.0 + 10.6x1 + 1.2x2 - 6.8x3 - 1.3x4 + 1.8x5 + 9.1x6 + 1.7x7
Class 8: -31.7 + 9.0x1 + 9.5x2 + 7.0x3 + 9.8x4 + 12.7x5 + 9.0x6 + 9.9x7
Class 9: -28.2 + 8.9x1 + 9.2x2 + 8.2x3 + 10.3x4 - .4x5 + 10.0x6 + 11.0x7
Class 10: -28.1 + 10.0x1 + 10.0x2 + 7.2x3 + .1x4 + 13.0x5 + 9.6x6 + 8.4x7
Then classify x as that class corresponding to the largest value of the 10 linear
combinations.
Although the accuracy of this rule is better than that of the tree classifier, its form is
complex and sheds little light on the structure of the data. However, it turns out that
the Bayes rule for this problem does have the preceding form with slightly different
coefficients from those shown here. This may explain why the discriminant rate of
.253 is about 1 SE (5000 test sample) below the Bayes rate of .26. This value is not
typical of other replicate data sets. A more typical value of .31 resulted on the replicate
set used in Section 4.3.3. The tree rate for these data is .33.
With the waveform data, seven variables were entered: x5, x10, x13, x14, x15, x19, x20·
The test sample misclassification estimate was .26. This is slightly better than the
standard tree classifier (.28), but not as good as the tree grown using linear
combinations (.20) .
In both examples (using the replicate digit data), the linear discriminant
resubstitution error rate was misleadingly low, .24 in the digit problem and .20 in the
waveform data. The BMDP program gives jackknifed (N-fold cross-validation)
estimates of the rates as .26 and .23.
The two simulated examples used in the last few chapters were not designed to
make tree structured classification look good as compared to other methods. In fact,
nearest neighbor and linear discriminant methods offer competitive accuracies.
It is not hard to construct examples in which one method will do poorly while
another does well. The comparative accuracy of different methods depends on the
data. Still, in all applications to real data that we know of where various methods have
been compared, the accuracy of tree structured classifiers has generally been either
best or close to best.
200
APPENDIX
(A.1)
where δ ranges over all possible values. This is done by rewriting (A.1) as
δ(x1 + γ) ≥ v - c
so that
201
Order the values of un. For all un such that x1,n + γ ≥ 0, find the best split of the form u
≤ δ1 over all values of δ1. For all un such that x1,n + γ < 0, find the best split of the
form u ≥ δ2. Take the best of these two splits and let δ be the corresponding threshold.
Actually, there is a neater way of finding the best δ that combines the two searches
over un into a single search. But its description is more complicated and is therefore
omitted.
This procedure is carried out for three values of γ : γ = 0, γ = -.25, γ = .25. For each
value of γ, the best split of the form (A.1) is found. Then these three splits are
compared and δ, γ corresponding to the best of the three are used to update v as
follows. Let
where
and
c‘ = c + δγ.
Now this is repeated using variable x2; that is, a search is made for the best split of
the form
final linear combination split is determined, it is converted into the split amxm ≤ c
on the original nonnormalized variables.
This algorithm is the result of considerable experimentation. Two previous versions
were modified after we observed that they tended to get trapped at local maxima.
There is still room for improvement in two directions. First, it is still not clear that the
202
present version consistently gets close to the global maximum. Second, the algorithm
is CPU expensive. On the waveform example, tree building took six times as long as
using univariate splits only. We plan to explore methods for increasing efficiency.
The linear combination algorithm should not be used on nodes whose population is
small. For instance, with 21 ordered variables, finding the best linear combination split
in a node with a population not much larger than 21 will produce gross overfitting of
the data. For this reason, the program contains a user-specified minimum node
population NC. If N(t) ≤ Nc, the linear combination option at t is disabled and only
univariate splits are attempted.
203
6
Tree structured classification and regression techniques have been applied by the
authors in a number of fields. Our belief is that no amount of testing on simulated data
sets can take the place of seeing how methods perform when faced with the
complexities of actual data.
With standard data structures, the most interesting applications have been to medical
data. Some of these examples are covered in this chapter. The next chapter discusses
the mass spectra projects involving nonstandard data structures.
In Section 6.1 we describe 30-day prognoses for patients who were known to have
suffered heart attacks. Section 6.2 contains a report on CART’s performance in
diagnosing heart attacks. The subjects were other patients who entered hospital
emergency rooms with chief complaints of acute chest pain. Section 6.3 gives
applications of CART to the diagnosis of cancer. The descriptive variables of that
study are measures of immunosuppression. In Section 6.4 the classification of age by
measurements of gait is used to illustrate the application of CART to the detection of
outliers. A brief final section contains references to related work on computer-aided
diagnosis. In each of these sections, the medical terminology needed is explained
briefly but in sufficient detail to render the material accessible to all who understand
the previous chapters.
The chapter contains some, but rather few, explicit comparisons with parametric
competitors. We justify this choice with the presumption that our readers are more
familiar with them than with CART; in any case, this is a book about CART. In
practice, more often than not CART has outperformed its parametric competitors on
problems of classification in medicine. Usually, the reductions in misclassification cost
have been less than 15 percent of those of the “best” parametric procedures.
Notwithstanding substantial contributions to nonparametric statistics in recent years
and the clear mathematical and simulation arguments for their use, it is our experience
that parametric procedures (such as the Fisher linear discriminant and logistic
regression) hold up rather well in medical applications. Thus, for CART to do as well
as it does is an accomplishment. It will be emphasized that the ease with which the
tree can be interpreted and applied renders it an important alternative to other
procedures even when its performance is roughly comparable to theirs.
204
6.1 PROGNOSIS AFTER HEART ATTACK
205
patients at relatively low risk of dying. A result of the above considerations is that a
terminal node of any CART-designed tree is a class 1 node if the number of survivors
at the node exceeds 3.2 times the number of early deaths, that is, if 37π(1)C(2|1)N1(t)
> 178π(2)C(1|2)N2(t).
From among about 100 variables screened, 19 were selected for inclusion in CART.
Thirteen of the variables were chosen because when class 1 and class 2 patients were
compared, lowest attained significance levels resulted from two-sample t tests on the
differences (continuous variables) or from chi square tests for independence
(dichotomous variables); 6 of the variables included were chosen because related
studies by others suggested that they have predictive power for the question at hand.
Given the implicit concern CART has with relationships among variables, it may have
been preferable to use a variable selection scheme (possibly CART itself) that looked
directly at “interactions” and not merely “main effects.” Twelve of the 19 included
variables are dichotomous ; the remaining are continuous and nonnegative. Note that
for all 215 patients there were complete data on all 19 included variables. All included
variables are noninvasive ; that is, they can be measured without use of a catheter. (A
catheter is a small hollow tube that is passed within the arteries or veins to reach the
heart. It is used to measure pressures inside the heart and to inject liquids used for
studies by X ray.) There is substantial evidence that classification more successful than
what is being reported here can be made with the help of invasive measurements.
However, the process of catheterization presents some risk and also possible
discomfort to the patient.
Figure 6.1 shows the tree CART produced for the problem; the Gini splitting
criterion was used with a minimum node content of 5 observations, and tenfold cross-
validation was employed. The displayed tree is that with the smallest cross-validated
risk.
TABLE 6.1 Summary of CART’s Resubstitution Performance
If priors and costs are normalized so that π(1)C(2|1) = .6 and π(2)C(1|2) = .4, then it
follows from the data in Table 6.1 that the resubstitution overall misclassification cost
is (.6) + (.4) .17. Since the no-data optimal rule has misclassification cost
.4, about 59 percent of the overall cost of misclassification appears to be saved by our
data and CART. (A no-data optimal rule assigns any observation to a class j that
206
minimizes π(i)C(j|i). In case of ties, the convention used here is i to choose the
lowest value of j.) Of course, this may be somewhat optimistic, and therefore results of
the tenfold cross-validation are of interest. The minimum cross-validated
misclassification cost is .19 [ (.6)(.14) + (.4)(.28)], as is evident from Table 6.2,
which gives results for resubstitution and cross-validation for the trees that arose by
optimal pruning. The previously estimated 59 percent reduction of overall
misclassification cost is reduced by cross-validation to 53 percent.
FIGURE 6.1 Note that the systolic blood pressure is the maximum blood pressure that
occurs with each heart cycle during contraction of the left-sided pumping chamber;
this pressure is measured with a blood pressure cuff and stethoscope. By definition,
there was sinus tachycardia present if the sinus node heart rate ever exceeded 100
beats per minute during the first 24 hours following admission to the hospital; the
207
sinus node is the normal electrical pacemaker of the heart and is located in the right
atrium.
TABLE 6.2
Since N2 is only 37, the alert reader may ask whether cross-validation might have
been more accurate if it had been done by randomly dividing each of the two classes
into groups of about 10 percent each and randomly combining. Indeed, at least one of
the 10 percent subgroups in the cross-validation we did has no early deaths at all. Our
current thinking is that stratification can be used to produce more accurate cross-
validation estimates of the risks of tree structured classifiers. (See Section 8.7.2 for a
similar conclusion in the regression context.)
Perhaps the most striking aspect of the tree presented in Figure 6.1 is its simplicity.
Not only is it simple in appearance, but also the variables required to use it are easy to
measure. Note, however, that two of the three splits require information not
necessarily available upon admission to the hospital, and some physicians may find
the possible 24-hour period past admission an unreasonably long time to wait for the
results of CART’s classification.
Several procedures were compared with CART in their abilities to classify. The
BMDP stepwise discriminant program (see Jennrich and Sampson, 1981) began with
the same 19 variables as did CART. Its solution retained 12 variables, and
resubstitution overall classification cost was .20, which tenfold cross-validation
adjusted to .21 = (.6)(.13) + (.4) (.33). Also, logistic regression (DuMouchel, 1981)
was employed. Ten of the original 19 variables were used. (The decision on which to
include was based on considerations not reported here.) But logistic regression was
given additional help; namely, three two-factor interactions suggested by various
208
CART runs were included in the analysis. (Morgan and Sonquist, 1963, emphasized
that tree structured rules for classification and regression can be used as detectors of
interactions.) In view of which variables appear at adjacent nodes, the tree presented in
Figure 6.1 suggests two of those three interactions: age by minimum systolic blood
pressure and age by heart rate. Upon resubstitution, logistic regression produced the
same class contributions to overall misclassification cost as did the stepwise
discriminant (.10 from class 1 and .35 from class 2). However, the two procedures did
not classify each patient identically. A tenfold cross-validation adjusted the logistic
regression resubstitution figure (.20) to .21 = (.6)(.12) + (.4)(.36). In other areas of the
study from which this section was taken, logistic regression with main effects and
interactions suggested by CART performed well. We mention in passing that attempts
to employ nearest neighbor classification rules were unsuccessful because the
procedures performed quite poorly and were computationally expensive as well.
Particularly in a medical context, it is important to emphasize that even though the
linear classification rule given by the stepwise discriminant and the quadratic rule
given by logistic regression are easy to apply, neither is as easy as the simple displayed
tree.
In summary, for the problem of distinguishing early deaths and survivors in the
context presented, with complete data from 215 patients, it seems possible to reduce
the misclassification cost to about 50 percent its value for the no-data optimal rule.
And no procedure we tried performed quite as well as the rule given by CART.
209
6.2 DIAGNOSING HEART ATTACKS
This section, like the previous section, is about heart attacks, but its concern is with
diagnosis rather than prognosis. The specific issue is this. A patient enters the
emergency room of a hospital with the complaint of acute chest pain. Is the patient
suffering (or has he or she just suffered) an acute myocardial infarction? In certain
respects the problem of diagnosis may be less difficult than that of making 30-day
prognoses, since changes in treatment protocols from one hospital to another and
population bases over time probably have less impact on successful diagnoses of heart
attacks than on prognoses.
The work reported in this section is part of a study done by Lee Goldman and others
based on data gathered at the Yale-New Haven Hospital and at the Brigham and
Women’s Hospital, Boston (see Goldman et al., 1982). Goldman is with the
Department of Medicine, Brigham and Women’s Hospital and Harvard Medical
School.
The way heart attacks were diagnosed in this study involved criteria (2) and (3) of
Section 6.1: indicative electrocardiograms and characteristic elevations of levels of
enzymes that tend to be released by damaged heart muscle. The measurements of the
enzymes can take time, which is especially precious when, on the one hand, a patient
just presented may be undergoing a heart attack, and, on the other, coronary care units
for treating patients who have undergone heart attacks are heavily used and expensive.
Thus, CART was employed with noninvasive, relatively easily gathered data and the
hope of providing quick and accurate diagnoses.
The initial phase of this study involved every patient at least 30 years of age who
came to the Yale-New Haven Hospital emergency room between August and
November, 1977, with the chief complaint of chest pain that did not come from known
muscle damage or pneumonia and that might have been caused by a heart attack. Of
the 500 such patients, 482 had data sufficiently complete to be included in the learning
sample. The cited reference contains an explanation of why neither the exclusion of 18
patients from the learning sample nor the absence of some data for those patients
retained introduced any appreciable biases into the analyses. Of the 482 members of
the training sample, 422, the class 1 patients, did not suffer heart attacks, although
some were diagnosed as suffering from acute (ischemic) heart disease. In all reported
growing of trees the empirical prior probabilities π (1) = .88 and π (2) = .12 were used,
as was the Gini splitting criterion.
About 100 noninvasive variables were culled from data forms completed by
emergency room interns or residents at times when the patients’ post-emergency room
courses and enzyme levels were unknown. Univariate criteria like those described in
Section 6.1 determined the 40 variables retained for use in CART. These retained
variables were largely dichotomous or categorical.
210
The work which led to the tree of Figure 6.2 was done before cross-validation was a
standard component of CART. Instead, the process of finding optimally pruned trees
employed the bootstrap, which is described in Section 11.7. The pruned tree with
smallest bootstrapped misclassification cost was trimmed further in a subjective
fashion. That trimming and the choice of misclassification costs involve the notions of
sensitivity and specificity.
The sensitivity of a diagnostic test (sometimes called the true positive rate) is the
percentage of correctly diagnosed patients from among those who have suffered heart
attacks. The specificity (or true negative rate) is the percentage of correctly diagnosed
patients from among those who have not suffered heart attacks.
FIGURE 6.2 ER stands for emergency room and EKG for electrocardiogram. The ST
segment and the T wave are parts of the electrocardiogram that reflect electrical
repolarization of the ventricles. A segment of the electrocardiogram called the QRS
complex indicates damage to the muscle wall of the pumping chambers by the
appearance of a sharp depression and rise called a Q wave. [Adapted from Figure 1 of
211
Coldam et al. (1982) with the permission of The New England Journal of Medicine.]
The only values of C(1|2) and C(2|1) ever considered satisfy C(1|2) >C(2|1), for it
seems more important to diagnose correctly a patient who has suffered a heart attack
than one who has not. Even so, if the ultimate rule for classification is to be of real
use, it should strive to minimize the false positive rate (the complement of the
specificity). With these considerations in mind, it was determined (by Goldman) that
C(1|2)/C(2|1) should be approximately as small as possible subject to the constraint
that the resulting optimal tree and classification rule have 100 percent resubstitution
sensitivity. These considerations yielded C(1|2)/ C(2|1) slightly in excess of 14. If
costs are normalized so that π(1)C(1|2) + π(2)C(1|2) = 1, then π(1)C(2|1) = .33 and
π(2)C(1|2) = .67. Thus, the no-data optimal rule has misclassification cost .33.
Trimming of the CART- and bootstrap-generated tree mentioned previously was
motivated as follows. The cited cost considerations dictated that a terminal node t is a
class 1 node if the class 1 patients at t are 15 or more times as numerous as the class 2
patients, that is, if N1(t) ≥ 15N2 (t). If N1(t) + N2(t) < 15 and N2(t) = 0, it is perhaps not
clear whether t ought to be a class 1 or class 2 node. Thus, in general, if any terminal
node t satisfied N1(t) + N2(t) < 15 and N2(t) = 0, the split that led to t was deleted, even
if CART and the bootstrap dictated it should be retained. (Node G is the exception.)
All told, the subjective trimming process eliminated two splits. The compositions of
the terminal nodes will be given in the discussion of the validation set.
Seven or eight years ago, when various ideas surrounding cross-validation
techniques and CART were not nearly so well developed as they are now, various trees
were grown, and these trees invariably were trimmed in a subjective fashion by
investigators who relied on their knowledge of subject matter and context. Not
surprisingly, knowledgeable investigators often can trim large trees to ones that then
turn out to be useful.
Table 6.3 gives CART’s resubstitution performance. The resubstitution overall
misclassification cost is .07. Since the no-data optimal rule has misclassification cost
.33, about 80 percent of the cost of misclassification appears to be saved by the data
and CART as amended.
TABLE 6.3 Resubstitution Performance of the Tree of Figure 6.2 on the Yale-New
Haven Hospital Data
212
In Figure 6.2, the node that is three left daughter nodes below the root node differs
from the other nodes in that its question has two parts: a “yes” to either sends a
candidate patient to terminal node D. This combination of questions was imposed by
Goldman for reasons based on his knowledge of cardiology: the first question was
suggested by CART; the second makes clinical sense in view of the ancestor nodes.
Both the imposed split and the subjective trimming that have been discussed remind us
that CART is a valuable companion to, but not a replacement for, substantive
knowledge. (Of course, the same can be said for the best of other statistical
procedures.)
One especially interesting aspect of the study reported is that the tree was used to
classify patients at another hospital.
213
TABLE 6.4
We now turn to the larger of the two studies at the Brigham and Women’s Hospital.
This validation study involved patients who presented to the emergency room between
October 1980 and August 1981 under the same conditions as had the Yale-New Haven
patients. The minimum age was reduced from 30 to 25. Also, those patients who were
not admitted to the hospital were precluded from the study if they did not consent to
return to the hospital in 48 to 72 hours for certain noninvasive follow-up tests. About
80 percent of the nonadmitted patients agreed to return, and 85 percent of those
actually returned for the tests. The other 15 percent were known to be alive and,
indeed, well at least one week after leaving the emergency room. In all, 357 patients
were included in the validation sample, 302 class 1 and 55 class 2.
Table 6.4 summarizes the class memberships for both training and validation
samples for each terminal node of the tree. From the data of Table 6.5 it follows that
the given classification rule had sensitivity 91 percent and specificity 70 percent. Also,
the validation sample overall misclassification cost is .16, about 48 percent of its no-
data value.
TABLE 6.5 Performance of the Classification Rule on the Brigham and Women’s
Hospital Validation Sample of Patients
The most interesting and important competitors of the classification rule we have
been discussing are the decisions of the emergency room physicians. The cited
reference has a discussion of many aspects of those decisions; however, the simplest
way of quantifying them is by identifying the decision to admit a patient to the
hospital coronary care unit with a diagnosis of heart attack. Physicians’ classifications
are slightly less accurate than the tree, with sensitivity 91 percent and specificity 67
percent. Note that the 50 (of 55) correctly diagnosed class 2 patients are not identical
to the 50 class 2 patients correctly diagnosed by the amended CART-generated
classification rule. If the two classification rules are integrated so that a patient is
classified to class 2 if the tree indicates class 2 and the emergency room physician
admits the patient to the coronary care unit (or, in fact, to any hospital bed), then the
resulting sensitivity drops to 88 percent, but the specificity rises to 77 percent. The
overall cost of misclassification is thereby reduced to .15, which is 46 percent of its
no-data value.
214
To summarize, an amended CART-generated classification rule, or better, that rule
supplemented by the judgments of emergency room physicians, can reduce the cost of
misclassification to less than 50 percent of its no-data value for an extremely
important and sometimes very difficult problem in clinical medicine. A reader might
inquire whether other technologies do as well or better. Analyses that utilize logistic
regression, for example, and which are not reported in detail here, show that it does
nearly but not quite as well as the rule presented. And even though the tree in this
section is not so simple as the one in Section 6.1, it is still much easier to use
accurately (by emergency room personnel) than is a linear function of, say, nine
variables.
215
6.3 IMMUNOSUPPRESSION AND THE DIAGNOSIS OF CANCER
216
has one fewer class 2 observation than the corresponding node in the first. The missing
observation seems to have appeared in terminal node D in the second tree. In fact, the
first surrogate split at the root node of the first tree is on a variable that was not
included on the list of variables for the second tree. When CART was run on a
collection of variables that included the preselected nine and the missing surrogate
variable, that one observation again appeared at node A, but the tree was otherwise
unchanged.
CART’s resubstitution cost of misclassification was .094 for both trees, which is an
apparent 81 percent reduction in the corresponding no-data value. That the sensitivities
and specificities differ is clear from Tables 6.6 and 6.7.
Although the first tree appears to be more sensitive but less specific than the second,
cross-validation indicates that just the opposite is true, for the cross-validated
misclassification probabilities are as shown in Table 6.8. As with resubstitution, upon
cross-validation the trees had identical overall costs of misclassification. Each cost
was .22, which amounts to a saving of roughly 57 percent of the no-data
misclassification cost. The cited papers of Dillman and Koziol suggest that in
detecting cancer, neither linear discrimination nor logistic regression is more sensitive
than CART and that they are both less specific.
217
FIGURE 6.3 There are many classifications of white blood cells. For example, they
are dichotomized as to B or T cells and as lymphocytes or not. Lymphocytes are the
main components of the body’s immune system. By percent lymphocytes is meant the
number of lymphocytes as a percentage of the overall white blood count. T8 cells are a
type of T cell that may suppress an individual’s immune reaction to a challenge. One
challenge that is widely used in studies of immunosuppression is that presented by
pokeweed mitogen. It is well understood from other sources that immunosuppressed
individuals tend to have low lymphocyte counts, low reactivity to pokeweed mitogen,
and large numbers of T8 cells relative to other lymphocytes. [Adapted from Figure 3
of Dillman and Koziol (1983) with the permission of Cancer Research.]
218
TABLE 6.8
From the studies summarized in this section, we see that only the percent
lymphocytes, reactivity to pokeweed mitogen, and number of T8 cells are used as
primary splitting variables. These are relatively old tests for immunosuppression. By
and large, the newer and much more expensive tests were not helpful as supplemental
information, although two of the six variables that scored highest in the measure of
importance arose from the new tests. A final interesting fact that surfaced from the
analyses under discussion is this: the ratio of T4 cells to T8 cells, which was once
thought to be a powerful diagnostic tool for the problem at hand, turns out to be one of
the least important variables (and never appeared as even a surrogate splitting
variable).
219
6.4 GAIT ANALYSIS AND THE DETECTION OF OUTLIERS
This section is different in emphasis from the earlier sections in this chapter. The
context in which the data were gathered may be less familiar to readers than are heart
attacks and cancer, though this subject matter, too, has important clinical applications.
In addition, we concentrate on the detection of outliers as opposed to classification.
The data arose from a study of the development of gait, that is, walking, in 424
normal children aged 1 to 7. All studies were performed at the Motion Analysis
Laboratory at Children’s Hospital and Health Center, San Diego. The Laboratory is
directed by David H. Sutherland, Chief of Orthopaedic Surgery at Children’s Hospital
and a member of the Department of Surgery at the University of California, San
Diego.
Gait is useful as a barometer of (the top down) neurologic development of normal
children—and also of children suffering from cerebral palsy or muscular dystrophy. A
supposedly normal child whose walk is markedly different from that of others his or
her age may have neurologic abnormalities. On the other hand, a child known to be
neurologically impaired might have that impairment quantified in part by gait
measurements. So, it should be of interest to see how well the age of a normal child
can be deduced from gait measurements (which do not a priori dictate that child’s age
or size) and thereby to learn something of the relationship between chronologic age
and neurologic development. The prediction of age from gait measurements is the
classification problem of the present section. Readers desiring more information on
gait analysis are referred to work by Sutherland and colleagues (Sutherland et al.,
1980; Sutherland et al., 1981; and Sutherland and Olshen, 1984).
Children were studied at ages 1 through 4 in increments of six months, and also at
ages 5, 6, and 7. (In all but the length of step and walking velocity, the gait of a 7-year-
old resembles that of an adult so closely as to render gait studies of maturation
uninteresting in normal children over 7 years.) The classes are numbered 1 through 10
in order of increasing age. Each child was studied within 30 days of the date required
for his or her class. While some children were studied at two or even three ages, most
were studied just once. Although there were not exactly equal numbers of children in
each group, the numbers were close enough to equality that we took π(i) ≡ .1, i = 1, ...,
10. Also, we took C(i|j) = |i - j|1/2 so that each no-data optimal rule assigns every
observation to class 5 or class 6 independent of the covariates ; and each such rule has
expected cost of misclassification approximately 1.45. The exact choice of C and
possible competitors are discussed later in this section.
During a gait study a child is attired in minimal clothing and has sticks placed
perpendicular to the sacrum (at the lower end of the spine), facing forward midway
between the hips, and facing forward on each tibia (shinbone). Also, four markers are
placed on the lower extremities. The child walks at a speed comfortable to him or her
220
down the level walkway. Data are gathered on the motions of the joints in three planes,
the walking velocity, cadence, time spent on each leg alone, and the ratio of the width
of the pelvis to the distance between the ankles while walking. In addition, both the
sex and the dominant side, if it is discernible, are included as variables. Many
measurements were gathered with high-speed cameras and later digitized from film.
(We mention in passing the paucity of work by statisticians on digitizing processes.)
Figure 6.4 depicts the layout of the laboratory.
FIGURE 6.4 Camera layout. [Reprinted from Figure 2 of Sutherland et al. (1980) with
the permission of The Journal of Bone and Joint Surgery.]
Each child is observed for at least three passes down the walkway. A pass includes
about six steps, that is, three full cycles. Data on seven important variables from the
three passes are averaged. The cycle for which the seven variables most closely
resemble their averages is chosen as the representative cycle for that child. Because
joint motions are approximately periodic, an overall mean and the first six sine and
cosine Fourier coefficients are used as summary statistics of the motions and as
variables in CART. In all, there are well over 100 variables that might have been used
for classification. Various CART runs on subsets of variables available at this writing,
univariate statistics like those of Sections 6.1 and 6.2, and Sutherland’s judgment led
to a much reduced list of 18 variables, the independent variables of the analysis
reported here. Trees were generated with the twoing splitting criterion. Tenfold cross-
validation showed that the tree with smallest cross-validated misclassification cost has
19 terminal nodes and that its estimated misclassification cost is .84. The tree gives
patent evidence of the outliers that are the focal point of this section. On subjective
grounds, an optimal tree with 16 terminal nodes was chosen for presentation here. To
221
describe the tree in detail would take us too far afield, so instead in Table 6.9 we report
the contents and class assignments of its 16 terminal nodes.
CART’s resubstitution performance is summarized in Table 6.10. Notice the three
circled observations: a 5-year-old (J. W.) classified as a 1 -year-old, another 5-year-
old (J. C.) classified as a 2-year-old, and a 7-year-old (L. H.) classified as a 2-year-old.
Since the three children whose observations are circled were classified as being only
2 years old and at least 3 years younger than their true ages, their films and written
records were reviewed to examine qualitative aspects of their walks and to verify
values of their motion variables. Furthermore, their data from subsequent years, when
available, were analyzed with a view toward detecting possible developmental
abnormalities. No such abnormalities were found, but still, three interesting facts
emerged.
J. W. was and remains a normal child. For him the original measurements were all
correct. However, possibly because of his fear of the laboratory, his walk was
deliberately such that the heel for each step followed directly on the toe of the
previous step; this is called “tandem walking.” What resulted was a large external
rotation of the hip, as is characteristic of the 1 -yearolds to which he was classified.
222
TABLE 6.9
223
TABLE 6.10
In the class of L. H., all measurements used in CART except walking velocity
agreed with those of the second analysis of the data. Her walking velocity was stated
in laboratory records to be 81, which is correct if velocity is measured in meters per
minute. For all other children, velocity was measured in centimeters per second; L.
H.’s 81 thus should have been 136. With the correction, but the same classification
scheme as before, L. H. moves from terminal node E to terminal node P. She is thereby
reclassified from age 2 to age 6, only 1 year below her true age.
Like L. H., J. C. was originally a member of terminal node E. While her other
measurements were unchanged after reanalysis, her walking velocity was found to
224
have been coded originally as 58 rather than the true value 109. Once the mistake is
corrected, J. C. moves to terminal node 0, and she is classified as the 5-year-old she
was at the time of her study.
The 7-year-old in terminal node D might be thought of as a candidate outlier,
though we have no evidence that such is the case. Of course, the membership of that
terminal node is quite diverse. Also, from the paper of Sutherland et al. (1980), it
seems that for a 7-year-old to be classified by gait as a 3-year-old is not as surprising
as for a 5-year-old (such as J. W. and J. C.) to be classified as a 2-year-old.
The reader should note the ease with which outliers were detected and also note that
although nothing of medical significance was learned from the detection, two mistakes
were found and corrected and a third set of data were deleted from further analyses.
CART has contributed to important understanding of the gait data beyond the
detection of outliers. For example, in the forthcoming monograph of Sutherland and
Olshen (1984), it is argued from several points of view that dominant side bears no
discernible relationship to the development of mature gait. This somewhat surprising
result was suggested by preliminary CART runs in the process of variable selection,
for dominance seldom appeared as much as a surrogate splitting variable. Its
coordinate importance was always low. Because plausibility arguments suggested that
dominance should play some part in our classification problem, it was retained in the
final list of 18 variables, despite the lack of any quantitative evidence for the retention.
In the CART run reported here, dominance appeared once as a fourth surrogate
splitting variable, and its coordinate importance was about 60 percent that of the
second least important variable.
We close this section with brief comments on the choice of the cost function c. The
three mentioned outliers show up clearly when trees are grown with the default zero-
one cost function c(i|j) = δij, and it is evident why this might occur. With that choice
CART tries to have each terminal node consist of children of one age; failing that, a 6-
year-old looks no worse at a 2-year-old node than does a 2- year-old. In view of the
natural ordering of the classes, the zero-one cost function does not seem sensible. The
first alternative we tried has c(i|j) = = |i - j|. This led to larger trees than did the
previous function, and furthermore, those trees obscured two of the three outliers.
Again, explanation is simple. With cost six times as large for classifying a 6-year-old
as a 2-year-old than for so classifying a 2 -year-old, CART will work hard to ensure
that the variabilities of ages within terminal nodes are small. The final choice c(i|j) = |i
- j|1/2 seems to incorporate the best features of its two cited competitors and also to be
rather simple, as they are. Thus, it was the choice for presentation here.
225
6.5 RELATED WORK ON COMPUTER-AIDED DIAGNOSIS
The paper by Goldman et al. (1982) contains a nice survey of previous contributions to
uses of statistical techniques in the diagnosis of myocardial infarction. The
contribution of Pozen, D’Agostino, and Mitchell (1980), who utilized logistic
regression, is particularly noteworthy. Goldman observes that the approach to the
problem through CART was more accurate than previous attempts.
The use of statistical approaches to combining cancer markers and thereby aiding
the automatic diagnosis of cancer is quite new. The cited contributions of Dillman and
Koziol and their colleagues (1983) are at the forefront of the field. Further work is
being carried out under the auspices of the National Cancer Institute’s Biological
Response Modifier Program, wherein a number of agents are being evaluated for their
collective impact on the immune system in a variety of individuals with and without
cancer. CART will be employed in that study.
One obstacle to the successful use of CART or any other technique that utilizes an
ongoing data base in medical prognosis is that data bases change over time. Patient
mixes and hospital protocols change enough to affect both unconditional distributions
of class membership and conditional distributions of predictors given class
membership. Feinstein (1967) has been skeptical of some statistical approaches to
diagnosis and prognosis for this reason. His skepticism has been advanced by
practitioners of artificial intelligence (see Szolovits, 1982).
In the recent paper by Duda and Shortliffe (1983), explicit criticism is made of
“decision-tree or statistically based programs.” Their approach to medical diagnosis
involves modeling so-called expert opinion and claims to offer explanations of
diagnoses. The approach through CART to medical diagnosis and prognosis attempts
to supplement trees constructed from data by expert opinion, but not to do away
altogether with data as a principal tool. Goldman et al. (1982) indicate that sound
statistical methodology applied to good data sometimes provides more accurate
medical diagnoses than do experts.
226
7
227
7.1 INTRODUCTION
The preceding chapters have discussed and used the standard structure built into
CART. The data structure was assumed to consist of a fixed number of variables and
the set of splits S is built into the program.
However, the general tree methodology is a flexible tool for dealing with general
data structures. It may be applied to a general measurement space X without
assumptions regarding its structure. The critical element, which must be defined by the
analyst, is a set of binary questions or, equivalently, a set S of splits s of X into two
disjoint subsets. A measure of node impurity is then selected, and at each node a
search is made over S for that split s* which most reduces the impurity.
The burden is on the analyst in constructing a set of questions which can effectively
extract information from the data. However, the tree structure permits considerable
“overkill.” A large number of splits can be searched over, allowing the inclusion of
many questions that may or may not turn out to be informative. The program then
selects the best of these questions. For example, in the ship classification problem a set
of 5000 splits was used. In the chemical spectra problems, about 2000 questions were
used.
Even if the data base contains a fixed number of variables, it may not be desirable to
use the standardized set as splits. For example, in word recognition, a common
procedure is to use band pass filters to decompose the amplitude into amplitudes over
M frequency ranges. A single word is then represented by M waveforms. Even if these
waveforms are discretized over time into N time samples, there are MN coordinates,
where MN is generally large.
Using univariate splits on these coordinates is not a promising approach. In speech
recognition, as in other pattern recognition problems, the common approach is to
define “features” consisting of combinations of variables suggested either by the
physical nature of the problem or by preliminary exploration of the data. To make the
problem amenable to available techniques, usually a fixed number of features are
extracted and then some standard classification technique such as discriminant
analysis or kth nearest neighbor is employed.
In contrast, a tree approach permits the use of a large and variable number of
features. This is particularly useful in problems where it is not clear which features
and what threshold values are informative.
Given only a set S of splits, there are no longer any entities such as variables, and
the variable oriented features of CART are not applicable. What remains applicable is
discussed in Section 7.2. The general structure and question formulation are illustrated
in Section 7.3 through a description of the mass spectra tree grown to recognize the
presence of the element bromine in compounds.
228
7.2 GENERALIZED TREE CONSTRUCTION
The only parts of CART which depend on a fixed cases by variables data matrix are
the measure of predictive association between variables and the linear combination of
variables. Boolean combinations of splits may be used in the general case. The
missing value algorithm and variable ranking are no longer applicable.
All other parts of the tree methodology are applicable, including the pruning
algorithm, the test sample and cross-validation estimates, and the standard error
estimates.
We have not set up a program which accepts a general data structure and permits the
user specification of the set of splits S. Each nonstandard tree has been programmed
from scratch incorporating pruning, estimation, etc., as subroutines. Building a general
program is feasible and may be undertaken if need develops.
229
7.3 THE BROMINE TREE: A NONSTANDARD EXAMPLE
230
7.3.1 Background
In 1977-1978 and again in 1981, the EPA funded projects for the construction of
classification trees to recognize the presence of certain elements in compounds
through the examination of their mass spectra. In the initial project, a tree algorithm
for recognition of chlorine was constructed (Breiman, 1978). During the second
project (Breiman, 1981), trees were constructed to detect the presence of bromine, an
aromatic phenyl ring, and nitrogen. In this chapter we describe the construction of the
bromine tree.
In brief, the mass spectrum of a compound is found by putting a small amount of
the compound into a vacuum and bombarding it with electrons. The molecules of the
compound are split into various fragments and the molecular weight of the fragments
are recorded together with their abundance. The largest fragment consists of molecules
of the compound itself which were not fragmented. A typical mass spectrum is given
in Table 7.1, and graphed in Figure 7.1. The relative abundances are normalized so
that the maximum abundance is 1000.
231
A trained chemist can interpret a mass spectrum in terms of the chemical bonds
involved and often reconstruct the original molecule. According to McLafferty (1973,
preface): “The fragment ions indicate the pieces of which the molecule is composed,
and the interpreter attempts to deduce how these pieces fit together in the original
molecular structure. In recent years such correlations have been achieved for the
spectra of a variety of complex molecules.”
232
7.3.2 Data Reduction and Question Formulation
The critical element of the bromine tree was the construction of a set of questions
designed to recognize bromine hallmarks. In particular, bromine has two important
footprints:
1. For every 100 atoms of bromine occurring naturally at atomic weight 79, there
are 98 isotopic atoms of weight 81.
2. Small fragments containing any halogens in the compound (chlorine and
bromine predominantly) tend to split off easily.
The first step in this process was a reduction in dimensionality of the data. For any
spectrum, denote by Hm the relative fragment abundance at molecular weight m. Then
Hm was defined to be a peak or local maximum if
If a fragment contains one bromine atom, the theoretical value for the ratios R = (R1, ...,
R7) is
233
However, if it contains two bromine atoms, the most frequently occurring combination
is one 79 bromine atom and one isotopic bromine atom. The theoretical ratio vector
then is
If bromine occurs in combination with chlorine, then since chlorine (weight 35) has
an isotope of weight 37 that occurs 24.5 percent of the time, there is a different
theoretical ratio vector. The data base showed significant numbers of compounds with
one, two, three, four bromine atoms and one or two bromine atoms in combination
with one or two chlorine atoms. This gives eight theoretical ratio vectors. However, we
reasoned that with any error, two nearly equal abundances might be shifted; that is,
with a small experimental error, a fragment containing a single bromine atom might
have its peak at the isotopic weight. For this reason, the theoretical shifted ratio ,
was added together with another shifted ratio where adjacent abundances were
nearly equal.
The set of theoretical ratios are listed in Table 7.2. Corresponding to every
theoretical ratio a set of weights was defined. These are given in Table 7.3.
TABLE 7.2
234
Then D was set equal to min Di and B was defined as the number of bromine atoms
in the theoretical ratio closest to R. Thus, D gave an overall measure of how closely
the abundances adjacent to a given peak mimicked clusters of abundances reflecting
isotopic structure of bromine, possibly combined with chlorine.
If a peak occurred at weight m, the loss L was defined as
and the height of the peak denoted by H. The final data reduction consisted of
associating an (L, H, D, B) vector to each peak in the spectrum. Over the data base, the
number of peaks ranged from a low of 2 to over 30.
For each reduced spectrum, associate with each L its rank RL among the losses; that
is, if there are N peaks with losses arranged so that
L1 ≤ L2 ≤ ... ≤ LH,
235
a. D ≤ d, d = .01, .02, .04, .08, .16, .32,
b. RL ≤ r, r = k + 1, k + 2, k + 3, k + 4, k + 8,
c. RH ≤ h, h = k + 1, k + 2, k + 3, k + 4, k + 8?
Therefore, Q1 contained 600 questions.
The second set of questions (modified slightly to simplify the presentation) is aimed
at the losses:
Q2: Is there a peak with loss L, L = 0, 1, ..., 250 and RL ≤ r, r = 1, 2, 3, 4, 5, ∞?
Q2 contains 1500 questions. The Boolean combinations of splits in Q2 were
constructed (see Section 5.2.3). Oring was used instead of anding, so these splits were
of the form s1 or s2....
A third set of 15 questions was aimed at the total number of peaks in the spectrum
but did not produce any splits in the tree construction.
236
7.3.3 Tree Construction and Results
Out of about 33,000 compounds, only 873 contained bromine. An independent test
sample approach was used, and the cases given in Table 7.4 were selected at random.
TABLE 7.4
The decision to make the nonbromine set 10 times as large as the bromine was
somewhat arbitrary. The reasoning was that the nonbromines represented a much
larger diversity of compounds than the bromines; therefore, the sample size selected
should be large enough to reflect this diversity. On the other hand, using all of about
32,000 nonbromine spectra seemed redundant and wasteful.
In view of the purpose of the project, we wanted to roughly equalize the bromine
and nonbromine misclassification rates. Therefore, equal priors were used, together
with the Gini splitting criterion. Tree pruning was done as described in Chapter 3.
The initial tree T1 contained 50 nodes. The graph of the misclassification rates on
pruning up is illustrated in Figure 7.2.
The minimum value of Rts = 5.8 percent occurred with 13 terminal nodes. This tree
is pictured in Figure 7.3. The test sample bromine misclassification rate was 7.2
percent and the nonbromine 4.3 percent. Standard errors of Rts ranged from 1.0 percent
for 50 nodes to a low of 0.8 percent around the 13-terminal-node range. As might be
expected, the standard error for the bromine misclassification rate alone was higher,
about 1.5 percent in the 13-node range, compared with .4 percent for the nonbromines.
237
FIGURE 7.2
Over 75 percent of the bromines ended in one terminal node defined by isotopic
structure questions only. About 80 percent of the nonbromines were in a terminal node
defined by an initial split on isotopic structure and then whittled down by a sequence
of questions about loss locations.
As a final verification, the remaining 23,000 nonbromines were run through the
optimal tree. The resulting misclassification rate was 4.4 percent.
We believe that the drop from 8 percent on the chlorine tree to 6 percent on the
bromine was due to experience gained in formulating an informative set of questions.
For example, Boolean combinations were not used in the chlorine tree. The procedure
is still unrefined, and further development of the question set could result in a
significant drop in misclassification.
238
239
FIGURE 7.3
Hopefully, the detail in which this example was presented does not obscure the
point: The hierarchical tree structure gives the analyst considerable flexibility in
formulating and testing a large number of questions designed to extract information
from nonstandard data structures.
There is flexibility not only in handling nonstandard data structures but also in using
information. For example, we have designed tree structured programs that make use of
a “layers of information” approach. The set of questions is divided into subsets of s1,
s2, ... The tree first uses the questions in s1 until it runs out of splits that significantly
decrease impurity. Then it starts using the splits in s2, and so on.
In general, our experience has been that once one begins thinking in terms of free
structures, many avenues open up for new variations and uses.
240
8
REGRESSION TREES
241
8.1 INTRODUCTION
The tree structured approach in regression is simpler than in classification. The same
impurity criterion used to grow the tree is also used to prune the tree. Besides this,
there are no priors to deal with. Each case has equal weight.
Use of a stepwise optimal tree structure in least squares regression dates back to the
Automatic Interaction Detection (AID) program proposed by Morgan and Sonquist
(1963). Its development is traced in Sonquist and Morgan (1964), Sonquist (1970),
Sonquist, Baker, and Morgan (1973), Fielding (1977), and Van Eck (1980). The major
difference between AID and CART lies in the pruning and estimation process, that is,
in the process of “growing an honest tree.” Less important differences are that CART
does not put any restrictions on the number of values a variable may take and contains
the same additional features as in classification—variable combinations, predictive
association measure to handle missing data and assess variable importance, and
subsampling. On the other hand, CART does not contain some of the options available
in AID, such as a limited look ahead.
In this chapter, we first start with an example of a regression tree grown on some
well-known data (Section 8.2). In Section 8.3 some standard background for the
regression problem is discussed. Then the basic splitting criterion and tree growing
methodology are described in Section 8.4. Pruning and use of a test sample or cross-
validation to get right-sized trees are covered in Section 8.5. In Section 8.6 simulated
data are generated from a known model and used to illustrate the pruning and
estimation procedure. There are two issues in cross-validation that are discussed in
Section 8.7. Many features of classification trees carry over to the regression context.
These are described in Sections 8.8, 8.9, and 8.10, using both a real data set and
simulated data for illustrations.
In place of least squares, other error criteria can be incorporated into the tree
structured framework. In particular, least absolute deviation regression is implemented
in CART. This is described and illustrated in Section 8.11.
242
8.2 AN EXAMPLE
For their 1978 paper, Harrison and Rubinfeld gathered data about Boston housing
values to see if there was any effect of air pollution concentration (NOX) on housing
values. The data consisted of 14 variables measured for each of 506 census tracts in
the Boston area.
These variables, by tract, are as follows:
y: median value of homes in thousands of dollars (MV) x1: crime rate (CRIM) x2:
percent land zoned for lots (ZN) x3: percent nonretail business (INDUS) x4: 1 if
on Charles River, 0 otherwise (CHAS) x5: nitrogen oxide concentration, pphm
(NOX) x6: average number of rooms (RM) x7: percent built before 1940 (AGE)
x8: weighted distance to employment centers (DIS) x9: accessibility to radial
highways (RAD) x10: tax rate (TAX) x11: pupil/teacher ratio (P/T) x12: percent
black (B) x13: percent lower-status population (LSTAT)
Using various transformations of both the dependent variable MV and independent
variables, Harrison and Rubinfeld fitted the data with a least squares regression
equation of the form
(8.1)
where b is a parameter to be estimated. The resubstitution estimate for the
proportion of variance explained is .81.
These data became well known when they were extensively used in the book
Regression Diagnostics, by Belsley, Kuh, and Welsch (1980). A regression tree was
grown on these data using the methods outlined in this chapter. The tree is pictured in
Figure 8.1.
The number within each node is the average of the dependent variable MV over all
tracts (cases) in the node. For instance, over all 506 tracts, the average MV is 22.5
(thousands of dollars). Right underneath each node is the question that split the node.
The numbers in the lines are the number of cases going right and left. So, for example,
243
430 tracts had RM ≤ 6.9 and 76 did not.
To use this tree as a predictor, data (x1, ..., x13) for a tract is dropped down the tree
until it comes to rest in a terminal node. Then the predicted value of MV is the average
of the MV’s for that terminal node.
It is interesting to track the actions of the tree. The split RM ≤ 6.9 separates off 430
tracts with the low average MV of 19.9 from 76 tracts with a high average of 37.2.
Then the left branch is split on LSTAT ≤ 14, with 255 tracts having less than 14
percent lower-status population and 175 more than 14 percent. The 255 tracts going
left are then split on DIS ≤ 1.4. The 5 tracts with DIS ≤ 1.4 have a high average MV of
45.6. These 5 are high housing cost inner city tracts.
FIGURE 8.1
The other branches can be similarly followed down and interpreted. We think that
the tree structure, in this example, gives results more easy to interpret than equation
(8.1). The cross-validated estimate of the proportion of variance explained is .73 ± .04.
This is difficult to compare with the Harrison-Rubinfeld result, since they used
log(MV) as the dependent variable and resubstitution estimation.
Only 4 of the 13 variables appear in splits; RM, LSTAT, DIS, and CRIM. In
particular, the air pollution variable (NOX) does not appear. It is the fourth best
surrogate to an LSTAT split at a node and either the first or second best surrogate to
CRIM splits at three other nodes.
The variable importances (in order) are in Table 8.1.
244
TABLE 8.1
The numbers beneath the terminal nodes in Figure 8.1 are the standard deviations of
the MV values in the terminal node. The leftmost terminal node, for example, contains
five MV values with an average of 45.6 and a standard deviation of 8.8. Clearly, these
five MV values have a wide range. The standard deviations in the other nodes give
similar indications of how tightly (or loosely) the MV values are clustered in that
particular node.
This tree was grown using the techniques outlined in the rest of this chapter. It will
be used along the way to illustrate various points. Since we do not know what “truth”
is for these data, simulated data from a known model will be introduced and worked
with in Section 8.6.
245
8.3 LEAST SQUARES REGRESSION
In regression, a case consists of data (x, y) where x falls in a measurement space X and
y is a real-valued number. The variable y is usually called the response or dependent
variable. The variables in x are variously referred to as the independent variables, the
predictor variables, the carriers, etc. We will stay with the terminology of x as the
measurement vector consisting of measured variables and y as the response variable.
A prediction rule or predictor is a function d(x) defined on X taking real values; that
is, d(x) is a real-valued function on X.
Regression analysis is the generic term revolving around the construction of a
predictor d(x) starting from a learning sample . Construction of a predictor can have
two purposes: (1) to predict the response variable corresponding to future
measurement vectors as accurately as possible; (2) to understand the structural
relationships between the response and the measured variables.
For instance, an EPA sponsored project had as its goal the prediction of tomorrow’s
air pollution levels in the Los Angeles basin based on today’s meteorology and
pollution levels. Accuracy was the critical element.
In the example given concerning the Boston housing data, accuracy is a secondary
and somewhat meaningless goal. It does not make sense to consider the problem as
one of predicting housing values in future Boston census tracts. In the original study,
the point of constructing a predictor was to get at the relationship between NOX
concentration and housing values.
Suppose a learning sample consisting of N cases (x1, y1), ..., (xN, yN) was used to
construct a predictor d(x). Then the question arises of how to measure the accuracy of
this predictor. If we had a very large test sample ( , ), ..., ( , yʹ) of size N2, the
accuracy of d(x) could be measured as the average error
246
The methodology revolving about this measure is least squares regression.
To define accuracy in the mean squared error sense, a theoretical framework is
needed. Assume that the random vector (X, Y) and the learning sample are
independently drawn from the same underlying distribution.
DEFINITION 8.2. Define the mean squared error R*(d) of the predictor d as
E(Y - a)2
is E(Y) .
See Section 9.1 for a complete proof.
247
8.3.1 Error Measures and Their Estimates
Given a learning sample consisting of (x1, y1), ..., (xN, yN), again one wants to use
both to construct a predictor d(x) and to estimate its error R*(d). There are several
ways to estimate R*. The usual (and worst) is the resubstitution estimate
(8.5)
Test sample estimates Rts(d) are gotten by randomly divining into 1 and 2 and
using 1 to construct d and 2 to form
(8.6)
The v-fold cross-validation estimate Rcv(d) comes from dividing L in v subsets L1, ...,
Lv, each containing (as nearly as possible) the same number of cases. For each v, v = 1,
..., v, apply the same construction procedure to the learning sample L - Lv, getting the
predictor d(v) (x). Then set
(8.7)
The rationale for Rts and Rcv is the same as that given in Section 1.4.
In classification, the misclassification rate has a natural and intuitive interpretation.
But the mean squared error of a predictor does not. Furthermore, the value of R*(d)
depends on the scale in which the response is measured. For these reasons, a
normalized measure of accuracy which removes the scale dependence is often used.
248
Let
μ = E(Y) .
Then
is the mean squared error using the constant μ as a predictor of Y, which is also the
variance of Y.
DEFINITION 8.8. The relative mean squared error RE*(d) in d(x) as a predictor of Y
is
RE*(d) = R*(d)/R*(μ).
The idea here is that μ is the baseline predictor for Y if nothing is known about X.
Then judge the performance of any predictor d based on X by comparing its mean
squared error to that of μ.
The relative error is always nonnegative. It is usually, but not always, less than 1.
Most sensible predictors d(x) are more accurate than μ, and RE*(d) < 1. But on
occasion, some construction procedure may produce a poor predictor d with RE*(d) ≥
1.
Let
and
Then the resubstitution estimate RE(d) for RE*(d) is R(d)/R(y) . Using a test sample,
the estimate is
1 - RE(d)
249
is called the proportion of the variance explained by d. Furthermore, it can be
shown that if ρ is the sample correlation between the values yn and d(xn), n = 1, ..., N,
then
ρ2 = 1 - RE(d).
The ρ2 value of .81 was reported in the original linear regression study of the Boston
data.
In general, R(d) is not a variance and it does not make sense to refer to 1 - RE(d) as
the proportion of variance explained. Neither is 1 - RE(d) equal to the square of the
sample correlation between the yn and d(xn) values.
Therefore, we prefer to use the terminology relative error and the estimates of RE*
(d) as a measure of accuracy rather than 1 - RE*(d). The value .73 for “the proportion
of variance explained” given by the tree predictor grown on the Boston data is the
estimate 1 - REcv (d) .
250
8.3.2 Standard Error Estimates
Standard error estimates for Rts and Rcv are covered in Sections 11.4 and 11.5.
Understanding the concept behind their derivation is important to their interpretation.
For this reason, we give the derivation in the test sample setting.
Let the learning sample consist of N1 cases independently selected from an
underlying probability distribution, and suppose that the learning sample is used to
construct a predictor d(x). The test sample consists of N2 cases drawn independently
from the same distribution. Denote these by (X1, Y1), (X2, Y2), ..., ( , Y ). Then an
unbiased estimate for R*(d) is
(8.9)
Since the individual terms in (8.9) are independent (holding the learning sample
fixed), the variance of Rts is the sum of the variances of the individual terms. All cases
have the same distribution, so the variance of each term equals the variance of the first
term. Thus, the standard deviation of Rts is
Now use the sample moment estimates
and
251
to estimate the standard error of Rts by
Since sample fourth moments can be highly variable, less credence should be given
to the SE’s in regression than in classification.
The measure RE* is a ratio, and its estimates REts and REcv are ratio estimators with
more complicated standard error formulas (see Sections 11.4 and 11.5).
Proportionally, the standard error of REts, say, may be larger than that of Rts. This is
because the variability of Rts (d)/Rts( ) is affected by the variability of both the
numerator and the denominator and by the interaction between them. At times, the
variability of the denominator is the dominant factor.
252
8.3.3 Current Regression Methods
where d has known functional form depending on x and a finite set of parameters θ =
(θ1, θ2, ...). Then θ is estimated as that parameter value which minimizes R(d (x, ));
that is,
where the coefficients bo, ..., bM are to be estimated. Assuming further that the error
term is normally distributed N(0, 2) and independent from case to case leads to an
elegant inferential theory.
But our focus is on data sets whose dimensionality requires some sort of variable
selection. In linear regression the common practice is to use either a stepwise selection
or a best subsets algorithm. Since variable selection invalidates the inferential model,
stepwise or best subsets regression methods have to be viewed as heuristic data
analysis tools.
However, linear regression with variable selection can be a more powerful and
flexible tool than discriminant analysis in classification. The assumptions necessary
for good performance are much less stringent, its behavior has been widely explored,
diagnostic tools for checking goodness of fit are becoming available, and robustifying
programs are flourishing.
Therefore, tree structured regression as a competitor to linear regression should be
looked at in somewhat a different light than tree structured classification and used in
those problems where its distinctive characteristics are desirable.
Two nonparametric regression methods, nearest neighbor and kernel, have been
studied in the literature (see the bibliography in Collomb, 1980). Their drawbacks are
similar to the analogous methods in classification. They give results difficult to
interpret and use.
253
8.4 TREE STRUCTURED REGRESSION
FIGURE 8.2
Since the predictor d(x) is constant over each terminal node, the tree can be thought
of as a histogram estimate of the regression surface (see Figure 8.3).
254
FIGURE 8.3
Starting with a learning sample £, three elements are necessary to determine a tree
predictor:
255
where the sum is over all yn such that xn € t and N(t) is the total number of cases in t.
The proof of Proposition 8.10 is based on seeing that the number a which minimizes
(yn- a)2 is n
Similarly, for any subset yn, of the yn , the number which minimizes Σn, (yn, - a)2 is the
average of the yn,.
From now on, we take the predicted value in any node t to be (t). Then, using the
notation R(T) instead of R(d),
(8.11)
Set
(8.12)
These expressions have simple interpretations. For every node t, (yn - y(t))2 is
the within node sum of squares. That is, it is the total squared deviations of the yn in t
from their average. Summing over t ∈ gives the total within node sum of squares,
and dividing by N gives the average.
Given any set of splits S of a current terminal node t in , DEFINITION 8.13. The
best split s* of t is that split in S which most decreases R(T).
256
More precisely, for any split s of t into tL and tR, let
(8.14)
so that R(t) = s2 (t)p(t), and
(8.15)
Note that s2(t) is the sample variance of the yn values in the node t. Then the best
split of t minimizes the weighted variance
pLs2(tL) + pRs2(tR),
257
where pL and pR are the proportion of cases in t that go left and right, respectively.
We noted in Chapter 3 that in two-class problems, the impurity criterion given by
p(1|t) p(2|t) is equal to the node variance computed by assigning the value 0 to every
class 1 object, and 1 to every class 2 object. Therefore, there is a strong family
connection between two-class trees and regression trees.
258
8.5 PRUNING AND ESTIMATING
259
8.5.1 Pruning
FIGURE 8.4
Define the error-complexity measure Rα (T) as
260
0 = α1< α2 <···
such that for αk ≤α<αk+1, Tk is the smallest subtree of Tmax minimizing Rα (T).
261
8.5.2 Estimating R*(Tk) and RE*(Tk)
To select the right sized tree from the sequence T1 > T2 > ···, honest estimates of R(Tk)
are needed. To get test sample estimates, the cases in L are randomly divided into a
learning sample L1 and a test sample L2. The learning sample L1 is used to grow the
sequence {Tk} of pruned trees. Let dk (x) denote the predictor corresponding to the
tree Tk. If L2 has N2 cases, define
In practice, we have generally used cross-validation except with large data sets. In
v-fold cross-validation L is randomly divided into L1, ...,Lv such that each subsample
Lv, v = 1, ..., V, has the same number of cases (as nearly as possible).
Let the vth learning sample be L(v) = L- Lv, and repeat the tree growing and pruning
procedure using L(v). For each v, this produces the trees T(v)(α) which are the minimal
error-complexity trees for the parameter value α.
Grow and prune using all of L, getting the sequences {Tk} and {αk}. Define α′k =
. Denote by (x) the predictor corresponding to the tree T(v) ( ). The cross-
validation estimates Rcv (Tk) and RECV(TK) are given by
and
For the Boston data, RECV(Tk) is plotted against in Figure 8.5 and compared to the
resubstitution estimates.
262
FIGURE 8.5
263
8.5.3 Tree Selection
In regression, the sequence T1 > ··· > {t1} tends to be larger than in classification. In
the Boston example, there were 75 trees in the sequence.
The pruning process in regression trees usually takes off only two terminal nodes at
a time. This contrasts with classification, where larger branches are pruned, resulting
in a smaller sequence of pruned subtrees.
The mechanism is illustrated by the following hypothetical classification example.
Figure 8.6 illustrates a current branch of the tree starting from an intermediate node.
There are two classes with equal priors and the numbers in the nodes are the class
populations.
If the two leftmost terminal nodes are pruned, the result is as illustrated in Figure
8.7. There are a total of 100 misclassified in this branch. But in the top node, there are
also 100 misclassified. Therefore, the top node, by itself, is a smaller branch having
the same misclassification rate as the three-node configuration in Figure 8.7. In
consequence, if the two leftmost nodes in Figure 8.6 were pruned, the entire branch
would be pruned.
264
FIGURE 8.7
In classification, a split almost always decreases the impurity I(T), but, as just
illustrated, may not decrease R(T). In regression trees, a split almost always decreases
R(T). Since we are pruning up on R(T), in general only two terminal nodes at a time
will be pruned.
Not only is the sequence of trees generally larger in regression, but the valley
containing the minimum value of Rcv(Tk) also tends to be flatter and wider.
In the Boston example, Rcv(T1) = 18.8. The minimum occurred at tree 32 with
Rcv(T32) =18.6 and | 32| = 49. For 1 ≤k≤67, 18.6 ≤ Rcv(Tk) ≤ 21.6. The SE estimates
for Rcv in this range were about 2.8. Clearly, the selection of any tree from T1 to T67 on
the basis of Rcv is somewhat arbitrary.
In keeping with our philosophy of selecting the smallest tree commensurate with
accuracy, the 1 SE rule was used. That is, the Tk selected was the smallest tree such
that
265
Rcv(Tk) ≤ Rcv(Tk0 ) + SE,
where
and SE is the standard error estimate for Rcv(Tk0). Tree 66 was selected by this rule,
with Rcv(T66) = 21.1 and | 66| = 9.
If the analyst wants to select the tree, Section 11.6 gives a method for restricting
attention to a small subsequence of the main sequence of pruned trees.
266
8.6 A SIMULATED EXAMPLE
To illustrate the pruning and estimation process outlined in Section 8.5, as well as
other features, we use some simulated data.
The data were generated from this model: Take x1, ..., x10 independent and
y = 3 + 3x2 + 2x3 + x4 + z,
if x1 = -1, set
y = 3 + 3x5 + 2x6 + x7 + z.
Variables x8, x9, x10 are noise.
This example consists of two distinct regression equations with the choice of
equation dictated by the binary variable x1. The other variables are ordered three-
valued variables. The best predictor is
267
numbers in the lines connecting the nodes are the popula-tions
TABLE 8.2
going to the left and right nodes. The numbers in the nodes are the learning sample
node averages. In the terminal nodes, the number above the line is the learning sample
node average. The number below the line is the test set node average.
268
FIGURE 8.8
With one exception, the tree follows the structure of the model generating the data.
The first split is on the binary variable x1 and separates the 96 cases generated from
the equation
y = -3 + 3x5 + 2x6 + x7 + z
269
generated from the same model but with different random number seeds. A summary
of the 1 SE trees is given in Table 8.3.
TABLE 8.3
270
8.7 TWO CROSS-VALIDATION ISSUES
271
8.7.1 The Small Tree Problem
The data generated in the example was also run with 2, 5, 25, and 50 cross-validations.
Partial results are given in Table 8.4.
TABLE 8.4
Using two cross-validations produces a serious loss of accuracy. With five cross-
validations, the reduction in accuracy is still apparent. But 10, 25, and 50 cross-
validations give comparable results, with 10 being slightly better.
The lack of accuracy in the smaller trees was noted in classification, but to a lesser
extent. In regression trees the effect can be more severe. Analysis of the problem leads
to better understanding of how cross-validation works in tree structures. There are
some aspects of the problem that are specific to this particular simulated data set.
Instead of discussing these, we focus on two aspects that are more universal.
Table 8.5 lists the estimates of RE* and the corresponding complexity parameter αk
for the 13 smallest trees.
The cross-validated estimate of RE* for the three terminal node tree uses the cross-
validation trees corresponding to the value of α = = 483. Of these 10 cross-
validation trees,
272
TABLE 8.5
7 have two terminal nodes, and 3 have three terminal nodes. Thus, the majority of
the cross-validation trees have accuracies comparable to the two terminal node tree.
Largely because of this, the tenfold cross-validated estimate of RE* for the three
terminal node tree is the same as that for the two terminal node tree.
There are two reasons for the disparity in the number of terminal nodes between the
main tree and the cross-validation trees. First, the three terminal node main tree is
optimal only over the comparatively narrow range (467,499). For α = 483, the two
terminal node main tree has cost-complexity almost as small as the three terminal node
tree. Second, since the cross-validation trees are grown on a subset of the data, for the
same number of terminal nodes they will tend to have lower resubstitution error rates
than the main tree. The combination of these two factors is enough to swing the
balance from three terminal nodes in the main tree to two terminal nodes in some of
the cross-validation trees.
In general, whenever there is a tree Tk in the main tree sequence that is optimal over
a comparatively narrow α-range (αk, αk+1), we can expect that some of the cross-
validation trees have fewer terminal nodes than the main tree. The cross-validated RE*
estimate will then be biased upward toward the RE* values corresponding to Tk+1. If
the RE* value for Tk+1 is considerably larger than that for Tk, the bias may be large.
Thus, the effect is more pronounced in the smaller trees where RE* is rapidly
increasing.
Another potential source of bias is unbalanced test samples. Suppose that in tenfold
cross-validation on 200 cases, a test sample contains, say, 10 of the 20 highest y
values, together with 10 “typical” values. Suppose that the tree with two terminal
nodes grown on the remaining 180 cases generally sends high y values right and lower
ones left. Because of the absence of 10 high y values, the mean of the right node will
273
be smaller than if all cases were available. Then when the test sample is run through
the tree, the sum of squares will be inflated.
This bias is also reduced as the tree grows larger. Suppose the tree is grown large
enough so that the remaining 10 of the 20 highest y values are mostly split off into a
separate node t. At this point the absence of the 10 high response values, assuming
they would also fall into t, does not affect the mean of any node except t.
Furthermore, assuming that the 10 cases in the test sample are randomly selected
from the 20 original high y value cases, the average within-node sum of squares
resulting when these cases drop into t is an unbiased estimate of the average within-
node sum of squares for t.
To summarize, sources of bias in small trees are cross-validation trees that have
fewer terminal nodes than the corresponding main tree and unbalanced test samples.
The former might be remedied by selecting the cross-validation trees to have, as nearly
as possible, the same number of terminal nodes as the main tree; the latter by
stratifying the cases by their y values and selecting the test samples by combining
separate samples from each stratum. Both of these have been tried. The results are
summarized in Table 8.6. The 3rd column of this table gives the original cross-
validation results. The 4th and 5th columns give cross-validation estimates using trees
with the same number of terminal nodes as the main tree. The 5th column estimate
uses stratified test samples, the 4th column estimates do not.
TABLE 8.6
Using cross-validation trees with the same number of nodes as the main tree and
stratifying the test sets reduce the bias in the estimates. Even so, a marked upward bias
remains. To some extent, this seems to be data-set dependent. When other seeds were
used to generate data from the same model, the small tree effect was usually present
but not so pronounced.
At any rate, in the examples we have examined, the 1 SE tree has always been in the
range where the cross-validation estimates are reasonably accurate. (Glick, 1978, has
some interesting comments on the “leave-one-out” estimate as used in linear
discrimination, which may be relevant to the preceding discussion.)
274
8.7.2 Stratification and Bias
Since there was some indication that stratification gave more accurate estimates in the
preceding example, it was implemented in CART and tested on 10 more replicates
generated from our simulation model.
The idea is this: In tenfold cross-validation, the cases are ordered by their y values
and then put into bins corresponding to this ordering.
Thus, for example, the first bin consists of the cases having the 10 lowest y values.
The second bin contains the 10 next lowest y values and so on. Then each of the 10
test samples is constructed by drawing one case at random (without replacement) from
each bin.
The comparison of the unstratified and stratified cross-validation estimates is given
in Table 8.7.
TABLE 8.7
As Table 8.7 shows, the stratified estimates never do worse than the unstratified. On
the 8th data set stratification is considerably more accurate. It is difficult to generalize
from this single example, but our current thinking is that stratification is the preferred
method.
Note that on the 9th data set, the cross-validation estimate is about 3 SE’s lower
than the test sample estimate. Both in regression and classification, we have noted that
in SE units, the cross-validation estimates tend to differ more from the 5000 test
sample estimates than predicted on the basis of classical normal theory. This problem
is currently being studied and appears complex.
275
8.8 STANDARD STRUCTURE TREES
As in classification, we say that the data have standard structure if the measurement
space X is of fixed dimensionality M, x = (x1, ..., xM), where the variables may be
either ordered or categorical.
The standard set of splits then consists of all splits of the form {is xm < c?} on
ordered variables and {isx∈s?} for categorical variables where S is any subset of the
categories.
Categorical variables in standard structure regression can be handled using a result
similar to Theorem 4.5. If xm ∈ {b1, ..., bL} is categorical, then for any node t, define
(bℓ) as the average over all yn in the node such that the mth coordinate of xn is bℓ ,
Order these so that
This reduces the search for the best subset of categories from 2L—1 - 1 to L - 1
subsets. The proof is in Section 9.4.
Procedures that carry over without any modification are
Variable combinations
Surrogate splits s
Missing value algorithm
Variable importance
In addition, exploratory trees and subsampling carry over for any data structures.
Actually, subsampling is simpler in regression because there are no classes. If the node
population N(t) is greater than the threshold population N0, select a random sample of
size No from the N(t) cases and use the sample to split the node.
In regression, use of linear combination splits does not have the appeal that it does
in classification. In fact, the linear combination algorithm run on both the Boston and
the simulated data did not produce any significant decrease in either the cross-
validated or the test sample estimates of the relative error. The reason appears to be
276
that whether using linear combinations or univariate splits, what is produced is a flat-
topped histogram-type approximation to the regression surface.
A promising alternative for improving accuracy is to grow a small tree using only a
few of the most significant splits. Then do multiple linear regression in each of the
terminal nodes. Obviously, this would be well tailored to the simulated data set. But
the trade-off, in general, is a more complicated tree structure. See Friedman (1979) for
a related method.
277
8.9 USING SURROGATE SPLITS
The definition of surrogate splits and their use in missing values and variable
importance carries over, in entirety, to regression trees. They are just as useful here as
in classification.
278
8.9.1 Missing Value Examples
In the Boston data, 5 percent, 10 percent, and 25 percent of the measurement variables
were deleted at random. The expected number of deletions per case are .7, 1.3, and
3.3, respectively, and the expected percentages of complete cases are 51 percent, 25
percent, and 2 percent. The corresponding trees were grown and cross-validated and
the 1 SE trees selected. The results are given in Table 8.8.
TABLE 8.8
A similar experiment was carried out with the simulated data. The test samples used
had the corresponding proportion of data randomly deleted. Table 8.9 has the results
for the 1 SE trees.
TABLE 8.9
As in classification, the question arises as to whether the loss in accuracy is due to
the tree constructed or to the increased difficulty in classifying cases with variables
missing. Two experiments were carried out. In the first, complete test samples were
dropped down the trees constructed with incomplete data. In the second, incomplete
test samples were dropped down the 1 SE tree constructed on complete data (see Table
8.10).
279
TABLE 8.10
The situation is similar to the digit recognition problem. The measurement variables
are independent. There are no meaningful surrogate splits. Tree construction holds up
fairly well. But the deletion of a few variables in the test set, in particular, one or more
of x1, X2, x3, X5, x6, can throw the predicted response value far off.
280
8.9.2 Variable Importance
The variable importances are defined as in classification, using ΔR instead of ΔI. The
results for the Boston data are given in Section 8.2. Only four variables appear in splits
in the 1 SE tree, but a number of other variables are ranked as high as, or higher than,
two of the variables appearing in splits (DIS, CRIM).
In particular, P/T is the third-ranking variable, after LSTAT and RM. The variables
NOX, INDUS, and TAX have variable importances about equal to those at DIS,
CRIM. The importances for the other five variables AGE, RAD, B, ZN, and CHAS
taper off slowly.
From the structure of the model generating the simulated data, one would clearly
rank the variable in importance as The graph of the outputted variable importances is
shown in Figure 8.9. The values track our expectations.
FIGURE 8.9
281
8.10 INTERPRETATION
The same facilities for tree interpretation exist for regression trees as for classification
trees; exploratory trees can be rapidly grown and the cross-validation output can be
examined.
282
8.10.1 Tree Stability
The results of repeated runs with new seeds on the regression model are in contrast to
the two classification examples. In the latter, the tree structures were unstable due to
redundancy or high correlations in the data. The regression model has much more
clear cut structure. The variables are independent, with an obvious order of
importance. The first few splits in the original data set and in the four replicated data
sets were similar. Figure 8.10 shows the variables split on in the five data sets.
The variable importances are also fairly stable. Table 8.11 gives the rankings of the
nonnoise variables in the five runs.
FIGURE 8.10
283
284
8.10.2 Robustness and Outliers
Tree structured regression is quite robust with respect to the measurement variables,
but less so with respect to the response variable. As usual, with least squares
regression, a few unusually high or low y values may have a large influence on the
residual sum of squares.
However, the tree structure may treat these outliers in a way that both minimizes
their effect and signals their presence. It does this by isolating the outliers in small
nodes. If there are one or a few cases in a node whose y values differ significantly
from the node mean, then a significant reduction in residual sum of squares can be
derived from splitting these cases off.
As an illustration, look at the rightmost terminal node in Figure 8.1. It contains only
one case, with a y value of 21.9. The parent node had a node mean of 45.1 with a
standard deviation of 6.1. The residual sum of squares was decreased about 50 percent
by splitting the one case to the right. A similar occurrence can be seen in the third
terminal node from the right.
To the extent that the tree structure has available splits that can isolate outliers, it is
less subject to distortion from them than linear regression.
285
8.10.3 Within-Node Standard Deviations
To illustrate a point, assume that the data are generated from a model of the form
Y = g(X) + ε,
where ε is a random error term independent of X with mean zero. Denote the variance
of ε at X = x by 2(x). If 2(x) is constant, then the model is called homoscedastic. In
general, the error variance is different over different parts of the measurement space.
Lack of homoscedasticity may have an unfortunate effect on tree structures. For
instance, in a given node t, the within-node variance s2(t) may be large compared to
other nodes, even though (t) is a good approximation to the regression surface over t.
Then the search will be on for a noisy split on some variable, resulting in a pLs2(tL) +
pRs2(tR) value considerably lower than s2(t). If such a split can be found, then it may
be retained even when the tree is pruned upward.
Because the successive splits try to minimize the within-node variances, the
resubstitution estimates s2(t) will tend to be biased low. Table 8.12 gives the node
means and standard deviations computed using resubstitution and the 5000-case test
sample for the 13 terminal nodes of the tree (Figure 8.8) grown on the simu-lated data.
The nodes are ordered from left to right. The systematic downward bias is apparent.
286
TABLE 8.12
It would be tempting to think of (t) ± 2s(t) as a 95 percent confidence interval in the
following sense. Drop a large independent test sample down the tree. Of all the cases
falling into terminal node t, about 95 percent of the corresponding y values are in the
range (t) + 2s(t).
This is not valid and does not even hold up as a heuristic. First of all, as already
noted, the s2(t) are generally biased low. An adjustment on the s2(t) similar to the
adjustment on r(t) in classification has been explored, but the improvement was
marginal. Second, the dominant error often is not in the downward bias of s2(t), but
instead is that (t) is a poor estimate of the true node mean. To see this, look at nodes 9
and 10 in Table 8.12. Finally, use of (t) ± 2s(t) implicitly invokes the unwarranted
assumption that the node distribution of y values is normal.
The smaller the terminal nodes, the more potentially biased are the estimates (t),
2
s (t). In situations where it is important that those individual node estimates be
accurate, it is advisable to pick smaller trees having larger terminal nodes. For
instance, selecting the tree in the sequence with 12 terminal nodes instead of 13 would
have eliminated the noisy split leading to the terminal nodes 9 and 10 in Table 8.12.
The single parent node of 9 and 10 with a population of 29 cases becomes terminal in
the smaller tree. It has node mean 2.4 and standard deviation 1.9 compared with test
sample estimates of 2.0 and 2.0.
287
8.11 LEAST ABSOLUTE DEVIATION REGRESSION
288
8.11.1 Background
d(x), then based on a large test sample of size N2 consisting of (x1, yi), ..., ( ), a
natural measure of the accuracy of d(x) is
Use of this type of measure leads to least absolute deviation (LAD) regression. (For a
recent review of linear least absolute deviation regression methods see Narula and
Wellington, 1982.)
To give a theoretical definition of accuracy in the least deviation sense, assume that
(X, Y) and the learning sample cases are independently drawn from the same
distribution.
DEFINITION 8.17. Define the absolute error R*(d) of the predictor d as
p(y ≥ ν(y)) ≥ .5
p(y ≤ ν(y)) ≥ .5.
Unlike means or expected values, the median of a distribution may not be unique.
PROPOSITION 8.18. Any predictor dB of the form
dB(x) = ν(Y|X = x)
E|Y- a|
take a equal to any median value of Y.
289
Let υ = υ(Y). Then
R*(υ) = E |Y - υ | :
is the absolute error using the constant υ as a predictor. In analogy with Definition 8.8,
use:
DEFINITION 8.20. The relative absolute error R*(d) in d(x) as a predictor of Y is
Based on the test sample ( , y1), ..., ( , ), the test sample estimate is
Then the resubstitution estimate for the absolute relative error is R(d)/R( ). The test
sample estimate is Rts(d)/R( ) and the cross-validated estimate is Rcv(d)/R( ).
Standard error estimates are derived in much the same way as in least squares
regression. However, they depend only on sample second moments and first absolute
moments. Therefore, they should be much less variable and more believable than the
corresponding least squares estimates.
Since LAD regression was not implemented and tested until late in the course of
this monograph, the results in the later chapters, particularly the consistency results in
Chapter 12, do not address the least absolute deviation criterion. We are virtually
certain that with minor modifications the consistency results can be extended to the
LAD context.
290
8.11.2 Tree Structured LAD Regression
To specify how tree structured least absolute deviation regression works, given a
learning set L, three elements are again necessary:
(8.22)
Set
then
291
(8.23)
As in Definition 8.13, given a set of splits S of a node t, the best split s* is that one
which most decreases R(T). Alternatively, the best split s* in S is any minimizer of
equivalently, it minimizes
Thus LAD regression iteratively attempts to minimize the sum of the absolute
deviations from the node medians.
To get a form analagous to (8.15), define the average deviation d(t) for the node t as
(8.24)
Then
and the best split of t minimizes the weighted sum of average deviations
292
293
8.11.3 The CART Implementation on Standard Data Structures
For x1, say, an ordered variable taking many values, there are a large number of
potential splits on x1 at the larger nodes of the tree. The obvious way to evaluate any
split on x1 of a node t into t and tR is to order the y values in t and tR , compute υ(tL)
and υ(tR), and then compute the absolute deviations from υ(tL) and υ(tR) in each node.
The result is a very slow-running program.
Instead, CART uses a fast update algorithm devised by Hal Forsey. This algorithm
was incorporated into CART by Padraic Neville. The resulting LAD regression
program runs about the same magnitude of speed as ls (least squares) regression. For
instance, on the Boston housing data with tenfold cross-validation, Is regression takes
707 CPU seconds on a VAX 11/750 to construct the final tree (Figure 8.1). On the
same data and machine, LAD regression requires 1100 CPU seconds, or 56 percent
more running time. The update algorithm does not work on categorical variables. On
these, CART directly sorts and evaluates the medians and deviations for all possible
splits. Therefore, the presence of categorical variables taking many possible values
will result in longer running times.
294
8.11.4 Examples
FIGURE 8.11
The LAD regression program was run on the Boston housing data and on the
simulated data described in Section 8.6. The tree selected for the Boston data is shown
in Figure 8.11. The number inside each node circle or rectangle is a median of the y
values in the node. The number underneath each terminal node is the average absolute
deviation in the node. Otherwise, the notation is the same as in Figure 8.1. The cross-
validation estimates RE* and R* are .44 ± .03 and 2.9 ± .2, respectively, based on
stratified test samples. The results using unstratified samples are very similar.
Variable importance in LAD regression is computed similarly to Is regression, but
based on changes in sums of absolute deviations instead of sums of squares. The
295
resulting importances are given in Table 8.13 and compared with the Is importances.
TABLE 8.13
It is difficult to decide which tree is “best.” Relative error in LAD regression is not
directly comparable to relative error in Is regression. The latter is the ratio of the mean
squared error of the regression to the original mean squared error. If any comparisons
are made, they should be between the relative error in LAD regression and ratios of
root mean squared errors in Is regression; equivalently, the square root of the relative
error in Is regression. Even this is not quite cricket, since by the Schwarz inequality,
for any set of numbers a1,.... aN
It follows that if the medians are about equal to the means, the mean absolute
deviation will usually be less than the root mean squared error.
The RECV in LAD regression for the Boston data is .44. The for Is regression
is .52. But, as we pointed out in the preceding paragraph, the implications of this
difference are not clear.
There are two interesting contrasts in the tree structures. There are four inner city
tracts with high MV values (all equal to 50). The Is regression puts these four together
with another tract having MV value 28 into one terminal node. The LAD regression
isolates these four into two terminal nodes, each having two of the tracts.
296
In the last split on the right, there is an intermediate node in both trees containing
the same 30 tracts. These tracts are characterized by RM > 7.4. The Is tree splits by
finding one outlier with low MV having a higher value of CRIM. The LAD tree, not
weighing outliers as heavily as least squares, splits on P/T with fourteen low P/T and
high MV going left and sixteen lower MV tracts going right.
In terms of rankings, there is one significant change in the variable importances. In
the LAD tree INDUS is the third most important variable by a considerable margin
over P/T, which is the third ranked by a goodly margin in the Is tree. Interestingly
enough, INDUS is never split on in the LAD tree. It achieves its ranking by showing
up often as a surrogate for the splits on RM.
LAD regression was also run on data generated from the model specified in Section
8.1. On the original data set, the LAD tree selected is exactly the same as the tree
selected by Is regression and pictured in Figure 8.8. The RECV value is .41 ± .03 and
REts is .41 ± .01. For the Is tree is also .41.
To check the accuracy of the cross-validation estimates, LAD regression was run on
the first five data sets used in Table 8.7. Stratified test samples were used. The results
are given in Table 8.14.
TABLE 8.14
These estimates, in SE units, are generally closer to the test sample estimates than
the Is estimates are to their corresponding test sample estimates. In particular, the
estimate for data set 9 is not unusually low in terms of SE units. But generalizing from
the results on only six data sets is folly, and we plan more extensive testing and
comparison.
297
8.12 OVERALL CONCLUSIONS
Regression trees have been used by the authors in fields as diverse as air pollution,
criminal justice, and the molecular structure of toxic substances. Its accuracy has been
generally competitive with linear regression. It can be much more accurate on
nonlinear problems but tends to be somewhat less accurate on problems with good
linear structure.
Our philosophy in data analysis is to look at the data from a number of different
viewpoints. Tree structured regression offers an interesting alternative for looking at
regression type problems. It has sometimes given clues to data structure not apparent
from a linear regression analysis. Like any other tool, its greatest benefit lies in its
intelligent and sensible application.
298
9
In the remainder of the book, tree structured procedures will be developed and studied
in a general framework, which includes regression, classification, and class probability
estimation as special cases. The reader is presumed to be familiar with the motivation
and intuition concerning tree structured procedures. But otherwise the material in the
following chapters can be read independently of that given earlier. In Chapters 9 and
10 the joint distribution of X and Y is assumed to be known. Additional topics that
arise when tree structured procedures are applied to a learning sample will be dealt
with in Chapter 11. Chapter 12 is devoted to a mathematical study of the consistency
of tree structured procedures as the size of the learning sample tends to infinity.
299
9.1 BAYES RULE
Let X denote the set of possible measurement vectors, let X denote an X -valued
random variable whose distribution is denoted by P(dx), and let Y denote a real-valued
response (or classification) 266 variable. Let A denote the set of possible “actions,”
and let L(y, a) denote the “loss” if y is the actual value of Y and a is the action taken. A
decision rule d is an A-valued function on X : action a = d(x) is prescribed whenever x
is the observed value of X. The risk R(d) of such a rule is defined to be the expected
loss when the rule is used; namely, R(d) = EL(Y, d(X)).
In the regression problem, A is the real line and L(y, a) = (y - a)2. Thus, R(d) = E[(Y
- d(X))2] is the mean square error of d(X) viewed as a predictor of Y.
In the classification problem as treated in previous chapters, the possible values of Y
are restricted to the finite set {1, ..., J} ; A = {1, ..., J} ; L(y, a) = C(a|y) is the cost of
classifying a class y object as a class a object; and R(d) is the expected cost of using
the classification rule d. Even in the classification context it is worthwhile to consider
the added generality obtained by allowing the set of actions to differ from the set of
classes. For example, {1, ..., J} might correspond to possible diseases and A to
possible treatments.
In the class probability estimation problem the values of Y are again restricted to {1,
..., J}. Now A is the set of J-tuples (a1, ..., aJ) of nonnegative numbers that sum to 1;
and L(y, a) is defined by
The quantity dj(X) can be viewed as a predictor of ψj(Y), with R(d) denoting the sum
of the corresponding mean square errors of prediction.
A Bayes rule dB is any rule d that minimizes R(d). To find such a Bayes rule,
observe that R(d) = EE [L(Y, d(X)) |X] and hence that
300
R(d) = ∫ E[L(Y, d (x)) |X = x] P (dx) ;
that is, R(d) is the integral of E [L(Y, d (x)) | X = x] with respect to the distribution
of X. Thus, d is a Bayes rule if for each x ∈ X, a = d(x) minimizes E[L(Y, a) |X = x].
Also, the risk R (dB) of a Bayes rule dB can be written as
consequently,
and u(x) is the unique value of a that minimizes E[L(Y, a) |X = x]. Therefore, the
Bayes rule is given by dB(x) = µ (x) = E[Y|X = x]. Observe also that for any rule d,
Thus, to obtain a Bayes rule, choose dB(x) to be, say, the smallest value of i that
minimizes Σj C(i|j)P(j|x). The risk of the Bayes rule is given by
301
To find the Bayes rule for the class probability estimation problem, observe that P(j|x)
= E [ψj ,(Y) |X = x] and hence that E [ψj(Y) - P (j|x)|x = x] = 0. It is also easily seen
that
Thus, it follows as in the proof of the corresponding result for the regression problem
that
for a = (a1, ..., aJ). Consequently, the Bayes rule is given by dB(x) = (P(1|x), ..., P(j|x)),
and it has risk
302
9.2 BAYES RULE FOR A PARTITION
A Bayes rule corresponding to the partition is a rule of the preceding form having
the smallest possible risk. Thus, d is a Bayes rule corresponding to T if, and only if,
d(x) = ν(τ(x)), where for each t ∈ T, a = ν(t) minimizes E[L(Y, a)|X ∈ t]. From now
on, given a node t, let ν(t) be a value of a that minimizes E[L(Y, a) | X ∈ t]. Also, set
In the regression problem given a node t, set µ (t) = E (Y|X ∈ t) and
303
T In the classification problem, given a node t, set P(jlt) = P(Y =j|x ∈ t). Here a
Bayes rule corresponding to is obtained by choosing ν(t) to be a value (for example,
the smallest value) of i that minimizes Σj c(i|j)p(j|t) and setting (x) = ν(τ(x)).
Moreover,
In the class probability estimation problem, the unique Bayes rule corresponding
to is given by (x) = ν(τ(x)), where ν(t)= (P(1|t), ..., P(Jlt)). Also,
THEOREM 9.1. Let t be a node and let be a collection of nodes that forms a
partition t. Then R(t) ≥ R(s), with equality holding if, and only if,
(9.2)
and
(9.3)
Consequently,
304
which yields the desired result.
Let , be two partitions of X into disjoint nodes. Then is a refinement of
if for any pair of nodes t ∈ s ∈ , either s is a subset of t or s and t are
disjoint. (Recall that if T and T′ are trees having root X, then the corresponding
collections and of terminal nodes are each partitions of X. Observe that if T′ T,
as defined in Section 3.2, then ; that is, is a refinement of T.)
THEOREM 9.4. Let T and T′ be partitions of X into disjoint nodes such that .
Then R( ) ≤ R( ), with equality holding if and only if r(s) = E[L(Y, ν(t))|X ∈ s] for all
pairs t E T, s ∈ such that s C t.
PROOF. Given t ∈ , let be the partition of t defined by = {s ∈ : s ⊂ t}. It
follows from (9.2) and (9.3) that
305
9.3 RISK REDUCTION SPLITTING RULE
Let T be a fixed partition and lett∈ also be fixed. Consider a split s of t into two
disjoint nodes tL and tR. Set
and
The relative risk reduction
306
r(tL) = E[L(Y, υ(t)) | X ∈ tL] and r(tR) = E[L(Y, ν(t))|X ∈ tR].
Consider a split s of t into tL, tR . According to Theorem 9.5, the risk of the Bayes
rule for the modified partition cannot be more than the risk for the original partition; it
is strictly smaller unless every choice of the action a that is optimal for t is also
optimal for t and tR .
In the regression problem,
Consequently,
Analogous results hold in the class probability estimation problem. That is,
and
307
308
9.4 CATEGORICAL SPLITS
Consider splits of t into tL, tR based on the mth coordinate of x, which is assumed here
to be a categorical variable whose possible values range over a finite set B. Then
Since φ is a concave function on an interval,
309
where B2 = B - B1. Note here that ψ(B1) = ψ (B2) By the concavity of φ,
(9.7)
Since B1 is optimal, it follows from (9.7) that A3 is optimal; also, Q3 > 0, so
310
and hence
(9.8)
Now.
Finally,
(9.9)
which is equivalent to (9.7).
It is easily seen that Q13 = vQ3 + (1 - v)Q123 and Q24 = (1 - v)Q4 + vQ124. Define Y
and Y by
311
and
(9.10)
and
(9.11)
Next it will be shown that
(9.12)
By assumption, Y1 ≥ Y2 and Y13 < Y24. If Y1 = Y2, then Y = Y13 and Y = Y24, so (9.12)
is trivially satisfied. Otherwise, Y1 > Y2 and hence
Thus, by the concavity of φ,
312
Consequently,
so (9.12) again holds. Since (9.9) follows from (9.10) - (9.12), the proof of the
theorem is complete.
313
10
OPTIMAL PRUNING
314
10.1 TREE TERMINOLOGY
For purposes of this chapter, a tree can be defined as consisting of a finite nonempty
set T of positive integers and two functions left (·) and right(·) from t, to T ∪ {0},
which together satisfy the following two properties:
(i) For each t ∈ T, either left(t) = right(t) = 0 or left(t) > t and right(t) > t;
(ii) For each t ∈ T, other than the smallest integer in T, there is exactly one s ∈ T
such that either t = left(s) or t = right (s).
For simplicity, T itself will be referred to as a tree and, as before, each element of T
will be referred to as a node. The correspondence between nodes of T and subsets of X
is irrelevant in this chapter.
Figure 10.1 shows a tree that arose in the discussion of the digit recognition
example in Section 2.6.1. Table 10.1 determines the same tree by specifying the values
of ℓ(t) = left(t) and r(t) = right(t). The reader is encouraged to apply to this tree the
various definitions that follow.
The triple T, left(·), right(·) can always be chosen (as they were in the example in
Figure 10.1) so that for each node t either left(t) = right(t) = 0 or left(t) > 0 and right(t)
= left(t) + 1. Under this restriction, the tree is determined by specifying left(·).
315
FIGURE 10.1
TABLE 10.1
The minimum element of a tree T is called the root of T, denoted by root(T). If s, t ∈
T and t = left(s) or t = right(s), then s is called the parent of t. The root of T has no
parent, but every other node in t has a unique parent. Let parent(·) be the function from
T to T ∪ {0} defined so that parent(root(T)) = 0 and parent(t) is the parent of t for t ≠
root(T). A node t is called a terminal node if it is not a parent, that is, if left(t) = right(t)
= 0. Let T denote the collection of terminal nodes of T. The elements in T - are
called nonterminal nodes.
A node s is called an ancestor of a node t if s = parent(t) of s = parent(parent(t)) or
.... A node t is called a descendant of s if s is an ancestor of t. If s is an ancestor of t,
there is a unique “path” s0, ..., sm from s to t (that is, such that s0 = s, sk-1 = parent(sk)
for 1 < k < m, and s = t). Let ℓ(s, t) = m denote the length of this path. Also set ℓ(t, t) =
0.
Given a nonempty subset T1 of T, define left1(·) and right1(·) from T1 to T1 ∪ {0}
by
316
and
Then T1 is called a subtree of T if the triple T1, left1(·), right1(·) forms a tree. If T1 is a
subtree of T, and T2 is a subtree of T1, then T2 is a subtree of T. Given t ∈ T, the
collection Tt consisting of t and all its descendants is called the branch of t stemming
from t. It is a subtree of t. The branch (T1)t of a subtree T1 of T is denoted by T1t.
The tree T is said to be trivial if any of the following equivalent conditions is
satisfied: |T| = 1; | | = 1; T = {root(T)}; T - is empty. (Recall that |s| denotes the
number of elements in the set S.) Otherwise, T is said to be nontrivial.
Given a nontrivial tree T, set t1 = root(T), tL = left(t1), tR = right(t1), TL = , and TR
= . Then TL is called the left primary branch of T and TR is called the right primary
branch. Observe that {t1}, TL, TR are nonempty disjoint sets whose union is T and that
L, R are nonempty disjoint sets whose union is ; in particular,
(10.1)
and
(10.2)
Properties of trees are typically proved by induction based on the observation that
the primary branches of a nontrivial tree are trees having fewer terminal nodes than the
original one. For example, it follows easily from (10.1), (10.2), and induction that
317
(10.3)
A subtree T1 of T is called a pruned subtree of T if root(T1) = root(T); this is denoted
by T1 T or T T1. The notation T1 T (or T T1) is used to indicate that T1 T and
T1 ≠ T. Observe that is transitive; that is, if T1 T and T2 T1, then T2 T.
Similarly, is transitive. An arbitrary subset T1 of T is a subtree of T having root t if,
and only if, it is a pruned subtree of Tt.
The definitions given here differ slightly from those in Chapter 3. There T1 was
called a pruned subtree of T if T1 T. Also, pruning was given a “bottom-up”
definition. This definition is very suggestive, but the “top-down” definition (root(T1) =
root(T)) given here will be more convenient in using induction to establish the validity
of the optimal pruning algorithm. The two definitions are obviously equivalent to each
other.
Let T be a nontrivial tree having root t and primary branches TL and TR. Let T′ be a
nontrivial pruned subtree of T and let and respectively, denote its left and right
primary branches. Then is a pruned subtree of TL and is a pruned subtree of TR.
Moreover, every pair of pruned subtrees of TL, TR arises uniquely in this manner.
Observe that and are disjoint sets whose union is ′. In particular,
(10.4)
Let T“ be a second nontrivial pruned subtree of T. Then T″ T′ if, and only if,
and . The following result is now easily proved by induction.
THEOREM 10.5. Let T1 and T2 be pruned subtrees of T. Then T2 is a pruned subtree
of T1 if, and only if, every nonterminal node of T2 is also a nonterminal node of T1.
Let # (T) temporarily denote the number of pruned subtrees of a tree T. If T is
trivial, then # (T) = 1. Otherwise,
(10.6)
It is obvious from (10.6) that the maximum number of pruned subtrees of a tree having
m terminal nodes increases rapidly with m. In particular, let T be the tree all of whose
terminal nodes have exactly n ancestors, so that | | = 2n. It follows from (10.6) that #
(Tn+1) = (# (Tn))2 + 1. Consequently, # (T1) = 2, # (T2) = 5, # (T3) = 26, # (T4) = 677,
318
and so forth. Now (# (Tn))1/| | is easily seen to be increasing in n and to converge
rapidly to a number b = 1.5028368. It is left as an interesting curiosity for the reader to
show that # (Tn) = [b| |] for every n > 1, where [c] denotes the largest integer no
larger than c. P. Feigin came up with a short elegant proof of this result.
319
10.2 OPTIMALLY PRUNED SUBTREES
(This material should be read in conjunction with the more intuitive discussion in
Section 3.2.)
Let T0 be a fixed nontrivial tree (for example, the tree Tmax in earlier chapters) and
let R(t), t ∈ T0, be fixed real numbers. Given a real number α, set Rα(t) = R(t) + α for t
∈ T0. Given a subtree T of T0, set
and
If T is a trivial tree having root t1, then R(T) = R(t1) and Rα(T) =Rα(t1).
A pruned subtree T1 of T is said to be an optimally pruned subtree of T (with respect
to α) if
Since there are only finitely many pruned subtrees of T, there is clearly an optimal
one, but not necessarily a unique one. An optimally pruned subtree T1 of T is said to
be the smallest optimally pruned subtree of T if T′ T1 for every optimally pruned
subtree T′ of T. There is clearly at most one smallest optimally pruned subtree of T
(with respect to α); when it exists, it is denoted by T(α).
Let T′ be a nontrivial pruned subtree of T and let and be its two primary
branches. Then
320
Theorem 10.7 now follows easily by induction. (In this result, TL(α) and TR(α) denote
the smallest optimally pruned subtrees of TL and TR, respectively.)
THEOREM 10.7. Every tree T has a unique smallest optimally pruned subtree T(α).
Let T be a nontrivial tree having root t1 and primary branches TL and TR. Then
PROOF. It is easily seen that if T(α1) is trivial, then T(α2) is trivial for α2 ≥ α1.
Statement (i) now follows from Theorem 10.7 by induction. If α2 > α1 and T(α2)
T(α1), then
and
321
PROOF. The theorem will be proved by induction. It is clearly true when | | = 1.
Suppose it is true for all trees having fewer than n terminal nodes, where n > 2. Let T
be a tree having n terminal nodes, root t1, and primary branches TL and TR. By
hypothesis,
and
It is easily seen that for eacht∈T - and real number α, the following two statements
are valid:
(10.12)
322
(10.13)
PROOF. It follows immediately from Theorem 10.10 that T is the unique optimally
pruned subtree of itself wrt α forα<α1; that T is an optimally pruned subtree of itself
wrt α1, but not the smallest one; and that (10.12) holds. In particular,
Consequently,
323
and hence
324
(10.15)
(10.16)
and
(10.17)
By definition,
(10.18)
Formulas (10.15), (10.16), and (10.18) together yield an algorithm for determining K,
the αk’s, and the Tk’s and hence for determining T0(α), -∞<α<∞, from (10.17).
Let 0 ≤ K < K. Then TK is an optimally pruned subtree of itself wrt αK+1, but
(10.19)
Formula (10.23) in the next result is more convenient for determining T0(α), -
325
∞<α<∞, than is (10.17).
THEOREM 10.20. Let gk(t), t ∈ T0 - 0 and 0 ≤ K ≤ K - 1, be defined recursively by
(10.21)
and, for 1 ≤k≤K-1,
(10.22)
Then for -∞<α<∞,
(10.23)
PROOF. It suffices to show that if 0 < k < K - 1 and α ≤ αk+1, then
(10.24)
(for then (10.23) holds forα≤αK; since T0 (αK) is trivial and T0(α) T0(αK) for α> αK
by Theorem 10.9(i), (10.23) holds for -∞<α<∞).
It follows from Theorem 10.11 that (10.24) is valid for k = 0 andα<α1. Suppose now
that 1 < k ≤ K - 1 and
326
(10.25)
Then, in particular
(10.26)
and hence
(10.27)
Now Tk - k ⊂ Tk-1 - k-1, so by (10.13) and (10.22),
(10.28)
thus it follows from (10.27) that fors∈T0 - 0 andα≤αk, gk(s) > α if, and only if, gk-1 (s)
> α. Hence (10.25) implies (10.24) forα<αk. Suppose now that αk <α<αk+1. Then
327
THEOREM 10.30. Let t ∈ 0 and let -∞ < α< ∞. If gK-1(s) > α for all ancestors s of t,
then t ∈ 0(α). Otherwise, let s be the first node of T0 along the path from the root of
T0 to t for which gK-1(s) ≤ α. Then s is the unique ancestor of t in 0 (α).
Recall that if s is an ancestor of t, then ℓ(s, t) is the length of the path from s to t.
Given -∞ <α<∞andt∈T0, set
by Theorem 10.10 and hence R(s) > α(| 0s(α)| - 1). It is easily shown by induction that
| 0s(α)| ≥ ℓ(s, t) + 2 and therefore that R(s) > α(ℓ(s, t) + 1). This yields the desired
conclusion.
THEOREM 10.32. Suppose that R(t) ≥ 0 for all t ∈ T0. Given a real number α, set
328
10.3 AN EXPLICIT OPTIMAL PRUNING ALGORITHM
Consider an initial tree T0 = {1, ..., m}, where m is specified, as are αmin and the
quantities ℓ(t) = left(t), r(t) = right(t), and R(t) for 1 ≤ t ≤ m. Algorithm 10.1 can be
used to find T0(α) for α > αmin. In this algorithm, “k : = 1” means ”set k equal to 1”; ∞
is to be interpreted as a large positive number; N(t) = | kt|; S(t) = R(Tkt); g(t) = gk(t),
and G(t) = min[gk(s) : s ∈ Tkt - kt]. The statements following “repeat” are to be cycled
through until the condition that N(1) = 1 (that is, that Tk is trivial) is checked and
satisfied; at this point the algorithm is finished, k = K, and g(t) = gK-1(t) for t ∈ T0 - 0.
The write statement writes out k, | k|, αk, and R(Tk) for 1 ≤ k ≤ K. The small positive
number ε is included to prevent computer round-off error from generating extraneous
trees Tk with αk = αk-1. In the intended statistical applications, ε can be chosen to be a
small, positive constant times R(1).
The algorithm was applied with T0 being the tree T7 having 10 terminal nodes that
arose in the discussion of the stochastic digit recognition problem in Section 3.5.1;
R(t) = (t) is also taken from that problem; and αmin = 0. Table 10.2 shows the output
of the write statement, while Table 10.3 shows the values of g(t) = g6(t) after the
algorithm is finished.
329
ALGORITHM 10.1
330
TABLE 10.2
TABLE 10.3
Figure 10.2 shows the tree T0 = T1 with g6(t) written under each nonterminal node
of T0. By (10.23),
331
FIGURE 10.2
Thus, in going from T1 to T2, nodes 18 and 19 are removed; in going from T2 to T3,
nodes 10, 11, 14, and 15 are removed; in going from T3 to T4, nodes 16 and 17 are
removed; in going from T4 to T5, nodes 8 and 9 are removed; in going from T5 to T6,
nodes 4, 5, 6, 7, 12, and 13 are removed; and in going from T6 to T7, nodes 2 and 3 are
removed. According to (10.17), T0(α) = Tk for 1 ≤ k ≤ 6 and αk ≤α<αk+1, while T0(α) =
T7 = {1} forα≥α7. Figure 10.2 can also be used to illustrate Theorem 10.30.
In this particular example, gK-1(t) ≤ gK-1(s) whenever s, t are nonterminal nodes of
T0 and t is a descendant of s. But such a result is not true in general.
332
11
In Chapter 9, and implicitly in Chapter 10 as well, the joint distribution of (X, Y) was
assumed to be known. In the first three sections of the present chapter, the methods
described in Chapters 9 and 10 are modified and combined to yield tree construction
procedures based on a learning sample. In Sections 11.4 and 11.5, methods based on
test samples and cross-validation are developed for obtaining nearly unbiased
estimates of the overall risk of a tree structured rule. Section 11.6 treats the selection
of a particular optimally pruned subtree from among the K candidates. In Section 11.7,
an alternative method based on the bootstrap is considered for eliminating the
overoptimism of the resubstitution estimate of the overall risk; it is shown, however,
that there exist situations in which this alternative method is defective. Section 11.8
treats the tendency of splitting rules based on a learning sample to prefer end-cut splits
—that is, splits in which the estimated probability of going to the left is close to zero
or one.
333
11.1 ESTIMATED BAYES RULE FOR A PARTITION
Recall the terminology from Sections 9.1 and 9.2. Let (Xn, Yn), n ∈ n = {1, ..., N}, be
a random (learning) sample of size N from the joint distribution of (X, Y). Let T be a
partition of X, which may depend on the learning sample. Given t ∈ , set η(t) = {n ∈
η : Xn ∈ t}, N(t) = = |η(t)|, and estimate P(t) = P(X ∈ t) by p(t) = N(t)/N. Suppose that
p(t) > 0 for all t ∈ .
Consider the estimate of the Bayes rule for the partition T that has the form
(x) = ν(τ(x)), x ∈ X, where τ is the partition function corresponding to T and ν(t) is
defined separately for regression, classification, and class probability estimation.
In the regression problem, let
denote the sample mean and sample variance, respectively, of the numbers Yn, n ∈
η(t). Set ν(t) = (t).
In classification and class probability estimation, there are two models to consider:
one model when the prior probabilities π(j), 1 ≤ j ≤ J, are unknown and estimated from
the learning sample and another model when these prior probabilities take on known
(or assumed) values.
Consider first the model 1 version of classification or class probability estimation in
which the prior probabilities must be estimated from the learning sample. Here the
random variable Y ranges over {1, ..., J}. For 1 ≤ j ≤ J, set nj(t) = {η ∈ η(t): Yn = j},
Nj(t) = |ηj(t)|, and p(j|t) = Nj(t)/N(t). In the classification problem let ν(t) be the
smallest value of i ∈ {1, ..., J} that minimizes Σj C(i|j)p(j|t). In the class probability
estimation problem let ν(t) denote the vector (p(1|t), ..., p(J|t)) of estimated conditional
probabilities, p(j|t) being an estimate of p(j|t) = P(Y = j|X ∈ t).
Recall from Section 9.1 the definition of L(y, a) in these three problems. The
resubstitution estimate R( ) of the risk R*( ) of the rule is defined by
334
where now
and R(t) = p(t)r(t). In the regression problem, r(t) = s2(t); in the model 1 version of
In this alternative setup the estimated Bayes rule d is again of the form d (x) =
ν(τ(x)), x ∈ X, where ν(t), t ∈ , is now defined as follows. In the classification
problem, ν(t) is the smallest value of i that minimizes
335
equivalently, ν(t) is the smallest value of i that minimizes ∑j c(ilj) π (j)Nj(t)/Nj. In the
class probability estimation problem, ν(t) = (p(1 | t), ..., p(J | t)), as before.
In either case the resubstitution estimate R or R* is defined by
336
11.2 EMPIRICAL RISK REDUCTION SPLITTING RULE
Let a learning sample be available for either the regression problem or the model 1 or
model 2 version of classification or class probability estimation. Let T be a partition of
X such that p(t) > 0 for t ∈ .
Consider a split s of t ∈ into tL , tR , where p(tL) > 0 and p(tR) > 0. Set
Let , be the modification to T obtained by replacing t by the pair tL, tR. The empirical
risk reduction ΔR(s, t) = R - R due to the split is given by
337
11.3 OPTIMAL PRUNING
Consider a tree to be constructed from a learning sample for the purpose of regression,
classification, or class probability estimation. Each node t of the tree is identified with
a subset of X, and if tL = left (t) and tR = right(t), then tL, tR is a partition of t. Let a
fixed splitting rule (for example, empirical risk reduction) be employed.
Choose αmin ≥ 0. Do not consider splitting node t if (t) ≤ 0 (see Theorem
10.32), where Sα (t) is defined as in Section 10.2. Additional conditions for not
splitting a node t could be used—for example, do not split t if p(t) < ε1 (ε1 being a
fixed positive number); or do not split a node having more than, say, 10 ancestors.
Similarly, conditions for not considering an otherwise allowable split of t into tL, tR
might be employed—for example, do not consider the split if p(tL) < ε2 or p(tR) < ε2.
Let T0 be the initial tree obtained by continuing the splitting process until it comes
to a halt by considerations such as those in the previous paragraph. (This was denoted
by Tmax in Chapter 3.) Given a pruned subtree T of T0, set R(T) = ∑ R(t) . The
optimal pruning algorithm applied to T0, αmin, and R(t) for t ∈ T0 yields the following:
a positive integer K; a strictly increasing sequence αk, 1 ≤ k ≤ K, of numbers such that
α1 = αmin; trees Tk, 1 < k < K, such that T0 T1, Tk Tk+1 for 1 ≤ k < K, and TK =
{root(T0)} = {X}. Necessarily,
Let 1 ≤ k < K and αk ≤α<αk+1 (αK ≤α<∞if k = K). Then T0(α) = Tk. That is, Tk
minimizes R(T) + α| | among all pruned subtrees of T0; and if Tʹ also minimizes this
expression, then Tʹ Tk. Let g(t) = gK-1(t) be determined by the optimal pruning
algorithm.
For 1 ≤ k K, let τk be the partition function corresponding to k and let dk =
be the estimated Bayes rule corresponding to this partition. Then dk(x) = ν(τk(x)) for x
∈ X, where ν(t) is defined in Section 11.1, and R(dk) = R(Tk). Given x ∈ X, the
quantities dk(x), 1 ≤ k ≤ K, can easily be determined together according to Algorithm
11.1 (see Theorem 10.30). In the description of this algorithm, t1 = root(T0), ℓ(t) =
left(t), r(t) = right(t), and t is a terminal node if, and only if, ℓ(t) = 0.
338
ALGORITHM 11.1
339
11.4 TEST SAMPLES
Consider next the model 2 version of classification or class probability estimation. For
1 ≤ j ≤ J, let Xn, n ∈ , be a random sample of size Nʹj = |ηʹj| from the conditional
distribution of X, given that Y = J (the 2J sets ηj, 1 ≤ j ≤ J, and , 1 ≤ j ≤ J, are taken
to be disjoint). The test sample is now given by (Xn, Yn), n ∈ ηʹ, where nʹ = ∪j ηʹj has
Nʹ = Σj members and Yn = j for n ∈ nʹj. The corresponding estimate Rts(dk) of R*
(dk) is defined by
340
In the model 2 version of class probability estimation, dK(x) = υ for all x ∈ X, where ν
is the vector (π(1), ..., π(J)) of probabilities, and
341
and that ( - µ)2 is of the order 1/N′ by the central limit theorem. Therefore, Y can be
replaced by µ in determining the standard error of REts(dK) To determine this standard
error, think of dk as fixed and rewrite the modified expression for REts (dK) in the form
Here U1, and U2 are consistent estimates of µ1 = R*(dk) and µ2 = E(Y - µ)2,
respectively, and H(U1, U2) is a consistent estimate of H(µ1, µ2) = µ1/µ2. The
asymptotic variance of H(U1, U2) is the same as the variance of
namely,
342
where = Var U1n, = Var U2n, and σ12 = Cov(U1n, U2n) for n ∈ ηʹ. Now , ,
and σ12 can be estimated, respectively, by
and
343
11.5 CROSS-VALIDATION
The use of test samples to estimate the risk of tree structured procedures requires that
one set of sample data be used to construct the procedure and a disjoint set be used to
evaluate it. When the combined set of available data contains a thousand or more
cases, this is a reasonable approach. But if only a few hundred cases or less in total are
available, it can be inefficient in its use of the available data; cross-validation is then
preferable.
Let Tk and dk = , 1 ≤ k ≤ K, be obtained by optimal pruning from the entire
learning sample (X , Yn), n ∈ n. Then T0(α) = T for αk- ≤α<αk+1, where αK+1 = ∞.
Suppose that α1 = αmin > 0. Let denote the geometric mean (αkαk+1)1/2 of αk and
αk+1 with = ∞. K
Choose a positive integer V > 2. Randomly divide n into V sets ηv, 1 ≤ v < V, of
nearly equal size. Define vn ∈ {1, ..., v} for n ∈ η by vn = v if n ∈ n. For 1 ≤ v ≤ V and
be the rule corresponding to the tree constructed from the same data.
(Observe that depends in a minor way—through —on the entire learning
sample. Observe also that the data (Xn, Yn), n ∈ ηv, were denoted by Lv in the first part
of the book.)
Let k ∈ {1, ..., K} be fixed. In the regression problem or the model 1 version of
classification or class probability estimation, the cross-validation estimate of the risk
R(dk) of dk is defined by
It is not at all clear how to obtain a rigorously valid standard error estimate for
Rcv(dk), since the random variables L(Yn, (Xn)) are by no means independent.
The heuristic device of simply ignoring this lack of independence yields a formula
similar to that obtained for the corresponding test sample estimate and which appears
to work reasonably well in practice; specifically,
344
where
where is the sample variance of the numbers L(j, (Xn)), n ∈ ηj, so that
To evaluate these formulas efficiently, for each v ∈ {1, ..., V} andn∈η(v), the
numbers L(Yn, (Xn)), 1 < k < K, should be calculated together. This is easily done
along the lines of Algorithm 11.1.
Experience in the use of the preceding formulas for SE(Rcv(dk)) in the evaluation of
tree structured procedures constructed from simulated data indicates that the expected
value of SE2(Rcv(dk)) is typically close to the expected value of [Rcv(dk) - R*(dk)]2.
But the probability that |Rcv (dk) - R*(dk)| exceeds, say, 2 SE(Rcv(dk)) can be noticeably
larger than the value .05 suggested by normal approximation. (This is especially true
when there are a few relatively large values of L(Yn, (Xn)) and when the tree has
345
very few terminal nodes.
In the regression context, temporarily set Y = Σ Yn and s2 = Σ(Yn - )2. Then
is an estimate of the relative mean square error RE*(dk). By analogy with the
formula derived in Section 11.4 for the standard error of REts(dk), one can derive a
heuristic formula for the standard error of RECV(dk). The result is
where
and
346
11.6 FINAL TREE SELECTION
347
applied directly to the rounded-off values of RECV in the table. Then only six of the
trees TK, 1 < k < 180, emerge as being allowable, namely, TK for k = 168, 174, 175,
177, 179, 180.
ALGORITHM 11.2
348
11.7 BOOTSTRAP ESTIMATE OF OVERALL RISK
A method based on the bootstrap (see Efron 1979, 1982) can also be used to reduce the
overoptimism of the resubstitution estimate R(d) of the overall risk R*(d) of a tree
structured rule d based on a learning sample. For simplicity, the discussion will be
confined to the regression problem or to the model 1 version of the classification or
class probability estimation problem. Thus,
where G* is the distribution of (X, Y). Observe that the difference R*(d) - R(d) is a
random variable, since R*(d) and R(d) both depend on the learning sample. Let B*(G*)
= E(R*(d) - R(d)) denote the dependence of the indicated bias term on the true
distribution of (X, Y) .
Clearly, R(d) + B*(G*) is a nonoveroptimistic estimate of R*(d), but this estimate
cannot be evaluated in practice, since G* is unknown. An obvious modification is the
estimate R(d) + B*(G), where G is the empirical distribution of the learning sample
(Xn , Yn), n ∈ n. As Efron pointed out, an estimate B(G) of B*(G) is easily obtained in
practice by the Monte Carlo method. The quantity R(d) + B(G) is called the bootstrap
estimate of R*(d).
Although this bootstrap estimate may typically work reasonably well in practice,
there exist situations in which it is clearly defective. To be specific, consider a
classification problem with the usual zero-one cost matrix. Suppose that X and Y are
independent and that the unknown prior probabilities π(j), 1 ≤ j ≤ J, all have the same
value, namely, 1/J. Then every classification rule is Bayes and has risk (J - 1)/J.
Suppose that X = M, that the marginal distributions of X are continuous, and that
only coordinate splits are to be considered. Let d be the tree structured classification
rule obtained by using, say, the Gini splitting rule and continuing the splitting process
down to pure nodes (that is, nodes t such that Yn is constant as n ranges over η(t) and
hence r(t) = 0). Then R(d) = 0. Consequently, B*(G*) = (J - 1)/J. On the other hand, it
is easily seen that EB*(G) equals (J - 1)/J times the probability that any particular
member n of n does not appear in a random sample of size N drawn with replacement
from η. Thus,
349
Recall that
Therefore, EB*(G) can be less than 40 percent of the true bias B*(G*). This defect has
been confirmed for large trees grown from real and simulated data.
Since estimates of risk based on cross-validation have an advantage vis-à-vis bias
over those based on the bootstrap, it is natural to suspect that the former estimates
have a somewhat larger variance (see Glick, 1978). The presumed larger variance of
cross-validation is of particular concern when the learning sample is small. In medical
applications of classification (see Chapter 6), the learning sample frequently contains
only a few dozen individuals in the class corresponding to adversity; if so, even when
the learning sample contains many individuals in the other class, its size is effectively
small.
When the learning sample is genuinely large, however, and the resubstitution
estimate of risk is highly overoptimistic, the bias effect dominates the variance effect
and cross-validation is superior to the bootstrap. Efron (1983) studied several
proposals for modifying the bootstrap to improve its bias-correcting ability; these
approaches have yet to be tried out in the context of selecting classification and
regression trees.
350
11.8 END-CUT PREFERENCE
It has been known for some time (see Morgan and Messenger, 1973) that the empirical
versions of the risk reduction splitting rule for regression and class probability
estimation tend to favor end-cut splits—that is, splits in which pL is close to zero or
one. To see this in a simple setting, let X be the real line and let (X, Y) be a pair of real-
valued random variables such that X has a continuous distribution function and 0 < σ2
= Var(y) < ∞. Let (Xn, Yn), n ≥ 1, be a random sample from the distribution of
(X, Y) and let (Xn, Yn), 1 ≤ n ≤ N, be the corresponding learning sample of size N.
Consider all splits of the root node of the form tL = {x ∈ X : x ≤ c} and tR = {x ∈ X :
x > c}. In the regression problem, the empirical version of the risk reduction splitting
rule is to choose a split that maximizes
< ··· < . (Since X has a continuous distribution, X1 , ..., XN are distinct with
probability one.) Set YNn = for 1 ≤ n ≤ N. Given the split of the root node into tL,
tR, set m = N(tL). Then pL= m/N and
Suppose now that X and Y are independent of each other. Then end-cut splits are
preferred in the sense that for each ε ∈ (0, 1/2), P(ε < PL < 1 - ε) → 0 as N → ∞. This
is a consequence of the following result, since YN1,..., YNN are independent and have
the same distribution as Y.
THEOREM 11.1. Let Yn, n > 1, be independent and identically distributed random
variables each having variance σ2, where 0 <σ<∞. Set
351
(11.2)
and
(11.3)
PROOF. Without loss of generality, it can be assumed that Yn ’s n have mean zero. It
suffices to prove (11.2), (11.3) then following by symmetry.
Choose N ≥ 1 and c > 0. According to the Kolmogorov inequality (see Chung,
1974),
and hence
Therefore,
352
(11.4)
According to the law of the iterated logarithm (see Breiman, 1968),
Now M0 can be made large enough so that log(log m) > 2/(εc) for m ≥ M0.
Consequently,
and hence
(11.5)
Since c can be made arbitrarily small, (11.2) follows from (11.4) and (11.5). This
completes the proof of the theorem.
When J = 2, Theorem 11.1 is also directly applicable to the Gini splitting rule for
the (model 1 version of the) classification problem or, equivalently, the risk reduction
splitting rule for class probability estimation. Consider splits tL, tR of the root node,
and let ψ(y = 1) equal 1 or 0 according as y = 1 or y ≠ 1. The Gini splitting rule is to
choose a split that maximizes
353
If X and Y are independent, then so are X and ψ(Y = 1); thus, it again follows from
Theorem 11.1 that for each ε ∈ (0, 1/2), P(ε ≤ pL ≤ 1 - ε) → 0 as N → ∞.
One way to eliminate the end-cut preference of these splitting rules is to multiply
the quantity to be maximized by a positive power of pLpR. Consider, for example, the
twoing splitting rule for the classification problem (which coincides with the Gini
splitting rule when J = 2). Instead of choosing the splitting rule to maximize
let it maximize
354
12
CONSISTENCY
The tree structured classification and regression procedures discussed in this book use
the learning sample to partition the measurement space. In this chapter a more general
collection of such “partition-based” procedures will be considered. It is natural to
desire that as the size of the learning sample tends to infinity, partition-based estimates
of the regression function should converge to the true function and that the risks of
partition-based predictors and classifiers should converge to the risk of the
corresponding Bayes rules. If so, the procedures are said to be consistent. In this
chapter the consistency of partition-based regression and classification procedures will
be verified under surprisingly general conditions.
In Section 12.1 a required preliminary result of independent interest on the rate of
convergence of a sequence of empirical distributions to its theoretical counterpart is
described. In Sections 12.2 and 12.3 consistency is discussed for the regression
problem and classification problem, respectively; analogous consistency results for
class probability estimation is implicitly contained in these two sections. Proofs of the
results in Sections 12.1 to 12.3 are given in Sections 12.4 to 12.6. A thorough
understanding of these proofs requires some background in real analysis and
probability theory.
For previous consistency results along the lines of those in Sections 12.2 and 12.3,
see Gordon and Olshen (1978, 1980). Necessary and sufficient conditions for the
consistency of a wide class of nonparametric procedures in regression and
classification are contained in Stone (1977).
355
12.1 EMPIRICAL DISTRIBUTIONS
where I1,..., IM are intervals of the real line, each of which may be open, closed, half-
open or half-closed. The more general definition of B allows for linear combination
splits, as described in Section 5.2.
Let Xn, n > 1, be a random sample from a distribution P on X. For N ≥ 1, let pN
denote the empirical distribution of Xn, 1 ≤ n ≤ N, defined by
It follows immediately from the strong law of large numbers that for each (Borel)
subset t of X, lim NPN(t) = P(t) with probability one. According to a general version of
the Glivenko-Cantelli theorem (see Vapnik and Chervonenkis, 1971; Steele, 1978;
Dudley, 1978; and Pollard, 1981),
(12.1)
The following strengthening of (12.1) plays an essential role in the consistency proofs
for partition-based regression and classification that will be given later on.
THEOREM 12.2. Given positive numbers ε and c, there is a positive number kε,c = k
such that
356
(12.3)
This theorem is also valid if P(t) is replaced by pN(t) in the right side of the
inequality in (12.3); indeed, the two forms of the theorem are easily shown to be
equivalent. For an elementary application of the theorem, let kN be a sequence of
positive constants.
THEOREM 12.4. If limN kN = ∞, then for all positive constants ε and c,
(12.5)
It is natural to conjecture that in (12.1), (12.3), and (12.5), B can be replaced by the
collection of all polyhedra in X. But this conjecture is not true in general. To see this,
let M = 2 and let P be the uniform distribution on a circle. Let t be the convex hull of
Xn, 1 < n < N, that is, the inscribed polygon having vertices Xn, 1 < n < N. Then pN(t) =
1, but P(t) = 0. This example is due to Ranga Rao (1962).
Alexander (1983) has shown that Theorem 12.2 can also be derived as a
consequence of his deep generalization of a theorem of Kiefer (1961). In their two
consistency papers cited earlier, Gordon and Olshen made direct use of Kiefer’s
theorem. It will be seen shortly that Theorem 12.2 leads to improved results on
consistency.
357
12.2 REGRESSION
Let (X, Y) be a pair of random variables such that X ∈ X, Y is real-valued, and E|Y| <
∞. Let P denote the distribution of X and let dB denote the regression function of Y on
X, defined by dB(x) = E(Y|X = x). Let (Xn , Yn), n ≥ 1, denote a random sample from
the joint distribution of (X, Y) and suppose that this random sample is independent of
(X, Y). Given N ≥ 1 and t C X, set
where |x| = ( + ... + )1/2 for x = (x1, ..., xM) . Let DN (x) = δ(τN(x)) denote the
diameter of the set t ∈ containing x.
Finally, let dN denote the estimator of the regression function dB defined by dN(x) =
N(τN(x)), where
358
(12.6)
Formula (12.9) in the next result means that limN P(DN(x) ≥ ε) = 0 for all ε > 0.
THEOREM 12.7. Suppose that E |Y|q < ∞, where 1 ≤ q < ∞. If
(12.8)
and
(12.9)
then
(12.10)
Equation (12.10) determines the sense in which the sequence {dN} of estimators of
the regression function dB is consistent. The expectation in (12.10) involves the
randomness in both X and dN. Alternatively, this expectation can be written as
359
R(dB), that is, if for large N, dN(X) is nearly as good a predictor of Y as is the optimal
predictor dB(X).
It follows easily from the properties of condition expectation that
360
12.3 CLASSIFICATION
Let (X, Y) be a pair of random variables such that X ∈ X and Y ∈ {1, ..., J}, where 2 ≤
J < ∞. As before, let P denote the distribution of X. Given 1 ≤ j ≤ J, set π(j) = P(Y = j)
and P(j|x) = P(Y = j|X = x) for x ∈ X.
Let pN(j|x) denote a sample-based estimate of P(j|x) such that 0 ≤ PN(j|x) ≤ 1 for x ∈
X and 1 ≤ j ≤, J and pN is independent of (X, Y). Then for 1 ≤ j ≤ J, the following three
conditions are equivalent:
(12.14)
(12.15)
(12.16)
Let C(i|j) denote the cost of classifying a class j object as class i. Given a {1, ..., J}-
valued function d on X, let
denote the risk of using d as a classifier. A Bayes rule dB , that is, a rule that
minimizes R(d), is given by choosing dB(x) to be the smallest value of i ∈ {1, ..., J}
361
that minimizes ∑j. C(i|j) P(j|x). Let dN(x) be chosen to be the smallest value of i ∈ {1,
..., J} that minimizes E C(i|j)pN(j|x), and note that ER(dN) ≥ R(dB). The sequence {dN }
is again said to be risk consistent if limN ER(dN) = R(dB).
THEOREM 12.17. If (12.15) holds, then {dN} is risk consistent.
This theorem will now be applied to the two sampling schemes for the classification
problem that have been considered in this book.
(Model 1). Let (Xn,Yn), n ≥ 1, denote a random sample from the joint distribution of
(X, Y) and suppose that this random sample is independent of (X, Y). Let ηN(t), , τN,
pN, and kN be defined as in the regression problem and suppose that (12.6) holds. Also
set
PN(j|x) = PN(j|τN(x)),
where
where t = τN(x). Consequently dN(x) = iN (τN(x)), where iN(t) is the smallest value of
Set
362
ηjN(t) = {n : 1 ≤ n ≤ N, Xn ∈ t and Yn = j}
and
and
pN(j|x) = pN (j|τB(x))
where t = τN(x). Thus, dN(x) = iN(τN(x)), where iN(t) is the smallest value of i that
minimizes ∑jc(i|j)π(j)pjN(t). Instead of (12.6), suppose that
(12.18)
Theorem 12.7 will be used to obtain the next result. THEOREM 12.19. Let model 1
and (12.6) or model 2 and (12.18) hold, and suppose that (12.8) and (12.9) are
satisfied. Then {dN} is risk consistent.
There is an alternative to model 2 for known prior probabilities that is worth
considering. Let the data consist of a random sample Xn, n ∈ nj, of fixed (that is,
363
nonrandom) size Nj from Pj for j ∈ {1, ..., J} ; set N = (N1, ..., NJ), and let dN be
defined as dN in model 2, with pjN replaced by the empirical distribution of Xn, n ∈ nj.
Under appropriate conditions, dN is Bayes risk con . sistent as N1, ..., NJ all tend to
infinity. The result can be proved by assuming that the desired conclusion is false and
using a subsequence argument to obtain a contradiction to Theorem 12.19 for model 2.
The details are left to the interested reader.
Theorems 12.11 and 12.19 provide some theoretical justification for tree structured
regression and classification—risk consistency under mild regularity conditions. But
no theoretical justification has been obtained so far for any of the specific splitting
rules discussed in the book, nor for optimal pruning or cross-validation.
364
12.4 PROOFS FOR SECTION 12.1
The inequalities for binomial probabilities in the next result are needed for the proof of
Theorem 12.2.
LEMMA 12.20. Let Z have a binomial distribution with parameters m and p. Given ε
> 0, let δε > 0 be defined by ( - 1)/δε = 1 + ε. Then for all k > 0,
(12.21)
and
(12.22)
PROOF. The moment-generating function of Z is given by Eeδz = (1 - p + peδ)m. It is
easily seen that (eδ - 1)/δ is continuous and strictly increasing on (0, ∞) and has limits
1 and ∞, respectively, at 0 and ∞. Thus, given ε > 0, there is a unique positive number
δε such that ( - 1)/δε = 1 + ε. Observe that
365
so (12.21) is valid.
In proving (12.22), it can be assumed that 0 <ε<1 (otherwise, the result is trivially
true). It is straightforward to show that (1 - )/δε > 1 - ε and hence that
Consequently,
so (12.22) is valid.
Recall that X1, X2, ... is a random sample from P and that pN is the empirical
distribution of X1, ..., XN. Let ... be a second such random sample and let
denote the empirical distribution of , ..., . These two random samples are assumed
to be independent of each other.
LEMMA 12.23. Given positive numbers ε and c, there is a positive number k = k such
that ε, c
PROOF1. Let ξ1 , , ..., ξN , be a random sample of size 2N from P. Let S1, ..., SN
be independent and identically distributed Bernoulli (that is, zero-one) random
variables each having probability .5 of equaling 1; and let the Sn ’s be independent of
(ξn , ) , n ≥ 1. Define Xn , for 1 ≤ n ≤ N by
and
366
Then , ...,XN, is a random sample of size 2N from P. Let PN denote the empirical
distribution of X1, ..., XN (as just constructed) and let denote the empirical
distribution of , ..., . It suffices to verify the conclusion of the lemma for this
choice of PN, . N N
To this end, let Ψt denote the indicator of t defined by Ψt(x) = 1 if x ∈ t and Ψt(x) =
0 otherwise. Then
and
so
and
constants each equal to -1, 0, or 1 and m = |vn| , then (2Sn - 1)vn has the same
distribution as 2z - m.
Given ε > 0, let δε be defined as in Lemma 12.1. Choose t ∈ B. It follows from the
observation in the previous paragraph that on the event |Ψt (ξn) -Ψt ( ) | = m,
367
Consequently,
and hence
Therefore,
which yields the desired result. (Some measurability problems that have been ignored
here are straightforward to handle for the particular collection B under consideration;
see Section 5 of Pollard, 1981.)
PROOF OF THEOREM 12.2. Let ε and c be fixed positive numbers. According to
368
Lemma 12.23, there is a positive constant k such that
(12.24)
By Lemma 12.20, there is a positive integer N0 such that
(12.25)
Since pN and are independent, it follows from (12.24) and (12.25) that
369
12.5 PROOFS FOR SECTION 12.2
The proof of Theorem 12.13 begins with the following elementary result.
LEMMA 12.26. Let Z1, ..., Zm be independent real-valued random variables having
mean zero and moment-generating functions Ψ1, ... Ψm, respectively, and suppose that
|Ψℓ(u)| ≤ exp[Ku2/2] for 1 ≤ ℓ ≤m and |u| ≤ s, where s and K are fixed positive
constants. Set = (Z1 + ••• + Zm)/m. Then
Similarly,
and N(x) = N(τN(x)) forx∈x. Suppose dB is continuous and that, with probability
370
one, limN DN(•) = 0 uniformly on compacts. Then, with probability one, limN N = dB
uniformly on compacts. Thus, Theorem 12.13 follows easily from the following result.
LEMMA 12.27. . Suppose that (12.8) and Assumption 12.12 hold. Then for every
compact set B in X and every ε > 0 and c > 0,
PROOF. It can be assumed without loss of generality that dB = 0. Let Ψ(·|x) denote the
moment-generating function of the conditional distribution of Y, given that X = x, and
let Ψ‘(u|x) and Ψ“(u|x) refer to differentiation with respect to u. Then Ψ(0|x) = 1,
Ψ’(0|x) = 0, and Ψ”(u|x) = E[ |X = x]. Let B, ε, and c be as in the statement of the
lemma. It follows straightforwardly from Assumption 12.12, Taylor’s theorem with
remainder, and an argument similar to that used in proving Chebyshev’s inequality that
there are positive constants s and K such that Ψ(u|x) ≤ exp(ku2/2) for x ∈ B and |u| ≤ s.
Let t ⊂ B and let 0 <ε≤KS. According to Lemma 12.26,
on the event that pN(t) ≥ kNN-1 log N. It now follows from the combinatorial result of
Vapnik and Chervonenkis, as in the proof of Lemma 12.23, that
371
and hence
372
(To verify the last equality, note that YN is independent of pN-1 and has the same
distribution at Y.)
Set
(12.29)
To this end, let 0 < ε < 1 and let ΩN denote the event that
373
(12.30)
By (12.8) and Theorem 12.2,
(12.31)
Observe that if t ∈ BN, then
Consequently,
(12.32)
Since ε can be made arbitrarily small, (12.29) follows from (12.8), (12.31), and
(12.32). This completes the proof of the lemma.
LEMMA 12.33. Suppose that E[|y|q] < ∞, where 1 < q < ∞, and let 0 < ε < ∞. Then
there is a bounded random variable Y′ = H(X, Y) such that (E[|Y’ - y|q])1/q ≤ ε and the
regression function of Y′ on X is continuous.
PROOF. Given a positive number K1, define Y″ as a function of Y by
374
The number K1 can be chosen so that (E[|Y″ - Y|q)1/q ≤-ε/2. Let d” denote the
regression function of Y″ on X. There is a bounded continuous function on X such
that (E[| (X) - (X)|q])1/q ≤ e/2. (The collection of bounded continuous functions is
dense in Lq.) Set Y′ = Y″ + (X) - (X). Then is the regression function of Y′ on
X; Y′ is a bounded function of X and Y; and, by Minkowski’s inequality, (E[|Y′ -
Y|q])1/q ≤ ε. This completes the proof of the lemma.
With this preparation, it is easy to complete the proof of Theorem 12.7. Suppose
that E[|y|q] < ∞, where 1 ≤ q < ∞. Choose ε > 0, let Y′ = H(X, Y) be as in Lemma
12.33, and let denote the regression function of Y′ on X. Also set = H(Xn, Yn) for
all n > 1. Set (X) = (τN(x)), where
Then
Thus by Minkowski’s inequality, to complete the proof of Theorem 12.7 it suffices to
verify the following four statements:
375
Statement (i) follows from Lemma 12.28 with Y and Yn , n > 1, replaced by y’ - Y
and - Yn, n ≥ 1, respectively. Since Y’ is a bounded random variable, it follows from
(12.8), (12.9), and Lemma 12.27 that (X) - (X) converges to zero in probability;
so the conclusion to (ii) follows from the bounded convergence theorem. Since is
bounded and continuous, it follows from (12.9) that (X) converges to (X) in
probability; so the conclusion to (iii) follows from another application of the bounded
convergence theorem. It follows from the conditional form of Hölder’s or Jensen’s
inequality that
and hence that (iv) holds. This completes the proof of Theorem 12.7.
376
12.6 PROOFS FOR SECTION 12.3
where
Suppose that (12.6), (12.8), and (12.9) hold. Then (12.14) holds by Theorem 12.7 and
hence (12.15) holds. Thus, it follows from Theorem 12.17 that {dN} is risk consistent.
PROOF OF THEOREM 12.19 FOR MODEL 2. Recall that in model 2,
377
Thus, by Theorem 12.17, to verify that {dN} is risk consistent, it suffices to prove that
for 1 ≤ j ≤ J,
(12.34)
LEMMA 12.35. If (12.9) holds, then for 1 ≤ j ≤ J,
where dB(x) = P(j|x). Choose ε > 0, let be a bounded continuous function such that
(12.36)
(the collection of bounded continuous functions is dense in L2), and set µ‘(t) = E( (X)
|X ∈ t). It follows from (12.9) that limN µ′(τN(X) = (X) in probability and hence that
378
(12.37)
(For bounded random variables, convergence in probability and convergence in L2 are
equivalent.) Now
(12.38)
To see this, observe that
379
PROOF. Set BN = {t ∈ B : ∑ π(j)pjN(t) ≥kNCN}, where
(12.40)
Choose ε > 0 and let ΩN denote the event that
(12.41)
Then limN P(ΩN) = 1 by Theorem 12.2. If t ∈ BN and (12.41) holds, then
and
380
Since ε can be made arbitrarily small, (12.40) holds as desired.
It follows from Lemmas 12.35 and 12.39 that if (12.18), (12.8), and (12.9) hold,
then (12.34) is satisfied and hence {dN} is risk consistent. This completes the proof of
Theorem 12.19 for model 2.
381
BIBLIOGRAPHY
Alexander, K. S. 1983. Rates of growth for weighted empirical processes. Proceedings
of the Neyman-Kiefer Conference. In press.
Anderson, J. A., and Philips, P. R. 1981. Regression, discrimination and measurement
models for ordered categorical variables. Appl. Statist., 30 : 22-31.
Anderson, T. W. 1966. Some nonparametric multivariate procedures based on
statistically equivalent blocks. In Multivariate analysis, ed. P. R. Krishnaiah. New
York: Academic Press, 5-27.
Bellman, R. E. 1961. Adaptive control processes. Princeton, N.J.: Princeton University
Press.
Belsley, D. A.; Kuh, E.; and Welsch, R. E. 1980. Regression diagnostics. New York:
Wiley.
Belson, W. A. 1959. Matching and prediction on the principle of biological
classification. Appl. Statist., 8 : 65-75.
Beta-Blocker Heart Attack Trial Study Group. 1981. Beta-blocker heart attack trial. J.
Amer . Med. Assoc., 246: 2073-2074.
Breiman, L. 1968. Probability. Reading, Mass.: Addison-Wesley.
Breiman, L. 1978. Description of chlorine tree development and use. Technical report.
Santa Monica, Calif.: Technology Service Corporation.
Breiman, L. 1981. Automatic identification of chemical spectra. Technical report.
Santa Monica, Calif.: Technology Service Corporation.
Breiman, L., and Stone, C. J. 1978. Parsimonious binary classification trees. Technical
report. Santa Monica, Calif.: Technology Service Corporation.
Bridges, C. R. 1980. Binary decision trees and the diagnosis of acute myocardial
infarction. Masters thesis, Massachusetts Institute of Technology, Cambridge.
Chung, K. L. 1974. A course in probability theory. 2d ed. New York: Academic Press.
Collomb, G. 1981. Estimation non paramétrique de la regression: review
bibliographique. Internat. Statist. Rev., 49: 75-93.
Cover, T. M., and Hart, P. E. 1967. Nearest neighbor pattern classification. IEEE
Trans. Information Theory, IT-13: 21-27.
Darlington, R. B. 1968. Multiple regression in psychological research and practice.
Psychological Bull., 69: 161-182.
Dillman, R. 0., and Koziol, J. A. 1983. Statistical approach to immunosuppression
classification using lymphocyte surface markers and functional assays. Cancer Res.,
43: 417-421.
Dillman, R. 0.; Koziol, J. A.; Zavanelli, M. I.; Beauregard, J. C.; Halliburton, B. L.;
and Royston, I. 1983. Immunocompetence in cancer patients—assessment by in vitro
stimulation tests and quantification of lymphocyte subpopulations. Cancer. In press.
Doyle, P. 1973. The use of automatic interaction detector and similar search
procedures. Operational Res. Quart., 24: 465-467.
Duda, R. 0., and Shortliffe, E. H. 1983. Expert systems research. Science, 220 : 261-
268.
Dudley, R. M. 1978. Central limit theorems for empirical measures. Ann. Probability,
382
6: 899-929.
DuMouchel, W. H. 1981. Documentation for DREG. Technical report. Cambridge:
Massachusetts Institute of Technology.
Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7: 1-
26.
Efron, B. 1982. The jackknife, the bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied Mathematics.
Efron, B. 1983. Estimating the error rate of a prediction rule: improvements on cross-
validation. J. Amer. Statist. Assoc., 78 : 316-331.
Einhorn, H. 1972. Alchemy in the behavioral sciences. Pub. Op. Quart., 36: 367-378.
Feinstein, A. 1967. Clinical judgment. Baltimore: Williams and Wilkins.
Fielding, A. 1977. Binary segmentation: the automatic interaction detector and related
techniques for exploring data structure. In The analysis of survey data, Vol. I, ed. C. A.
O’Muircheartaigh and C. Payne. Chichester: Wiley.
Fisher, W. D. 1958. On grouping for maximum homogeneity. J. Amer. Statist. Assoc.,
53: 789-798.
Fix, E., and Hodges, J. 1951. Discriminatory analysis, nonparametric discrimination:
consistency properties. Technical report. Randolph Field, Texas: USAF School of
Aviation Medicine.
Friedman, J. H. 1977. A recursive partitioning decision rule for nonparametric
classification. IEEE Trans. Computers, C-26: 404-408.
Friedman, J. H. 1979. A tree-structured approach to nonparametric multiple
regression. In Smoothing techniques for curve estimation, eds. T. Gasser and M.
Rosenblatt. Berlin: Springer-Verlag.
Gilpin, E.; Olshen, R.; Henning, H.; and Ross, J., Jr. 1983. Risk prediction after
myocardial infarction: comparison of three multivariate methodologies. Cardiology,
70: 73-84.
Glick, N. 1978. Additive estimators for probabilities of correct classification. Pattern
Recognition, 10: 211-222.
Gnanadesikan, R. 1977. Methods for statistical data analysis of multivariate
observations. New York: Wiley.
Goldman, L.; Weinberg, M.; Weisberg, M.; Olshen, R.; Cook, F.; Sargent, R. K.;
Lamas, G. A.; Dennis, C.; Deckelbaum, L.; Fineberg, H.; Stiratelli, R.; and the
Medical Housestaffs at Yale-New Haven Hospital and Brigham and Women’s
Hospital. 1982. A computer-derived protocol to aid in the diagnosis of emergency
room patients with acute chest pain. New England J. Med., 307: 588-596.
Gordon, L., and Olshen, R. A. 1978. Asymptotically efficient solutions to the
classification problem. Ann. Statist., 6: 515- 533.
Gordon, L., and Olshen, R. A. 1980. Consistent nonparametric regression from
recursive partitioning schemes. J. Multivariate Anal., 10: 611-627.
Hand, D. J. 1981. Discrimination and classification. Chichester: Wiley.
Hand, D. J. 1982. Kernel discriminant analysis. Chichester: Research Studies Press,
Wiley.
Harrison, D., and Rubinfeld, D. L. 1978. Hedonic prices and the demand for clean air.
J. Envir. Econ. and Management, 5: 81-102.
Henning, H.; Gilpin, E. A.; Covell, J. W.; Swan, E. A.; O’Rourke, R. A.; and Ross, J.,
383
Jr. 1976. Prognosis after acute myocardial infarction: multivariate analysis of mortality
and survival. Circulation, 59: 1124-1136.
Henrichon, E. G., and Fu, K. -S. 1969. A nonparametric partitioning procedure for
pattern classification. IEEE Trans. Compuputers, C-18: 614-624.
Hills, M. 1967. Discrimination and allocation with discrete data. Appl. Statist., 16:
237-250.
Hooper, R., and Lucero, A. 1976. Radar profile classification: a feasibility study.
Technical report. Santa Monica, Calif.: Technology Service Corporation.
Jennrich, R., and Sampson, P. 1981. Stepwise discriminant analysis. In BMDP
statistical software 1981, ed. W. J. Dixon. Berkeley: University of California Press.
Kanal, L. 1974. Patterns in pattern recognition: 1968-1974. IEEE Trans. Information
Theory, IT-20: 697-722.
Kiefer, J. 1961. On large deviations of the empiric d.f. of vector chance variables and a
law of iterated logarithm. Pacific J. Math., 11: 649-660.
Light, R. J., and Margolin, B. H. 1971. An analysis of variance for categorical data. J.
Amer. Statist. Assoc., 66: 534-544.
Mabbett, A.; Stone, M.; and Washbrook, J. 1980. Cross-validatory selection of binary
variables in differential diagnosis. Appl. Statist., 29: 198-204.
McCullagh, P. 1980. Regression models for ordinal data. J. Roy. Statist. Soc. Ser. B,
42: 109-142.
McLafferty, F. W. 1973. Interpretation of mass spectra. Reading, Mass.: Benjamin.
Meisel, W. S. 1972. Computer-oriented approaches to pattern recognition. New York:
Academic Press.
Meisel, W. S., and Michalpoulos, D. A. 1973. A partitioning algorithm with
application in pattern classification and the optimization of decision trees. IEEE Trans.
Computers, C-22: 93-103.
Messenger, R. C., and Mandell, M. L. 1972. A model search technique fo: predictive
nominal scale multivariate analysis. J. Amer. Statist. Assoc., 67: 768-772.
Morgan, J. N., and Messenger, R. C. 1973. THAID: a sequential search program for
the analysis of nominal scale dependent variables. Ann Arbor: Institute for Social
Research, University of Michigan.
Morgan, J. N., and Sonquist, J. A. 1963. Problems in the analysis of survey data, and a
proposal. J. Amer. Statist. Assoc., 58: 415-434.
Narula, S. C., and Wellington, J. F. 1982. The minimum sum of absolute errors
regression: a state of the art survey. Internat. Statist. Rev., 50: 317-326.
Norwegian Multicenter Study Group. 1981. Timolol-induced reduction in mortality
and reinfarction in patients surviving acute myocardial infarction. New England J.
Med., 304 : 801-807.
Pollard, D. 1981. Limit theorems for empirical processes. Z.
wahrscheinlichkeitstheorie verw. Gebiete, 57: 181-195.
Pozen, M. W.; D’Agostino, R. B.; and Mitchell, J. B. 1980. The usefulness of a
predictive instrument to reduce inappropriate admissions to the coronary care unit.
Ann. Internal Med., 92: 238-242.
Ranga Rao, R. 1962. Relations between weak and uniform convergence of measures
with applications. Ann. Math. Statist., 33: 659- 680.
Rounds, E. M. 1980. A combined nonparametric approach to feature selection and
384
binary decision tree design. Pattern Recognition, 12: 313-317.
Shibata, R. 1981. An optimal selection of regression variables. Biometrika, 68 : 45-54.
Sklansky, J. 1980. Locally trained piecewise linear classifiers. IEEE Trans. Pattern
Analysis Machine Intelligence, PAMI-2: 101-111.
Sonquist, J. A. 1970. Multivariate model building: the validation of a search strategy.
Ann Arbor: Institute for Social Research, University of Michigan.
Sonquist, J. A.; Baker, E. L.; and Morgan, J. N. 1973. Searching for structure. Rev. ed.
Ann Arbor: Institute for Social Research, University of Michigan.
Sonquist, J. A., and Morgan, J. N. 1964. The detection of interaction effects. Ann
Arbor: Institute for Social Research, University of Michigan.
Steele, J. M. 1978. Empirical discrepancies and subadditive processes. Ann.
Probability, 6: 118-127.
Stone, C. J. 1977. Consistent nonparametric regression (with discussion). Ann. Statist.,
5: 595-645.
Stone, C. J. 1981. Admissible selection of an accurate and parsimonious normal linear
regression model. Ann. Statist., 9: 475-485.
Stone, M. 1977. Cross-validation: a review. Math. Operationforsch. Statist. Ser.
Statist., 9: 127-139.
Sutherland, D. H., and Olshen, R. A. 1984. The development of walking. London:
Spastics International Medical Publications. To appear.
Sutherland, D. H.; Olshen, R.; Cooper, L.; and Woo, S. L.-Y. 1980. The development
of mature gait. J. Bone Joint Surgery, 62A: 336-353.
Sutherland, D. H.; Olshen, R.; Cooper, L.; Watt, M.; Leach, J.; Mubarak, S.; and
Schultz, P. 1981. The pathomechanics of gait in Duchenne muscular dystrophy.
Developmental Med. and Child Neurology, 39: 598-605.
Szolovits, P. 1982. Artificial intelligence in medicine. In Artificial intelligence in
medicine, ed. P. Szolovits. Boulder, Colo.: Westview Press.
Toussaint, G. T. 1974. Bibliography on estimation of misclassification. IEEE Trans.
Information Theory, IT-20: 472-479.
Van Eck, N. A. 1980. Statistical analysis and data management highlights of OSIRIS
IV. Amer. Statist., 34: 119-121.
Vapnik, V. N.; and Chervonenkis, A. Ya. 1971. On the uniform convergence of relative
frequencies of events to their probabilities. Theor. Probability Appl., 16: 264-280.
Zeldin, M., and Cassmassi, J. 1978. Development of improved methods for predicting
air quality levels in the South Coast Air Basin. Technical report. Santa Monica, Calif.:
Technology Service Corporation.
385
NOTATION INDEX
386
387
388
389
390
391
392
SUBJECT INDEX
393
Absolute error R*(d),
Accuracy
AID (Automatic Interaction Detection, see THAID)
Alexander, K. S.
Altered prior probabilities
in waveform recognition
Average deviation
Average error
394
Bayes misclassification rate RB
Bayes optimal misclassification rate R*
Bayes rule dB
in classification
in class probability estimation
corresponding to a partition d
in regression
no data optimal rule
Bias
from unbalanced test samples
in small trees
in s2(t)
Binary splits S
Binary tree T
Binary tree structured classifiers
Binary variables
Bootstrap
Boston housing values example
accuracy in
estimation of relative risk in
from cross-validation
from resubstitution
final tree selection
linear combination splits in
as a missing value example
proportion of variance explained, cross-validated estimate of
trees
variable importance
least absolute deviation regression
least squares regression
Branch Tt
Breiman, L.
Bromine problem (see chlorine project; mass spectra problem; ozone levels,
classification of)
background
data reduction
estimation of risk
questions
395
tree
396
CART
compared to AID
computational efficiency of
least absolute deviation regression in
Chemical spectra problems (see bromine problem; chlorine project; mass spectra
problem; ozone levels, classification of)
Chlorine project (see bromine problem; mass spectra project; ozone levels,
classification of)
choice of priors in
Class assignment rule
Classes
Classification rule
consistent
Classifier d
Class probability estimation
Bayes rule in
Complexity
complexity parameter α
guidelines for selecting
cost-complexity measure, Rα(T)
Computational efficiency
of bromine tree
of CART
Consistent predictors and classifiers
Cost-complexity measure, Rα(T)
Cover, T. M.
Cross-validation, v-fold
classifier in d(v)(x)
estimate Rcv (d)
stratification in
397
Data structures
nonstandard
standard
Delta splitting rule
Density method
Descendant
Digit recognition problem
accuracy of CART in
accuracy of nearest neighbor in
accuracy of stepwise discriminant in
estimation of risk in
final tree selection
instability of tree structures in
missing data in
with noise variables
prior probabilities
specification
splitting criteria in
tree
variable importance in
Dillman, R. O.
Discriminant analysis
accuracy of
in diagnosing cancer
in digit recognition problem
in prognosis following heart attack
398
Efron, B.
End-cut preference
EPA
Estimation of risk ,
bias in
in class probability estimation
cross-validation
in class probability estimation
of mean squared error Rcv (d)
at a node Rcv (j)
of relative mean squared error REcv (T)
of a rule Rcv (d)
of a tree Rcv (T)
in digit recognition problem
estimate of standard error in
in class probability estimation
internal
at a node (t)
relative risk
mean squared error RE*(d)
resubstitution estimate
of Bayes risk at a node R(t)
of a class assignment rule at a node r(t)
in class probability estimation
of mean squared error R(d)
problems with
in regression
of relative risk
of a subtree R(Tt)
of a tree R(T)
of risk at a node r(t), R(t)
test sample Rts
in class probability estimation
of mean squared error Rts(d)
of a rule Rts (d)
of a tree Rts (T)
in waveform recognition problem
399
Experts
400
Features
construction and use of in
waveform recognition problem
Feigin, P.
Final tree selection
one standard error rule
Friedman, J. H.
401
Generalized tree construction
Gini index of diversity
as a splitting criterion
Gini splitting criterion
in class probability estimation
in digit recognition
ordered
preference for
symmetric
twoing
variable misclassification costs in
Goldman, L.
Goodness of split criterion Φ (s,t)
Gordon, L.
402
Hart, P. E.
Heart attack problem (prognosis)
diagnosis
measurement space in
tree
403
Impurity
of a node i(t)
reduction of Δi(s,t)
of a tree I(T)
Indicator function X(·)
Instability of tree structures (see tree stability)
Invariance of decision rule to monotonic transformations of data
404
kth nearest neighbor rule
Kernel estimates
Koziol, J. A.
405
Learning sample L
Logistic regression
in diagnosing cancer
in prognosis following heart attack
406
Mass spectra problem (see bromine problem; chlorine project; ozone levels,
classification of)
background
choice of priors in
Mean squared error MSE
Measurement space X
Measurement vector x
Meisel, W. S.
Misclassification cost c(i|j)
altered c′(i|j)
Misclassification rate
overall R*(T)
within node r*(t)
Missing data
Model 1
Model 2
Morgan, J. N.
407
Nearest neighbor rule (see also kth nearest neighbor rule)
accuracy of
Node t
ancestor
impurity i(t)
nonterminal
terminal
408
Olshen, R. A.
One standard error rule
Ordered twoing criterion
Outliers
Ozone levels, classification of (see bromine problem; chlorine project; mass spectra
problem)
409
Partition
function
refinement
Pollard, D.
Predictive measure of association λ(s*| m)
Prior probabilities π(j)
altered π′(j)
choice of
in diagnosing heart attacks
in prognosis following heart attacks
in waveform recognition problem
Probability model
Pruned subtree
optimal pruning algorithm
optimally pruned subtree
410
Questions Q
411
Radar range profile
Response (dependent variable) y,Y
Resubstitution (see estimation of risk, resubstitution estimate)
Risk identification
Risk reduction splitting rule
Robustness
412
Sensitivity--true positive rate
Ship classification
Smallest minimizing subtree T (α)
Sonquist, J. A.
Specificity--true negative rate
Split s
best
binary
Boolean combination
categorical
complementary
goodness of Φ(s, t)
linear combination
surrogate m
with missing data
in variable ranking
Splitting criteria (see delta splitting rule; digit recognition problem; Gini index of
diversity; Gini splitting criterion; risk reduction splitting rule; waveform recognition
problem)
Stone, C. J.
Stone, M.
Stopping rules
Subsampling
Sutherland, D. H.
413
Terminal subset (see node, terminal)
Test sample L2
unbalanced, as a source of bias
validation sample in diagnosing heart attacks
Test sample estimation
unbalanced test samples
THAID
Tree (see binary tree, pruned subtree)
allowable
algorithm for computing
branch of
root
subtree
Tree construction, elements of
Tree stability (see instability of tree structures)
True misclassification rate
414
Validation sample (see test sample)
Vapnik, V. N., and Chervonenkis, A. Ya.
Variable
categorical
combinations
dependent Y,y
importance
in Boston housing values example
in digit recognition problem
in waveform recognition problem
independent x
masked
numerical
ordered
ranking
in waveform data
selection
in diagnosing heart attack
in prognosis following heart attack
surrogate (see split, surrogate)
415
Waveform recognition problem
accuracy of CART in
accuracy of nearest neighbor in
accuracy of stepwise discrimina in
altered priors in
construction and use of feature in
error rate of Bayes rule
estimation of risk in
final tree
final tree selection in
instability of tree structures in
linear combination splits in
misclassification costs in
missing data in
one SE rule in
overall misclassification rate
prior probabilities in
splitting criteria in
variable importance in
Weakest link cutting
416
1
We wish to thank David Pollard for a clear explanation of some of the key ideas used
in this proof.
417
Table of Contents
Dedication 5
Title Page 6
Copyright Page 9
PREFACE 10
Acknowledgements 12
Chapter 1 - BACKGROUND 13
1.1 CLASSIFIERS AS PARTITIONS 13
1.2 USE OF DATA IN CONSTRUCTING CLASSIFIERS 15
1.3 THE PURPOSES OF CLASSIFICATION ANALYSIS 16
1.4 ESTIMATING ACCURACY 18
1.5 THE BAYES RULE AND CURRENT CLASSIFICATION
22
PROCEDURES
Chapter 2 - INTRODUCTION TO TREE CLASSIFICATION 27
2.1 THE SHIP CLASSIFICATION PROBLEM 27
2.2 TREE STRUCTURED CLASSIFIERS 29
2.3 CONSTRUCTION OF THE TREE CLASSIFIER 32
2.4 INITIAL TREE GROWING METHODOLOGY 36
2.5 METHODOLOGICAL DEVELOPMENT 46
2.6 TWO RUNNING EXAMPLES 56
2.7 THE ADVANTAGES OF THE TREE STRUCTURED APPROACH 70
Chapter 3 - RIGHT SIZED TREES AND HONEST ESTIMATES 73
3.1 INTRODUCTION 73
3.2 GETTING READY TO PRUNE 76
3.3 MINIMAL COST-COMPLEXITY PRUNING 79
3.4 THE BEST PRUNED SUBTREE: AN ESTIMATION PROBLEM 84
3.5 SOME EXAMPLES 95
APPENDIX 103
Chapter 4 - SPLITTING RULES 110
4.1 REDUCING MISCLASSIFICATION COST 110
4.2 THE TWO-CLASS PROBLEM 114
4.3 THE MULTICLASS PROBLEM: UNIT COSTS 120
4.4 PRIORS AND VARIABLE MISCLASSIFICATION COSTS 130
4.5 TWO EXAMPLES 135
4.6 CLASS PROBABILITY TREES VIA GINI 140
APPENDIX 148
418
Chapter 5 - STRENGTHENING AND INTERPRETING 152
5.1 INTRODUCTION 152
5.2 VARIABLE COMBINATIONS 153
5.3 SURROGATE SPLITS AND THEIR USES 164
5.4 ESTIMATING WITHIN-NODE COST 176
5.5 INTERPRETATION AND EXPLORATION 181
5.6 COMPUTATIONAL EFFICIENCY 191
5.7 COMPARISON OF ACCURACY WITH OTHER METHODS 198
APPENDIX 200
Chapter 6 - MEDICAL DIAGNOSIS AND PROGNOSIS 204
6.1 PROGNOSIS AFTER HEART ATTACK 204
6.2 DIAGNOSING HEART ATTACKS 209
6.3 IMMUNOSUPPRESSION AND THE DIAGNOSIS OF CANCER 215
6.4 GAIT ANALYSIS AND THE DETECTION OF OUTLIERS 219
6.5 RELATED WORK ON COMPUTER-AIDED DIAGNOSIS 225
Chapter 7 - MASS SPECTRA CLASSIFICATION 227
7.1 INTRODUCTION 227
7.2 GENERALIZED TREE CONSTRUCTION 228
7.3 THE BROMINE TREE: A NONSTANDARD EXAMPLE 229
Chapter 8 - REGRESSION TREES 241
8.1 INTRODUCTION 241
8.2 AN EXAMPLE 242
8.3 LEAST SQUARES REGRESSION 245
8.4 TREE STRUCTURED REGRESSION 253
8.5 PRUNING AND ESTIMATING 258
8.6 A SIMULATED EXAMPLE 266
8.7 TWO CROSS-VALIDATION ISSUES 270
8.8 STANDARD STRUCTURE TREES 275
8.9 USING SURROGATE SPLITS 277
8.10 INTERPRETATION 281
8.11 LEAST ABSOLUTE DEVIATION REGRESSION 287
8.12 OVERALL CONCLUSIONS 297
Chapter 9 - BAYES RULES AND PARTITIONS 299
9.1 BAYES RULE 299
9.2 BAYES RULE FOR A PARTITION 302
9.3 RISK REDUCTION SPLITTING RULE 305
9.4 CATEGORICAL SPLITS 308
Chapter 10 - OPTIMAL PRUNING 314
419
10.1 TREE TERMINOLOGY 314
10.2 OPTIMALLY PRUNED SUBTREES 319
10.3 AN EXPLICIT OPTIMAL PRUNING ALGORITHM 328
Chapter 11 - CONSTRUCTION OF TREES FROM A
333
LEARNING SAMPLE
11.1 ESTIMATED BAYES RULE FOR A PARTITION 333
11.2 EMPIRICAL RISK REDUCTION SPLITTING RULE 336
11.3 OPTIMAL PRUNING 337
11.4 TEST SAMPLES 339
11.5 CROSS-VALIDATION 343
11.6 FINAL TREE SELECTION 346
11.7 BOOTSTRAP ESTIMATE OF OVERALL RISK 348
11.8 END-CUT PREFERENCE 350
Chapter 12 - CONSISTENCY 355
12.1 EMPIRICAL DISTRIBUTIONS 355
12.2 REGRESSION 357
12.3 CLASSIFICATION 360
12.4 PROOFS FOR SECTION 12.1 364
12.5 PROOFS FOR SECTION 12.2 369
12.6 PROOFS FOR SECTION 12.3 376
BIBLIOGRAPHY 382
NOTATION INDEX 386
SUBJECT INDEX 393
420