GUIDE User Manual 26.0 Department of Statistics Wisconsin-Madison
GUIDE User Manual 26.0 Department of Statistics Wisconsin-Madison
Contents
1 Warranty disclaimer 4
2 Introduction 5
2.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Program operation 10
3.1 Required files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Input file creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Classification 13
4.1 Univariate splits, ordinal predictors: glaucoma data . . . . . . . . . . 13
4.1.1 Input file generation . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Contents of glaucoma.in . . . . . . . . . . . . . . . . . . . . 15
4.1.3 Executing the program . . . . . . . . . . . . . . . . . . . . . . 16
4.1.4 Interpreting the output file . . . . . . . . . . . . . . . . . . . . 19
4.2 Linear splits: glaucoma data . . . . . . . . . . . . . . . . . . . . . . . 26
Based on work partially supported by grants from the U.S. Army Research Office, National
Science Foundation, National Institutes of Health, Bureau of Labor Statistics, and Eli Lilly & Co.
Work on precursors to GUIDE additionally supported by IBM and Pfizer.
1
CONTENTS CONTENTS
5 Regression 92
5.1 Least squares constant: birthwt data . . . . . . . . . . . . . . . . . . 93
5.1.1 Input file creation . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.3 Contents of cons.var . . . . . . . . . . . . . . . . . . . . . . 108
5.2 Least squares simple linear: birthwt data . . . . . . . . . . . . . . . . 109
5.2.1 Input file creation . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.3 Contents of lin.var . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.4 Contents of lin.reg . . . . . . . . . . . . . . . . . . . . . . . 117
5.3 Multiple linear: birthwt data . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.1 Input file creation . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.3 Contents of mul.var . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.4 Contents of mul.reg . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Stepwise linear: birthwt data . . . . . . . . . . . . . . . . . . . . . . 125
5.4.1 Input file creation . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4.2 Contents of step.reg . . . . . . . . . . . . . . . . . . . . . . 128
5.5 Best ANCOVA: birthwt data . . . . . . . . . . . . . . . . . . . . . . 128
5.5.1 Input file creation . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5.3 Contents of ancova.reg . . . . . . . . . . . . . . . . . . . . . 135
5.6 Quantile regression: birthwt data . . . . . . . . . . . . . . . . . . . . 135
5.6.1 Piecewise constant: 1 quantile . . . . . . . . . . . . . . . . . . 135
1 Warranty disclaimer
Redistribution and use in binary forms, with or without modification, are permitted
provided that the following condition is met:
Redistributions in binary form must reproduce the above copyright notice, this
condition and the following disclaimer in the documentation and/or other materials
provided with the distribution.
THIS SOFTWARE IS PROVIDED BY WEI-YIN LOH AS IS AND ANY EX-
PRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL WEI-YIN
LOH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EX-
EMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIM-
ITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The views and conclusions contained in the software and documentation are those
of the author and should not be interpreted as representing official policies, either
expressed or implied, of the University of Wisconsin.
2 Introduction
GUIDE stands for Generalized, Unbiased, Interaction Detection and Estimation. It
is the only classification and regression tree algorithm with all these features:
3. Weighted least squares, least median of squares, quantile, Poisson, and relative
risk (proportional hazards) regression models.
7. Categorical variables for splitting only, or for both splitting and fitting (via 0-1
dummy variables), in regression tree models.
Tables 1 and 2 compare the features of GUIDE with CRUISE (Kim and Loh, 2001,
2003), QUEST (Loh and Shih, 1997), C4.5 (Quinlan, 1993), RPART1 , and M5
(Quinlan, 1992; Witten and Frank, 2000).
The GUIDE algorithm is documented in Loh (2002) for regression trees and
Loh (2009) for classification trees. Loh (2008a), Loh (2011) and Loh (2014) review
the subject. Advanced features of the algorithm are reported in Chaudhuri and Loh
(2002), Loh (2006b), Kim et al. (2007), Loh et al. (2007), and Loh (2008b). For a list
of third-party applications of GUIDE, CRUISE, QUEST, and the logistic regression
tree algorithm LOTUS (Chan and Loh, 2004; Loh, 2006a), see
http://www.stat.wisc.edu/~ loh/apps.html
This manual illustrates the use of the program and interpretation of the output.
1
RPART is an implementation of CART (Breiman et al., 1984) in R. CART is a registered
trademark of California Statistical Software, Inc.
2.1 Installation
GUIDE is available free from www.stat.wisc.edu/~ loh/guide.html in the form of
compiled 32- and 64-bit executables for Linux, Mac OS X, and Windows on Intel
and compatible processors. Data and description files used in this manual are in the
zip file www.stat.wisc.edu/~ loh/treeprogs/guide/datafiles.zip.
Linux: There are three 64-bit executables to choose from: Intel and NAG (for Red
Hat 6.8), and Gfortran (for Ubuntu 16.0). Make the unzipped file executable
by issuing this command in a Terminal application in the folder where the file
is located: chmod a+x guide.
Mac OS X: There are three executables to choose from. Make the unzipped file
executable by issuing this command in a Terminal application in the folder
where the file is located: chmod a+x guide
Windows: There are four executables to choose from: Intel (64 or 32 bit), Absoft
(64 bit) and Gfortran (64 bit). The 32-bit executable may run a bit faster
but the 64-bit versions can handle larger arrays. Download the 32 or 64-bit
executable guide.zip and unzip it (right-click on file icon and select Extract
all). The resulting file guide.exe may be placed in one of three places:
2.2 LATEX
GUIDE uses the public-domain software LATEX (http://www.ctan.org) to produce
tree diagrams. The specific locations are:
After LATEX is installed, a pdf file of a LATEX file, called diagram.tex say, produced
by GUIDE can be obtained by typing these three commands in a terminal window:
1. latex diagram
2. dvips diagram
3. ps2pdf diagram.ps
The first command produces a file called diagram.dvi which the second com-
mand uses to create a postscript file called diagram.ps. The latter can be viewed
and printed if a postscript viewer (such as Preview for the Mac) is installed. If
no postscript viewer is available, the last command can be used to convert the
postscript file into a pdf file, which can be viewed and printed with Adobe Reader.
The file diagram.tex can be edited to change colors, node sizes, etc. See, e.g.,
http://tug.org/PSTricks/main.cgi/.
Windows users: Convert the postscript figure to Enhanced-format Meta File
(emf) format for use in Windows applications such as Word or PowerPoint. There
are many conversion programs are available on the web, such as Graphic Converter
(http://www.graphic-converter.net/) and pstoedit (http://www.pstoedit.net/).
3 Program operation
3.1 Required files
The GUIDE program requires two text files for input.
Data file: This file contains the training sample. Each file record consists of ob-
servations on the response (i.e., dependent) variable, the predictor (i.e., X or
independent) variables, and optional weight and time variables. Entries in each
record are comma, space, or tab delimited (multiple spaces are treated as one
space, but not for commas). A record can occupy more than one line in the
file, but each record must begin on a new line.
Values of categorical variables can contain any ascii character except single
and double quotation marks, which are used to enclose values that contain
spaces and commas. Values can be up to 60 characters long. Class labels are
truncated to 10 characters in tabular displays.
A common problem among first-time users is getting the data file in proper
shape. If the data are in a spreadsheet and there are no empty cells, export
them to a MS-DOS Comma Separated (csv) file (the MS-DOS CSV format
takes care of carriage return and line feed characters properly). If there are
empty cells, a good solution is to read the spreadsheet into R (using read.csv
with proper specification of the na.strings argument), verify that the data are
correctly read, and then export them to a text file using either write.table
or write.csv.
Description file: This provides information about the name and location of the
data file, names and column positions of the variables, and their roles in the
analysis. Different models may be fitted by changing the roles of the variables.
We demonstrate with the text files glaucoma.rdata and glaucoma.dsc from
www.stat.wisc.edu/~ loh/treeprogs/guide/datafiles.zip or from the R
package ipred (Peters and Hothorn, 2015)). The data give the values of 66
variables obtained from a laser scan image of the optic nerve for 85 normal
people and 85 people with glaucoma. The response variable is Class (normal
or glaucoma). The top and bottom lines of the file glaucoma.dsc are:
glaucoma.rdata
NA
2
1 ag n
2 at n
3 as n
4 an n
5 ai n
:
63 tension n
64 clv n
65 cs n
66 lora n
67 Class d
The 1st line gives the name of the data file. If the latter is not in the current
folder, gives its full path (e.g., "c:\data\glaucoma.rdata") surrounded by
quotes (because it contains backslashes). The 2nd line gives the missing value
code, which can be up to 80 characters long. If it contains non-alphanumeric
characters, it too must be surrounded by quotation marks. A missing value
code must appear in the second line of the file even if there are no missing
values in the data (in which case any character string not present among the
data values can be used). The 3rd line gives the line number of the first data
record in the data file. Because glaucoma.rdata has the variable names in the
first row, a 2 is placed on the third line of glaucoma.dsc. Blank lines in
the data and description files are ignored. The position, name and role of each
variable comes next (in that order), with one line for each variable.
Variable names must begin with an alphabet and be not more than 60 charac-
ters long. If a name contains non-alphanumeric characters, it must be enclosed
in matching single or double quotes. Spaces and the four characters #, %, {,
and } are replaced by dots (periods) if they appear in a name. Variable names
are truncated to 10 characters in tabular output. Leading and trailing spaces
are dropped.
The following roles for the variables are permitted. Lower and upper case
letters are accepted.
b Categorical variable that is used both for splitting and for node modeling in
regression. It is transformed to 0-1 dummy variables for node modeling.
It is converted to c type for classification.
c Categorical variable used for splitting only.
d Dependent variable. Except for multi-response data (see Sec. 5.10), there
can only be one such variable. In the case of relative risk models, this
is the death indicator. The variable can take character string values for
classification.
f Numerical variable used only for fitting the linear models in the nodes of the
tree. It is not used for splitting the nodes and is disallowed in classification.
n Numerical variable used both for splitting the nodes and for fitting the node
models. It is converted to type s in classification.
r Categorical treatment (Rx) variable used only for fitting the linear models
in the nodes of the tree. It is not used for splitting the nodes. If this
variable is present, all n variables are automatically changed to s.
s Numerical-valued variable only used for splitting the nodes. It is not used as
a regressor in the linear models. This role is suitable for ordinal categorical
variables if they are given numerical values that reflect the orderings.
t Survival time (for proportional hazards models) or observation time (for
longitudinal models) variable.
w Weight variable for weighted least squares regression or for excluding ob-
servations in the training sample from tree construction. See section 9.2
for the latter. Except for longitudinal models, a record with a missing
value in a d, t, or z-variable is automatically assigned zero weight.
x Excluded variable. This allows models to be fitted to different subsets of
the variables without reformatting the data file.
z Offset variable used only in Poisson regression.
Windows. The terminal program is started from the Start button by choosing
All Programs Accessories Command Prompt
After the terminal window is opened, change to the folder where the data and pro-
gram files are stored. For Windows users who do not know how to do this, read
http://www.digitalcitizen.life/command-prompt-how-use-basic-commands.
2. Convert the data file into a format suitable for importation into database, spread-
sheet, or statistics software. See Table 2 for the statistical packages supported.
Section 9.5 has an example.
4 Classification
4.1 Univariate splits, ordinal predictors: glaucoma data
We first show how to generate an input file to produce a classification tree from
the data in the file glaucoma.rdata, using the default options. Whenever you are
prompted for a selection, there is usually range of permissible values given within
square brackets and a default choice (indicated by the symbol <cr>=). The default
may be selected by pressing the ENTER or RETURN key. Annotations are printed in
blue italics in this manual.
normal 85 0.50000000
Total #cases w/ #missing
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
170 0 17 0 0 0 66 0 0
No. cases used for training: 170
No. cases excluded due to 0 weight or missing D: 0
Finished reading data file
Choose 1 for estimated priors, 2 for equal priors, 3 for priors from a file
Input 1, 2, or 3 ([1:3], <cr>=1):
See other parts of manual for examples of equal and specified priors.
Choose 1 for unit misclassification costs, 2 to input costs from a file
Input 1 or 2 ([1:2], <cr>=1):
Input 1 for LaTeX tree code, 2 to skip it ([1:2], <cr>=1):
Choose option 2 if you do not want LaTeX code.
Input file name to store LaTeX code (use .tex as suffix): glaucoma.tex
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: glaucoma.fit
This file will contain the node number and predicted class for each observation.
Input file is created!
Run GUIDE with the command: guide < glaucoma.in
GUIDE reads only the first item in each line; the rest of the line is a comment for
human consumption. It is generally not advisable for the user to edit this file because
each question depends on the answers given to previous questions.
This produces the following output to the screen. The alternative command guide
< glaucoma.in > log.txt sends the screen output to the file log.txt.
GUIDE Classification and Regression Trees and Forests
Version 26.0 (Build date: June 1, 2017)
Compiled with GFortran 6.2.0 on Mac OS X Sierra 10.12.5
Copyright (c) 1997-2017 Wei-Yin Loh. All rights reserved.
This software is based upon work supported by the U.S. Army Research Office,
the National Science Foundation and the National Institutes of Health.
Input 0 for linear, interaction and univariate splits (in this order),
1 for univariate, linear and interaction splits (in this order),
2 to skip linear splits,
3 to skip linear and interaction splits: 1
Input 1 to prune by CV, 2 by test sample, 3 for no pruning: 1
The final pruned tree is marked with two asterisks (**); it has 4 terminal nodes.
58 mr s 5.9900E-01 1.2190E+00
59 rnf s -1.9000E-02 4.5100E-01
60 mdic s 1.2000E-02 6.6300E-01
61 emd s 4.7000E-02 7.4300E-01
62 mv s 0.0000E+00 1.8300E-01
63 tension s 1.0000E+01 2.5000E+01 4
64 clv s 0.0000E+00 1.4600E+02 12
65 cs s 3.3000E-01 1.9100E+00 1
66 lora s 0.0000E+00 9.2578E+01
67 Class d 2
This shows the type, minimum, maximum and number of missing values of each variable.
Total #cases w/ #missing
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
170 0 17 0 0 0 66 0 0
This shows the number of each type of variable.
No. cases used for training: 170
No. cases excluded due to 0 weight or missing D: 0
having mean CV cost within the specified standard error (SE) bounds.
The mean CV costs and SEs are given in the 3rd and 4th columns.
The other columns are bootstrap estimates used for experimental purposes.
***************************************************************
glaucoma 7 1.00000
normal 0 0.00000
Number of training cases misclassified = 0
Predicted class is glaucoma
----------------------------
Node 3: Intermediate node
A case goes into Node 6 if clv <= 2.0000000E+00
clv mean = 3.5821E+01
Class Number ClassPrior
glaucoma 78 0.80412
normal 19 0.19588
Number of training cases misclassified = 19
Predicted class is glaucoma
----------------------------
Node 6: Terminal node
Class Number ClassPrior
glaucoma 1 0.06667
normal 14 0.93333
Number of training cases misclassified = 1
Predicted class is normal
----------------------------
Node 7: Terminal node
Class Number ClassPrior
glaucoma 77 0.93902
normal 5 0.06098
Number of training cases misclassified = 5
Predicted class is glaucoma
----------------------------
Figure 1 shows the classification tree drawn by LaTeX using the file glaucoma.tex.
The last sentence in its caption gives the second best variable for splitting the root
lora
56.40 1
clv clv
8.40 2 2.00 3
0 lora 1 77
62 4 49.23 5 14 6 7 5
normal normal glaucoma
0 7
4 10 11 0
normal glaucoma
150
glaucoma
normal
100
clv
50
0
0 20 40 60 80
lora
Figure 1: GUIDE v.26.0 0.50-SE classification tree for predicting Class using esti-
mated priors and unit misclassification costs. At each split, an observation goes to
the left branch if and only if the condition is satisfied. The symbol stands for
or missing. Predicted classes (based on estimated misclassification cost) printed
below terminal nodes; sample sizes for Class = glaucoma and normal, respectively,
beside nodes. Second best split variable at root node is clv.
Wei-Yin Loh 25 GUIDE manual
4.2 Linear splits: glaucoma data 4 CLASSIFICATION
node. The top lines of the file glaucoma.fit are shown below. Their order corre-
sponds to the order of the observations in the training sample file. The 1st column
(labeled train) indicates whether the observation is used (y) or not used (n) to
fit the model. Since we used the entire data set to fit the model here, all the entries
in the first column are y. The 2nd column gives the terminal node number that the
observation belongs to and the 3rd and 4th columns give its observed and predicted
classes.
train node observed predicted
y 4 "normal" "normal"
y 4 "normal" "normal"
y 4 "normal" "normal"
y 4 "normal" "normal"
y 6 "normal" "normal"
Input file name to store LaTeX code (use .tex as suffix): lin.tex
Input 1 to include node numbers, 2 to omit them ([1:2], <cr>=1):
Choosing 2 will give a tree with no node labels.
Input 1 to number all nodes, 2 to number leaves only ([1:2], <cr>=1):
Input 1 to color terminal nodes, 2 otherwise ([1:2], <cr>=1):
Choose amount of detail in nodes of LaTeX tree diagram
Input 0 for #errors, 1 for class sizes, 2 for nothing ([0:2], <cr>=1):
Choose 2 if a large tree is expected.
You can store the variables and/or values used to split and fit in a file
Choose 1 to skip this step, 2 to store split variables and their values
Input your choice ([1:2], <cr>=1): 2
Choose 2 to output the info to another file for further processing.
Input file name: linvar.txt
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: lin.fit
Input 2 to save terminal node IDs for importance scoring; 1 otherwise ([1:2], <cr>=1):
Input 2 to write R function for predicting new cases, 1 otherwise ([1:2], <cr>=1):2
Input file name: linpred.r
Input file is created!
Run GUIDE with the command: guide < lin.in
Running GUIDE with the input file yields the following results. The LATEX tree
diagram and partitions are shown in Figure 2.
Node 1: 4.1110165E-01 * clv + lora <= 5.9402920E+01
Node 2: normal
Node 1: 4.1110165E-01 * clv + lora > 5.9402920E+01 or NA
Node 3: glaucoma
Each row refers to a node. The 1st column gives the node number. The 2nd column
contains the letter 1, n, s, c, or t, indicating a split on two variables, a n variable, a s
variable, a c variable, or a terminal node, respectively. The 3rd and 4th columns give
the names of the 2 variables in a bivariate split or the names of the split variable and
the interacting variable in a univariate split. If a node cannot be split, the words
NONE are printed. If a node is terminal, the predicted class is printed in the 5th
column. Otherwise, if it is a non-terminal node, the 5th column gives the number of
values to follow. In the above example, the 2 in the 5th column of each non-terminal
node indicates that it is followed by two parameter values defining the linear split. If
lora
clv 1
2 83
71 2 3 14
normal glaucoma
150
100
glaucoma
clv
normal
50
0
0 20 40 60 80
lora
Figure 2: GUIDE v.26.0 0.50-SE classification tree for predicting Class using linear
split priority, estimated priors and unit misclassification costs. At each split, an
observation goes to the left branch if and only if the condition is satisfied. Predicted
classes (based on estimated misclassification cost) printed below terminal nodes;
sample sizes for Class = glaucoma and normal, respectively, beside nodes.
the split is on a categorical variable, the 5th column gives the number of categorical
values defining the split and the 6th and subsequent columns give their values.
Contents of linpred.r: This file contains the following R function for predicting
future observations:
predicted <- function(){
if(!is.na(lora) & !is.na(clv) & 0.41110164757186696*clv + lora <= 59.402920297324165){
nodeid <- 2
predict <- "normal"
} else {
nodeid <- 3
predict <- "glaucoma"
}
return(c(nodeid,predict))
}
Input file name to store LaTeX code (use .tex as suffix): peptide.tex
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: peptide.fit
Input file is created!
Run GUIDE with the command: guide < peptide.in
4.3.2 Results
Results from the output file peptide.out follow.
Classification tree
Pruning by cross-validation
Data description file: segal.dsc
Training sample file: segal.dat
Missing value code: NA
Records in data file start on line 2
Dependent variable is bind
Number of records in data file: 310
Length of longest data entry: 6
Class proportions of dependent variable bind:
Number of classes: 2
Class #Cases Proportion
0 129 0.41612903
1 181 0.58387097
Classification tree:
At each categorical variable split, values not in training data go right
***************************************************************
pos5
in S1
10 119
159 22
1 0
Figure 3: GUIDE v.26.0 0.50-SE classification tree for predicting bind using esti-
mated priors and unit misclassification costs. At each split, an observation goes to
the left branch if and only if the condition is satisfied. For splits on categorical vari-
ables, values not present in the training sample go to the right. Set S1 = {F, M, Y}.
Predicted classes (based on estimated misclassification cost) printed below terminal
nodes; sample sizes for bind = 0 and 1, respectively, beside nodes. Second best split
variable at root node is pos1.
The results indicate that the largest tree before pruning has 10 terminal nodes.
The pruned tree (marked by **) has 2 terminal nodes. Its cross-validation estimate
of misclassification cost (or error rate here) is 0.1097. Figure 3 shows the pruned
tree. It splits on pos5, sending values A, C, D, E, G, H, I, K, L, N, P, Q, R,
S, T, V, and W to the left node. The second best variable to split the root node is
pos1.
live. That is, 79% of the individuals are in the live class. The contents of
hepdsc.txt are:
hepdat.txt
"?"
1
1 CLASS d
2 AGE n
3 SEX c
4 STEROID c
5 ANTIVIRALS c
6 FATIGUE c
7 MALAISE c
8 ANOREXIA c
9 BIGLIVER c
10 FIRMLIVER c
11 SPLEEN c
12 SPIDERS c
13 ASCITES c
14 VARICES c
15 BILIRUBIN n
16 ALKPHOSPHATE n
17 SGOT n
18 ALBUMIN n
19 PROTIME n
20 HISTOLOGY c
Using the default estimated priors yields a null tree with no splits. To obtain a
nonnull tree, we choose equal priors here.
0. Read the warranty disclaimer
1. Create an input file for model fitting or importance scoring (recommended)
2. Convert data to other formats without creating input file
Input your choice: 1
Name of batch input file: hepeq.in
Input 1 for model fitting, 2 for importance or DIF scoring,
3 for data conversion ([1:3], <cr>=1):
Name of batch output file: hepeq.out
Input 1 for single tree, 2 for ensemble ([1:2], <cr>=1):
Input 1 for classification, 2 for regression, 3 for propensity score grouping
(propensity score grouping is an experimental option)
Input your choice ([1:3], <cr>=1):
Input 1 for default options, 2 otherwise ([1:2], <cr>=1): 2
Option 2 is needed for equal or specified priors.
Input 1 for simple, 2 for nearest-neighbor, 3 for kernel method ([1:3], <cr>=1):
Input 0 for linear, interaction and univariate splits (in this order),
1 for univariate, linear and interaction splits (in this order),
155 0 72 0 0 0 6 0 13
No. cases used for training: 155
No. cases excluded due to 0 weight or missing D: 0
Finished reading data file
Default number of cross-validations: 10
Input 1 to accept the default, 2 to change it ([1:2], <cr>=1):
Best tree may be chosen based on mean or median CV estimate
Input 1 for mean-based, 2 for median-based ([1:2], <cr>=1):
Input number of SEs for pruning ([0.00:1000.00], <cr>=0.50):
Choose 1 for estimated priors, 2 for equal priors, 3 for priors from a file
Input 1, 2, or 3 ([1:3], <cr>=1):2
Option 2 is for equal priors.
Choose 1 for unit misclassification costs, 2 to input costs from a file
Input 1 or 2 ([1:2], <cr>=1):
Choose a split point selection method for numerical variables:
Choose 1 to use faster method based on sample quantiles
Choose 2 to use exhaustive search
Input 1 or 2 ([1:2], <cr>=2):
Default max. number of split levels: 10
Input 1 to accept this value, 2 to change it ([1:2], <cr>=1):
Default minimum node sample size is 2
Input 1 to use the default value, 2 to change it ([1:2], <cr>=1):
Input 1 for LaTeX tree code, 2 to skip it ([1:2], <cr>=1):
Input file name to store LaTeX code (use .tex as suffix): hepeq.tex
Input 1 to include node numbers, 2 to omit them ([1:2], <cr>=1):
Input 1 to number all nodes, 2 to number leaves only ([1:2], <cr>=1):
Input 1 to color terminal nodes, 2 otherwise ([1:2], <cr>=1):
Choose amount of detail in nodes of LaTeX tree diagram
Input 0 for #errors, 1 for class sizes, 2 for nothing ([0:2], <cr>=1):
You can store the variables and/or values used to split and fit in a file
Choose 1 to skip this step, 2 to store split and fit variables,
3 to store split variables and their values
Input your choice ([1:3], <cr>=1):3
Input file name: hepvar.txt
Contents of this file are shown below.
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: hepeq.fit
Input 2 to write R function for predicting new cases, 1 otherwise ([1:2], <cr>=1):
Input file is created!
Run GUIDE with the command: guide < hepeq.in
Node 6: live
Node 3: SPIDERS /= "no"
Node 7: die
Figure 4 shows the LATEX trees using estimated priors (left) and equal priors
(right). Nodes that predict the same class have the same color. The tree using
equal priors has one more split (on SPIDERS). But both trees misclassify the same
number of samples. Therefore the left tree, being shorter, is preferred if priors are
estimated. On the other hand, since the ratio of die to live classes is 32:123,
equal priors makes each die observation equivalent to r = 123/32 = 3.84375 live
observations. Consequently, a terminal node is classified as die if its ratio of live
to die observations is less than r. Note that although only 21% of the data are
in the die class, most of these individuals are in nodes 2 and 6 (70% and 31%,
respectively).
where C(i, j) denotes the cost of classifying an observation as class i given that it
belongs to class j. Note that GUIDE sorts the class values in alphabetical order, so
that die is treated as class 1 and live as class 2 here. This matrix is saved in the
text file cost.txt which has these two lines:
ASCITES ASCITES
= yes 1 =yes 1
14 18 14 SPIDERS
6 2 3 117 6 2 =no 3
die live die
5 13
88 6 7 29
live die
Figure 4: GUIDE v.26.0 0.50-SE classification tree for predicting CLASS using es-
timated (left) and equal (right) priors and unit misclassification costs. At each
split, an observation goes to the left branch if and only if the condition is satisfied.
Predicted classes (based on estimated misclassification cost) printed below terminal
nodes; sample sizes for CLASS = die and live, respectively, beside nodes. Second
best split variable at root node is SPIDERS.
0 1
4 0
The following lines in the input file generation step shows where this file is used:
Choose 1 for estimated priors, 2 for equal priors, 3 to input the priors from a file
Input 1, 2, or 3 ([1:3], <cr>=1):
Choose 1 for unit misclassification costs, 2 to input costs from a file
Input 1 or 2 ([1:2], <cr>=1): 2
Input the name of a file containing the cost matrix C(i|j),
where C(i|j) is the cost of classifying class j as class i
The rows of the matrix must be in alphabetical order of the class names
Input name of file: cost.txt
The resulting tree is the same as that for equal priors in Figure 4.
Classification tree
Pruning by cross-validation
Data description file: derm.dsc
Training sample file: derm.dat
Missing value code: ?
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is class
Number of records in data file: 358
Length of longest data entry: 2
Number of classes: 6
Class proportions of dependent variable class:
Class #Cases Proportion
1 111 0.31005587
2 60 0.16759777
3 71 0.19832402
4 48 0.13407821
5 48 0.13407821
6 20 0.05586592
Classification tree:
***************************************************************
----------------------------
Node 4: Intermediate node
A case goes into Node 8 if fibrosis <= 5.0000000E-01 or ?
fibrosis mean = 3.7979E-01
Class Number ClassPrior
1 111 0.38676
2 60 0.20906
3 0 0.00000
4 48 0.16725
5 48 0.16725
6 20 0.06969
Number of training cases misclassified = 176
Predicted class is 1
----------------------------
Node 8: Intermediate node
A case goes into Node 16 if spongiosis <= 5.0000000E-01 or ?
spongiosis mean = 1.0544E+00
Class Number ClassPrior
1 111 0.46444
2 60 0.25105
3 0 0.00000
4 48 0.20084
5 0 0.00000
6 20 0.08368
Number of training cases misclassified = 128
Predicted class is 1
----------------------------
Node 16: Intermediate node
A case goes into Node 32 if elongation <= 5.0000000E-01
elongation mean = 2.0909E+00
Class Number ClassPrior
1 111 0.91736
2 3 0.02479
3 0 0.00000
4 1 0.00826
5 0 0.00000
6 6 0.04959
Number of training cases misclassified = 10
Predicted class is 1
----------------------------
Node 32: Terminal node
Class Number ClassPrior
1 0 0.00000
2 3 0.33333
3 0 0.00000
4 1 0.11111
5 0 0.00000
6 5 0.55556
Number of training cases misclassified = 4
Predicted class is 6
----------------------------
Node 33: Terminal node
Class Number ClassPrior
1 111 0.99107
2 0 0.00000
3 0 0.00000
4 0 0.00000
5 0 0.00000
6 1 0.00893
Number of training cases misclassified = 1
Predicted class is 1
----------------------------
Node 17: Intermediate node
A case goes into Node 34 if perifoll <= 5.0000000E-01 or ?
perifoll mean = 2.6271E-01
Class Number ClassPrior
1 0 0.00000
2 57 0.48305
3 0 0.00000
4 47 0.39831
5 0 0.00000
6 14 0.11864
Number of training cases misclassified = 61
Predicted class is 2
----------------------------
Node 34: Intermediate node
A case goes into Node 68 if koebner <= 5.0000000E-01 or ?
koebner mean = 5.4369E-01
Class Number ClassPrior
1 0 0.00000
2 56 0.54369
3 0 0.00000
4 47 0.45631
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 47
Predicted class is 2
----------------------------
Node 68: Intermediate node
A case goes into Node 136 if disappea <= 5.0000000E-01 or ?
disappea mean = 9.3750E-02
Class Number ClassPrior
1 0 0.00000
2 55 0.85938
3 0 0.00000
4 9 0.14063
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 9
Predicted class is 2
----------------------------
Node 136: Terminal node
Class Number ClassPrior
1 0 0.00000
2 55 0.94828
3 0 0.00000
4 3 0.05172
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 3
Predicted class is 2
----------------------------
Node 137: Terminal node
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 0 0.00000
4 6 1.00000
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 0
Predicted class is 4
----------------------------
Node 69: Terminal node
Class Number ClassPrior
1 0 0.00000
2 1 0.02564
3 0 0.00000
4 38 0.97436
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 1
Predicted class is 4
----------------------------
Node 35: Terminal node
Class Number ClassPrior
1 0 0.00000
2 1 0.06667
3 0 0.00000
4 0 0.00000
5 0 0.00000
6 14 0.93333
Number of training cases misclassified = 1
Predicted class is 6
----------------------------
Node 9: Terminal node
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 0 0.00000
4 0 0.00000
5 48 1.00000
6 0 0.00000
Number of training cases misclassified = 0
Predicted class is 5
----------------------------
Node 5: Terminal node
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 3 1.00000
4 0 0.00000
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 0
Predicted class is 3
----------------------------
Node 3: Terminal node
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 68 1.00000
4 0 0.00000
5 0 0.00000
6 0 0.00000
Number of training cases misclassified = 0
Predicted class is 3
----------------------------
2 0 55 0 3 0 0
3 0 0 71 0 0 0
4 0 1 0 44 0 0
5 0 0 0 0 48 0
6 0 4 0 1 0 19
Total 111 60 71 48 48 20
polypap
.5000 1
oralmuc
.5000 2 3 0/68
3
fibrosis
.5000 4 5 0/3
3
spongiosis
.5000 8 9 0/48
5
elongation perifoll
.5000 16 .5000 17
koebner
4/9 32 33 1/112 .5000 34 35 1/15
6 1 6
disappea
.5000 68 69 1/39
4
Results
Classification tree
Pruning by cross-validation
Data description file: derm.dsc
Training sample file: derm.dat
Missing value code: ?
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is class
Number of records in data file: 358
Length of longest data entry: 2
Number of classes: 6
Class proportions of dependent variable class:
Class #Cases Proportion
1 111 0.31005587
2 60 0.16759777
3 71 0.19832402
4 48 0.13407821
5 48 0.13407821
6 20 0.05586592
Classification tree:
***************************************************************
Fit variable
Class Number ClassPrior fibrosis
1 111 0.38276
2 60 0.20690
3 3 0.01034
4 48 0.16552
5 48 0.16552
6 20 0.06897
Number of training cases misclassified = 131
If node model is inapplicable due to missing values, predicted class =
1
----------------------------
Node 4: Intermediate node
A case goes into Node 8 if spongiosis <= 5.0000000E-01 or ?
Nearest-neighbor K = 6
Fit variable
Class Number ClassPrior follipap
1 0 0.00000
2 57 0.47899
3 1 0.00840
4 47 0.39496
5 0 0.00000
6 14 0.11765
Number of training cases misclassified = 50
If node model is inapplicable due to missing values, predicted class =
2
----------------------------
Node 18: Terminal node
Nearest-neighbor K = 5
koebner mean = 5.3846E-01
Fit variable
Class Number ClassPrior koebner
1 0 0.00000
2 56 0.53846
3 1 0.00962
4 47 0.45192
5 0 0.00000
6 0 0.00000
----------------------------
Node 19: Terminal node
Nearest-neighbor K = 3
Class Number ClassPrior
1 0 0.00000
2 1 0.06667
3 0 0.00000
4 0 0.00000
5 0 0.00000
6 14 0.93333
----------------------------
Node 5: Terminal node
Nearest-neighbor K = 4
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 0 0.00000
4 0 0.00000
5 48 1.00000
6 0 0.00000
----------------------------
Node 3: Terminal node
Nearest-neighbor K = 5
polypap
.5000 1
fibrosis
.5000 2 3 0/68
3
spongiosis
.5000 4 5 0/48
5
follipap
7/123 8 .5000 9
1
11/104 18 19 1/15
2 6
Figure 6: GUIDE v.26.0 0.50-SE classification tree for predicting class using univari-
ate nearest-neighbor node models, estimated priors and unit misclassification costs.
At each split, an observation goes to the left branch if and only if the condition is
satisfied. The symbol stands for or missing. Predicted classes (based on es-
timated misclassification cost) printed below terminal nodes; #misclassified/sample
size beside each node. Second best split variable at root node is bandlike.
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
358 0 0 0 0 0 34 0 0
No. cases used for training: 358
Finished reading data file
Default number of cross-validations = 10
Input 1 to accept the default, 2 to change it ([1:2], <cr>=1):
Best tree may be chosen based on mean or median CV estimate
Input 1 for mean-based, 2 for median-based ([1:2], <cr>=1):
Input number of SEs for pruning ([0.00:1000.00], <cr>=0.50):
Choose 1 for estimated priors, 2 for equal priors, 3 for priors from a file
Input 1, 2, or 3 ([1:3], <cr>=1):
Choose 1 for unit misclassification costs, 2 to input costs from a file
Input 1 or 2 ([1:2], <cr>=1):
Choose a split point selection method for numerical variables:
Choose 1 to use faster method based on sample quantiles
Choose 2 to use exhaustive search
Input 1 or 2 ([1:2], <cr>=2):
Default max number of split levels = 10
Input 1 to accept this value, 2 to change it ([1:2], <cr>=1):
Default minimum node sample size is 10
Input 1 to use the default value, 2 to change it ([1:2], <cr>=1):
Input 1 for LaTeX tree code, 2 to skip it ([1:2], <cr>=1):
Input file name to store LaTeX code (use .tex as suffix): ker.tex
Input 1 to include node numbers, 2 to omit them ([1:2], <cr>=1):
Input 1 to number all nodes, 2 to number leaves only ([1:2], <cr>=1):
Input 1 to color terminal nodes, 2 otherwise ([1:2], <cr>=1):
Choose amount of detail in nodes of LaTeX tree diagram
Input 0 for #errors, 1 for class sizes, 2 for nothing ([0:2], <cr>=0):
You can store the variables and/or values used to split and fit in a file
Choose 1 to skip this step, 2 to store split and fit variables,
3 to store split variables and their values
Input your choice ([1:3], <cr>=1):
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: ker.fit
Input 2 to save terminal node IDs for importance scoring; 1 otherwise ([1:2], <cr>=1):
Input name of file to store predicted class and probability: ker.pro
This file contains the estimated class probabilities for each observation.
Input file is created!
Run GUIDE with the command: guide < ker.in
Results
Classification tree
Pruning by cross-validation
Data description file: derm.dsc
Classification tree:
***************************************************************
6 0 0.00000 0.0000E+00
----------------------------
Node 19: Terminal node
Class Number ClassPrior
1 0 0.00000
2 1 0.06667
3 0 0.00000
4 0 0.00000
5 0 0.00000
6 14 0.93333
----------------------------
Node 5: Terminal node
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 0 0.00000
4 0 0.00000
5 48 1.00000
6 0 0.00000
----------------------------
Node 3: Terminal node
Class Number ClassPrior
1 0 0.00000
2 0 0.00000
3 68 1.00000
4 0 0.00000
5 0 0.00000
6 0 0.00000
----------------------------
polypap
.5000 1
fibrosis
.5000 2 3 0/68
3
spongiosis
.5000 4 5 0/48
5
follipap
9/123 8 .5000 9
1
11/104 18 19 1/15
2 6
Figure 7: GUIDE v.26.0 0.50-SE classification tree for predicting class using uni-
variate kernel discrimination node models, estimated priors and unit misclassification
costs. At each split, an observation goes to the left branch if and only if the condition
is satisfied. The symbol stands for or missing. Predicted classes (based on es-
timated misclassification cost) printed below terminal nodes; #misclassified/sample
size beside each node. Second best split variable at root node is bandlike.
The tree is shown in Figure 7. Unline the nearest-neighbor option, the kernel
option can provide an estimated class probability vector for each observation. These
are contained in the file ker.pro, the top few lines of which are given below. For
example, the probabilities that the 1st observation belongs to classes 16 are (0,
0.876, 0, 0.239, 0, 0). The last two columns give the predicted and observed class of
the observation.
"1" "2" "3" "4" "5" "6" predicted observed
0.00000 0.84423 0.03637 0.11940 0.00000 0.00000 "2" "2"
0.99616 0.00000 0.00000 0.00000 0.00000 0.00384 "1" "1"
0.00000 0.00000 1.00000 0.00000 0.00000 0.00000 "3" "3"
0.99616 0.00000 0.00000 0.00000 0.00000 0.00384 "1" "1"
0.00000 0.00000 1.00000 0.00000 0.00000 0.00000 "3" "3"
rcaprox lmt
1.50 1.50
0 1
2 0
8 0
rcadist 22 rcaprox 0
1.50 0 1.50 30
30 4
3
7
cxmain 5 rcadist ladprox
1.50 0 1.50 1.50
2
73
4 CLASSIFICATION
0 0 10 2 2 2 18 0
0 0 cxmain cxmain om1 0 0 4 0 0 om1 9
0 0 1.50 ramus 1.50 0 0 0 0 0 1.50 0
0 1 58 0 0 0 0 20 1 3 1 2 0 30
GUIDE manual
5 20 15 0 32 0 0 0
0 0 1 7 4 5 19 0
0 0 0 0 0 0 0 7
0 0 0 0 0 0 0 0
0 1 1 2 1 2 2 3
Figure 8: GUIDE v.26.0 0.50-SE classification tree for predicting num using estimated priors and unit misclas-
sification costs. At each split, an observation goes to the left branch if and only if the condition is satisfied.
The symbol stands for or missing. Predicted classes (based on estimated misclassification cost)
printed below terminal nodes; sample sizes for num = 0, 1, 2, 3, and 4, respectively, beside nodes. Second
best split variable at root node is rcaprox.
4.7 More than 2 classes: heart disease 4 CLASSIFICATION
Classification tree
Pruning by cross-validation
Data description file: heartdsc.txt
Training sample file: heartdata.txt
Missing value code: NA
Records in data file start on line 2
Warning: N variables changed to S
Dependent variable is num
Number of records in data file: 617
Length of longest data entry: 9
Missing values found among categorical variables
Separate categories will be created for missing categorical variables
Number of classes: 5
Class proportions of dependent variable num:
Class #Cases Proportion
0 247 0.40032415
1 141 0.22852512
2 99 0.16045381
3 100 0.16207455
4 30 0.04862237
21 nitr c 2 63
22 pro c 2 61
23 diuretic c 2 80
24 proto c 14 112
25 thaldur s 1.0000E+00 2.4000E+01 56
26 thaltime s 0.0000E+00 2.0000E+01 384
27 met s 2.0000E+00 2.0000E+02 105
28 thalach s 6.0000E+01 1.9000E+02 55
29 thalrest s 3.7000E+01 1.3900E+02 56
30 tpeakbps s 1.0000E+02 2.4000E+02 63
31 tpeakbpd s 1.1000E+01 1.3400E+02 63
32 trestbpd s 0.0000E+00 1.2000E+02 59
33 exang c 2 55
34 xhypo c 2 58
35 oldpeak s -2.6000E+00 5.0000E+00 62
36 slope c 4 308
37 rldv5 s 2.0000E+00 3.6000E+01 143
38 rldv5e s 2.0000E+00 3.6000E+01 142
39 ca s 0.0000E+00 9.0000E+00 606
40 thal c 7 475
41 cyr s 1.0000E+00 8.7000E+01 9
42 num d 5
43 lmt s 0.0000E+00 1.6200E+02 275
44 ladprox s 1.0000E+00 2.0000E+00 236
45 laddist s 1.0000E+00 2.0000E+00 246
46 diag s 1.0000E+00 2.0000E+00 276
47 cxmain s 1.0000E+00 2.0000E+00 235
48 ramus s 1.0000E+00 2.0000E+00 285
49 om1 s 1.0000E+00 2.0000E+00 271
50 om2 s 1.0000E+00 2.0000E+00 290
51 rcaprox s 1.0000E+00 2.0000E+00 245
52 rcadist s 1.0000E+00 2.0000E+00 270
53 database c 3
Classification tree:
***************************************************************
----------------------------
Node 16: Intermediate node
A case goes into Node 32 if ladprox <= 1.5000000E+00 or NA
ladprox mean = 1.6818E+00
Class Number ClassPrior
0 188 0.89100
1 21 0.09953
2 2 0.00948
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 23
Predicted class is 0
----------------------------
Node 32: Intermediate node
A case goes into Node 64 if laddist <= 1.5000000E+00 or NA
laddist mean = 1.2857E+00
Class Number ClassPrior
0 188 0.95918
1 8 0.04082
2 0 0.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 8
Predicted class is 0
----------------------------
Node 64: Terminal node
Class Number ClassPrior
0 188 0.98947
1 2 0.01053
2 0 0.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 2
Predicted class is 0
----------------------------
Node 65: Terminal node
Class Number ClassPrior
0 0 0.00000
1 6 1.00000
2 0 0.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 1
----------------------------
Node 33: Terminal node
3 5 0.33333
4 0 0.00000
Number of training cases misclassified = 8
Predicted class is 2
----------------------------
Node 5: Terminal node
Class Number ClassPrior
0 0 0.00000
1 2 0.06250
2 8 0.25000
3 22 0.68750
4 0 0.00000
Number of training cases misclassified = 10
Predicted class is 3
----------------------------
Node 3: Intermediate node
A case goes into Node 6 if lmt <= 1.5000000E+00
lmt mean = 1.5601E+00
Class Number ClassPrior
0 59 0.17302
1 106 0.31085
2 73 0.21408
3 73 0.21408
4 30 0.08798
Number of training cases misclassified = 235
Predicted class is 1
----------------------------
Node 6: Intermediate node
A case goes into Node 12 if rcaprox <= 1.5000000E+00 or NA
rcaprox mean = 1.4137E+00
Class Number ClassPrior
0 58 0.18710
1 106 0.34194
2 73 0.23548
3 73 0.23548
4 0 0.00000
Number of training cases misclassified = 204
Predicted class is 1
----------------------------
Node 12: Intermediate node
A case goes into Node 24 if rcadist <= 1.5000000E+00 or NA
rcadist mean = 1.1444E+00
Class Number ClassPrior
0 58 0.31694
1 79 0.43169
2 31 0.16940
3 15 0.08197
4 0 0.00000
Number of training cases misclassified = 104
Predicted class is 1
----------------------------
Node 24: Intermediate node
A case goes into Node 48 if ladprox <= 1.5000000E+00 or NA
ladprox mean = 1.3290E+00
Class Number ClassPrior
0 58 0.36943
1 72 0.45860
2 27 0.17197
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 85
Predicted class is 1
----------------------------
Node 48: Intermediate node
A case goes into Node 96 if laddist <= 1.5000000E+00 or NA
laddist mean = 1.2212E+00
Class Number ClassPrior
0 58 0.54717
1 40 0.37736
2 8 0.07547
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 48
Predicted class is 0
----------------------------
Node 96: Intermediate node
A case goes into Node 192 if cxmain <= 1.5000000E+00 or NA
cxmain mean = 1.2439E+00
Class Number ClassPrior
0 58 0.69880
1 25 0.30120
2 0 0.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 25
Predicted class is 0
----------------------------
Node 192: Terminal node
Class Number ClassPrior
0 58 0.92063
1 5 0.07937
2 0 0.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 5
Predicted class is 0
----------------------------
Node 193: Terminal node
Class Number ClassPrior
0 0 0.00000
1 20 1.00000
2 0 0.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 1
----------------------------
Node 97: Intermediate node
A case goes into Node 194 if
1.0000000E+00 * ramus + cxmain <= 2.5000000E+00 or NA
Linear combination mean = 2.3043E+00
Class Number ClassPrior
0 0 0.00000
1 15 0.65217
2 8 0.34783
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 8
Predicted class is 1
----------------------------
Node 194: Terminal node
Class Number ClassPrior
0 0 0.00000
1 15 0.93750
2 1 0.06250
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 1
Predicted class is 1
----------------------------
Node 195: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 7 1.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 2
----------------------------
Node 49: Intermediate node
A case goes into Node 98 if cxmain <= 1.5000000E+00 or NA
cxmain mean = 1.2000E+00
Class Number ClassPrior
0 0 0.00000
1 32 0.62745
2 19 0.37255
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 19
Predicted class is 1
----------------------------
Node 98: Intermediate node
A case goes into Node 196 if om1 <= 1.5000000E+00 or NA
om1 mean = 1.1250E+00
Class Number ClassPrior
0 0 0.00000
1 32 0.78049
2 9 0.21951
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 9
Predicted class is 1
----------------------------
Node 196: Terminal node
Class Number ClassPrior
0 0 0.00000
1 32 0.88889
2 4 0.11111
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 4
Predicted class is 1
----------------------------
Node 197: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 5 1.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 2
----------------------------
0 0 0.00000
1 0 0.00000
2 2 0.33333
3 4 0.66667
4 0 0.00000
Number of training cases misclassified = 2
Predicted class is 3
----------------------------
Node 51: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 0 0.00000
3 11 1.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 3
----------------------------
Node 13: Intermediate node
A case goes into Node 26 if ladprox <= 1.5000000E+00 or NA
ladprox mean = 1.4882E+00
Class Number ClassPrior
0 0 0.00000
1 27 0.21260
2 42 0.33071
3 58 0.45669
4 0 0.00000
Number of training cases misclassified = 69
Predicted class is 3
----------------------------
Node 26: Intermediate node
A case goes into Node 52 if laddist <= 1.5000000E+00 or NA
laddist mean = 1.2769E+00
Class Number ClassPrior
0 0 0.00000
1 27 0.41538
2 23 0.35385
3 15 0.23077
4 0 0.00000
Number of training cases misclassified = 38
Predicted class is 1
----------------------------
Node 52: Intermediate node
A case goes into Node 104 if cxmain <= 1.5000000E+00 or NA
cxmain mean = 1.3830E+00
Class Number ClassPrior
0 0 0.00000
1 27 0.57447
2 20 0.42553
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 20
Predicted class is 1
----------------------------
Node 104: Terminal node
Class Number ClassPrior
0 0 0.00000
1 27 0.93103
2 2 0.06897
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 2
Predicted class is 1
----------------------------
Node 105: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 18 1.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 2
----------------------------
Node 53: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 3 0.16667
3 15 0.83333
4 0 0.00000
Number of training cases misclassified = 3
Predicted class is 3
----------------------------
Node 27: Intermediate node
A case goes into Node 54 if cxmain <= 1.5000000E+00 or NA
cxmain mean = 1.4355E+00
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 19 0.30645
3 43 0.69355
4 0 0.00000
Number of training cases misclassified = 19
Predicted class is 3
----------------------------
Node 54: Intermediate node
A case goes into Node 108 if ramus <= 1.5000000E+00 or NA
ramus mean = 1.2571E+00
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 19 0.54286
3 16 0.45714
4 0 0.00000
Number of training cases misclassified = 16
Predicted class is 2
----------------------------
Node 108: Intermediate node
A case goes into Node 216 if om1 <= 1.5000000E+00 or NA
om1 mean = 1.2692E+00
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 19 0.73077
3 7 0.26923
4 0 0.00000
Number of training cases misclassified = 7
Predicted class is 2
----------------------------
Node 216: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 19 1.00000
3 0 0.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 2
----------------------------
Node 217: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 0 0.00000
3 7 1.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 3
----------------------------
Node 109: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 0 0.00000
3 9 1.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 3
----------------------------
Node 55: Terminal node
Class Number ClassPrior
0 0 0.00000
1 0 0.00000
2 0 0.00000
3 27 1.00000
4 0 0.00000
Number of training cases misclassified = 0
Predicted class is 3
----------------------------
Node 7: Terminal node
Class Number ClassPrior
0 1 0.03226
1 0 0.00000
2 0 0.00000
3 0 0.00000
4 30 0.96774
Number of training cases misclassified = 1
Predicted class is 4
----------------------------
painexer=a
0|
247/141/99/100/30
5 Regression
GUIDE can fit least-squares (LS), quantile, Poisson, proportional hazards, and least-
median-of-squares (LMS) regression tree models. We use the birthweight data in files
birthwt.dat and birthwt.dsc to demonstrate LS models. The data consist of ob-
servations from 50,000 live births. They are a subset of a larger dataset analyzed
in Koenker and Hallock (2001); see also Koenker (2005). The variables are weight
(infant birth weight), black (indicator of black mother), married (indicator of mar-
ried mother), boy (indicator of boy), visit (prenatal visit: 0 = no visits, 1 = visit
in 2nd trimester, 2 = visit in last trimester, 3 = visit in 1st trimester), ed (Mothers
education level: 0 = high school, 1 = some college, 2 = college, 3 = less than high
school), smoke (indicator of smoking mother), cigsper (number of cigarettes smoked
per day), age (mothers age), and wtgain (mothers weight gain during pregnancy).
The contents of birthwt.dsc are:
birthwt.dat
NA
1
1 weight d
2 black c
3 married c
4 boy c
5 age n
6 smoke c
7 cigsper n
8 wtgain n
9 visit c
10 ed c
11 lowbwt x
The last variable lowbwt is a derived indicator of low birthweight not used here.
5.1.2 Results
The contents of cons.out follow.
Least squares regression tree
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is weight
Piecewise constant model
Number of records in data file: 50000
Length of longest data entry: 4
50000 0 0 1 0 0 3 0 6
No weight variable in data file
No. cases used for training: 50000
Regression tree:
At each categorical variable split, values not in training data go right
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap solution
, see
Loh et al. (2016), "Identification of subgroups with differential treatment eff
ects
for longitudinal and multiresponse variables", Statistics in Medicine, v.35, 48
37-4855.
Figure 10 shows the tree diagram. It is rather large and may not be so easy
to interpret. This is because the complexity of a piecewise-constant model rests
completely in the tree structure. GUIDE has 4 options that will reduce the tree size
by moving some of the complexity to the nodes of the tree.
black married
=1 =0
5 REGRESSION
3330. 3093. 3320. 3528. 3451. 3669.
GUIDE manual
Figure 10: GUIDE v.26.0 0.50-SE piecewise constant least-squares regression tree for predicting weight. At
each split, an observation goes to the left branch if and only if the condition is satisfied. The symbol
stands for or missing. For splits on categorical variables, values not present in the training sample go to
the right. Sample size (in italics) and mean of weight printed below nodes. Second best split variable at
root node is married.
5.1 Least squares constant: birthwt data 5 REGRESSION
(9) white
Input your choice ([1:9], <cr>=1):
You can store the variables and/or values used to split and fit in a file
Choose 1 to skip this step, 2 to store split and fit variables,
3 to store split variables and their values
Input your choice ([1:3], <cr>=1):3
Choose 3 to save split variable info to a file.
Input file name: lin.var
Input 2 to save regressor names in a file, 1 otherwise ([1:2], <cr>=1): 2
Option 2 saves names of regressors and their coefs to a file.
Input file name: lin.reg
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: lin.it
Input 2 to save terminal node IDs for importance scoring; 1 otherwise ([1:2], <cr>=1):
Input 2 to write R function for predicting new cases, 1 otherwise ([1:2], <cr>=1):
Input file is created!
Run GUIDE with the command: guide < lin.in
5.2.2 Results
Warning: The p-values produced by GUIDE are not adjusted for split
selection. Therefore they are typically biased low. One way to adjust
the p-values to control for split selection is with the bootstrap method in
Loh et al. (2016). Our experience indicates, however, that any unadjusted
p-value less than 0.01 is likely to be significant at level 0.05 after the
bootstrap adjustment.
Least squares regression tree
Predictions truncated at global min and max of D sample values
The default option sets this truncation.
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Dependent variable is weight
Piecewise simple linear or constant model
Powers are dropped if they are not significant at level .0500
The default option sets non-significant regression coefs to 0.
Number of records in data file: 50000
Length of longest data entry: 4
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap solution
, see
married
=0 1
cigsper
2 .5000 3
14369
3234. boy
+wtgain =0 6 7
3313
3198.
+wtgain
12 13
15610 16708
3388. 3506.
+wtgain +wtgain
Figure 11: GUIDE v.26.0 0.50-SE piecewise simple linear least-squares regression
tree for predicting weight. At each split, an observation goes to the left branch if
and only if the condition is satisfied. The symbol stands for or missing.
Sample size (in italics), mean of weight, and sign and name of regressor variables
printed below nodes. Nodes with negative and positive slopes are colored red and
green, respectively. Second best split variable at root node is black.
----------------------------
Node 7: Terminal node
Coefficients of least squares regression functions:
Regressor Coefficient t-stat p-val Minimum Mean Maximum
Constant 2.9060E+03 129.19 0.0000
wtgain 9.8392E+00 14.36 0.0000 0.0000E+00 2.9688E+01 9.8000E+01
Mean of weight = 3198.11469966797
Predicted values truncated at 240.000000000000 & 6350.00000000000
----------------------------
The tree model is shown in Figure 11. Besides being much smaller than the
piecewise-constant model, it shows that wtgain (mothers weight gain) is the best
4 boy 2 0
6 smoke 2 0
9 visit 4 0
10 ed 4 0
Re-checking data ...
Assigning codes to categorical and missing values
Finished processing 5000 of 50000 observations
Finished processing 10000 of 50000 observations
Finished processing 15000 of 50000 observations
Finished processing 20000 of 50000 observations
Finished processing 25000 of 50000 observations
Finished processing 30000 of 50000 observations
Finished processing 35000 of 50000 observations
Finished processing 40000 of 50000 observations
Finished processing 45000 of 50000 observations
Finished processing 50000 of 50000 observations
Data checks complete
Rereading data
Total #cases w/ #missing
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
50000 0 0 1 3 0 0 0 6
No weight variable in data file
No. cases used for training: 50000
Finished reading data file
Choose how you wish to deal with missing values in training or test data:
Option 1: Fit separate models to complete and incomplete cases
Option 2: Impute missing F and N values at each node with means for regression
Option 3: Fit a piecewise constant model
Input selection: ([1:3], <cr>=2):
These options are irrelevant here; they deal with missing values.
Default number of cross-validations: 10
Input 1 to accept the default, 2 to change it ([1:2], <cr>=1):
Best tree may be chosen based on mean or median CV estimate
Input 1 for mean-based, 2 for median-based ([1:2], <cr>=1):
Input number of SEs for pruning ([0.00:1000.00], <cr>=0.50):
Choose a split point selection method for numerical variables:
Choose 1 to use faster method based on sample quantiles
Choose 2 to use exhaustive search
Input 1 or 2 ([1:2], <cr>=2):
Default max. number of split levels: 30
Input 1 to accept this value, 2 to change it ([1:2], <cr>=1):
Default minimum node sample size is 2499
Input 1 to use the default value, 2 to change it ([1:2], <cr>=1):
Input 1 for LaTeX tree code, 2 to skip it ([1:2], <cr>=1):
Input file name to store LaTeX code (use .tex as suffix): mul.tex
Input 1 to include node numbers, 2 to omit them ([1:2], <cr>=1):
5.3.2 Results
Least squares regression tree
Predictions truncated at global min and max of D sample values
Truncation of predicted values can be changed by selecting the non-default option.
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Dependent variable is weight
Piecewise linear model
Number of records in data file: 50000
Length of longest data entry: 4
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
black
=1 1
boy
2 =0 3
8142
3163.
6 7
20229 21629
3352. 3467.
Figure 12: GUIDE v.26.0 0.50-SE least-squares multiple linear regression tree for
predicting weight. At each split, an observation goes to the left branch if and only if
the condition is satisfied. Sample sizes (in italics) and means of weight are printed
below nodes. Second best split variable (based on curvature test) at root node is
boy.
50000 0 0 1 3 0 0 0 6
No weight variable in data file
No. cases used for training: 50000
Finished reading data file
Choose how you wish to deal with missing values in training or test data:
Option 1: Fit separate models to complete and incomplete cases
Option 2: Impute missing F and N values at each node with means for regression
Option 3: Fit a piecewise constant model
Input selection: ([1:3], <cr>=2):
Default number of cross-validations: 10
Input 1 to accept the default, 2 to change it ([1:2], <cr>=1):
Best tree may be chosen based on mean or median CV estimate
Input 1 for mean-based, 2 for median-based ([1:2], <cr>=1):
Input number of SEs for pruning ([0.00:1000.00], <cr>=0.50):
Choose fraction of cases for splitting
Larger values give more splits: 0 = median split and 1 = all possible splits
Default fraction is 1.0000
Choose 1 to accept default split fraction, 2 to change it
Input 1 or 2 ([1:2], <cr>=1):
Default max. number of split levels: 30
Input 1 to accept this value, 2 to change it ([1:2], <cr>=1):
Default minimum node sample size is 2499
Input 1 to use the default value, 2 to change it ([1:2], <cr>=1):
Input 1 for LaTeX tree code, 2 to skip it ([1:2], <cr>=1):
Input file name to store LaTeX code (use .tex as suffix): step.tex
Input 1 to include node numbers, 2 to omit them ([1:2], <cr>=1):
Input 1 to number all nodes, 2 to number leaves only ([1:2], <cr>=1):
Choose a color for the terminal nodes:
(1) white
(2) lightgray
(3) gray
(4) darkgray
(5) black
(6) yellow
(7) red
(8) blue
(9) green
(10) magenta
(11) cyan
Input your choice ([1:11], <cr>=6):
You can store the variables and/or values used to split and fit in a file
Choose 1 to skip this step, 2 to store split and fit variables,
3 to store split variables and their values
Input your choice ([1:3], <cr>=1):3
Input file name: step.var
Input 2 to save regressor names in a file, 1 otherwise ([1:2], <cr>=1):2
The result is the same as that for the multiple linear regression option in this
example.
4 boy b
5 age n
6 smoke b
7 cigsper n
8 wtgain n
9 visit b
10 ed b
11 lowbwt x
5.5.2 Results
Least squares regression tree
Predictions truncated at global min. and max. of D sample values
Pruning by cross-validation
Data description file: birthwtancova.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Dependent variable is weight
Piecewise simple linear ANCOVA model
F-to-enter and F-to-delete: 4.000 3.990
Number of records in data file: 50000
Length of longest data entry: 4
Number of dummy variables created: 10
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
The results show that the tree is trivial, with no splits after pruning. The model is
linear in wtgain and all indicator variables except those for visit.
5.6.3 Results
Quantile regression tree with quantile probability .0800
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is weight
Piecewise constant model
Number of records in data file: 50000
Length of longest data entry: 4
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
wtgain
25.50 1
black black
=1 2 =1 3
Figure 13: GUIDE v.26.0 0.50-SE piecewise constant 0.08-quantile regression tree
for predicting weight. At each split, an observation goes to the left branch if and
only if the condition is satisfied. The symbol stands for or missing. Sample
size (in italics) and 0.08-quantiles of weight printed below nodes. Second best split
variable at root node is cigsper.
5.6.6 Results
Dual-quantile regression tree with .0800 and .1200 quantiles
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is weight
Piecewise constant model
Number of records in data file: 50000
Length of longest data entry: 4
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
50000 0 0 1 0 0 3 0 6
No. cases used for training: 50000
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
Figure 14 shows the tree. The sample size and 0.08 and 0.12 quantiles are printed
below each terminal node.
wtgain
25.50 1
black black
=0 2 =0 3
50 51
8942 2269
3060. 2920.
2920. 2785.
Figure 14: GUIDE v.26.0 0.50-SE piecewise constant 0.08 and 0.12-quantile regres-
sion tree for predicting weight. At each split, an observation goes to the left branch
if and only if the condition is satisfied. The symbol stands for or missing.
Sample size (in italics) and sample 0.12 and 0.08-quantiles of weight printed below
nodes. Second best split variable at root node is cigsper.
5.6.9 Results
Quantile regression tree with quantile probability .0800
No truncation of predicted values
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
black
=1 1
cigsper
2 1.50 3
8142
2381.
+wtgain
married
=0 6 7
5564
2440.
+wtgain
12 13
6561 29733
2637. 2778.
+wtgain +wtgain
Figure 15: GUIDE v.26.0 0.50-SE piecewise simple linear .080-quantile regression
tree for predicting weight. At each split, an observation goes to the left branch
if and only if the condition is satisfied. The symbol stands for or missing.
Sample size (in italics), 0.08-quantile of weight, and sign and name of best regressor
printed below nodes. Nodes with negative and positive slopes are colored red and
green, respectively. Second best split variable at root node is married.
----------------------------
Node 7: Terminal node
Coefficients of quantile regression function:
Regressor Coefficient Minimum Mean Maximum
Constant 2.0480E+03
wtgain 1.3612E+01 0.0000E+00 3.0389E+01 9.8000E+01
----------------------------
The tree is shown in Figure 15. Piecewise linear quantile regression with two
quantiles simultaneously is not available at the present time.
4 boy 2 0
6 smoke 2 0
9 visit 4 0
10 ed 4 0
Re-checking data ...
Assigning codes to categorical and missing values
Finished processing 5000 of 50000 observations
Finished processing 10000 of 50000 observations
Finished processing 15000 of 50000 observations
Finished processing 20000 of 50000 observations
Finished processing 25000 of 50000 observations
Finished processing 30000 of 50000 observations
Finished processing 35000 of 50000 observations
Finished processing 40000 of 50000 observations
Finished processing 45000 of 50000 observations
Finished processing 50000 of 50000 observations
Data checks complete
Rereading data
Total #cases w/ #missing
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
50000 0 0 1 3 0 0 0 6
No. cases used for training: 50000
Finished reading data file
Input 1 for LaTeX tree code, 2 to skip it ([1:2], <cr>=1):
Input file name to store LaTeX code (use .tex as suffix): q08mul.tex
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: q08mul.fit
Input file is created!
Run GUIDE with the command: guide < q08mul.in
5.6.12 Results
Quantile regression tree with quantile probability .0800
No truncation of predicted values
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Dependent variable is weight
Piecewise linear model
Number of records in data file: 50000
Length of longest data entry: 4
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
black
=1 1
married
2 =0 3
8142
2381.
6 7
9053 32805
2580. 2750.
Figure 16: GUIDE v.26.0 0.50-SE multiple linear .080-quantile regression tree for
predicting weight. At each split, an observation goes to the left branch if and only
if the condition is satisfied. Sample size (in italics) and 0.08-quantiles of weight
printed below nodes. Second best split variable at root node is married.
median of squares regression (Rousseeuw and Leroy, 1987). GUIDE can construct
tree models using this criterion. We use the birthwt data for illustration. A session
log of the input file generation is below, followed by the results and the LATEX tree
diagram in Figure 17.
5.7.1 Results
Least median of squares regression tree
Predictions truncated at global min. and max. of D sample values
Pruning by cross-validation
Data description file: birthwt.dsc
Training sample file: birthwt.dat
Missing value code: NA
Records in data file start on line 1
Dependent variable is weight
Piecewise simple linear or constant model
Powers are dropped if they are not significant at level 1.0000
Number of records in data file: 50000
Length of longest data entry: 4
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
cigsper
.5000 1
black boy
=1 2 =0 3
boy wtgain
=0 4 27.50 5 6 7
3148 3385
3119. 3260.
+wtgain
boy
8 9 10 =0 11
3584 3766 13965
3158. 3289. 3374.
+wtgain age
22 25.50 23
10500
3459.
+age
wtgain
46 38.50 47
4075
3515.
+wtgain
94 95
4596 2981
3572. 3686.
+wtgain
2 sex b
3 agegp c
4 deaths d
5 pop x
6 logpop z
Our goal is to construct a Poisson regression tree for the gender-specific rate of lung
cancer deaths, where rate is the expected number of deaths in a county divided by
its population size for each gender. That is, letting denote the expected number
of gender-specific deaths in a county, we fit this model in each node of the tree:
log(/pop) = 0 + 1 I(sex = M)
or, equivalently,
log() = 0 + 1 I(sex = M) + logpop.
5.8.2 Results
Poisson regression tree
No truncation of predicted values
Pruning by cross-validation
Data description file: lungcancer.dsc
Training sample file: lungcancer.txt
Missing value code: NA
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
The results show that the death rate increases with age and that the rate for
males is consistently higher than that for females. The tree diagram is given in
Figure 18.
agegp
=45-54 1
agegp
2 =55-64 3
230
5.493E-03
6 7
230 460
0.01 0.02
Figure 18: GUIDE v.26.0 0.50-SE multiple linear Poisson regression tree for predict-
ing rate of deaths. At each split, an observation goes to the left branch if and only
if the condition is satisfied. Sample size (in italics) and sample rate printed below
nodes. Second best split variable at root node is sex.
where Si is a set corresponding node i and i is its associated coefficient vector. See
Loh et al. (2015) for more details. P
We illustrate the piecewise-constant model (x, t) = 0 (t) i I(x Si ) exp(i0 )
with a data set from the Worcester Heart Attack Study analyzed in Hosmer et al.
(2008). The data are in the file whas500.csv and the description file in whas500.dsc
whose contents are repeated below.
whas500.csv
NA
1
1 id x
2 age n
3 gender c
4 hr n
5 sysbp n
6 diasbp n
7 bmi n
8 cvd c
9 afb c
10 sho c
11 chf c
12 av3 c
13 miord c
14 mitype c
15 year c
16 admitdate x
17 disdate x
18 fdate x
19 los n
20 dstat x
21 lenfol t
22 fstat d
The goal of the study is to observe survival rates following hospital admission for
acute myocardial infarction. The response variable is lenfol, which stands for total
length of follow-up in days. Variable fstat is status at last follow-up (0=alive,
1=dead) and variable chf is congestive heart complications (0=no, 1=yes).
0. Read the warranty disclaimer
1. Create an input file for model fitting or importance scoring (recommended)
2. Convert data to other formats without creating input file
Input your choice: 1
Name of batch input file: cons.in
Input 1 for model fitting, 2 for importance or DIF scoring,
3 for data conversion ([1:3], <cr>=1):
Name of batch output file: cons.out
Input 1 for single tree, 2 for ensemble ([1:2], <cr>=1):
Input 1 for classification, 2 for regression, 3 for propensity score grouping
(propensity score grouping is an experimental option)
Input your choice ([1:3], <cr>=1): 2
Choose type of regression model:
1=linear, 2=quantile, 3=Poisson, 4=proportional hazards,
5=multiresponse or itemresponse, 6=longitudinal data (with T variables).
Input choice ([1:6], <cr>=1): 4
Choose complexity of model to use at each node:
If R variable present (i.e., subgroup identification),
choose 1 for multiple regression (including R var.) in each node,
Input 2 to save individual fitted values and node IDs, 1 otherwise ([1:2], <cr>=2):
Input name of file to store node ID and fitted value of each case: cons.fit
Input file is created!
Run GUIDE with the command: guide < cons.in
5.9.1 Results
Proportional hazards regression with relative risk estimates
Pruning by cross-validation
Data description file: whas500.dsc
Training sample file: whas500.csv
Missing value code: NA
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is fstat
Piecewise constant model
Number of records in data file: 500
Length of longest data entry: 10
Smallest uncensored T: 1.0000
No. complete cases excluding censored T < smallest uncensored T: 500
No. cases used to compute baseline hazard: 500
No. cases with D=1 and T >= smallest uncensored: 215
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
age
71.50 1
chf chf
=1 2 =1 3
age
4 5 6 85.50 7
49 195 106
1.11 0.21 3.03
14 15
120 30
1.06 3.32
Figure 19: GUIDE v.26.0 0.50-SE piecewise constant relative risk regression tree for
predicting fstat. At each split, an observation goes to the left branch if and only
if the condition is satisfied. The symbol stands for or missing. Sample size
(in italics) and mean relative risks (relative to sample average ignoring covariates)
printed below nodes. Second best split variable at root node is bmi.
The tree model, given in Figure 19, shows that risk of death is lowest (0.21
relative to the sample average for the whole data set) for those younger than 72 with
no congestive heart complications. The groups with the highest risks (3.033.32
relative to average) are those older than 71 with congestive heart complications and
those older than 85 without congestive heart complications.
The top few lines of the file whas500.fit and its column definitions are:
train node survivaltime logbasecumhaz relativerisk survivalprob mediansurvtime
y 14 2.178000E+03 -7.667985E-02 1.063335E+00 3.865048E-01 1.553833E+03
y 5 2.172000E+03 -7.667985E-02 2.123517E-01 8.270912E-01 2.354277E+03
y 5 2.190000E+03 -7.667985E-02 2.123517E-01 8.270912E-01 2.354277E+03
y 4 2.970000E+02 -1.320296E+00 1.109557E+00 7.512523E-01 1.534963E+03
logbasecumhaz:
Rt log of the estimated baseline cumulative hazard function log 0 (t) =
log 0 0 (u) du at observed time t.
relativerisk: exp( x ), risk of death relative to the average for the sample,
where x is the covariate vector of the observation, is the estimated regression
coefficient vector in the node, and is the coefficient in the constant model
0 (t) exp( ) fitted to all the training cases in the root node. Because a constant
is fitted to each node here, = 0.035381 is the value of at the root node.
For example, the first subject, which is in node 14, has = 0.026029 and so
relativerisk = exp( ) = exp(0.026029 + 0.035381) = 1.063335.
survivalprob: probability that the subject survives up to observed time t. For the
first subject, this is
mediansurvtime: estimated median survival time t such that exp{0 (t) exp( x)} =
0.5, or, equivalently, 0 (t) exp( i x) = log(0.5), or logbasecumhaz(t) =
log log(2) i x, using linear interpolation of 0 (t). Median survival times
greater than the largest observed time have a trailing plus (+) sign.
3 age n
4 race c
5 marstat c
6 educ n
7 income n
8 poverty c
9 famsize n
10 condlist c
11 health n
12 latotal n
13 wkclass c
14 indus c
15 occup c
16 raday d
17 visit d
18 nacute x
19 hda12 d
20 lnvisit x
5.10.2 Results
Multi-response or longitudinal data without T variables
Pruning by cross-validation
Data description file: phs.dsc
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
The tree is shown in Figure 20. The file mult.fit saves the mean values of the
dependent variables in each terminal node:
node raday visit hda12
2 0.28630E+01 0.12474E+02 0.30076E+01
3 0.36337E+00 0.31909E+01 0.34108E+00
latotal
2.50 1
2 3
5936 54064
2.863 0.363
12.47 3.191
3.008 0.341
Figure 20: GUIDE v.26.0 0.50-SE regression tree for predicting response variables
raday, visit, and hda12. PCA not used. At each split, an observation goes to
the left branch if and only if the condition is satisfied. Sample sizes (in italics) and
predicted values of raday, visit, and hda12 printed below nodes. Second best split
variable at root node is health.
The file mult.nid gives the terminal node number for each observation, including
those that are not used to construct the tree (indicated by the letter n in the train
column of the file).
7 exper5 t
8 exper6 t
9 exper7 t
10 exper8 t
11 exper9 t
12 exper10 t
13 exper11 t
14 exper12 t
15 exper13 t
16 postexp1 x
17 postexp2 x
18 postexp3 x
19 postexp4 x
20 postexp5 x
21 postexp6 x
22 postexp7 x
23 postexp8 x
24 postexp9 x
25 postexp10 x
26 postexp11 x
27 postexp12 x
28 postexp13 x
29 wage1 d
30 wage2 d
31 wage3 d
32 wage4 d
33 wage5 d
34 wage6 d
35 wage7 d
36 wage8 d
37 wage9 d
38 wage10 d
39 wage11 d
40 wage12 d
41 wage13 d
42 ged1 x
43 ged2 x
44 ged3 x
45 ged4 x
46 ged5 x
47 ged6 x
48 ged7 x
49 ged8 x
50 ged9 x
51 ged10 x
52 ged11 x
53 ged12 x
54 ged13 x
55 uerate1 x
56 uerate2 x
57 uerate3 x
58 uerate4 x
59 uerate5 x
60 uerate6 x
61 uerate7 x
62 uerate8 x
63 uerate9 x
64 uerate10 x
65 uerate11 x
66 uerate12 x
67 uerate13 x
68 race c
5.11.2 Results
Lowess smoothing
Longitudinal data with T variables
Pruning by cross-validation
Data description file: wagedsc.txt
Training sample file: wagedat.txt
Missing value code: NA
Records in data file start on line 1
Warning: N variables changed to S
Number of D variables = 13
Number of D variables = 13
D variables are:
wage1
wage2
wage3
wage4
wage5
wage6
wage7
wage8
wage9
wage10
wage11
wage12
wage13
T variables are:
exper1
exper2
exper3
exper4
exper5
exper6
exper7
exper8
exper9
exper10
exper11
exper12
exper13
Number of records in data file: 888
Length of longest data entry: 16
Model fitted to subset of observations with complete D values
68 race c 3
***************************************************************
Figure 21 shows the tree and Figure 22 plots lowess-smoothed curves of mean
wage in the two terminal nodes. The plotting values are obtained from the result file
wage.fit whose contents are given below. The first column gives the node number
and the next two columns the start and end of the times at which fitted values are
hgc
9.50 1
race
2 = black 3
577
6 7
95 216
Figure 21: GUIDE v.26.0 0.00-SE regression tree for predicting longitudinal variables
wage1, wage2, etc. At each split, an observation goes to the left branch if and only
if the condition is satisfied. The symbol stands for or missing. For splits
on categorical variables, values not present in the training sample go to the right.
Sample sizes (in italics) printed below nodes.
computed. The other columns give the fitted values equally spaced between the start
and end times.
node t.start t.end fitted1 fitted2 fitted3 fitted4 fitted5 fitted6 fitted7 fitted8 fitted9 fitted10
2 0.10000E-02 0.12700E+02 0.48875E+01 0.51221E+01 0.53241E+01 0.54668E+01 0.55738E+01 0
6 0.80000E-02 0.12558E+02 0.61270E+01 0.58648E+01 0.57522E+01 0.57674E+01 0.57653E+01 0
7 0.20000E-02 0.12045E+02 0.56786E+01 0.58892E+01 0.60859E+01 0.62420E+01 0.63533E+01 0
The file wage.var below gives the type (t if node is terminal) and name of the
variable used to split each node and the split point (for n or s variables) or split
values (if c variable). The word NONE indicates a terminal node that cannot be split
by any variable. For a non-terminal node, the integer in the 5th column indicates
the number of split values to follow on the line.
1 s hgc hgc 1 0.9500000000E+01
2 t race race 0.0000000000E+00
3 c race race 1 "black"
6 t NONE NONE 0.0000000000E+00
3 c race race 1 "black"
7 t hgc hgc 0.0000000000E+00
hgc 9
11
9
8
7
6
5
0 2 4 6 8 10 12
exper (years)
Figure 22: Lowess-smoothed mean wage curves in the terminal nodes of Figure 21.
cancer.txt
NA
1
1 horTh r
2 age n
3 menostat c
4 tsize n
5 tgrade c
6 pnodes n
7 progrec n
8 estrec n
9 time t
10 event d
686 0 0 0 0 0 5 0 2
Survival time variable in column: 9
Event indicator variable in column: 10
Proportion uncensored among nonmissing T and D variables: 0.445
No. cases used for training: 672
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
progrec
21.50 1
1.660 0.879
1.476 2 3 0.459
274 398
Figure 23: GUIDE v.26.0 0.50-SE Gi proportional hazards regression tree for differen-
tial treatment effects without linear prognostic control. At each split, an observation
goes to the left branch if and only if the condition is satisfied. Numbers beside
terminal nodes are estimated relative risks (relative to average for sample ignoring
covariates) corresponding to horTh levels no and yes, respectively; numbers (in ital-
ics) below are sample sizes. Nodes with negative and positive effects for horTh level
yes are colored red and green, respectively. Second best split variable at root node
is estrec.
Let (u, x) denote the hazard function at time u and predictor values x and
let 0 (u) denote the baseline hazard function. The results show that the fitted
proportional hazards model is
Node 2 Node 3
Survival probability
0.8
0.8
0.4
0.4
horTh = yes horTh = yes
horTh = no horTh = no
0.0
0.0
500 1000 2000 500 1000 2000
Figure 24: Estimated survival probability functions for breast cancer data
Estimated relative risks and survival probabilities The file nolin.fit gives
the terminal node number, estimated survival time, log baseline cumulative haz-
ard, relative risk (relative to the average for the data, ignoring covariates), survival
probability, and median survival time of each observation in the training sample
file cancer.txt. The results for the first few observations are shown below. See
Section 5.9 for definitions of the terms.
train node survivaltime logbasecumhaz relativerisk survivalprob mediansurvtime
y 3 1.814000E+03 -3.317667E-01 8.787636E-01 5.331186E-01 2.014420E+03
y 3 2.018000E+03 -2.024282E-01 4.587030E-01 6.882035E-01 2.659000E+03+
y 3 7.120000E+02 -1.300331E+00 4.587030E-01 8.828100E-01 2.659000E+03+
y 3 1.807000E+03 -3.550694E-01 4.587030E-01 7.255880E-01 2.659000E+03+
y 3 7.720000E+02 -1.176558E+00 8.787636E-01 7.631865E-01 2.014420E+03
y 2 4.480000E+02 -2.105688E+00 1.660293E+00 8.173929E-01 1.038277E+03
that includes the most important covariate as a linear predictor in each node. This
is accomplished by choosing best simple polynomial option and specifying each
potential linear predictor as n in the description file (no change is needed in
cancerdsc.txt).
Regression tree:
***************************************************************
WARNING: p-values below not adjusted for split search. For a bootstrap
solution, see Loh et al. (2016), "Identification of subgroups with
differential treatment effects for longitudinal and multiresponse
variables", Statistics in Medicine, v.35, 4837-4855.
progrec
24.50 1
2 3
292 380
1.55 0.70
pnodes pnodes
Figure 25: GUIDE v.26.0 0.50-SE Gi proportional hazards regression tree for differ-
ential treatment effects with linear prognostic control. At each split, an observation
goes to the left branch if and only if the condition is satisfied. Sample size (in italics),
relative risks (relative to average for sample ignoring covariates), and name of linear
prognostic variable printed below nodes. Second best split variable at root node is
estrec.
Without controlling for split selection, the p-value of the effect of treatment
(horTh) is 0.0011 after adjustment with prognostic variable pnodes in terminal
node 3. The small size of the p-value suggests that the treatment effect in the node
will remain significant at the 0.05 level after controlling for split selection (one ap-
proach to adjusting for split selection is the bootstrap method proposed in Loh et al.
(2016)). The treatment effect in node 2, however, is not significant even without
controlling for split selection. The tree is shown in Figure 25.
6 Importance scoring
When there are numerous predictor variables, it may be useful to rank them in order
of their importance. GUIDE has a facility to do this. In addition, it provides a
threshold for distinguishing the important variables from the unimportant onessee
Loh et al. (2015) and Loh (2012); the latter also shows that using GUIDE to find a
subset of variables can increase the prediction accuracy of a model.
You can also output the importance scores and variable names to a file
Input 1 to create such a file, 2 otherwise ([1:2], <cr>=1):
Input file name: imp.scr
Input file is created!
Run GUIDE with the command: guide < imp.in
The scores are also printed in the file imp.scr. Following is an R file for graphing
them in Figure 26.
z0 <- read.table("imp.scr",header=TRUE)
par(mar=c(5,6,2,1),las=1)
barplot(z0$Score,names.arg=z0$Variable,col="cyan",horiz=TRUE,xlab="Importance scores")
abline(v=1,col="red",lty=2)
phct
mv
tension
mdt
as
mr
ag
mdi
mhct
mds
an
mdg
ai
mdn
at
vast
eat
tmn
vasn
vbst
eas
vasg
eag
vass
emd
vasi
phcs
vbss
vbsg
abrt
mhcs
tmt
vbsi
hvc
vbsn
eai
vbrt
mhcg
mdic
mhci
vbrs
ean
vbrn
vbrg
abrn
phci
mhcn
abrg
phcg
abrs
tms
hic
abri
vart
cs
vbri
phcn
varn
tmi
tmg
rnf
varg
vari
vars
lora
clv
0 2 4 6 8 10
Importance scores
Figure 26: Importance scores for glaucoma data; variables with bars shorter than
indicated by the red dashed line are considered unimportant.
11 chf 2 0
12 av3 2 0
13 miord 2 0
14 mitype 2 0
15 year 3 0
Re-checking data ...
Assigning codes to categorical and missing values
Data checks complete
Smallest uncensored T: 1.0000
No. complete cases excluding censored T < smallest uncensored T: 500
No. cases used to compute baseline hazard: 500
No. cases with D=1 and T >= smallest uncensored: 215
Rereading data
Total #cases w/ #missing
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
500 0 0 5 0 0 6 0 9
Survival time variable in column: 21
Event indicator variable in column: 22
Proportion uncensored among nonmissing T and D variables: .430
No. cases used for training: 500
Finished reading data file
Input expected fraction of noise variables erroneously selected ([0.00:0.99], <cr>=0.01):
You can create a description file with the selected variables included or excluded
Input 2 to create such a file, 1 otherwise ([1:2], <cr>=1): 2
Input 1 to keep only selected variables, 2 to exclude selected variables ([1:2], <cr>=1):
Input file name: whas500new.dsc
You can also output the importance scores and variable names to a file
Input 1 to create such a file, 2 otherwise ([1:2], <cr>=1):
Input file name: whas500imp.scr
Input file is created!
Run GUIDE with the command: guide < whas500imp.in
Results The importance scores are given at the end of the output file whas500imp.out.
Variables with scores below 1.0 (i.e., below the cut-off line) are considered unimpor-
tant.
The scores are also contained in the file whas500imp.scr for input into another
computer program:
Rank Score Variable
1.00 1.23414E+01 age
2.00 9.38189E+00 chf
3.00 5.89639E+00 year
4.00 4.76455E+00 bmi
5.00 4.41122E+00 hr
6.00 2.92057E+00 diasbp
7.00 2.26483E+00 mitype
8.00 1.70883E+00 miord
9.00 1.59334E+00 sho
10.00 1.54693E+00 gender
11.00 1.49968E+00 afb
12.00 1.37849E+00 los
13.00 1.18834E+00 sysbp
14.00 7.82079E-01 cvd
15.00 2.67578E-01 av3
Finally, here are the contents of the file whas500new.dsc. It puts an x against
the variables (cvd and av3 here) that are not important.
"whas500.csv"
"NA"
1
1 id x
2 age n
3 gender c
4 hr n
5 sysbp n
6 diasbp n
7 bmi n
8 cvd x
9 afb c
10 sho c
11 chf c
12 av3 x
13 miord c
14 mitype c
15 year c
16 admitdate x
17 disdate x
18 fdate x
19 los n
20 dstat x
21 lenfol t
22 fstat d
13 worth d
14 energy d
15 hope d
16 better d
17 total x
18 gender c
19 education n
20 age n
21 dxcurren x
22 sumscore x
Here is the session log to create an input file for identifying DIF items and the
important predictor variables:
0. Read the warranty disclaimer
1. Create an input file for model fitting or importance scoring (recommended)
2. Convert data to other formats without creating input file
Input your choice: 1
Name of batch input file: dif.in
Input 1 for model fitting, 2 for importance or DIF scoring,
3 for data conversion ([1:3], <cr>=1): 2
Name of batch output file: dif.out
Input 1 for classification, 2 for regression, 3 for propensity score grouping
(propensity score grouping is an experimental option)
Input your choice ([1:3], <cr>=1): 2
Choose type of regression model:
1=linear, 2=quantile, 3=Poisson, 4=proportional hazards,
5=multiresponse or itemresponse, 6=longitudinal data (with T variables).
Input choice ([1:6], <cr>=1): 5
Choose option 5 for item response data.
Input 1 for default options, 2 otherwise ([1:2], <cr>=1):
afraid
happy
help
home
memory
alive
worth
energy
hope
better
Multivariate or univariate split variable selection:
Choose multivariate if there is an order among the D variables; otherwise choose univariate
Input 1 for multivariate, 2 for univariate ([1:2], <cr>=1): 2
Input 1 to normalize D variables, 2 for no normalization ([1:2], <cr>=1): 2
Input 1 for equal, 2 for unequal weighting of D variables ([1:2], <cr>=1):
Reading data file ...
Number of records in data file: 1978
Length of longest data entry: 4
Checking for missing values ...
Total number of cases: 1978
Col. no. Categorical variable #levels #missing values
18 gender 2 0
Re-checking data ...
Allocating missing value information
Assigning codes to categorical and missing values
Data checks complete
Creating missing value indicators
Some D variables have missing values
You can use all the data or only those with complete D values
Using only cases with complete D values will reduce the sample size
but allows the option of using PCA for split selection
Input 1 to use all obs, 2 to use obs with complete D values ([1:2], <cr>=2):
Rereading data
PCA can be used for variable selection
Do not use PCA if differential item functioning (DIF) scores are wanted
Input 1 to use PCA, 2 otherwise ([1:2], <cr>=2):
Choose option 2 because DIF scoring is desired.
#cases w/ miss. D = number of cases with all D values missing
Total #cases w/ #missing
#cases miss. D ord. vals #X-var #N-var #F-var #S-var #B-var #C-var
1978 0 1 3 0 0 3 0 1
No. cases used for training: 1977
No. cases excluded due to 0 weight or missing D: 1
Finished reading data file
Warning: interaction tests skipped
Input expected fraction of noise variables erroneously selected ([0.00:0.99], <cr>=0.01):
Input 1 to save p-value matrix for differential item functioning (DIF), 2 otherwise ([1:2], <cr>=1)
Input file name to store DIF p-values: dif.pv
This file is useful for finding the items with DIF.
You can create a description file with the selected variables included or excluded
Input 2 to create such a file, 1 otherwise ([1:2], <cr>=1):
You can also output the importance scores and variable names to a file
Input 1 to create such a file, 2 otherwise ([1:2], <cr>=1):
Input file name: dif.scr
Input file is created!
Run GUIDE with the command: guide < dif.in
The importance scores are in the file dif.scr. They show that age is most
important and gender is least important.
Rank Score Variable
1.00 4.59054E+00 age
2.00 3.43418E+00 gender
3.00 2.36410E+00 education
The last column of dif.pv below shows that items #4 and #10 (bored and
memory) has DIF.
Item Itemname education age gender DIF
1 satis 0.747E-01 0.359E-01 0.918E-01 no
2 drop 0.157E-01 0.202E+00 0.904E+00 no
3 empty 0.547E-03 0.373E-01 0.241E+00 no
4 bored 0.563E-07 0.319E+00 0.361E+00 yes
5 spirit 0.978E+00 0.827E+00 0.261E-01 no
6 afraid 0.479E-01 0.157E-02 0.280E-02 no
7 happy 0.838E+00 0.591E+00 0.330E-01 no
8 help 0.160E-01 0.849E+00 0.384E-02 no
9 home 0.238E+00 0.181E+00 0.155E-03 no
10 memory 0.486E+00 0.000E+00 0.614E-02 yes
11 alive 0.276E+00 0.243E+00 0.416E+00 no
12 worth 0.126E+00 0.931E+00 0.650E+00 no
13 energy 0.471E+00 0.765E+00 0.203E-04 no
14 hope 0.490E+00 0.620E+00 0.224E+00 no
15 better 0.432E+00 0.476E+00 0.439E+00 no
8 Tree ensembles
A tree ensemble is a collection of trees. GUIDE has two methods of constructing an
ensemble. The preferred one is called GUIDE forest. Similar to Random Forest
(Breiman, 2001), it fits unpruned trees to bootstrap samples and randomly selects a
small subset of variables to search for splits at each node. There are, however, two
important differences:
1. GUIDE forest uses the unbiased GUIDE method for split selection and Random
Forest uses the biased CART method. As a result, GUIDE forest is very much
faster than Random Forest if the dependent variable is a class variable having
more than two distinct values and there are categorical predictor variables with
large numbers of categories. In addition, GUIDE forest is applicable to data
with missing values
2. Random Forest (Liaw and Wiener, 2002) requires apriori imputation of missing
values in the predictor variables but GUIDE forest does not need imputation.
Choose 1 for estimated priors, 2 for equal priors, 3 for priors from a file
Input 1, 2, or 3 ([1:3], <cr>=1):
Choose 1 for unit misclassification costs, 2 to input costs from a file
Input 1 or 2 ([1:2], <cr>=1):
Input name of file to store predicted class and probability: hepforest.fit
Input file is created!
Run GUIDE with the command: guide < hepforest.in
8.3 Results
Warning: Owing to the intrinsic randomness in forests, your results may differ from
those shown below.
Random forest of classification trees
No pruning
Data description file: hepdsc.txt
Training sample file: hepdat.txt
Missing value code: ?
Records in data file start on line 1
Warning: N variables changed to S
Dependent variable is CLASS
Number of records in data file: 155
Length of longest data entry: 6
Missing values found among categorical variables
Separate categories will be created for missing categorical variables
Number of classes: 2
Class proportions of dependent variable CLASS:
Class #Cases Proportion
die 32 0.20645161
live 123 0.79354839
10 FIRMLIVER c 2 11
11 SPLEEN c 2 5
12 SPIDERS c 2 5
13 ASCITES c 2 5
14 VARICES c 2 5
15 BILIRUBIN s 3.0000E-01 8.0000E+00 6
16 ALKPHOSPHATE s 2.6000E+01 2.9500E+02 29
17 SGOT s 1.4000E+01 6.4800E+02 4
18 ALBUMIN s 2.1000E+00 6.4000E+00 16
19 PROTIME s 0.0000E+00 1.0000E+02 67
20 HISTOLOGY c 2
Except for the number of observations misclassified, the above results are not
particularly useful; they mostly provide a record of the parameter values chosen to
construct the forest. The predicted class probabilities in the file hepforest.fit are
more useful, the top few lines of which are shown below. The first column indicates
whether or not the observation is used for training (labeled y vs. n), followed
by its predicted class probabilities. The last two columns give the predicted and
observed class labels. For example, observation 7 below has predicted proabilities
of 0.2702 and 0.7298 for being in class die and live, respectively, and its predicted
class is live.
train "die" "live" predicted observed
y 0.11211E-01 0.98879E+00 "live" "live"
y 0.63869E-01 0.93613E+00 "live" "live"
y 0.22287E-01 0.97771E+00 "live" "live"
y 0.84751E-02 0.99152E+00 "live" "live"
y 0.68997E-02 0.99310E+00 "live" "live"
y 0.81449E-02 0.99186E+00 "live" "live"
y 0.27020E+00 0.72980E+00 "live" "die"
Results
9 Other features
9.1 Pruning with test samples
GUIDE typically has three pruning options for deciding the size of the final tree:
(i) cross-validation, (ii) test sample, and (iii) no pruning. Test-sample pruning is
available only when there are no derived variables, such as creation of dummy indi-
cator variables when b variables are present. If test-sample pruning is chosen, the
program will ask for the name of the file containing the test samples. This file must
have the same column format as the training sample file. Pruning with test-samples
or no pruning are non-default options.
1. Use a weight variable (designated as W in the description file) that takes value
1 for each training observation and 0 or each test observation.
2. Replace the D values of the test observations with the missing value code.
For tree construction, GUIDE does not use observations in the training sample file
that have zero weight.
1. Create a file (with name data.txt, say) containing one set of bootstrapped
data.
2. Create a data description file (with name desc.txt, say) that refers to data.txt.
3. Create an input file (with name input.txt, say) that refers to desc.txt.
If the files are not all in the same folder, full path names must be given. Here
log.txt is a text file that stores messages during execution. If GUIDE does
not run successfully, errors are also written to log.txt.
0 i p j q a
at the end of the data description file. Here i and j are integers giving the column
numbers of variables X1 and X2 , respectively, in the data file and a is one of the
letters n, s, or f (corresponding to a numerical variable used for both splitting and
fitting, splitting only, or fitting only).
To demonstrate, suppose we wish to fit a piecewise quadratic model in the vari-
able wtgain in the birthweight data. This is easily done by adding one line to the
file birthwt.dsc. First we assign the s (for splitting only) designator to every nu-
merical predictor except wtgain. This will prevent all variables other than wtgain
from acting as regressors in the piecewise quadratic models. To create the variable
wtgain2 , add the line
0 8 2 8 0 f
to the end of birthwt.dsc. The 8s in the above line refer to the column number of
the variables wtgain in the data file, and the f tells the program to use the variable
wtgain2 for fitting terminal node models only. Note: The line defines wtgain2 as
wtgain2 wtgain0 . Since we can equivalently define the variable by wtgain2 =
wtgain1 wtgain1 , we could also have used the line: 0 8 1 8 1 f.
The resulting description file now looks like this:
birthwt.dat
NA
1
1 weight d
2 black c
3 married c
4 boy c
5 age s
6 smoke c
7 cigsper s
8 wtgain n
9 visit c
10 ed c
11 lowbwt x
0 8 2 8 0 f
When the program is given this description file, the output will show the regression
coefficients of wtgain and wtgain2 in each terminal node of the tree.
1. R/Splus: Fields are space delimited. Missing values are coded as NA. Each
record is written on one line. Variable names are given on the first line.
2. SAS: Fields are space delimited. Missing values are coded with periods. Char-
acter strings are truncated to eight characters. Spaces within character strings
are replaced with underscores (_).
3. TEXT: Fields are comma delimited. Empty fields denote missing values. Char-
acter strings longer than eight characters are truncated. Each record is written
on one line. Variable names are given on the first line.
5. SYSTAT: Fields are comma delimited. Strings are truncated to eight char-
acters. Missing character values are replaced with spaces, missing numerical
values with periods. Each record occupies one line.
6. BMDP: Fields are space delimited. Categorical values are sorted in alphabetic
order and then assigned integer codes. Missing values are indicated by asterisks.
Variable names longer than eight characters are truncated.
7. DataDesk: Fields are space delimited. Missing categorical values are coded
with question marks. Missing numerical values are coded with asterisks. Each
record is written on one line. Spaces within categorical values are replaced with
underscores. Variable names are given on the first line of the file.
8. MINITAB: Fields are space delimited. Categorical values are sorted in alpha-
betic order and then assigned integer codes. Missing values are coded with
asterisks. Variable names longer than eight characters are truncated.
9. NUMBERS: Same as TEXT option except that categorical values are con-
verted to integer codes.
10. C4.5: This is the format required by the C4.5 (Quinlan, 1993) program.
11. ARFF: This is the format required by the WEKA (Witten and Frank, 2000)
programs.
Following is a sample session where the hepatitis data are reformatted for R or
Splus.
0. Read the warranty disclaimer
1. Create an input file for model fitting or importance scoring (recommended)
2. Convert data to other formats without creating input file
Input your choice: 3
Input name of log file: log.txt
References
Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123140.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification
and Regression Trees. Wadsworth, Belmont.
Broekman, B. F. P., Niti, M., Nyunt, M. S. Z., Ko, S. M., Kumar, R., and Ng, T. P.
(2011). Validation of a brief seven-item response bias-free Geriatric Depression
Scale. American Journal of Geriatric Psychiatry, 19:589596.
Broekman, B. F. P., Nyunt, S. Z., Niti, M., Jin, A. Z., Ko, S. M., Kumar, R.,
Fones, C. S. L., and Ng, T. P. (2008). Differential item functioning of the Geriatic
Depression Scale in an Asian population. Journal of Affective Disorders, 108:285
290.
Chan, K.-Y. and Loh, W.-Y. (2004). LOTUS: An algorithm for building accurate and
comprehensible logistic regression trees. Journal of Computational and Graphical
Statistics, 13:826852.
Choi, Y., Ahn, H., and Chen, J. J. (2005). Regression trees for analysis of count
data with extra poisson variation. Computational Statistics & Data Analysis,
49(3):893915.
Hosmer, D. W., Lemeshow, S., and May, S. (2008). Applied Survival Analysis. Wiley,
2nd edition.
Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased mul-
tiway splits. Journal of the American Statistical Association, 96:589604.
www.stat.wisc.edu/~ loh/treeprogs/cruise/cruise.pdf.
Kim, H. and Loh, W.-Y. (2003). Classification trees with bivariate linear discrimi-
nant node models. Journal of Computational and Graphical Statistics, 12:512530.
www.stat.wisc.edu/~ loh/treeprogs/cruise/jcgs.pdf.
Kim, H., Loh, W.-Y., Shih, Y.-S., and Chaudhuri, P. (2007). Visualizable and
interpretable regression models with good prediction power. IIE Transactions,
39:565579. www.stat.wisc.edu/~ loh/treeprogs/guide/iie.pdf.
Loh, W.-Y. (2006a). Logistic regression tree analysis. In Pham, H., editor, Handbook
of Engineering Statistics, pages 537549. Springer.
Loh, W.-Y. (2006b). Regression tree models for designed experiments. In Rojo, J.,
editor, The Second Erich L. Lehmann SymposiumOptimality, volume 49, pages
210228. Institute of Mathematical Statistics Lecture Notes-Monograph Series.
arxiv.org/abs/math.ST/0611192.
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery, 1:1423.
Loh, W.-Y. (2012). Variable selection for classification and regression in large p,
small n problems. In Barbour, A., Chan, H. P., and Siegmund, D., editors, Prob-
ability Approximations and Beyond, volume 205 of Lecture Notes in Statistics
Proceedings, pages 133157, New York. Springer.
Loh, W.-Y. (2014). Fifty years of classification and regression trees (with discussion).
International Statistical Review, 34:329370.
Loh, W.-Y., Fu, H., Man, M., Champion, V., and Yu, M. (2016). Identification of
subgroups with differential treatment effects for longitudinal and multiresponse
variables. Statistics in Medicine, 35:48374855.
Loh, W.-Y., He, X., and Man, M. (2015). A regression tree approach to identifying
subgroups with differential treatment effects. Statistics in Medicine, 34:18181833.
Loh, W.-Y. and Zheng, W. (2013). Regression trees for longitudinal and multire-
sponse data. Annals of Applied Statistics, 7:495522.
Marc, L. G., Raue, P. J., and Bruce, M. L. (2008). Screening performance of the 15-
item Geriatric Depression Scale in a diverse elderly home care population. Amer-
ican Journal of Geriatric Psychiatry, 16:914921.
Murnane, R. J., Boudett, K. P., and Willett, J. B. (1999). Do male dropouts ben-
efit from obtaining a GED, postsecondary education, and training? Evaluation
Reviews, 23:475502.
Schmoor, C., Olschewski, M., and Schumacher, M. (1996). Randomized and non-
randomized patients in clinical trials: experiences with comprehensive cohort stud-
ies. Statistics in Medicine, 15:263271.
Therneau, T., Atkinson, B., and Ripley, B. (2017). rpart: Recursive Partitioning
and Regression Trees. R package version 4.1-11.
Witten, I. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools
and Techniques with JAVA Implementations. Morgan Kaufmann, San Fransico,
CA. www.cs.waikato.ac.nz/ml/weka.