AIMA Decision Trees
AIMA Decision Trees
hypothesis space H to allow polynomials over both x and sin(x), and find that the data in
(c) can be fitted exactly by a simple function of the form ax + b + c sin(x). This shows the
REALIZABLE importance of the choice of hypothesis space. We say that a learning problem is realizable if
the hypothesis space contains the true function. Unfortunately, we cannot always tell whether
a given learning problem is realizable, because the true function is not known.
In some cases, an analyst looking at a problem is willing to make more fine-grained
distinctions about the hypothesis space, to say—even before seeing any data—not just that a
hypothesis is possible or impossible, but rather how probable it is. Supervised learning can
be done by choosing the hypothesis h∗ that is most probable given the data:
h∗ = argmax P (h|data ) .
h∈H
By Bayes’ rule this is equivalent to
h∗ = argmax P (data|h) P (h) .
h∈H
Then we can say that the prior probability P (h) is high for a degree-1 or -2 polynomial,
lower for a degree-7 polynomial, and especially low for degree-7 polynomials with large,
sharp spikes as in Figure 18.1(b). We allow unusual-looking functions when the data say we
really need them, but we discourage them by giving them a low prior probability.
Why not let H be the class of all Java programs, or Turing machines? After all, every
computable function can be represented by some Turing machine, and that is the best we
can do. One problem with this idea is that it does not take into account the computational
complexity of learning. There is a tradeoff between the expressiveness of a hypothesis space
and the complexity of finding a good hypothesis within that space. For example, fitting a
straight line to data is an easy computation; fitting high-degree polynomials is somewhat
harder; and fitting Turing machines is in general undecidable. A second reason to prefer
simple hypothesis spaces is that presumably we will want to use h after we have learned it,
and computing h(x) when h is a linear function is guaranteed to be fast, while computing
an arbitrary Turing machine program is not even guaranteed to terminate. For these reasons,
most work on learning has focused on simple representations.
We will see that the expressiveness–complexity tradeoff is not as simple as it first seems:
it is often the case, as we saw with first-order logic in Chapter 8, that an expressive language
makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness
of the language means that any consistent hypothesis must be very complex. For example,
the rules of chess can be written in a page or two of first-order logic, but require thousands of
pages when written in propositional logic.
Decision tree induction is one of the simplest and yet most successful forms of machine
learning. We first describe the representation—the hypothesis space—and then show how to
learn a good hypothesis.
698 Chapter 18. Learning from Examples
that any function in propositional logic can be expressed as a decision tree. As an example,
the rightmost path in Figure 18.2 is
Path = (Patrons = Full ∧ WaitEstimate = 0–10) .
For a wide variety of problems, the decision tree format yields a nice, concise result. But
some functions cannot be represented concisely. For example, the majority function, which
returns true if and only if more than half of the inputs are true, requires an exponentially
large decision tree. In other words, decision trees are good for some kinds of functions and
bad for others. Is there any kind of representation that is efficient for all kinds of functions?
Unfortunately, the answer is no. We can show this in a general way. Consider the set of all
Boolean functions on n attributes. How many different functions are in this set? This is just
the number of different truth tables that we can write down, because the function is defined
by its truth table. A truth table over n attributes has 2n rows, one for each combination of
values of the attributes. We can consider the “answer” column of the table as a 2n -bit number
n
that defines the function. That means there are 22 different functions (and there will be more
than that number of trees, since more than one tree can compute the same function). This is
a scary number. For example, with just the ten Boolean attributes of our restaurant problem
there are 21024 or about 10308 different functions to choose from, and for 20 attributes there
are over 10300,000 . We will need some ingenious algorithms to find good hypotheses in such
a large space.
Patrons?
No Yes WaitEstimate?
No Yes No Yes
Figure 18.2 A decision tree for deciding whether to wait for a table.
700 Chapter 18. Learning from Examples
is shown in Figure 18.3. The positive examples are the ones in which the goal WillWait is
true (x1 , x3 , . . .); the negative examples are the ones in which it is false (x2 , x5 , . . .).
We want a tree that is consistent with the examples and is as small as possible. Un-
fortunately, no matter how we measure size, it is an intractable problem to find the smallest
n
consistent tree; there is no way to efficiently search through the 22 trees. With some simple
heuristics, however, we can find a good approximate solution: a small (but not smallest) con-
sistent tree. The D ECISION -T REE -L EARNING algorithm adopts a greedy divide-and-conquer
strategy: always test the most important attribute first. This test divides the problem up into
smaller subproblems that can then be solved recursively. By “most important attribute,” we
mean the one that makes the most difference to the classification of an example. That way, we
hope to get to the correct classification with a small number of tests, meaning that all paths in
the tree will be short and the tree as a whole will be shallow.
Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four possible
outcomes, each of which has the same number of positive as negative examples. On the other
hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or
Some, then we are left with example sets for which we can answer definitively (No and Yes,
respectively). If the value is Full , we are left with a mixed set of examples. In general, after
the first attribute test splits up the examples, each outcome is a new decision tree learning
problem in itself, with fewer examples and one less attribute. There are four cases to consider
for these recursive problems:
1. If the remaining examples are all positive (or all negative), then we are done: we can
answer Yes or No. Figure 18.4(b) shows examples of this happening in the None and
Some branches.
2. If there are some positive and some negative examples, then choose the best attribute to
split them. Figure 18.4(b) shows Hungry being used to split the remaining examples.
3. If there are no examples left, it means that no example has been observed for this com-
Section 18.3. Learning Decision Trees 701
1 3 4 6 8 12 1 3 4 6 8 12
2 5 7 9 10 11 2 5 7 9 10 11
Type? Patrons?
No Yes Hungry?
No Yes
4 12
5 9 2 10
(a) (b)
Figure 18.4 Splitting the examples by testing on attributes. At each node we show the
positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b) Splitting
on Patrons does a good job of separating positive and negative examples. After splitting on
Patrons, Hungry is a fairly good second test.
bination of attribute values, and we return a default value calculated from the plurality
classification of all the examples that were used in constructing the node’s parent. These
are passed along in the variable parent examples.
4. If there are no attributes left, but both positive and negative examples, it means that
these examples have exactly the same description, but different classifications. This can
NOISE happen because there is an error or noise in the data; because the domain is nondeter-
ministic; or because we can’t observe an attribute that would distinguish the examples.
The best we can do is return the plurality classification of the remaining examples.
The D ECISION -T REE -L EARNING algorithm is shown in Figure 18.5. Note that the set of
examples is crucial for constructing the tree, but nowhere do the examples appear in the tree
itself. A tree consists of just tests on attributes in the interior nodes, values of attributes on
the branches, and output values on the leaf nodes. The details of the I MPORTANCE function
are given in Section 18.3.4. The output of the learning algorithm on our sample training
set is shown in Figure 18.6. The tree is clearly different from the original tree shown in
Figure 18.2. One might conclude that the learning algorithm is not doing a very good job
of learning the correct function. This would be the wrong conclusion to draw, however. The
learning algorithm looks at the examples, not at the correct function, and in fact, its hypothesis
(see Figure 18.6) not only is consistent with all the examples, but is considerably simpler
than the original tree! The learning algorithm has no reason to include tests for Raining and
Reservation, because it can classify all the examples without them. It has also detected an
interesting and previously unsuspected pattern: the first author will wait for Thai food on
weekends. It is also bound to make some mistakes for cases where it has seen no examples.
For example, it has never seen a case where the wait is 0–10 minutes but the restaurant is full.
702 Chapter 18. Learning from Examples
Figure 18.5 The decision-tree learning algorithm. The function I MPORTANCE is de-
scribed in Section 18.3.4. The function P LURALITY-VALUE selects the most common output
value among a set of examples, breaking ties randomly.
Patrons?
No Yes Hungry?
No Yes
No Type?
No Yes
Figure 18.6 The decision tree induced from the 12-example training set.
In that case it says not to wait when Hungry is false, but I (SR) would certainly wait. With
more training examples the learning program could correct this mistake.
We note there is a danger of over-interpreting the tree that the algorithm selects. When
there are several variables of similar importance, the choice between them is somewhat arbi-
trary: with slightly different input examples, a different variable would be chosen to split on
first, and the whole tree would look completely different. The function computed by the tree
would still be similar, but the structure of the tree can vary widely.
LEARNING CURVE We can evaluate the accuracy of a learning algorithm with a learning curve, as shown
in Figure 18.7. We have 100 examples at our disposal, which we split into a training set and
Section 18.3. Learning Decision Trees 703
0.8
0.7
0.6
0.5
0.4
0 20 40 60 80 100
Training set size
Figure 18.7 A learning curve for the decision tree learning algorithm on 100 randomly
generated examples in the restaurant domain. Each data point is the average of 20 trials.
a test set. We learn a hypothesis h with the training set and measure its accuracy with the test
set. We do this starting with a training set of size 1 and increasing one at a time up to size
99. For each size we actually repeat the process of randomly splitting 20 times, and average
the results of the 20 trials. The curve shows that as the training set size grows, the accuracy
increases. (For this reason, learning curves are also called happy graphs.) In this graph we
reach 95% accuracy, and it looks like the curve might continue to increase with more data.
positive. In general, the entropy of a random variable V with values vk , each with probability
P (vk ), is defined as
1
Entropy: H(V ) = P (vk ) log2 =− P (vk ) log2 P (vk ) .
P (vk )
k k
We can check that the entropy of a fair coin flip is indeed 1 bit:
H(Fair ) = −(0.5 log 2 0.5 + 0.5 log2 0.5) = 1 .
If the coin is loaded to give 99% heads, we get
H(Loaded ) = −(0.99 log 2 0.99 + 0.01 log 2 0.01) ≈ 0.08 bits.
It will help to define B(q) as the entropy of a Boolean random variable that is true with
probability q:
B(q) = −(q log2 q + (1 − q) log2 (1 − q)) .
Thus, H(Loaded ) = B(0.99) ≈ 0.08. Now let’s get back to decision tree learning. If a
training set contains p positive examples and n negative examples, then the entropy of the
goal attribute on the whole set is
p
H(Goal ) = B .
p+n
The restaurant training set in Figure 18.3 has p = n = 6, so the corresponding entropy is
B(0.5) or exactly 1 bit. A test on a single attribute A might give us only part of this 1 bit. We
can measure exactly how much by looking at the entropy remaining after the attribute test.
An attribute A with d distinct values divides the training set E into subsets E1 , . . . , Ed .
Each subset Ek has pk positive examples and nk negative examples, so if we go along that
branch, we will need an additional B(pk /(pk + nk )) bits of information to answer the ques-
tion. A randomly chosen example from the training set has the kth value for the attribute with
probability (pk + nk )/(p + n), so the expected entropy remaining after testing attribute A is
d
p k +nk
Remainder (A) = p+n B( pkp+n
k
k
).
k=1
INFORMATION GAIN The information gain from the attribute test on A is the expected reduction in entropy:
p
Gain(A) = B( p+n ) − Remainder (A) .
In fact Gain(A) is just what we need to implement the I MPORTANCE function. Returning to
the attributes considered in Figure 18.4, we have
$2 %
Gain(Patrons ) = 1 − 12 B( 02 ) + 12
4
B( 44 ) + 12
6
B( 26 ) ≈ 0.541 bits,
$2 %
Gain(Type) = 1 − 12 B( 12 ) + 122
B( 12 ) + 124
B( 24 ) + 12
4
B( 24 ) = 0 bits,
confirming our intuition that Patrons is a better attribute to split on. In fact, Patrons has
the maximum gain of any of the attributes and would be chosen by the decision-tree learning
algorithm as the root.
Section 18.3. Learning Decision Trees 705
On some problems, the D ECISION -T REE -L EARNING algorithm will generate a large tree
when there is actually no pattern to be found. Consider the problem of trying to predict
whether the roll of a die will come up as 6 or not. Suppose that experiments are carried out
with various dice and that the attributes describing each training example include the color
of the die, its weight, the time when the roll was done, and whether the experimenters had
their fingers crossed. If the dice are fair, the right thing to learn is a tree with a single node
that says “no,” But the D ECISION -T REE -L EARNING algorithm will seize on any pattern it
can find in the input. If it turns out that there are 2 rolls of a 7-gram blue die with fingers
crossed and they both come out 6, then the algorithm may construct a path that predicts 6 in
OVERFITTING that case. This problem is called overfitting. A general phenomenon, overfitting occurs with
all types of learners, even when the target function is not at all random. In Figure 18.1(b) and
(c), we saw polynomial functions overfitting the data. Overfitting becomes more likely as the
hypothesis space and the number of input attributes grows, and less likely as we increase the
number of training examples.
DECISION TREE
PRUNING For decision trees, a technique called decision tree pruning combats overfitting. Prun-
ing works by eliminating nodes that are not clearly relevant. We start with a full tree, as
generated by D ECISION -T REE -L EARNING . We then look at a test node that has only leaf
nodes as descendants. If the test appears to be irrelevant—detecting only noise in the data—
then we eliminate the test, replacing it with a leaf node. We repeat this process, considering
each test with only leaf descendants, until each one has either been pruned or accepted as is.
The question is, how do we detect that a node is testing an irrelevant attribute? Suppose
we are at a node consisting of p positive and n negative examples. If the attribute is irrelevant,
we would expect that it would split the examples into subsets that each have roughly the same
proportion of positive examples as the whole set, p/(p + n), and so the information gain will
be close to zero.2 Thus, the information gain is a good clue to irrelevance. Now the question
is, how large a gain should we require in order to split on a particular attribute?
SIGNIFICANCE TEST We can answer this question by using a statistical significance test. Such a test begins
NULL HYPOTHESIS by assuming that there is no underlying pattern (the so-called null hypothesis). Then the ac-
tual data are analyzed to calculate the extent to which they deviate from a perfect absence of
pattern. If the degree of deviation is statistically unlikely (usually taken to mean a 5% prob-
ability or less), then that is considered to be good evidence for the presence of a significant
pattern in the data. The probabilities are calculated from standard distributions of the amount
of deviation one would expect to see in random sampling.
In this case, the null hypothesis is that the attribute is irrelevant and, hence, that the
information gain for an infinitely large sample would be zero. We need to calculate the
probability that, under the null hypothesis, a sample of size v = n + p would exhibit the
observed deviation from the expected distribution of positive and negative examples. We can
measure the deviation by comparing the actual numbers of positive and negative examples in
2 The gain will be strictly positive except for the unlikely case where all the proportions are exactly the same.
(See Exercise 18.5.)
706 Chapter 18. Learning from Examples
each subset, pk and nk , with the expected numbers, p̂k and n̂k , assuming true irrelevance:
pk + n k pk + n k
p̂k = p × n̂k = n × .
p+n p+n
A convenient measure of the total deviation is given by
d
(pk − p̂k )2 (nk − n̂k )2
Δ= + .
p̂k n̂k
k=1
Under the null hypothesis, the value of Δ is distributed according to the χ2 (chi-squared)
distribution with v − 1 degrees of freedom. We can use a χ2 table or a standard statistical
library routine to see if a particular Δ value confirms or rejects the null hypothesis. For
example, consider the restaurant type attribute, with four values and thus three degrees of
freedom. A value of Δ = 7.82 or more would reject the null hypothesis at the 5% level (and a
value of Δ = 11.35 or more would reject at the 1% level). Exercise 18.8 asks you to extend the
D ECISION -T REE -L EARNING algorithm to implement this form of pruning, which is known
χ2 PRUNING as χ2 pruning.
With pruning, noise in the examples can be tolerated. Errors in the example’s label (e.g.,
an example (x, Yes) that should be (x, No)) give a linear increase in prediction error, whereas
errors in the descriptions of examples (e.g., Price = $ when it was actually Price = $$) have
an asymptotic effect that gets worse as the tree shrinks down to smaller sets. Pruned trees
perform significantly better than unpruned trees when the data contain a large amount of
noise. Also, the pruned trees are often much smaller and hence easier to understand.
One final warning: You might think that χ2 pruning and information gain look similar,
EARLY STOPPING so why not combine them using an approach called early stopping—have the decision tree
algorithm stop generating nodes when there is no good attribute to split on, rather than going
to all the trouble of generating nodes and then pruning them away. The problem with early
stopping is that it stops us from recognizing situations where there is no one good attribute,
but there are combinations of attributes that are informative. For example, consider the XOR
function of two binary attributes. If there are roughly equal number of examples for all four
combinations of input values, then neither attribute will be informative, yet the correct thing
to do is to split on one of the attributes (it doesn’t matter which one), and then at the second
level we will get splits that are informative. Early stopping would miss this, but generate-
and-then-prune handles it correctly.
should one modify the information-gain formula when some examples have unknown
values for the attribute? These questions are addressed in Exercise 18.9.
• Multivalued attributes: When an attribute has many possible values, the information
gain measure gives an inappropriate indication of the attribute’s usefulness. In the ex-
treme case, an attribute such as ExactTime has a different value for every example,
which means each subset of examples is a singleton with a unique classification, and
the information gain measure would have its highest value for this attribute. But choos-
GAIN RATIO ing this split first is unlikely to yield the best tree. One solution is to use the gain ratio
(Exercise 18.10). Another possibility is to allow a Boolean test of the form A = vk , that
is, picking out just one of the possible values for an attribute, leaving the remaining
values to possibly be tested later in the tree.
A decision-tree learning system for real-world applications must be able to handle all of
these problems. Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial packages have
been built that meet these criteria, and they have been used to develop thousands of fielded
systems. In many areas of industry and commerce, decision trees are usually the first method
tried when a classification method is to be extracted from a data set. One important property
of decision trees is that it is possible for a human to understand the reason for the output of the
learning algorithm. (Indeed, this is a legal requirement for financial decisions that are subject
to anti-discrimination laws.) This is a property not shared by some other representations,
such as neural networks.