0% found this document useful (0 votes)
223 views

AIMA Decision Trees

Decision trees are a simple yet effective machine learning method for classification problems. A decision tree represents a function that takes input attribute values and outputs a discrete decision/classification. Each internal node performs a test on an attribute value, with branches representing possible outputs. Leaf nodes specify the classification decision. The document describes decision trees in detail, including their expressiveness as a logical representation and how they can be induced from labeled example training data.

Uploaded by

ylw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views

AIMA Decision Trees

Decision trees are a simple yet effective machine learning method for classification problems. A decision tree represents a function that takes input attribute values and outputs a discrete decision/classification. Each internal node performs a test on an attribute value, with branches representing possible outputs. Leaf nodes specify the classification decision. The document describes decision trees in detail, including their expressiveness as a logical representation and how they can be induced from labeled example training data.

Uploaded by

ylw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Section 18.3.

Learning Decision Trees 697

hypothesis space H to allow polynomials over both x and sin(x), and find that the data in
(c) can be fitted exactly by a simple function of the form ax + b + c sin(x). This shows the
REALIZABLE importance of the choice of hypothesis space. We say that a learning problem is realizable if
the hypothesis space contains the true function. Unfortunately, we cannot always tell whether
a given learning problem is realizable, because the true function is not known.
In some cases, an analyst looking at a problem is willing to make more fine-grained
distinctions about the hypothesis space, to say—even before seeing any data—not just that a
hypothesis is possible or impossible, but rather how probable it is. Supervised learning can
be done by choosing the hypothesis h∗ that is most probable given the data:
h∗ = argmax P (h|data ) .
h∈H
By Bayes’ rule this is equivalent to
h∗ = argmax P (data|h) P (h) .
h∈H

Then we can say that the prior probability P (h) is high for a degree-1 or -2 polynomial,
lower for a degree-7 polynomial, and especially low for degree-7 polynomials with large,
sharp spikes as in Figure 18.1(b). We allow unusual-looking functions when the data say we
really need them, but we discourage them by giving them a low prior probability.
Why not let H be the class of all Java programs, or Turing machines? After all, every
computable function can be represented by some Turing machine, and that is the best we
can do. One problem with this idea is that it does not take into account the computational
complexity of learning. There is a tradeoff between the expressiveness of a hypothesis space
and the complexity of finding a good hypothesis within that space. For example, fitting a
straight line to data is an easy computation; fitting high-degree polynomials is somewhat
harder; and fitting Turing machines is in general undecidable. A second reason to prefer
simple hypothesis spaces is that presumably we will want to use h after we have learned it,
and computing h(x) when h is a linear function is guaranteed to be fast, while computing
an arbitrary Turing machine program is not even guaranteed to terminate. For these reasons,
most work on learning has focused on simple representations.
We will see that the expressiveness–complexity tradeoff is not as simple as it first seems:
it is often the case, as we saw with first-order logic in Chapter 8, that an expressive language
makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness
of the language means that any consistent hypothesis must be very complex. For example,
the rules of chess can be written in a page or two of first-order logic, but require thousands of
pages when written in propositional logic.

18.3 L EARNING D ECISION T REES

Decision tree induction is one of the simplest and yet most successful forms of machine
learning. We first describe the representation—the hypothesis space—and then show how to
learn a good hypothesis.
698 Chapter 18. Learning from Examples

18.3.1 The decision tree representation


DECISION TREE A decision tree represents a function that takes as input a vector of attribute values and
returns a “decision”—a single output value. The input and output values can be discrete or
continuous. For now we will concentrate on problems where the inputs have discrete values
and the output has exactly two possible values; this is Boolean classification, where each
POSITIVE example input will be classified as true (a positive example) or false (a negative example).
NEGATIVE A decision tree reaches its decision by performing a sequence of tests. Each internal
node in the tree corresponds to a test of the value of one of the input attributes, Ai , and
the branches from the node are labeled with the possible values of the attribute, Ai = vik .
Each leaf node in the tree specifies a value to be returned by the function. The decision tree
representation is natural for humans; indeed, many “How To” manuals (e.g., for car repair)
are written entirely as a single decision tree stretching over hundreds of pages.
As an example, we will build a decision tree to decide whether to wait for a table at a
GOAL PREDICATE restaurant. The aim here is to learn a definition for the goal predicate WillWait. First we
list the attributes that we will consider as part of the input:
1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar : whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat : true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full ).
6. Price: the restaurant’s price range ($, $$, $$$).
7. Raining : whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0–10 minutes, 10–30, 30–60, or >60).
Note that every variable has a small set of possible values; the value of WaitEstimate, for
example, is not an integer, rather it is one of the four discrete values 0–10, 10–30, 30–60, or
>60. The decision tree usually used by one of us (SR) for this domain is shown in Figure 18.2.
Notice that the tree ignores the Price and Type attributes. Examples are processed by the tree
starting at the root and following the appropriate branch until a leaf is reached. For instance,
an example with Patrons = Full and WaitEstimate = 0–10 will be classified as positive
(i.e., yes, we will wait for a table).

18.3.2 Expressiveness of decision trees


A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true
if and only if the input attributes satisfy one of the paths leading to a leaf with value true.
Writing this out in propositional logic, we have
Goal ⇔ (Path 1 ∨ Path 2 ∨ · · ·) ,
where each Path is a conjunction of attribute-value tests required to follow that path. Thus,
the whole expression is equivalent to disjunctive normal form (see page 283), which means
Section 18.3. Learning Decision Trees 699

that any function in propositional logic can be expressed as a decision tree. As an example,
the rightmost path in Figure 18.2 is
Path = (Patrons = Full ∧ WaitEstimate = 0–10) .
For a wide variety of problems, the decision tree format yields a nice, concise result. But
some functions cannot be represented concisely. For example, the majority function, which
returns true if and only if more than half of the inputs are true, requires an exponentially
large decision tree. In other words, decision trees are good for some kinds of functions and
bad for others. Is there any kind of representation that is efficient for all kinds of functions?
Unfortunately, the answer is no. We can show this in a general way. Consider the set of all
Boolean functions on n attributes. How many different functions are in this set? This is just
the number of different truth tables that we can write down, because the function is defined
by its truth table. A truth table over n attributes has 2n rows, one for each combination of
values of the attributes. We can consider the “answer” column of the table as a 2n -bit number
n
that defines the function. That means there are 22 different functions (and there will be more
than that number of trees, since more than one tree can compute the same function). This is
a scary number. For example, with just the ten Boolean attributes of our restaurant problem
there are 21024 or about 10308 different functions to choose from, and for 20 attributes there
are over 10300,000 . We will need some ingenious algorithms to find good hypotheses in such
a large space.

18.3.3 Inducing decision trees from examples


An example for a Boolean decision tree consists of an (x, y) pair, where x is a vector of values
for the input attributes, and y is a single Boolean output value. A training set of 12 examples

Patrons?

None Some Full

No Yes WaitEstimate?

>60 30-60 10-30 0-10


No Alternate? Hungry? Yes
No Yes No Yes

Reservation? Fri/Sat? Yes Alternate?


No Yes No Yes No Yes

Bar? Yes No Yes Yes Raining?


No Yes No Yes

No Yes No Yes

Figure 18.2 A decision tree for deciding whether to wait for a table.
700 Chapter 18. Learning from Examples

Input Attributes Goal


Example
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
x1 Yes No No Yes Some $$$ No Yes French 0–10 y1 = Yes
x2 Yes No No Yes Full $ No No Thai 30–60 y2 = No
x3 No Yes No No Some $ No No Burger 0–10 y3 = Yes
x4 Yes No Yes Yes Full $ Yes No Thai 10–30 y4 = Yes
x5 Yes No Yes No Full $$$ No Yes French >60 y5 = No
x6 No Yes No Yes Some $$ Yes Yes Italian 0–10 y6 = Yes
x7 No Yes No No None $ Yes No Burger 0–10 y7 = No
x8 No No No Yes Some $$ Yes Yes Thai 0–10 y8 = Yes
x9 No Yes Yes No Full $ Yes No Burger >60 y9 = No
x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10–30 y10 = No
x11 No No No No None $ No No Thai 0–10 y11 = No
x12 Yes Yes Yes Yes Full $ No No Burger 30–60 y12 = Yes

Figure 18.3 Examples for the restaurant domain.

is shown in Figure 18.3. The positive examples are the ones in which the goal WillWait is
true (x1 , x3 , . . .); the negative examples are the ones in which it is false (x2 , x5 , . . .).
We want a tree that is consistent with the examples and is as small as possible. Un-
fortunately, no matter how we measure size, it is an intractable problem to find the smallest
n
consistent tree; there is no way to efficiently search through the 22 trees. With some simple
heuristics, however, we can find a good approximate solution: a small (but not smallest) con-
sistent tree. The D ECISION -T REE -L EARNING algorithm adopts a greedy divide-and-conquer
strategy: always test the most important attribute first. This test divides the problem up into
smaller subproblems that can then be solved recursively. By “most important attribute,” we
mean the one that makes the most difference to the classification of an example. That way, we
hope to get to the correct classification with a small number of tests, meaning that all paths in
the tree will be short and the tree as a whole will be shallow.
Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four possible
outcomes, each of which has the same number of positive as negative examples. On the other
hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or
Some, then we are left with example sets for which we can answer definitively (No and Yes,
respectively). If the value is Full , we are left with a mixed set of examples. In general, after
the first attribute test splits up the examples, each outcome is a new decision tree learning
problem in itself, with fewer examples and one less attribute. There are four cases to consider
for these recursive problems:
1. If the remaining examples are all positive (or all negative), then we are done: we can
answer Yes or No. Figure 18.4(b) shows examples of this happening in the None and
Some branches.
2. If there are some positive and some negative examples, then choose the best attribute to
split them. Figure 18.4(b) shows Hungry being used to split the remaining examples.
3. If there are no examples left, it means that no example has been observed for this com-
Section 18.3. Learning Decision Trees 701

1 3 4 6 8 12 1 3 4 6 8 12
2 5 7 9 10 11 2 5 7 9 10 11

Type? Patrons?

French Italian Thai Burger None Some Full


1 6 4 8 3 12 1 3 6 8 4 12
5 10 2 11 7 9 7 11 2 5 9 10

No Yes Hungry?

No Yes

4 12
5 9 2 10
(a) (b)

Figure 18.4 Splitting the examples by testing on attributes. At each node we show the
positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b) Splitting
on Patrons does a good job of separating positive and negative examples. After splitting on
Patrons, Hungry is a fairly good second test.

bination of attribute values, and we return a default value calculated from the plurality
classification of all the examples that were used in constructing the node’s parent. These
are passed along in the variable parent examples.
4. If there are no attributes left, but both positive and negative examples, it means that
these examples have exactly the same description, but different classifications. This can
NOISE happen because there is an error or noise in the data; because the domain is nondeter-
ministic; or because we can’t observe an attribute that would distinguish the examples.
The best we can do is return the plurality classification of the remaining examples.
The D ECISION -T REE -L EARNING algorithm is shown in Figure 18.5. Note that the set of
examples is crucial for constructing the tree, but nowhere do the examples appear in the tree
itself. A tree consists of just tests on attributes in the interior nodes, values of attributes on
the branches, and output values on the leaf nodes. The details of the I MPORTANCE function
are given in Section 18.3.4. The output of the learning algorithm on our sample training
set is shown in Figure 18.6. The tree is clearly different from the original tree shown in
Figure 18.2. One might conclude that the learning algorithm is not doing a very good job
of learning the correct function. This would be the wrong conclusion to draw, however. The
learning algorithm looks at the examples, not at the correct function, and in fact, its hypothesis
(see Figure 18.6) not only is consistent with all the examples, but is considerably simpler
than the original tree! The learning algorithm has no reason to include tests for Raining and
Reservation, because it can classify all the examples without them. It has also detected an
interesting and previously unsuspected pattern: the first author will wait for Thai food on
weekends. It is also bound to make some mistakes for cases where it has seen no examples.
For example, it has never seen a case where the wait is 0–10 minutes but the restaurant is full.
702 Chapter 18. Learning from Examples

function D ECISION -T REE -L EARNING(examples, attributes, parent examples) returns


a tree
if examples is empty then return P LURALITY-VALUE(parent examples)
else if all examples have the same classification then return the classification
else if attributes is empty then return P LURALITY-VALUE(examples)
else
A ← argmaxa ∈ attributes I MPORTANCE(a, examples)
tree ← a new decision tree with root test A
for each value vk of A do
exs ← {e : e ∈ examples and e.A = vk }
subtree ← D ECISION -T REE -L EARNING(exs, attributes − A, examples)
add a branch to tree with label (A = vk ) and subtree subtree
return tree

Figure 18.5 The decision-tree learning algorithm. The function I MPORTANCE is de-
scribed in Section 18.3.4. The function P LURALITY-VALUE selects the most common output
value among a set of examples, breaking ties randomly.

Patrons?

None Some Full

No Yes Hungry?
No Yes

No Type?

French Italian Thai Burger

Yes No Fri/Sat? Yes


No Yes

No Yes

Figure 18.6 The decision tree induced from the 12-example training set.

In that case it says not to wait when Hungry is false, but I (SR) would certainly wait. With
more training examples the learning program could correct this mistake.
We note there is a danger of over-interpreting the tree that the algorithm selects. When
there are several variables of similar importance, the choice between them is somewhat arbi-
trary: with slightly different input examples, a different variable would be chosen to split on
first, and the whole tree would look completely different. The function computed by the tree
would still be similar, but the structure of the tree can vary widely.
LEARNING CURVE We can evaluate the accuracy of a learning algorithm with a learning curve, as shown
in Figure 18.7. We have 100 examples at our disposal, which we split into a training set and
Section 18.3. Learning Decision Trees 703

Proportion correct on test set


0.9

0.8

0.7

0.6

0.5

0.4
0 20 40 60 80 100
Training set size

Figure 18.7 A learning curve for the decision tree learning algorithm on 100 randomly
generated examples in the restaurant domain. Each data point is the average of 20 trials.

a test set. We learn a hypothesis h with the training set and measure its accuracy with the test
set. We do this starting with a training set of size 1 and increasing one at a time up to size
99. For each size we actually repeat the process of randomly splitting 20 times, and average
the results of the 20 trials. The curve shows that as the training set size grows, the accuracy
increases. (For this reason, learning curves are also called happy graphs.) In this graph we
reach 95% accuracy, and it looks like the curve might continue to increase with more data.

18.3.4 Choosing attribute tests


The greedy search used in decision tree learning is designed to approximately minimize the
depth of the final tree. The idea is to pick the attribute that goes as far as possible toward
providing an exact classification of the examples. A perfect attribute divides the examples
into sets, each of which are all positive or all negative and thus will be leaves of the tree. The
Patrons attribute is not perfect, but it is fairly good. A really useless attribute, such as Type,
leaves the example sets with roughly the same proportion of positive and negative examples
as the original set.
All we need, then, is a formal measure of “fairly good” and “really useless” and we can
implement the I MPORTANCE function of Figure 18.5. We will use the notion of information
ENTROPY gain, which is defined in terms of entropy, the fundamental quantity in information theory
(Shannon and Weaver, 1949).
Entropy is a measure of the uncertainty of a random variable; acquisition of information
corresponds to a reduction in entropy. A random variable with only one value—a coin that
always comes up heads—has no uncertainty and thus its entropy is defined as zero; thus, we
gain no information by observing its value. A flip of a fair coin is equally likely to come up
heads or tails, 0 or 1, and we will soon show that this counts as “1 bit” of entropy. The roll
of a fair four-sided die has 2 bits of entropy, because it takes two bits to describe one of four
equally probable choices. Now consider an unfair coin that comes up heads 99% of the time.
Intuitively, this coin has less uncertainty than the fair coin—if we guess heads we’ll be wrong
only 1% of the time—so we would like it to have an entropy measure that is close to zero, but
704 Chapter 18. Learning from Examples

positive. In general, the entropy of a random variable V with values vk , each with probability
P (vk ), is defined as
1
Entropy: H(V ) = P (vk ) log2 =− P (vk ) log2 P (vk ) .
P (vk )
k k
We can check that the entropy of a fair coin flip is indeed 1 bit:
H(Fair ) = −(0.5 log 2 0.5 + 0.5 log2 0.5) = 1 .
If the coin is loaded to give 99% heads, we get
H(Loaded ) = −(0.99 log 2 0.99 + 0.01 log 2 0.01) ≈ 0.08 bits.
It will help to define B(q) as the entropy of a Boolean random variable that is true with
probability q:
B(q) = −(q log2 q + (1 − q) log2 (1 − q)) .
Thus, H(Loaded ) = B(0.99) ≈ 0.08. Now let’s get back to decision tree learning. If a
training set contains p positive examples and n negative examples, then the entropy of the
goal attribute on the whole set is

p
H(Goal ) = B .
p+n
The restaurant training set in Figure 18.3 has p = n = 6, so the corresponding entropy is
B(0.5) or exactly 1 bit. A test on a single attribute A might give us only part of this 1 bit. We
can measure exactly how much by looking at the entropy remaining after the attribute test.
An attribute A with d distinct values divides the training set E into subsets E1 , . . . , Ed .
Each subset Ek has pk positive examples and nk negative examples, so if we go along that
branch, we will need an additional B(pk /(pk + nk )) bits of information to answer the ques-
tion. A randomly chosen example from the training set has the kth value for the attribute with
probability (pk + nk )/(p + n), so the expected entropy remaining after testing attribute A is

d
p k +nk
Remainder (A) = p+n B( pkp+n
k
k
).
k=1
INFORMATION GAIN The information gain from the attribute test on A is the expected reduction in entropy:
p
Gain(A) = B( p+n ) − Remainder (A) .
In fact Gain(A) is just what we need to implement the I MPORTANCE function. Returning to
the attributes considered in Figure 18.4, we have
$2 %
Gain(Patrons ) = 1 − 12 B( 02 ) + 12
4
B( 44 ) + 12
6
B( 26 ) ≈ 0.541 bits,
$2 %
Gain(Type) = 1 − 12 B( 12 ) + 122
B( 12 ) + 124
B( 24 ) + 12
4
B( 24 ) = 0 bits,
confirming our intuition that Patrons is a better attribute to split on. In fact, Patrons has
the maximum gain of any of the attributes and would be chosen by the decision-tree learning
algorithm as the root.
Section 18.3. Learning Decision Trees 705

18.3.5 Generalization and overfitting

On some problems, the D ECISION -T REE -L EARNING algorithm will generate a large tree
when there is actually no pattern to be found. Consider the problem of trying to predict
whether the roll of a die will come up as 6 or not. Suppose that experiments are carried out
with various dice and that the attributes describing each training example include the color
of the die, its weight, the time when the roll was done, and whether the experimenters had
their fingers crossed. If the dice are fair, the right thing to learn is a tree with a single node
that says “no,” But the D ECISION -T REE -L EARNING algorithm will seize on any pattern it
can find in the input. If it turns out that there are 2 rolls of a 7-gram blue die with fingers
crossed and they both come out 6, then the algorithm may construct a path that predicts 6 in
OVERFITTING that case. This problem is called overfitting. A general phenomenon, overfitting occurs with
all types of learners, even when the target function is not at all random. In Figure 18.1(b) and
(c), we saw polynomial functions overfitting the data. Overfitting becomes more likely as the
hypothesis space and the number of input attributes grows, and less likely as we increase the
number of training examples.
DECISION TREE
PRUNING For decision trees, a technique called decision tree pruning combats overfitting. Prun-
ing works by eliminating nodes that are not clearly relevant. We start with a full tree, as
generated by D ECISION -T REE -L EARNING . We then look at a test node that has only leaf
nodes as descendants. If the test appears to be irrelevant—detecting only noise in the data—
then we eliminate the test, replacing it with a leaf node. We repeat this process, considering
each test with only leaf descendants, until each one has either been pruned or accepted as is.
The question is, how do we detect that a node is testing an irrelevant attribute? Suppose
we are at a node consisting of p positive and n negative examples. If the attribute is irrelevant,
we would expect that it would split the examples into subsets that each have roughly the same
proportion of positive examples as the whole set, p/(p + n), and so the information gain will
be close to zero.2 Thus, the information gain is a good clue to irrelevance. Now the question
is, how large a gain should we require in order to split on a particular attribute?
SIGNIFICANCE TEST We can answer this question by using a statistical significance test. Such a test begins
NULL HYPOTHESIS by assuming that there is no underlying pattern (the so-called null hypothesis). Then the ac-
tual data are analyzed to calculate the extent to which they deviate from a perfect absence of
pattern. If the degree of deviation is statistically unlikely (usually taken to mean a 5% prob-
ability or less), then that is considered to be good evidence for the presence of a significant
pattern in the data. The probabilities are calculated from standard distributions of the amount
of deviation one would expect to see in random sampling.
In this case, the null hypothesis is that the attribute is irrelevant and, hence, that the
information gain for an infinitely large sample would be zero. We need to calculate the
probability that, under the null hypothesis, a sample of size v = n + p would exhibit the
observed deviation from the expected distribution of positive and negative examples. We can
measure the deviation by comparing the actual numbers of positive and negative examples in

2 The gain will be strictly positive except for the unlikely case where all the proportions are exactly the same.
(See Exercise 18.5.)
706 Chapter 18. Learning from Examples

each subset, pk and nk , with the expected numbers, p̂k and n̂k , assuming true irrelevance:
pk + n k pk + n k
p̂k = p × n̂k = n × .
p+n p+n
A convenient measure of the total deviation is given by

d
(pk − p̂k )2 (nk − n̂k )2
Δ= + .
p̂k n̂k
k=1

Under the null hypothesis, the value of Δ is distributed according to the χ2 (chi-squared)
distribution with v − 1 degrees of freedom. We can use a χ2 table or a standard statistical
library routine to see if a particular Δ value confirms or rejects the null hypothesis. For
example, consider the restaurant type attribute, with four values and thus three degrees of
freedom. A value of Δ = 7.82 or more would reject the null hypothesis at the 5% level (and a
value of Δ = 11.35 or more would reject at the 1% level). Exercise 18.8 asks you to extend the
D ECISION -T REE -L EARNING algorithm to implement this form of pruning, which is known
χ2 PRUNING as χ2 pruning.
With pruning, noise in the examples can be tolerated. Errors in the example’s label (e.g.,
an example (x, Yes) that should be (x, No)) give a linear increase in prediction error, whereas
errors in the descriptions of examples (e.g., Price = $ when it was actually Price = $$) have
an asymptotic effect that gets worse as the tree shrinks down to smaller sets. Pruned trees
perform significantly better than unpruned trees when the data contain a large amount of
noise. Also, the pruned trees are often much smaller and hence easier to understand.
One final warning: You might think that χ2 pruning and information gain look similar,
EARLY STOPPING so why not combine them using an approach called early stopping—have the decision tree
algorithm stop generating nodes when there is no good attribute to split on, rather than going
to all the trouble of generating nodes and then pruning them away. The problem with early
stopping is that it stops us from recognizing situations where there is no one good attribute,
but there are combinations of attributes that are informative. For example, consider the XOR
function of two binary attributes. If there are roughly equal number of examples for all four
combinations of input values, then neither attribute will be informative, yet the correct thing
to do is to split on one of the attributes (it doesn’t matter which one), and then at the second
level we will get splits that are informative. Early stopping would miss this, but generate-
and-then-prune handles it correctly.

18.3.6 Broadening the applicability of decision trees


In order to extend decision tree induction to a wider variety of problems, a number of issues
must be addressed. We will briefly mention several, suggesting that a full understanding is
best obtained by doing the associated exercises:
• Missing data: In many domains, not all the attribute values will be known for every
example. The values might have gone unrecorded, or they might be too expensive to
obtain. This gives rise to two problems: First, given a complete decision tree, how
should one classify an example that is missing one of the test attributes? Second, how
Section 18.3. Learning Decision Trees 707

should one modify the information-gain formula when some examples have unknown
values for the attribute? These questions are addressed in Exercise 18.9.

• Multivalued attributes: When an attribute has many possible values, the information
gain measure gives an inappropriate indication of the attribute’s usefulness. In the ex-
treme case, an attribute such as ExactTime has a different value for every example,
which means each subset of examples is a singleton with a unique classification, and
the information gain measure would have its highest value for this attribute. But choos-
GAIN RATIO ing this split first is unlikely to yield the best tree. One solution is to use the gain ratio
(Exercise 18.10). Another possibility is to allow a Boolean test of the form A = vk , that
is, picking out just one of the possible values for an attribute, leaving the remaining
values to possibly be tested later in the tree.

• Continuous and integer-valued input attributes: Continuous or integer-valued at-


tributes such as Height and Weight, have an infinite set of possible values. Rather than
generate infinitely many branches, decision-tree learning algorithms typically find the
SPLIT POINT split point that gives the highest information gain. For example, at a given node in
the tree, it might be the case that testing on Weight > 160 gives the most informa-
tion. Efficient methods exist for finding good split points: start by sorting the values
of the attribute, and then consider only split points that are between two examples in
sorted order that have different classifications, while keeping track of the running totals
of positive and negative examples on each side of the split point. Splitting is the most
expensive part of real-world decision tree learning applications.

• Continuous-valued output attributes: If we are trying to predict a numerical output


REGRESSION TREE value, such as the price of an apartment, then we need a regression tree rather than a
classification tree. A regression tree has at each leaf a linear function of some subset
of numerical attributes, rather than a single value. For example, the branch for two-
bedroom apartments might end with a linear function of square footage, number of
bathrooms, and average income for the neighborhood. The learning algorithm must
decide when to stop splitting and begin applying linear regression (see Section 18.6)
over the attributes.

A decision-tree learning system for real-world applications must be able to handle all of
these problems. Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial packages have
been built that meet these criteria, and they have been used to develop thousands of fielded
systems. In many areas of industry and commerce, decision trees are usually the first method
tried when a classification method is to be extracted from a data set. One important property
of decision trees is that it is possible for a human to understand the reason for the output of the
learning algorithm. (Indeed, this is a legal requirement for financial decisions that are subject
to anti-discrimination laws.) This is a property not shared by some other representations,
such as neural networks.

You might also like