03 02 Decision Trees (1)
03 02 Decision Trees (1)
Decision Trees
Decision trees
2
Introduction
3
Introduction
■ Decision trees classify instances by sorting them down the tree from the
root to some leaf node, which provides the classification of the instance.
■ Each node in the tree specifies a test of some attribute of the instance, and
each branch descending from that node corresponds to one of the possible
values for this attribute.
■ I.e. an instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example.
● This process is repeated for the subtree rooted at the new node.
4
Decision tree for playing
tennis
5
Introduction
■ Decision trees are made up of groups of rules that describe how attributes
(or features) of instances relate to each other.
■ Each path from the root of the tree to a leaf is like a set of conditions (rules)
based on the attributes. The entire tree is a combination of these different
sets of conditions.
6
Appropriate Problems for Decision Trees
7
Appropriate Problems for Decision Trees
8
The ID3
algorithm 9
ID3
■ Most algorithms that have been developed for learning decision trees are
variations on a core algorithm that employs a top-down, greedy search
through the space of possible decision trees.
■ ID3 learns decision trees by constructing them top down starting with the
question:
● Which attribute should be tested at the root of the tree?
■ To answer this question, each instance attribute is evaluated using a
statistical test to determine how well it alone classifies the training
examples.
10
ID3
■ A descendant of the root node is then created for each possible value of
this attribute, and the training examples are sorted to the appropriate
descendant node
● i.e., down the branch corresponding to the example's value for this
attribute.
■ The entire process is then repeated using the training examples associated
with each descendant node to select the best attribute to test at that point
in the tree. This forms a greedy search for an acceptable decision tree, in
which the algorithm never backtracks to reconsider earlier choices.
Let's see it with an example
11
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End
Return Root 12
ID3
■ The central choice in the ID3 algorithm is selecting which attribute to test
at each node in the tree.
● We want to select the most useful attribute to classify the examples
■ We will define a statistical property, called information gain
● This measures how well a given attribute separates the training
examples according to their target classification
■ ID3 uses this information gain measure to select among the candidate
attributes at each step while growing the tree.
13
ID3: Selecting the best classifier attribute
14
ID3: Selecting the best classifier attribute
15
ID3: Selecting the best classifier attribute
16
ID3: Selecting the best classifier attribute
■ Where
● Values(A) is the set of all possible values for attribute A
● Sv is the subset of S for which attribute A has value v
■ Note that the first term is just the entropy of the original collection S and
the second term is the expected value of entropy after S is partitioned
using attribute A
17
ID3: Selecting the best classifier attribute
18
ID3: Selecting the best classifier attribute
19
ID3: Selecting the best classifier attribute
20
Exercice
21
ID3: Hypothesis Space Search in DTL
22
ID3: Capabilities and limitations
23
ID3: Capabilities and limitations
24
ID3: Capabilities and limitations
■ ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current hypothesis.
● This contrasts with methods that make decisions incrementally, based
on individual training examples.
● One advantage of using statistical properties of all the examples (e.g.,
information gain) is that the resulting search is much less sensitive to
errors in individual training examples.
● ID3 can be easily extended to handle noisy training data by modifying
its termination criterion to accept hypotheses that imperfectly fit the
training data.
25
Exercises
26
The CART
algorithm 27
CART algorithm
28
CART: Gini index
■ The selection criterion in CART is the Gini index instead of the information
gain used in ID3
■ The Gini index measures the impurity of D, a data partition or set of training
tuples:
■ Where:
● Pi is the probability that a tuple in D belongs to class Ci and is
estimated by
29
CART: Gini index
30
CART: Gini index
31
CART: Gini index
32
CART: Gini index
■ The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute.
■ This attribute and either its splitting subset (for a discrete-valued splitting
attribute) or split-point (for a continuous-valued splitting attribute)
together form the splitting criterion.
33
CART: Gini index (Example)
34
CART: Gini index (Example)
■ Let D be the training data shown in the table, where there are nine tuples
belonging to the class buys computer D = yes and the remaining five tuples
belong to the class buys computer D = no.
■ A (root) node N is created for the tuples in D.
■ We first use the Gini index to compute the impurity of D:
35
CART: Gini index (Example)
■ To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute.
■ Let’s start with the attribute income and consider each of the possible
splitting subsets.
■ Consider the subset {low, medium}
● This would result in 10 tuples in partition D1 satisfying the condition
income ∈ {low, medium}
● The remaining four tuples of D would be assigned to partition D2
● The Gini index value computed based on this partitioning would be…
36
CART: Gini index (Example)
■ The Gini index value computed based on this partitioning would be:
37
CART: Gini index (Example)
■ Similarly, the Gini index values for splits on the remaining subsets are 0.458
(for the subsets {low, high} and {medium}) and 0.450 (for the subsets
{medium, high} and {low}).
■ Therefore, the best binary split for attribute income is on {low, medium} (or
{high}) because it minimizes the Gini index.
■ Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best
split for age with a Gini index of 0.375.
■ The attributes student and credit rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.
38
CART: Gini index (Example)
■ The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of 0.459- 0.357 =
0.102.
■ This binary split results in the maximum reduction in impurity of the tuples
in D and is returned as the splitting criterion.
■ Node N is labeled with the criterion, two branches are grown from it, and
the tuples are partitioned accordingly.
39
Tree pruning
40
CART: Tree pruning
■ When a decision tree is built, many of the branches will reflect anomalies in
the training data due to noise or outliers.
■ Tree pruning methods address this problem of overfitting the data.
■ Such methods typically use statistical measures to remove the least-reliable
branches.
■ Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend.
■ They are usually faster and better at correctly classifying independent test
data (i.e., of previously unseen tuples) than unpruned trees.
41
CART: Tree pruning
42
CART: Tree pruning
43
44
CART: Tree pruning
■ CART uses post-pruning with an approach called cost complexity.
■ This approach considers the cost complexity of a tree to be a function of
the number of leaves in the tree and the error rate of the tree.
● Where the error rate is the percentage of tuples misclassified by the
tree.
■ It starts from the bottom of the tree.
■ For each internal node, N, it computes the cost complexity of the subtree at
N, and the cost complexity of the subtree at N if it were to be pruned (i.e.,
replaced by a leaf node).
■ The two values are compared. If pruning the subtree at node N would result
in a smaller cost complexity, then the subtree is pruned. Otherwise, it is
kept.
45
CART: Tree pruning
46
CART: Exercices
47
Random forests 48
Random forest
49
Random forest
50
Random forest
51
Random forest
52
Random forest
53
Random forest: Bagging
Suppose that you are a patient and would like to have a diagnosis made based
on your symptoms. Instead of asking one doctor, you may choose to ask several.
If a certain diagnosis occurs more than any other, you may choose this as the
final or best diagnosis. That is, the final diagnosis is made based on a majority
vote, where each doctor gets an equal vote. Now replace each doctor by a
classifier, and you have the basic idea behind bagging. Intuitively, a majority vote
made by a large group of doctors may be more reliable than a majority vote
made by a small group.
54
Random forest: Bagging
55
Random forest: Bagging
■ Notice that with bagging we are not subsetting the training data into
smaller chunks and training each tree on a different chunk.
■ If we have a sample of size N, we are still feeding each tree a training set of
size N (unless specified otherwise).
■ But instead of the original training data, we take a random sample of size N
with replacement.
● E.g., if our training data was [1, 2, 3, 4, 5, 6] then we might give one of
our trees the following list [1, 2, 2, 3, 6, 6].
● Notice that both lists are of length six and that “2” and “6” are both
repeated in the randomly selected training data we give to our tree
(because we sample with replacement).
56
Random forest: Feature randomness
57
Further reading
58
Exercises
59
We have our first machine
learning model trained…
60
Do you have any questions?
[email protected]
Thanks!
61