0% found this document useful (0 votes)
6 views

03 02 Decision Trees (1)

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

03 02 Decision Trees (1)

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Rubén Sánchez Corcuera

[email protected]

Decision Trees
Decision trees

Decision tree learning is a method for approximating discrete-valued


target functions, in which the learned function is represented by a
decision tree.

2
Introduction

■ Learned trees can be represented as sets of it-then rules to improve human


readability
■ These learning methods are among the most popular of inductive inference
algorithms and have been successfully applied to a broad range of tasks
from learning to diagnose medical cases to learning to assess credit risk of
loan applicants.
● We will see that Random Forest, an ensemble method using decision
trees, offers pretty good results and is commonly used.

3
Introduction

■ Decision trees classify instances by sorting them down the tree from the
root to some leaf node, which provides the classification of the instance.
■ Each node in the tree specifies a test of some attribute of the instance, and
each branch descending from that node corresponds to one of the possible
values for this attribute.
■ I.e. an instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example.
● This process is repeated for the subtree rooted at the new node.

4
Decision tree for playing
tennis

5
Introduction

■ Decision trees are made up of groups of rules that describe how attributes
(or features) of instances relate to each other.
■ Each path from the root of the tree to a leaf is like a set of conditions (rules)
based on the attributes. The entire tree is a combination of these different
sets of conditions.

What are the set of rules of the previous decision tree?


Express it as a logic statement (5 mins)

6
Appropriate Problems for Decision Trees

■ Instances are represented by attribute-value pairs


● Instances are described by a fixed set of attributes and their values (temp:
hot).
● The easiest situation for a decision tree learning is when each attribute
takes on a small number of possible values (Hot, Mild, Cold)
■ The target function has discrete output value
● Boolean classification (true, false) to each example
● Decision tree methods easily extend to learning functions with more than
two possible output values

7
Appropriate Problems for Decision Trees

■ Disjunctive descriptions may be required


● Decision trees naturally represent disjunctive expressions
■ The training data may contain errors
● Decision tree methods are robust to errors, both in training samples
and in the attribute values
■ The training data may contain missing attribute values
● Decision tree methods can be used even when some training
examples have unknown values

8
The ID3
algorithm 9
ID3

■ Most algorithms that have been developed for learning decision trees are
variations on a core algorithm that employs a top-down, greedy search
through the space of possible decision trees.
■ ID3 learns decision trees by constructing them top down starting with the
question:
● Which attribute should be tested at the root of the tree?
■ To answer this question, each instance attribute is evaluated using a
statistical test to determine how well it alone classifies the training
examples.

10
ID3

■ A descendant of the root node is then created for each possible value of
this attribute, and the training examples are sorted to the appropriate
descendant node
● i.e., down the branch corresponding to the example's value for this
attribute.
■ The entire process is then repeated using the training examples associated
with each descendant node to select the best attribute to test at that point
in the tree. This forms a greedy search for an acceptable decision tree, in
which the algorithm never backtracks to reconsider earlier choices.
Let's see it with an example

11
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End
Return Root 12
ID3

■ The central choice in the ID3 algorithm is selecting which attribute to test
at each node in the tree.
● We want to select the most useful attribute to classify the examples
■ We will define a statistical property, called information gain
● This measures how well a given attribute separates the training
examples according to their target classification
■ ID3 uses this information gain measure to select among the candidate
attributes at each step while growing the tree.

13
ID3: Selecting the best classifier attribute

■ In order to define information gain precisely, we begin by defining a measure


commonly used in information theory → ENTROPY
● Entropy characterizes the (im)purity of an arbitrary collection of examples
■ Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this boolean classification is:

■ Where p+ is the proportion of positive examples and p- is the proportion of


negative examples
● If we have 20 examples and 5 are positive, p+ = 5/20
● We will define 0log20 as 0

14
ID3: Selecting the best classifier attribute

■ Entropy is 0 if all members of S belong to the same class


■ Entropy is 1 when the collection contains an equal number of positive and
negative examples
■ The previous formula can only be applied to Boolean attributes. More generally,
if the target attribute can take on several different values (c), then the entropy is
relative to this c-wise classification:

■ The logarithm is still base 2 because entropy is a measure of the expected


encoding length measured in bits (Binary)

15
ID3: Selecting the best classifier attribute

■ Given entropy as a measure of the impurity in a collection of training


examples, we can now define a measure of the effectiveness of an attribute
in classifying the training data.
■ The measure we will use, called information gain, is simply the expected
reduction in entropy caused by partitioning the examples according to this
attribute.

16
ID3: Selecting the best classifier attribute

■ The information gain, Gain(S, A), of an attribute A relative to a collection of


examples S is defined as:

■ Where
● Values(A) is the set of all possible values for attribute A
● Sv is the subset of S for which attribute A has value v
■ Note that the first term is just the entropy of the original collection S and
the second term is the expected value of entropy after S is partitioned
using attribute A

17
ID3: Selecting the best classifier attribute

■ Gain(S, A) is therefore the expected reduction in entropy caused by


knowing the value of attribute A.
■ Put another way, Gain(S, A) is the information provided about the target
function value, given the value of some other attribute A.

Let’s see it with an example

18
ID3: Selecting the best classifier attribute

■ Suppose S is a collection of training-example days described by attributes


including Wind, which can have the values Weak or Strong
■ Assume S is a collection containing 14 examples [9+, 5-]
■ Of these 14 examples, suppose:
● 6 of the positive and 2 of the negative examples have Wind = Weak
● Remainder have Wind = strong

19
ID3: Selecting the best classifier attribute

20
Exercice

■ Lets try calculating it ourselves:


● Suppose S is a collection of training-example days described by
attributes including Wind, which can have the values Weak and Strong
● Assume S is a collection containing 20 examples [16+, 4-]
● Suppose 9 of the positive and 1 of the negative have Wind = Weak and
the remainder Wind = Strong

Calculate the information gain using the previous formula

21
ID3: Hypothesis Space Search in DTL

■ ID3 can be characterized as searching a space of hypotheses for one that


fits the training examples provided
● The hypothesis space searched by ID3 is the set of possible decision
trees.
■ ID3 performs a simple-to-complex, hill-climbing search through this
hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that
correctly classifies the training data.
■ The evaluation function that guides this hill-climbing search is the
information gain measure.

22
ID3: Capabilities and limitations

■ ID3 maintains only a single current hypothesis as it searches through the


space of decision trees.
● By determining only a single hypothesis, ID3 loses the capabilities that
follow from explicitly representing all consistent hypotheses.
● For example, it does not have the ability to determine how many
alternative decision trees are consistent with the available training
data, or to pose new instance queries that optimally resolve among
these competing hypotheses

23
ID3: Capabilities and limitations

■ ID3 does not perform backtracking in its search


● Once it selects an attribute to test at a particular level in the tree, it
never backtracks to reconsider this choice.
● It is susceptible to the usual risks of hill-climbing search without
backtracking: converging to locally optimal solutions that are not
globally optimal.
● In the case of ID3, a locally optimal solution corresponds to the
decision tree it selects along the single search path it explores.
● However, this locally optimal solution may be less desirable than trees
that would have been encountered along a different branch of the
search.

24
ID3: Capabilities and limitations

■ ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current hypothesis.
● This contrasts with methods that make decisions incrementally, based
on individual training examples.
● One advantage of using statistical properties of all the examples (e.g.,
information gain) is that the resulting search is much less sensitive to
errors in individual training examples.
● ID3 can be easily extended to handle noisy training data by modifying
its termination criterion to accept hypotheses that imperfectly fit the
training data.

25
Exercises

■ Now that we understand how ID3 works, let’s implement it.


■ Follow the pseudocode to implement the algorithm in a collab notebook.
■ Use the data in 03_a_ID3_dataset.csv to try it.
● Some useful code on how to access external files from colab:
https://colab.research.google.com/notebooks/snippets/accessing_files.
ipynb

26
The CART
algorithm 27
CART algorithm

■ CART is the decision tree


algorithm implemented in
sklearn library
■ Is a greedy algorithm like
ID3 (no backtracking)

28
CART: Gini index

■ The selection criterion in CART is the Gini index instead of the information
gain used in ID3
■ The Gini index measures the impurity of D, a data partition or set of training
tuples:

■ Where:
● Pi is the probability that a tuple in D belongs to class Ci and is
estimated by

● The sum is computed over m classes

29
CART: Gini index

■ When considering a binary split, we compute a weighted sum of the


impurity of each resulting partition
■ For example, if a binary split on A partitions D into D1 and D2, the Gini index
of D given that partitioning is

30
CART: Gini index

■ When considering a binary split, we compute a weighted sum of the


impurity of each resulting partition
■ For example, if a binary split on A partitions D into D1 and D2, the Gini index
of D given that partitioning is

31
CART: Gini index

■ The Gini index considers a binary split for each attribute.


● I.e., we will end with a binary tree
■ For each attribute, each of the possible binary split is considered
■ For discrete-valued attributes, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset
■ For continuous-valued attributes, each possible split-point must be
considered. The strategy is similar to that described earlier for information
gain, where the midpoint between each pair of (sorted) adjacent values is
taken as a possible split-point.

32
CART: Gini index

■ The reduction in impurity that would be incurred by a binary split on a


discrete- or continuous-valued attribute A is:

■ The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute.
■ This attribute and either its splitting subset (for a discrete-valued splitting
attribute) or split-point (for a continuous-valued splitting attribute)
together form the splitting criterion.

33
CART: Gini index (Example)

34
CART: Gini index (Example)

■ Let D be the training data shown in the table, where there are nine tuples
belonging to the class buys computer D = yes and the remaining five tuples
belong to the class buys computer D = no.
■ A (root) node N is created for the tuples in D.
■ We first use the Gini index to compute the impurity of D:

35
CART: Gini index (Example)

■ To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute.
■ Let’s start with the attribute income and consider each of the possible
splitting subsets.
■ Consider the subset {low, medium}
● This would result in 10 tuples in partition D1 satisfying the condition
income ∈ {low, medium}
● The remaining four tuples of D would be assigned to partition D2
● The Gini index value computed based on this partitioning would be…

36
CART: Gini index (Example)

■ The Gini index value computed based on this partitioning would be:

37
CART: Gini index (Example)

■ Similarly, the Gini index values for splits on the remaining subsets are 0.458
(for the subsets {low, high} and {medium}) and 0.450 (for the subsets
{medium, high} and {low}).
■ Therefore, the best binary split for attribute income is on {low, medium} (or
{high}) because it minimizes the Gini index.
■ Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best
split for age with a Gini index of 0.375.
■ The attributes student and credit rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.

38
CART: Gini index (Example)

■ The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of 0.459- 0.357 =
0.102.
■ This binary split results in the maximum reduction in impurity of the tuples
in D and is returned as the splitting criterion.
■ Node N is labeled with the criterion, two branches are grown from it, and
the tuples are partitioned accordingly.

39
Tree pruning
40
CART: Tree pruning

■ When a decision tree is built, many of the branches will reflect anomalies in
the training data due to noise or outliers.
■ Tree pruning methods address this problem of overfitting the data.
■ Such methods typically use statistical measures to remove the least-reliable
branches.
■ Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend.
■ They are usually faster and better at correctly classifying independent test
data (i.e., of previously unseen tuples) than unpruned trees.

41
CART: Tree pruning

■ There are two common approaches to tree pruning: pre-pruning and


post-pruning.
■ In the pre-pruning approach, a tree is “pruned” by halting its construction
early
● If partitioning the tuples at a node would result in a split that falls
below a prespecified threshold, then further partitioning of the given
subset is halted.
■ Upon halting, the node becomes a leaf.
■ The leaf may hold the most frequent class among the subset tuples or the
probability distribution of those tuples.

42
CART: Tree pruning

■ Post-pruning removes subtrees from a “fully grown” tree.


■ A subtree at a given node is pruned by removing its branches and replacing
it with a leaf.
■ The leaf is labeled with the most frequent class among the subtree being
replaced.

43
44
CART: Tree pruning
■ CART uses post-pruning with an approach called cost complexity.
■ This approach considers the cost complexity of a tree to be a function of
the number of leaves in the tree and the error rate of the tree.
● Where the error rate is the percentage of tuples misclassified by the
tree.
■ It starts from the bottom of the tree.
■ For each internal node, N, it computes the cost complexity of the subtree at
N, and the cost complexity of the subtree at N if it were to be pruned (i.e.,
replaced by a leaf node).
■ The two values are compared. If pruning the subtree at node N would result
in a smaller cost complexity, then the subtree is pruned. Otherwise, it is
kept.
45
CART: Tree pruning

■ A pruning set of class-labeled tuples is used to estimate cost complexity.


■ This set is independent of the training set used to build the unpruned tree
and of any test set used for accuracy estimation.
■ The algorithm generates a set of progressively pruned trees.
■ In general, the smallest decision tree that minimizes the cost complexity is
preferred.

46
CART: Exercices

■ Open the 03_b_CART colab to do the exercises.

47
Random forests 48
Random forest

■ A Random Forest model is composed by a large number of individual


decision trees that operate as an ensemble method.
■ An ensemble for classification is a composite model, made up of a
combination of classifiers.
■ The individual classifiers vote, and a class label prediction is returned by the
ensemble based on the collection of votes.
■ Ensembles tend to be more accurate than their component classifiers.
● Wisdom of crowds
■ This is one of my go-to algorithms for the first tests with a new dataset.

49
Random forest

50
Random forest

■ The low correlation between models is the key.


● Similar to how in investments a portfolio of low correlation stocks is a
better idea as its parts on their own.
■ The ensemble reduces the errors that arise from individual trees (as long as
all of them don’t fail)
● While some trees may be wrong, many other trees will be right, so as a
group the trees are able to move in the correct direction.

51
Random forest

52
Random forest

■ The prerequisites for random forest to perform well are:


● There needs to be some actual signal in our features so that models
built using those features do better than random guessing.
● The predictions (and therefore the errors) made by the individual trees
need to have low correlations with each other.
■ How do we ensure that the behavior of each individual tree is not too
correlated with the behavior of any of the other trees in the model?
● Bagging and feature randomness

53
Random forest: Bagging

Suppose that you are a patient and would like to have a diagnosis made based
on your symptoms. Instead of asking one doctor, you may choose to ask several.
If a certain diagnosis occurs more than any other, you may choose this as the
final or best diagnosis. That is, the final diagnosis is made based on a majority
vote, where each doctor gets an equal vote. Now replace each doctor by a
classifier, and you have the basic idea behind bagging. Intuitively, a majority vote
made by a large group of doctors may be more reliable than a majority vote
made by a small group.

54
Random forest: Bagging

■ Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2,


… , k), a training set, Di, of d tuples is sampled with replacement from the
original set of tuples, D.
● Note that the term bagging stands for bootstrap aggregation.
■ Because sampling with replacement is used, some of the original tuples of
D may not be included in Di , whereas others may occur more than once.
■ A classifier model, Mi , is learned for each training set, Di.
■ Random forest takes advantage of this by allowing each individual tree to
randomly sample from the dataset with replacement, resulting in different
trees.

55
Random forest: Bagging

■ Notice that with bagging we are not subsetting the training data into
smaller chunks and training each tree on a different chunk.
■ If we have a sample of size N, we are still feeding each tree a training set of
size N (unless specified otherwise).
■ But instead of the original training data, we take a random sample of size N
with replacement.
● E.g., if our training data was [1, 2, 3, 4, 5, 6] then we might give one of
our trees the following list [1, 2, 2, 3, 6, 6].
● Notice that both lists are of length six and that “2” and “6” are both
repeated in the randomly selected training data we give to our tree
(because we sample with replacement).

56
Random forest: Feature randomness

■ In a decision tree, when building a node, we choose the feature that


provides the best result according to the used metric.
■ In contrast, each tree in a random forest can pick only from a random
subset of features. This forces even more variation amongst the trees in the
model and ultimately results in lower correlation across trees and more
diversification.

57
Further reading

■ Chapter 3 in [Mitchel, 1997]


■ Sections 8.2 and 8.6 in [Han and Kamber, 2006]
Extra material
■ Decision trees in sklearn:
https://scikit-learn.org/stable/modules/tree.html
■ Random Forests in sklearn:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble
.RandomForestClassifier.html

58
Exercises

■ Use Random Forest to process the Iris and Wine datasets.


■ Use a Random Forest to classify the CSV file of the previous
exercise.

59
We have our first machine
learning model trained…

but how do we know how


well is working?

60
Do you have any questions?
[email protected]

Thanks!
61

You might also like