0% found this document useful (0 votes)
13 views66 pages

IML Trees

Uploaded by

Ar Pri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views66 pages

IML Trees

Uploaded by

Ar Pri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Classification and regression trees

Pierre Geurts & Louis Wehenkel

Institut Montefiore, University of Liège, Belgium

ELEN062-1
Introduction to Machine Learning
September 2022

Acknowledgment: These slides have been reformatted by Yann Claes in July 2020.
1 / 66
Outline

1 Supervised learning

2 Principles of decision trees

3 Extensions

4 Regression trees

5 By-products

6 Conclusions, research and further reading

2 / 66
Outline

1 Supervised learning

2 Principles of decision trees

3 Extensions

4 Regression trees

5 By-products

6 Conclusions, research and further reading

3 / 66
Database/Dataset/Sample

A database/dataset/sample is a collection of objects/observations


(rows) described by attributes/features/variables (columns).

4 / 66
Supervised learning
Inputs Output

Ŷ = f (A1 , . . . , A1000 )
Automatic learning

Goal: from the database, find a function f (·) of the inputs that
approximates at best the output.

2 cases:
- Discrete output → classification problem
- Continuous output → regression problem
5 / 66
Application (i)

I Predict whether or not a bank client will be a good or a bad


debtor.

I Image classification:
- Face recognition
- Handwritten characters recognition

5
6 / 66
Application (ii)

I Classification of cancer types from gene expression profiles

No. patient Gene 1 Gene 2 ... Gene 7129 Leucemia


1 -134 28 ... 123 AML
2 -123 0 ... 17 AML
3 56 -123 ... -23 ALL
... ... ... ... ... ...
72 89 -123 ... 12 ALL
Source: [1]

7 / 66
Learning algorithm
A learning algorithm receives as input a learning sample and returns a
function h(·).

It is defined by:
- A hypothesis space H, which is a set of candidate models.
- A quality measure for a model.
- An optimisation strategy.

A model h(·) ∈ H obtained by automatic learning.

8 / 66
Outline

1 Supervised learning

2 Principles of decision trees


Tree representation
Tree learning

3 Extensions

4 Regression trees

5 By-products

6 Conclusions, research and further reading

9 / 66
Classification trees (aka Decision trees)

A supervised learning algorithm that can handle:


- Classification problems, that can be binary or multi-valued.
- Discrete (binary or multi-valued) or continuous attributes.

Classification trees were invented several times:


I By statisticians: e.g. CART (Breiman et al.)
I By the AI community: e.g. ID3, C4.5 (Quinlan et al.)

10 / 66
Hypothesis space
A decision tree is a tree where:
- Each interior node tests an attribute
- Each branch corresponds to an attribute value
- Each leaf is labelled with a class

A decision tree allowing one to predict whether a customer will buy a given
kind of computer. Source: [2]

11 / 66
Illustration - Should I play tennis? (i)

Adapted from [3]. (NB: first column is ’dummy’; output Y ≡ ’Play Tennis’)
12 / 66
Illustration - Should I play tennis? (ii)

Source: [3]

13 / 66
Tree learning
Tree learning problem consists in choosing the tree structure and
determining the predictions at leaf nodes.

For each leaf, the prediction (or label) is chosen such that the
misclassification error in the part of the LS reaching that leaf is
minimized: the majority class in the part of the LS reaching the leaf.
25 yes, 40 no

15 yes, 10 no

14 yes, 2 no

14 / 66
How to generate trees? (i)

What properties do we want a decision tree to have?

I It should be consistent with the learning sample (for the moment):

- Trivial algorithm: construct a decision tree that has one path to a


leaf for each example.
Problem: it does not capture useful information from the database.

15 / 66
How to generate trees (ii)

What properties do we want decision trees to have?

I It should be at the same time as simple as possible.


- Trivial algorithm: generate all trees and pick the simplest one that
is consistent with the learning sample.
Problem: there are too many trees.

16 / 66
Top-down induction of decision trees (i)

Idea: Choose the best attribute, split the learning sample accordingly
and proceed recursively until each object is correctly classified.

17 / 66
Top-down induction of decision trees (ii)

Algorithm 1: learn_dt(LS)
if all objects from LS have the same class or if all objects have the
same values for every attribute then
Create a leaf with a label corresponding to the majority class of
the objects of LS;
end if
else
Use LS to find the best splitting attribute A∗ ;
Create a test node for that attribute ;
forall different values a of A∗ do
Build LSa = {o ∈ LS | A∗ (o) is a} ;
Use learn_dt(LSa ) to grow a subtree from LSa
end forall
end if

18 / 66
Properties of the top-down approach

I Hill-climbing algorithm in the space of possible decision trees:


- It adds a sub-tree to the current tree and continues its search
- It does not backtrack

I Highly dependent upon the criterion for selecting attributes to test


(what we called above the “best splitting attribute A∗ ”)

I Sub-optimal (heuristic) but very fast

19 / 66
Which attribute is the best splitting attribute?

A1 = ? [29+, 35−] A2 = ? [29+, 35−]


T F T F

[21+, 5−] [8+, 30−] [18+, 33−] [11+, 2−]

We want a small tree. Therefore, we should maximize the class


separation at each step, i.e make successors as pure as possible.
⇒ it will favour short paths in trees.

20 / 66
Impurity measures

Let:
- LS be a sample of objects
- pj (∀j = 1, . . . , J) the proportion of objects in the LS belonging
to output-class j.

An impurity measure I(LS) should satisfy the following props:


I I(LS) is minimum only when ∃i s.t. pi = 1 and pj = 0 for j 6= i
(pure sample);
I I(LS) is maximum only when ∀j : pj = J1 (uniform number of
objects among classes);
I I(LS) is a symmetric function of its arguments p1 , . . . , pJ .

21 / 66
Shannon entropy as an impurity measure (i)

Definition of Shannon entropy:


J
4 X
H(LS) = − pj log2 pj
j=1
4
= ISh (LS)

If there are only two classes we have (since p2 = 1 − p1 )

H(LS) = −p1 log2 p1 − (1 − p1 ) log2 (1 − p1 ).

Notice that ISh satisfies the above props (see graphic).

(NB: Shannon entropy is the basis of information theory.)


22 / 66
Other examples of impurity measures (ii)

I Gini index:
J
4 X
IGi (LS) = pj (1 − pj ).
j=1

I (Misclassification) Error
rate:
Two-class case. Respectively, the
4 Shannon entropy, the Gini index
IER (LS) = 1 − max pj .
j and the Error rate, normalized
between 0 and 1.

23 / 66
Reduction of impurity achieved by a split

For a given impurity measure, the best splitting attribute is the one
which maximizes the expected reduction of impurity defined by
X |LSa |
∆I(LS, A) = I(LS) − I(LSa ),
|LS|
a∈A(LS)

where LSa is the subset of objects o from LS such that A(o) = a, and
where A(LS) is the set of different values of A observed in LS.

∆I is also called a score measure or a splitting criterion.

NB: There are other ways to define a splitting criterion that do not rely on an
impurity measure.

NB: The reduction of Shannon entropy is called the information gain.

24 / 66
Illustration (with Shannon entropy as impurity measure)

Which attribute is the best one to split?

A1 = ? [29+, 35−] A2 = ? [29+, 35−]


ISh = 0.99 ISh = 0.99
T F T F

[21+, 5−] [8+, 30−] [18+, 33−] [11+, 2−]


ISh = 0.71 ISh = 0.75 ISh = 0.94 ISh = 0.62

26 38
→ ∆ISh (LS, A1 ) = 0.99 − 64 × 0.71 − 64 × 0.75 = 0.25
51 13
→ ∆ISh (LS, A2 ) = 0.99 − 64 × 0.94 − 64 × 0.62 = 0.12

25 / 66
Application to the tennis problem

Which attribute should be tested here?


- ∆ISh (LS, Temp.) = 0.970 − 35 × 0.918 − 15 × 0.0 − 15 × 0.0 = 0.419
3 2
- ∆ISh (LS, Hum.) = 0.970 − 5 × 0.0 − 5 × 0.0 = 0.970
2 3
- ∆ISh (LS, Wind) = 0.970 − 5 × 1.0 − 5 × 0.918 = 0.019
The best attribute is thus humidity.
26 / 66
Overfitting (i)

For now, trees are perfectly consistent with the learning sample.
However, often, we would like them to be good at predicting classes of
unseen data from the same distribution, which is called generalization.

A tree T overfits the learning sample if and only if ∃T 0 such that:


1. ErrorLS (T ) < ErrorLS (T 0 )
2. Errorunseen (T ) > Errorunseen (T 0 )

27 / 66
Overfitting (ii)

In practice, Errorunseen (T ) is estimated from a separate test sample.


28 / 66
Why do trees overfit the learning sample? (i)
Data is noisy or attributes do not completely predict the outcome.

29 / 66
Reasons for overfitting (ii)
Data is incomplete, i.e. all cases are not covered.

We do not have enough data in some part of the learning sample to


make a good decision.
30 / 66
How to avoid overfitting?

I Pre-pruning: stop growing the tree earlier, before it reaches the


point where it perfectly classifies the learning sample.

I Post-pruning: allow the tree to overfit, then post-prune it.

I Ensemble methods: these will be covered later in the course.

31 / 66
Pre-pruning

Idea: stop splitting a node if either:


a. the local sample size is < Nmin
b. the local sample impurity < Ith
c. the impurity reduction ∆I of the best test is not large enough,
according to some statistical hypothesis test at level α.

Caveats:
- for criteria [a,b,c] suitable values of the meta-parameters, Nmin ,
Ith , α, are often problem dependent;
- criterion [c] may recommend stop-splitting too early.

32 / 66
Post-pruning (i)

Idea: split the learning sample into two sets:


- a growing sample GS to build the tree
- a validation sample V S to evaluate its generalization error
Then, build a complete tree from GS.

Next, compute a sequence of trees {T1 , T2 , . . .} where:


- T1 is the complete tree
- Ti is obtained by removing some test nodes from Ti−1
Finally, select the tree Ti∗ from the sequence that minimizes the error
on the validation sample V S.

33 / 66
Post-pruning (ii)

34 / 66
Post-pruning (iii)

How to build the sequence of trees?

I Reduced error pruning: at each step, remove the node that most
decreases the error on the validation sample.

I Cost-complexity pruning: define a cost-complexity criterion

ErrorGS (T ) + β Complexity(T )

and build the sequence of trees that minimizes this criterion, for
increasing values of β.

35 / 66
Post-pruning (iv)

ErrorGS = 13%, ErrorV S = 15%


ErrorGS = 0%, ErrorV S = 10%

ErrorGS = 6%, ErrorV S = 8% ErrorGS = 27%, ErrorV S = 25%

ErrorGS = 33%, ErrorV S = 35%

36 / 66
Post-pruning (v)

Problem: it requires to dedicate a part of the learning sample as a


validation set, which may be a problem in the case of a small database.

⇒ Solution: K-fold cross-validation


- Split the training set into K parts, often 10
- Generate K trees, each one leaving out one part among K
- Make a prediction for each learning object with the only tree built
without this case
- Estimate the error of this prediction

K-fold cross-validation may be combined with pruning.

37 / 66
How to use decision trees?

I Large data sets (ideal case):


- Split the data set into three parts: GS, V S, T S
- Grow a tree from GS
- Post-prune it from V S
- Test it on T S

I Small data sets (often):


- Grow a tree from the whole database
- Pre-prune it with default parameters (risky)/post-prune it by
10-fold cross-validation (costly)
- Estimate its accuracy by 10-fold cross-validation

38 / 66
Outline

1 Supervised learning

2 Principles of decision trees

3 Extensions
Continuous attributes
Attributes with many values
Missing values

4 Regression trees

5 By-products

6 Conclusions, research and further reading

39 / 66
Continuous attributes (i)

Example: temperatures as a number instead of a discrete value.

There are two solutions:


- Pre-discretize: cold if the temperature is below 70, mild if
between 70 and 75, hot if above 75.
- Discretize during tree growing:

Temperature
≤ 65.4 ≥ 65.4

no yes

How to find the cut-point?

40 / 66
Continuous attributes (ii)

Temp. Play Temp. Play


80 No 64 Yes
Temp. < 64.5 ∆I = 0.048
85 No 65 No
Temp. < 66.5 ∆I = 0.010
83 Yes 68 Yes
Temp. < 68.5 ∆I = 0.000
75 Yes 69 Yes
Temp. < 69.5 ∆I = 0.015
68 Yes 70 Yes
Temp. < 70.5 ∆I = 0.045
65 No 71 No
Sort Temp. < 71.5 ∆I = 0.001
64 Yes 72 No
72 No 72 Yes
Temp. < 73.5 ∆I = 0.001
75 Yes 75 Yes
70 Yes 75 Yes
Temp. < 77.5 ∆I = 0.025
69 Yes 80 No
Temp. < 80.5 ∆I = 0.000
72 Yes 81 Yes
Temp. < 82.0 ∆I = 0.010
81 Yes 83 Yes
Temp. < 84.0 ∆I = 0.113
71 No 85 No

41 / 66
Continuous attributes (iii)

42 / 66
Continuous attributes (iv)

43 / 66
Attributes with many values (i)

Letter

a b c z
...

Problems:
- Not good splits: they fragment the data too quickly, leaving
unsufficient data for the next level.
- The reduction of impurity of such tests is often high (e.g. splitting
on the object ID)

There are two solutions:


- Change the splitting criterion to penalize attributes with many
values.
- Consider only binary splits (preferable)
44 / 66
Attributes with many values (ii)

Modified splitting criteria:


∆ISh (LS,A)
- GainRatio(LS, A) = SplitInformation(LS,A)
 
- SplitInformation(LS, A) = − a |LS a| |LSa |
P
|LS| log 2 |LS|
The split information is high when there are many values.

Example: outlook in the tennis problem.


- ∆ISh (LS, outlook) = 0.246
- SplitInformation(LS, outlook) = 1.577
0.246
- GainRatio(LS, outlook) = 1.577 = 0.156 < 0.246
Problem: the gain ratio favours unbalanced tests.

45 / 66
Attributes with many values (iii)

Source: [3]

46 / 66
Attributes with many values (iv)

Allow binary tests only:

Letter
{a, d, o, m, t} All other letters

There are 2N −1 − 1 non-trivial binary partitions for N values. If N is


small, we can use enumeration.
However, if N is large, a heuristic is needed.
Example: Greedy approach

47 / 66
Missing attribute values

Not all attribute values are known for every object during learning or
testing.

Day Outlook Temperature Humidity Wind Play


D15 Sunny Hot ? Strong No

There are three strategies:


- Assign the most common value in the learning sample
- Assgin the most common value in the tree
- Assign a probability to each possible value

48 / 66
Outline

1 Supervised learning

2 Principles of decision trees

3 Extensions

4 Regression trees

5 By-products

6 Conclusions, research and further reading

49 / 66
Regression trees (i)

A regression tree is exactly the same model as a decision tree but with
a number in each leaf instead of a class.

50 / 66
Regression trees (ii)

A regression tree is a piecewise constant function of the input


attributes.

X1 ≤ t1

X2
X2 ≤ t2 X2 ≤ t3 r5
r2
t2 r3
r4
r1 r2 r3 X2 ≤ t4 r1
X1
t1 t3
r4 r5

51 / 66
Regression tree growing

To minimize the square error on the learning sample, the prediction at


a leaf is the average output of the learning cases reaching that leaf.

The impurity of a sample is defined by the variance of the output in


that sample:
n 2 o
I(LS) = vary|LS {y} = Ey|LS y − Ey|LS {y}

where Ey|LS {f (y)} denotes the average of f (y) in the sample LS.

The best split is the one that reduces the most variance:
X |LSa |
∆I(LS, A) = vary|LS {y} − vary|LSa {y}
a
|LS|

52 / 66
Regression tree pruning

The method are exactly the same: pre-pruning and post-pruning.

In post-pruning, the tree that minimizes the squared error on V S is


selected.

In practice, pruning is more important in regression problems because


full trees are much more complex: often, all objects have a different
output value and hence the full tree has as many leaves as objects in
the learning sample.

53 / 66
Outline

1 Supervised learning

2 Principles of decision trees

3 Extensions

4 Regression trees

5 By-products
Interpretability
Variable selection
Variable importance

6 Conclusions, research and further reading

54 / 66
Interpretability (i)

With decision trees,


interpretability is obvious.

Source: [3]
With neural networks, it is more complex.
Outlook

Humidity Play

Temperature
Don’t play

Wind
55 / 66
Interpretability (ii)

Source: [3]

A tree may be converted into a set of rules:


I If (Outlook = Sunny) and (Humidity = High) then
PlayTennis = N o
I If (Outlook = Sunny) and (Humidity = N ormal) then
PlayTennis = Y es
I If Outlook = Overcast then PlayTennis = Y es
I ...
56 / 66
Attribute selection

If some attributes are not useful for classification, they will not be
selected in the pruned tree.

Attribute selection is of practical importance, if measuring the value of


an attribute is costly.
Example: Medical diagnosis

Decision trees are often used as a pre-processing step for other learning
algorithms that suffer more when there are irrelevant variables.

57 / 66
Variable importance

In many applications, all variables do not contribute equally in


predicting the output.

We can evaluate variable importance by computing the total reduction


of impurity brought by each variable:
X
I(A) = |LSnode |∆I(LSnode , A)
N odes where A is tested

Temperature
Wind
Humidity
Outlook

58 / 66
Outline

1 Supervised learning

2 Principles of decision trees

3 Extensions

4 Regression trees

5 By-products

6 Conclusions, research and further reading

59 / 66
Decision trees - Conclusions

Advantages:
- They are very fast: they can handle very large datasets with many
attributes.
→ Complexity: O(nN log N ) where n is the number of attributes
and N the number of samples.
- They are flexible: they can handle several attribute types,
classification and regression problems, missing values, . . .
- They have a good interpretability: they provide rules and attribute
importance.

Disadvantages:
- They are quite unstable, due to their high variance.
- They are not always competitive with other algorithms in terms of
accuracy.

60 / 66
Research

I Cost and un-balanced learning sample.


I Oblique trees (test like
P
αi Ai < ath ).
I Using predictive models in leaves, e.g. using linear regression.
I Induction graphs.
I Fuzzy decision trees (from a crisp partition to a fuzzy partition of
the learning sample).

61 / 66
Further reading

I Hastie et al., The elements of statistical learning: data mining,


inference, and prediction, [4]:
- chapter 9, Section 9.2
I Louppe Gilles, Understanding random forests : from theory to
practice [5].
I L. Breiman et al., Classification and regression trees, [6]
I J.R. Quinlan, C4.5: programs for machine learning, [7]
I D.Zighed and R.Rakotomalala, Graphes d’induction: apprentissage
et data mining, [8]
I Supplementary slides are also available on the course website.

62 / 66
Software

I scikit-learn: http://scikit-learn.org
I Weka: https://www.cs.waikato.ac.nz/ml/weka/
I R: packages tree and rpart

63 / 66
References I

Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle


Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A
Caligiuri, et al.
Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring.
science, 286(5439):531–537, 1999.
Decision tree and entropy algorithm.
https://zhengtianyu.wordpress.com/2013/12/13/
decision-trees-and-entropy-algorithm/.
Accessed: 2020-07-29.
A tutorial to understand decision tree id3 learning algorithm.
https://nulpointerexception.com/2017/12/16/
a-tutorial-to-understand-decision-tree-id3-learning-algorithm/.
Accessed: 2020-07-29.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media, 2009.
Gilles Louppe.
Understanding random forests : from theory to practice.
Université de Liège, 2014.

64 / 66
References II

Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone.


Classification and regression trees. wadsworth int.
Group, 37(15):237–251, 1984.
J Ross Quinlan.
C4. 5: programs for machine learning.
Elsevier, 2014.
Djamel A Zighed and Ricco Rakotomalala.
Graphes d’induction: apprentissage et data mining.
Hermes Science Publications Paris, 2000.

65 / 66
Frequently asked questions

I What is the computational complexity of the learning algorithm ?


I How do we handle (continuous) numerical input variables ?
I Explain the optimal splitting algo.
I What are the 2-3 main changes to make to implement regression
tree learning with respect to decision tree learning ?
I How to adapt this approach to general loss-functions `(·, ·) ?
I Discuss theoretical asymptotic (N → ∞) properties ?
I Discuss computational complexity and interpretability.

66 / 66

You might also like