IML Trees
IML Trees
ELEN062-1
Introduction to Machine Learning
September 2022
Acknowledgment: These slides have been reformatted by Yann Claes in July 2020.
1 / 66
Outline
1 Supervised learning
3 Extensions
4 Regression trees
5 By-products
2 / 66
Outline
1 Supervised learning
3 Extensions
4 Regression trees
5 By-products
3 / 66
Database/Dataset/Sample
4 / 66
Supervised learning
Inputs Output
Ŷ = f (A1 , . . . , A1000 )
Automatic learning
Goal: from the database, find a function f (·) of the inputs that
approximates at best the output.
2 cases:
- Discrete output → classification problem
- Continuous output → regression problem
5 / 66
Application (i)
I Image classification:
- Face recognition
- Handwritten characters recognition
5
6 / 66
Application (ii)
7 / 66
Learning algorithm
A learning algorithm receives as input a learning sample and returns a
function h(·).
It is defined by:
- A hypothesis space H, which is a set of candidate models.
- A quality measure for a model.
- An optimisation strategy.
8 / 66
Outline
1 Supervised learning
3 Extensions
4 Regression trees
5 By-products
9 / 66
Classification trees (aka Decision trees)
10 / 66
Hypothesis space
A decision tree is a tree where:
- Each interior node tests an attribute
- Each branch corresponds to an attribute value
- Each leaf is labelled with a class
A decision tree allowing one to predict whether a customer will buy a given
kind of computer. Source: [2]
11 / 66
Illustration - Should I play tennis? (i)
Adapted from [3]. (NB: first column is ’dummy’; output Y ≡ ’Play Tennis’)
12 / 66
Illustration - Should I play tennis? (ii)
Source: [3]
13 / 66
Tree learning
Tree learning problem consists in choosing the tree structure and
determining the predictions at leaf nodes.
For each leaf, the prediction (or label) is chosen such that the
misclassification error in the part of the LS reaching that leaf is
minimized: the majority class in the part of the LS reaching the leaf.
25 yes, 40 no
15 yes, 10 no
14 yes, 2 no
14 / 66
How to generate trees? (i)
15 / 66
How to generate trees (ii)
16 / 66
Top-down induction of decision trees (i)
Idea: Choose the best attribute, split the learning sample accordingly
and proceed recursively until each object is correctly classified.
17 / 66
Top-down induction of decision trees (ii)
Algorithm 1: learn_dt(LS)
if all objects from LS have the same class or if all objects have the
same values for every attribute then
Create a leaf with a label corresponding to the majority class of
the objects of LS;
end if
else
Use LS to find the best splitting attribute A∗ ;
Create a test node for that attribute ;
forall different values a of A∗ do
Build LSa = {o ∈ LS | A∗ (o) is a} ;
Use learn_dt(LSa ) to grow a subtree from LSa
end forall
end if
18 / 66
Properties of the top-down approach
19 / 66
Which attribute is the best splitting attribute?
20 / 66
Impurity measures
Let:
- LS be a sample of objects
- pj (∀j = 1, . . . , J) the proportion of objects in the LS belonging
to output-class j.
21 / 66
Shannon entropy as an impurity measure (i)
I Gini index:
J
4 X
IGi (LS) = pj (1 − pj ).
j=1
I (Misclassification) Error
rate:
Two-class case. Respectively, the
4 Shannon entropy, the Gini index
IER (LS) = 1 − max pj .
j and the Error rate, normalized
between 0 and 1.
23 / 66
Reduction of impurity achieved by a split
For a given impurity measure, the best splitting attribute is the one
which maximizes the expected reduction of impurity defined by
X |LSa |
∆I(LS, A) = I(LS) − I(LSa ),
|LS|
a∈A(LS)
where LSa is the subset of objects o from LS such that A(o) = a, and
where A(LS) is the set of different values of A observed in LS.
NB: There are other ways to define a splitting criterion that do not rely on an
impurity measure.
24 / 66
Illustration (with Shannon entropy as impurity measure)
26 38
→ ∆ISh (LS, A1 ) = 0.99 − 64 × 0.71 − 64 × 0.75 = 0.25
51 13
→ ∆ISh (LS, A2 ) = 0.99 − 64 × 0.94 − 64 × 0.62 = 0.12
25 / 66
Application to the tennis problem
For now, trees are perfectly consistent with the learning sample.
However, often, we would like them to be good at predicting classes of
unseen data from the same distribution, which is called generalization.
27 / 66
Overfitting (ii)
29 / 66
Reasons for overfitting (ii)
Data is incomplete, i.e. all cases are not covered.
31 / 66
Pre-pruning
Caveats:
- for criteria [a,b,c] suitable values of the meta-parameters, Nmin ,
Ith , α, are often problem dependent;
- criterion [c] may recommend stop-splitting too early.
32 / 66
Post-pruning (i)
33 / 66
Post-pruning (ii)
34 / 66
Post-pruning (iii)
I Reduced error pruning: at each step, remove the node that most
decreases the error on the validation sample.
ErrorGS (T ) + β Complexity(T )
and build the sequence of trees that minimizes this criterion, for
increasing values of β.
35 / 66
Post-pruning (iv)
36 / 66
Post-pruning (v)
37 / 66
How to use decision trees?
38 / 66
Outline
1 Supervised learning
3 Extensions
Continuous attributes
Attributes with many values
Missing values
4 Regression trees
5 By-products
39 / 66
Continuous attributes (i)
Temperature
≤ 65.4 ≥ 65.4
no yes
40 / 66
Continuous attributes (ii)
41 / 66
Continuous attributes (iii)
42 / 66
Continuous attributes (iv)
43 / 66
Attributes with many values (i)
Letter
a b c z
...
Problems:
- Not good splits: they fragment the data too quickly, leaving
unsufficient data for the next level.
- The reduction of impurity of such tests is often high (e.g. splitting
on the object ID)
45 / 66
Attributes with many values (iii)
Source: [3]
46 / 66
Attributes with many values (iv)
Letter
{a, d, o, m, t} All other letters
47 / 66
Missing attribute values
Not all attribute values are known for every object during learning or
testing.
48 / 66
Outline
1 Supervised learning
3 Extensions
4 Regression trees
5 By-products
49 / 66
Regression trees (i)
A regression tree is exactly the same model as a decision tree but with
a number in each leaf instead of a class.
50 / 66
Regression trees (ii)
X1 ≤ t1
X2
X2 ≤ t2 X2 ≤ t3 r5
r2
t2 r3
r4
r1 r2 r3 X2 ≤ t4 r1
X1
t1 t3
r4 r5
51 / 66
Regression tree growing
where Ey|LS {f (y)} denotes the average of f (y) in the sample LS.
The best split is the one that reduces the most variance:
X |LSa |
∆I(LS, A) = vary|LS {y} − vary|LSa {y}
a
|LS|
52 / 66
Regression tree pruning
53 / 66
Outline
1 Supervised learning
3 Extensions
4 Regression trees
5 By-products
Interpretability
Variable selection
Variable importance
54 / 66
Interpretability (i)
Source: [3]
With neural networks, it is more complex.
Outlook
Humidity Play
Temperature
Don’t play
Wind
55 / 66
Interpretability (ii)
Source: [3]
If some attributes are not useful for classification, they will not be
selected in the pruned tree.
Decision trees are often used as a pre-processing step for other learning
algorithms that suffer more when there are irrelevant variables.
57 / 66
Variable importance
Temperature
Wind
Humidity
Outlook
58 / 66
Outline
1 Supervised learning
3 Extensions
4 Regression trees
5 By-products
59 / 66
Decision trees - Conclusions
Advantages:
- They are very fast: they can handle very large datasets with many
attributes.
→ Complexity: O(nN log N ) where n is the number of attributes
and N the number of samples.
- They are flexible: they can handle several attribute types,
classification and regression problems, missing values, . . .
- They have a good interpretability: they provide rules and attribute
importance.
Disadvantages:
- They are quite unstable, due to their high variance.
- They are not always competitive with other algorithms in terms of
accuracy.
60 / 66
Research
61 / 66
Further reading
62 / 66
Software
I scikit-learn: http://scikit-learn.org
I Weka: https://www.cs.waikato.ac.nz/ml/weka/
I R: packages tree and rpart
63 / 66
References I
64 / 66
References II
65 / 66
Frequently asked questions
66 / 66