7_DecisionTree
7_DecisionTree
Decision Trees
What are trees?
2
Decision Trees
• Classify between lemon and apples
G
Images from https://machinelearning.school.blog/2017/01/12/introduction-to-decision-trees/
3
Decision Trees
Root node
D
Branches
D
L W
Leaves
W ~
4
Images from https://machinelearning.school.blog/2017/01/12/introduction-to-decision-trees/
Rules for classifying data using attributes
• The tree consists of decision nodes and leaf
nodes.
• A decision node has two or more branches,
each representing values for the attribute
tested.
• A leaf node attribute produces a
homogeneous result (all in one class), which
does not require additional classification
testing
5
Each internal node: tests
Root node
one feature Xi
6
Images from https://machinelearning.school.blog/2017/01/12/introduction-to-decision-trees/
Read
F
• Features can be discrete, continuous or categorical
-
~ -
&
7
Example: What to do this Weekend?
• If my parents are visiting
– We’ll go to the cinema *
• If not
Et
⑳
– Then, if it’s sunny I’ll play tennis -
weather
↓
8
edgei ↓
rain's
X
I ↓
yes
No
e Treather
r
-
&
-
we
-
Home
T
Miner 8=
⑫is Richy Pr
-
Ishep
M
Written as a Decision Tree
Root of tree
Leaves
9
Using the Decision Tree
(No parents on a Sunny Day)
I
&
10
Using the Decision Tree
(No parents on a Sunny Day)
O O
11
From Decision Trees to Logic
• Read from the root to every tip
– If this and this and this … and this, then do this
• In our example:
– If no_parents and sunny_day, then play_tennis
– no_parents ∧ sunny_day è play_tennis
12
How to design a decision tree
• Decision tree can be seen as rules for performing a
categorisation
– E.g., “what kind of weekend will this be?”
13
Training and Visualizing a Decision Tree TRY
T
-
-
notebook
-
=
Jupter
-
Iris Decision Tree O
-
-
S -
-
--
00
•- -
max_depth=2& 52
• White box model – easy to ⑭
interpret -
-
• Not a black box model – like -
I &
-
I
..
neural networks - -
samples
-It ( H
54
+ +
1-
= % -
petal
Decision Tree Boundaries weight
of
&
&
• For
max_depth=2 E
-
• Max_depth=3 &
①
2
O =
CART Training Algorithm
-
0 0
• Splits training set into two subsets using a single feature k and a
threshold tk in this case petal length less than 2.45.
-
# -
50x0
150
+
10x0
-
⑬
-
-
Decision Tree Regularisation
-
• Recall in Data pre-processing lecture – decision tree gave a model with 0 error –
they have a high tendency to overfit. Hence regularisation!!!
-
&
-
• T
max_depth: Maximum depth of the tree in terms of layers.
• S
Max_features: Maximum number of features that are evaluated for splitting at each node
• Max_leaf_nodes: Maximum number of leaf nodes&
• min_samples_split: Minimum number of samples a node must have before it can be split
• min_samples_leaf: Minimum number of samples a leaf node must have to be created
• min_weight_fraction_leaf: Same as min_samples_leaf but expressed as a fraction of the
-
total number of weighted instances
-
Non-regularized vs. Regularized Tree
-
• Test accuracy:
-
-
No restrictions:
0.898
O
Restricted:
0.92
Regression with Decision Trees
-
-0
-0 3
.
CART for Regression
• Instead of trying to minimize
-
impurity, tries to split the training data
-
in order to minimize the MSE.
-
- =
-
Importance of Regularization in Regression
fo -
use
L
Sensitivity to Axis Orientation
• Decision trees love orthogonal decision boundaries.
• Rotate the data by 450 and note the convoluted boundary.
High Variance in Decision Trees
• Decision Trees have high variance.
• Small changes in hyperparameters leads to very different models
• Since the training algorithm used by Sci-Kit learn is stochastic in
nature, retraining the same model on the same data produces a very
different model.
• By averaging predictions over many trees, it’s possible to reduce the
variance. Such an ensemble of trees is called a random forest.
• Next slide shows an example of the same dataset and two different
tree configurations.
High Variance
The ID3 Algorithm
16
Entropy - Formulae
Er
• For example, in a binary categorization
– Where p+ is the proportion of positives
– And p- is the proportion of negatives
-
17
1 Entropy Example
20 =
0
1=
1092
t &
18
outlook
/ ↓ Y
Rain
sunny overcast
N
Ni
Y
I
& i In
E(R) = 0 97
47
.
0
E(s) = 0 .
E(0) =
47
+ 500
40
.
+
5
%
0 .
97 Th 69
14 = 0
.
T4
5
69 =
0 94-0
.
Outlook
⑳5
.
volume class .
al
It for
like
attributes
other
Find 14 for
1) Temp
2) Humidity
3) wind
Entropy Example
Entropy(S) =
- (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940
19
Information Gain (IG)
• Information gain is based on the decrease in entropy after a dataset is
split on an attribute.
• Which attribute creates the most homogeneous branches?
21
Information Gain (cont’d)
• Calculate Gain(S,A)
– Estimate the reduction in entropy we obtain if we know
the value of attribute A for the examples in S
22
An Example Calculation of
Information Gain
14
• Suppose we have a set of examples P+ =
• And Attribute A
– Which takes values v1, v2, v3 S2 =
• S1 takes value v2 for A, S2 takes value v2 for A
S3 = Vy
S3 takes value v3 for A, S4 takes value v1 for A ~
Sy = V
,
-
23
First Calculate Entropy(S)
• Recall that
Entropy(S) = -p+log2(p+) – p-log2(p-)
24
Calculate Gain for each Value of A
• Remember that
Su. 2[Si)
↓ =
0
E(S
= ) =
(2)
12)
+
loga
=
1 logn
112
I
112 +
~
&
EXSv) = /
Final Calculation
26
A Worked Example
6C271S2
5N SPAR
12
34WCR 5
Weekend Weather Parents Money Decision (Category)
W1 Sunny Yes Rich > Cinema
&
34
S
T -
W2 Sunny -
No Rich Tennis
-
↑
27
↑
W4
W5
Rainy
Rainy
Yes
No
S
&Poor
Rich
- Cinema
Stay in
-
13]
-
- Ish
W6 Rainy Yes &
O-
Poor Cinema -
W7
W8
Windy
Windy
No
No
0Poor
Rich
>
- Cinema -
& - Shopping
W9 Windy Yes -
Rich -
&
Cinema S
W10 Sunny No ↓
Rich - Tennis
27
0.
2 0 2 log
2
logo
-
0
.
3
-
,
blog ,
. .
0
.
0 4322 +.
0 464h
.
+ 0 .
6644 ⑫
=
= 1 571
.
Information Gain for All of S
• S = {W1,W2,…,W10}
• Firstly, we need to calculate:
– Entropy(S) = … = 1.571
28
Gain (S money
,
⑫ 2x
(10)
E()
-
z10E)
-
-
2 .
8074 .
+ 2x
+
2 x 18074 7
EX1.222 ↓
v
1 .
28906
o
=
+ 0 802
5164
.
0
5237
.
:
0 1 2890
.
1 57
.
I(log21) 0 37
.
= O
=
0
2816
.
The ID3 Algorithm
• Given a set of examples, S
– Described by a set of attributes Ai
– Categorised into categories cj
1. Choose the root node to be attribute A
– Such that A scores highest for information gain
• Relative to S, i.e., gain(S,A) is the highest over all
attributes
2. For each value v that A can take
– Draw a branch and label each with corresponding v
29
The ID3 Algorithm
• For each branch you’ve just drawn (for value v)
– If Sv only contains examples in category c
• Then put that category as a leaf node in the tree
– If Sv is empty
• Then find the default category (which contains the most
examples from S)
– Put this default category as a leaf node in the tree
– Otherwise
• Remove A from attributes which can be put into nodes
• Replace S with Sv
• Find new attribute A scoring best for Gain(S, A)
• Start again at part 2
• Make sure you replace S with Sv 30
Explanatory Diagram
31
Information Gain for All of S
• S = {W1,W2,…,W10}
• Firstly, we need to calculate:
– Entropy(S) = … = 1.571
• Next, we need to calculate information gain
– For all the attributes we currently have available
• (which is all of them at the moment)
– Gain(S, weather) = … = 0.7
– Gain(S, parents) = … = 0.61
– Gain(S, money) = … = 0.2816
• Hence, the weather is the first attribute to split on
-
33
Top of the Tree
• So, this is the top of our tree:
• Now, we look at each branch in turn
– In particular, we look at the examples with the attribute
prescribed by the branch
• Ssunny = {W1,W2,W10}
– Categorisations are cinema, tennis and tennis for W1,W2
and W10
– What does the algorithm say?
• Set is neither empty, nor a single category
• So we have to replace S by Ssunny and start again
34
Getting to the leaf nodes
• If it’s sunny and the parents have turned up
– Then, looking at the table in previous slide
• There’s only one answer: go to cinema
• If it’s sunny and the parents haven’t turned up
– Then, again, there’s only one answer: play tennis
• Hence our decision tree looks like this:
36
What is the optimal Tree Depth?
• We need to be careful to pick an appropriate
tree depth.
• If the tree is too deep, we can overfit.
• If the tree is too shallow, we underfit
• Max depth is a hyper-parameter that should
be tuned by the data. Alternative strategy is to
create a very deep tree, and then to prune it.
37
Control the size of the tree
• If we stop early, not all
training samples would
be classified correctly.
• How do we classify a new
instance:
– We label the leaves of this
smaller tree with the
majority of training
samples’ labels
38
Summary of learning classification
trees
• Advantages:
– Easily interpretable by human (as long as the tree is not too big)
– Computationally efficient
– Handles both numerical and categorical data
– It is parametric thus compact: unlike Nearest Neighborhood
Classification, we do not have to carry our training instances
around Building block for various ensemble methods (more on
this later)
• Disadvantages
– Heuristic training techniques
– Finding partition of space that minimizes empirical
error is NP-hard.
– We resort to greedy approaches with limited
theoretical underpinning.
39
Feature Space
• Suppose that we have p explanatory variables
X1, . . . , Xp and n observations.
41
Measures of Impurity
• At each node i of a classification tree, we have a
probability distribution p_{ik} over k classes.
• Deviance:
• Entropy:
• Gini index:
• Residual sum of squares
42