0% found this document useful (0 votes)
28 views

ML - 04 - Decision Trees

The document discusses decision trees, which are machine learning algorithms that can perform classification and regression tasks. It covers how decision trees are built from data, different node types in decision trees, and various conditions that can be used for splitting nodes. It also discusses concepts like information gain, entropy, and information theory that are important for constructing decision trees from data. The goal is to help readers understand how to train, visualize, and make predictions with decision tree models.

Uploaded by

Jok eR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

ML - 04 - Decision Trees

The document discusses decision trees, which are machine learning algorithms that can perform classification and regression tasks. It covers how decision trees are built from data, different node types in decision trees, and various conditions that can be used for splitting nodes. It also discusses concepts like information gain, entropy, and information theory that are important for constructing decision trees from data. The goal is to help readers understand how to train, visualize, and make predictions with decision tree models.

Uploaded by

Jok eR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Machine Learning

Chapter 04
Decision Trees
Prepared by: Ziad Doughan
Email: [email protected]

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 1


Introduction

Decision Trees are useful ML algorithms that can perform


both classification and regression tasks, and even multioutput
tasks.
They are powerful algorithms, capable of fitting complex
datasets.
In the previous Chapter, we have seen a Decision Tree
Regressor model on the California housing dataset, fitting it
perfectly but, overfitting it.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 2


Introduction

How to build a decision tree from tabular


data?
How to use a decision tree as classifier?
What is ensemble learning and what are
its advantages?
We will start by discussing how to train,
visualize, and make predictions with
Decision Trees.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 3


Trees in Machine Learning

The model is composed of a collection of "questions"


organized hierarchically in the shape of a tree.
The questions are usually called a condition, a split, or a test.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 4


Trees in Machine Learning

Decision trees are usually represented with the root at the


top.
Each non-leaf node contains a condition, and each leaf node
contains a prediction.
Input and output values can be continuous or discrete.
When output values are discrete and exactly 2 possible
values, we have a Boolean classification.
A Tree is non-parametric approach for Classification and
Regression.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 5


Node types

DT reaches its decision by performing a sequence of test. A


decision tree has 2 kinds of nodes:
• Each internal node is a question on features. It
corresponds to a test on one of the input attributes. It
branches out according to the answers.
• Each leaf node has a class label, determined by majority
vote of training examples reaching that leaf.
There are different types of conditions used to build decision
trees.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 6


Types of Conditions

Conditions can be divided to two main categories: axis-


aligned and oblique condition.
An axis-aligned condition involves only a single feature while
an oblique condition involves multiple features.
Example:
• axis-aligned condition: num_legs ≥ 2
• oblique condition: num_legs ≥ num_fingers

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 7


Types of Conditions

DTs are often trained with axis-aligned conditions.


However, oblique splits are more powerful because they
express complex patterns.
Oblique splits sometime produce better results but require
expensive training.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 8


Types of Conditions

Binary vs. non-binary conditions:


Decision trees containing only binary conditions (true or false)
are called Binary Decision Trees.
Decision Trees containing non-binary conditions are called
Non-binary Decision Trees.
Conditions with too much power are also more likely to
overfit. For this reason, decision forests generally use binary
decision trees.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 9


Types of Conditions

Example:

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 10


Common Threshold Conditions

The most common type of condition is the threshold


condition expressed as:

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 11


Decision Trees Learning Process

DT algorithm adopts a greedy divide and conquer strategy:


always test the most important attribute first.
ID3 algorithm uses a top-down greedy approach to build a
decision tree.
We start building the tree from the top and at each iteration
we select the best feature to create a node.
Given a collection of examples, the model builds a decision
tree that represents it.
Finally, it uses this representation to classify new examples.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 12


Case Study

Decide whether to wait for a table at a restaurant, based on the


following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60).
Chapter 04 - Machine Learning - Faculty of Engineering - BAU 13
Attribute-based Representations

Examples by attributes (Boolean, discrete, continuous):

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 14


Constructing Decision Trees

One possible representation for the hypotheses:

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 15


Hypothesis Spaces

How many distinct decision trees with n Boolean attributes?


Hypothesis Spaces = number of Boolean functions
𝑛
= number of distinct truth tables with 2𝑛 rows = 2
2 .
Example:
With 6 Boolean attributes, there are:
26
2 = 18,446,744,073,709,551,616 trees.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 16


Which attribute is most discriminating?

Information Gain: tells us how important a given attribute of


the feature vectors is.
We use it to decide the ordering of attributes in the nodes of
a decision tree.
Information Gain is the expected reduction in entropy of
target variable Y for data sample S, due to sorting.
Good Split: is the one that we are more certain about
classification after this split.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 17


From Entropy to Information Gain

Entropy: is a measure of disorder or impurity in a node.


Entropy H(X) of a random variable X:
𝑛

𝐻 𝑋 = − ෍ 𝑃(𝑋 = 𝑖)𝑙𝑜𝑔2 𝑃(𝑋 = 𝑖)


𝑖=1

Specific conditional entropy H(X|Y=v) of a random variable X


given Y=v:
𝑛

𝐻 𝑋|𝑌 = 𝑣 = − ෍ 𝑃(𝑋 = 𝑖|𝑌 = 𝑣)𝑙𝑜𝑔2 𝑃(𝑋 = 𝑖|𝑌 = 𝑣)


𝑖=1

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 18


From Entropy to Information Gain

Conditional entropy H(X|Y) of a random variable X given Y:

𝐻 𝑋|𝑌 = − ෍ 𝑃(𝑌 = 𝑣) 𝐻 𝑋|𝑌 = 𝑣


𝑣

Mutual information or Information Gain of X and Y:


𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻(𝑋|𝑌) = 𝐻 𝑌 − 𝐻(𝑌|𝑋)

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 19


Entropy of a 2-Classes Problem

Entropy: 𝐻 𝑋 = − σ𝑛𝑖=1 𝑃(𝑋 = 𝑖)𝑙𝑜𝑔2 𝑃(𝑋 = 𝑖)


What is the entropy of a problem containing only 1 class?
𝑛

𝐻 𝑋 = − ෍ 𝑃(𝑋 = 𝑖)𝑙𝑜𝑔2 𝑃(𝑋 = 𝑖) = −1𝑙𝑜𝑔2 1 = 0


𝑖=1

Not a good training set for a classification model


Low entropy → Minimum Impurity → Predictable samples.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 20


Entropy of a 2-Classes Problem

What is the entropy of a problem containing 2 classes with


50% in either class?
𝑛

𝐻 𝑋 = − ෍ 𝑃 𝑋 = 𝑖 𝑙𝑜𝑔2 𝑃 𝑋 = 𝑖 = −0.5𝑙𝑜𝑔2 0.5 = 1


𝑖=1

A good training set for a classification model


High entropy → Maximum Impurity → Less Predictable
samples.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 21


How Decision Trees Make Decisions

The x-axis measures the positive


class, and the y-axis measures
their entropies.
In the ‘U’ shape graph, entropy is
lowest at the extremes, when
either no positive instances or
only positive instances. When the
bubble is pure the disorder is 0.
Entropy is highest in the middle
when the bubble is evenly split
between positive and negative
instances. This is an extreme
disorder, because there is no
majority.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 22


Using Information Gain to Construct a DT

We start with the full training set X.

We choose the attribute A with


highest information gain for the
full training set at the root of tree.

We construct child nodes for each value of A. Each has an


associated subset of vectors in which A has a particular value.
We get X’s such that 𝑋′1 = 𝑥 ∈ 𝑋 𝑣𝑎𝑙𝑢𝑒 𝐴 = 𝑣1 } , …
We repeat recursively … but when do we stop?

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 23


Understanding Information Theory

Information Content: (or self-information, surprisal, or


Shannon information) is a basic quantity derived from the
probability of a particular event occurring from a random
variable.
It is closely related to entropy, which is the expected value of
the information content of a random variable, quantifying
how surprising the random variable is "on average".
It can be interpreted as quantifying the level of "surprise" of a
particular outcome. It can be expressed in various units of
information but most commonly in "bit".

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 24


Information Content

Claude Shannon's definition:


An event with probability 100% is perfectly unsurprising and yields
no information.
The less probable an event is, the more surprising it is and the more
information it yields.
If two independent events are measured separately, the total
amount of information is the sum of the values of the individual
events.
𝑛

𝐼 𝑉 = 𝐼 𝑃 𝑣1 , … , 𝑃 𝑣𝑛 = ෍ −𝑃(𝑣𝑖 )𝑙𝑜𝑔2 𝑃(𝑣𝑖 )


𝑖=1

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 25


Information Content

Additivity of independent events:


Let X and Y be two independent events. The information
content of the outcome is:
𝐼 𝑋, 𝑌 = 𝐼 𝑋 + 𝐼 𝑌
For a training set containing p positive examples and n
negative examples:
𝑝 𝑛 𝑝 𝑝 𝑛 𝑛
𝐼 , = −𝑃( )𝑙𝑜𝑔2 𝑃( ) − 𝑃( )𝑙𝑜𝑔2 𝑃( )
𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 26


Information Content

Thus, for chosen attribute A that divides the training set E into
subsets E1, …, Ev according to their values for A, where A has v
distinct values.
In this case, Information Gain is calculated for a split by
subtracting the weighted entropies of each branch from the
original entropy.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 27


Information Content

Information Gain (IG) is the expected reduction in entropy


from the attribute test:
𝑝 𝑛
𝐼𝐺 𝐴 = 𝐼 , − 𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝐴)
𝑝+𝑛 𝑝+𝑛
Where
𝑝𝑖 +𝑛𝑖 𝑝 𝑛
𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟 𝐴 = σ𝑣𝑖=1 𝐼( 𝑖 , 𝑖 )
𝑝+𝑛 𝑝𝑖 +𝑛𝑖 𝑝𝑖 +𝑛𝑖

Finally, choose the attribute


with largest IG.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 28


Back to the Example

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 29


Back to the Example

For the training set, p = n = 6 and p+n = 12, therefore:


6 6 6 6 6 6
𝐼 , = − 𝑙𝑜𝑔2 + − 𝑙𝑜𝑔2
12 12 12 12 12 12
= −0.5 × (−1) + −0.5 × −1 = 0.5 + 0.5 = 1𝑏𝑖𝑡

Consider the attributes Patrons and Type:


𝑝 𝑛
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 𝐼 , − 𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝑃𝑎𝑡𝑟𝑜𝑛𝑠)
𝑝+𝑛 𝑝+𝑛
𝑝 𝑛
𝐼𝐺 𝑇𝑦𝑝𝑒 = 𝐼 , − 𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝑇𝑦𝑝𝑒)
𝑝+𝑛 𝑝+𝑛

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 30


Back to the Example

𝑝 𝑛
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 𝐼 , − 𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟 𝑃𝑎𝑡𝑟𝑜𝑛𝑠
𝑝+𝑛 𝑝+𝑛
𝑣
𝑝𝑖 + 𝑛𝑖 𝑝𝑖 𝑛𝑖
=1−෍ 𝐼( , )
𝑝 + 𝑛 𝑝𝑖 + 𝑛𝑖 𝑝𝑖 + 𝑛𝑖
𝑖=1

2 0 2 4 4 0 6 2 4
=1− 𝐼 , + 𝐼 , + 𝐼 ,
12 2 2 12 4 4 12 6 6

2 4 6 2 2 4 4
=1− ×0+ ×0+ × − 𝑙𝑜𝑔2 + − 𝑙𝑜𝑔2
12 12 12 6 6 6 6

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 31


Back to the Example

𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 0.541 𝑏𝑖𝑡


2 1 1 2 1 1 4 2 2 4 2 2
𝐼𝐺 𝑇𝑦𝑝𝑒 = 1 − 𝐼 , + 𝐼 , + 𝐼 , + 𝐼 ,
12 2 2 12 2 2 12 4 4 12 4 4

2 2 4 4
𝐼𝐺 𝑇𝑦𝑝𝑒 = 1 − + + + = 0 𝑏𝑖𝑡
12 12 12 12
Do the same for all the other attributes and you will conclude
that Patrons has the highest IG of all.
So Patrons is chosen by DT learning algorithm as root.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 32


Recap on ID3 Algorithm

1. Calculate the Information Gain of each feature.


2. Considering that all rows don’t belong to the same class,
split the dataset into subsets using the feature for which
the Information Gain is maximum.
3. Make a DT node using the feature with the maximum
Information gain.
4. If all rows belong to the same class, make the current
node as a leaf node with the class as its label.
5. Repeat for the remaining features until we run out of all
features, or the decision tree has all leaf nodes.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 33


Gini Index

The Gini Index is another measure of impurity.


Gini Index equation:
2
𝐺𝑖𝑛𝑖 = 1 − ෍ 𝑃 𝑐
𝑐 ∈ 𝑐𝑙𝑎𝑠𝑠

Example:

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 34


Gini Index

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 35


Gini Index

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 36


Reduction in Gini-index

NOTE:
• Gini impurity ranges from 0 to 0.5.
• Entropy ranges from 0 to 1.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 37


Decision Trees and Bias

Decision trees like any other ML method is biased towards


patterns and representations.
Poor performance is often due to a bias-mismatch.
Can overfitting be avoided with Model Aggregation?
~ Decision trees will overfit !!! ~
We must use tricks to find simple trees by using Fixed depth,
Early stopping, and Pruning.
On the other hand, we can use ensembles of different trees
called Random Forests.
Chapter 04 - Machine Learning - Faculty of Engineering - BAU 38
Boosting Technique in Machine Learning

Boosting: is considered to be one of the most powerful yet


simple development approaches in ML.

Its concept is to find many weak rules of thumb is easier than


finding a single highly prediction rule.

The key is to combine the weak rules together and get better
performance.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 39


Advantages of Boosting

Boosting method may guarantee the following:

• Improves classification accuracy.

• Can be used with many different classifiers.

• Commonly used in many areas.

• Simple to implement.

• Not prone to overfitting.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 40


Boosting with Decision Trees

One of the most known Boosting techniques in Decision Trees


is Bagging from Bootstrap Aggregating.
Bagging works as following:
1. Create k bootstrap samples D1 ... Dk.
• Bootstrap sampling: Given a set D containing N training
examples, create Di by selecting N examples at random
with replacement (repetition allowed) from D.
2. Train distinct classifier on each Di.
3. Classify new instance by majority: vote / average.
Chapter 04 - Machine Learning - Faculty of Engineering - BAU 41
Example of Bagging

Sampling with repetition:

Then, we Build a classifier on each bootstrap sample.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 42


Some Probability of Bootstrap Samples

If we have ’n’ data points in the training set:

Probability of a particular data point to be selected in the


1
bootstrap sample =
𝑛

Probability of a particular data point not selected in the


1
bootstrap sample = 1 −
𝑛

So, the probability of each point not selected in the bootstrap


1 𝑛
sample = 1 −
𝑛

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 43


General Overview of Random Forest

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 44


Ensemble Method

Ensemble Method is a technique specifically designed for


decision tree classifiers.
This approach introduces two sources of randomness:
• Bagging Method: each tree is grown using a bootstrap
sample of training data.
• Random Vector Method: At each node, best split is
chosen from a random sample of m attributes instead of
all the attributes.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 45


Random Forest

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 46


Random Forest

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 47


Remarks on Random Forests

When learning over a large number of features, training a


decision tree is difficult, and the resulting tree may be very
large → (over fitting).
Instead, we can train small decision trees, with limited depth.
Treat DT models as experts: they are correct, but only on a
small region in the domain. Then, design other DTs for
another functions, typically a linear functions (boosting).

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 48


Advantages and Disadvantages of DTs

Advantages, Decision Trees:

• can generate understandable rules.

• perform classification without much computation.

• can handle continuous and categorical variables.

• indicate which fields are important for prediction or


classification.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 49


Advantages and Disadvantages of DTs

Disadvantages, Decision Trees:

• are not suitable in predicting continuous attribute.

• perform poorly with many class and small data.

• are Computationally expensive train. At each node,


candidate splitting field must be sorted before its best
split can be found.

• do not treat well non-rectangular regions.

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 50


End of Chapter 04

Chapter 04 - Machine Learning - Faculty of Engineering - BAU 51

You might also like