0% found this document useful (0 votes)
9 views47 pages

Lec-3-Decision Trees

decision tree in machine learning

Uploaded by

uzair31531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views47 pages

Lec-3-Decision Trees

decision tree in machine learning

Uploaded by

uzair31531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 47

DECISION TREES

Introduction

It is a method that induces concepts from examples


(inductive learning)

Most widely used & practical learning method

The learning is supervised: i.e. the classes or categories of the


data instances are known

It represents concepts as decision trees (which can be


rewritten as if-then rules)

The target function can be Boolean or discrete valued


1
DECISION TREES

Training a decision tree – The ID3 and CaRT Algorithm


• Ross Quinlan, CS (ID3: 1986, C4.5: 1993)
• Uses entropy as the impurity function
• Meant primarily for categorical attributes and for classification
• Breimanetal, Statistics (CaRT: 1984)
• Uses Gini impurity
• Meant for classification and regression

2
DECISION TREES

Decision Tree Representation

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute value

3. Each leaf node assigns a classification

3
DECISION TREES

Example

4
DECISION TREES

Example
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak

A Decision Tree for the concept PlayTennis


An unknown observation is classified by testing its attributes
and reaching a leaf node
5
DECISION TREES

Decision Tree Representation

Decision trees represent a disjunction(OR) of


conjunctions(AND) of constraints on the attribute values of
instances

Each path from the tree root to a leaf corresponds to a


conjunction of attribute tests (one rule for classification)

The tree itself corresponds to a disjunction of these


conjunctions (set of rules for classification)

6
DECISION TREES

Decision Tree Representation

7
DECISION TREES

Basic Decision Tree Learning Algorithm


Most algorithms for growing decision trees are variants of a
basic algorithm

An example of this core algorithm is the ID3 algorithm


developed by Quinlan (1986)

It employs a top-down, greedy search through the space of


possible decision trees

The Greedy Algorithm is a popular approach used in constructing decision trees. It


follows a step-by-step process to determine the optimal splits for creating the tree
nodes. This algorithm works in a greedy manner, meaning it makes locally optimal
decisions at each step without considering the global optimum.

8
DECISION TREES

Basic Decision Tree Learning Algorithm

First of all we select the best attribute to be tested at the root


of the tree

For making this selection each attribute is evaluated using a


statistical test to determine how well it alone classifies the
training examples

9
DECISION TREES

Basic Decision Tree Learning Algorithm


We have

D12 D11 - 12 observations


D1
D2 D5
D10 D4 - 4 attributes
D6 • Outlook
D3
D14 • Temperature
D8 D9 • Humidity
D7 D13 • Wind

- 2 classes (Yes, No)

10
DECISION TREES

Basic Decision Tree Learning Algorithm

Outlook
Sunny Rain
Overcast

D1 D8 D10 D6
D3
D14
D11 D12 D4
D9 D2 D7 D5
D13

11
DECISION TREES

Basic Decision Tree Learning Algorithm

The selection process is then repeated using the training


examples associated with each descendant node to select the
best attribute to test at that point in the tree

12
DECISION TREES

Outlook
Sunny Rain
Overcast

D1 D8 D10 D6
D3
D14
D11 D12 D4
D9 D2 D7 D5
D13

What is the
“best” attribute to test at this point? The possible choices are
Temperature, Wind & Humidity
13
DECISION TREES

Which Attribute is the Best Classifier?

The central choice in the ID3 algorithm is selecting which


attribute to test at each node in the tree

We would like to select the attribute which is most useful for


classifying examples

For this we need a good quantitative measure

For this purpose a statistical property, called information


gain is used

14
15
16
Entropy

To Define Information Gain precisely, we begin by defining a


measure which is commonly used in information theory called
Entropy.

Entropy basically tells us how impure a collection of data is. The


term impure here defines non-homogeneity.

Given a collection of examples/dataset S, containing positive and


negative examples of some target concept, the entropy of S
relative to this Boolean classification is:

17
To illustrate this equation, we will do an example that calculates
the entropy of our data set in Fig: 1. The dataset has 9 positive
instances and 5 negative instances, therefore-

18
19
By observing closely on equations 1.2, 1.3 and 1.4; we can come to
a conclusion that if the data set is completely homogeneous then
the impurity is 0, therefore entropy is 0 (equation 1.4), but if the
data set can be equally divided into two classes, then it is
completely non-homogeneous & impurity is 100%, therefore
entropy is 1 (equation 1.3).

20
The Information Gain

21
Information Gain:

Given Entropy is the measure of impurity in a collection of a dataset, now we


can measure the effectiveness of an attribute in classifying the training set.

The Information gain, is simply the expected reduction in entropy caused by


partitioning the data set according to this attribute.

The information gain (Gain(S,A) of an attribute A relative to a collection of


data set S, is defined as-

22
23
To become more clear, let’s use this equation and measure the information
gain of attribute Wind from the dataset of Figure 1.

The dataset has 14 instances, so the sample space is 14 where the sample has 9
positive and 5 negative instances.
The Attribute Wind can have the values Weak or Strong.

Therefore,
Values(Wind) = Weak, Strong

24
25
So, the information gain by the Wind attribute is 0.048. Let’s
calculate the information gain by the Outlook attribute.

26
These two examples should make us clear that how we can
calculate information gain.
The information gain of the 4 attributes of Figure 1 dataset
are:

27
Remember, the main goal of measuring information gain is to
find the attribute which is most useful to classify training set.
Our ID3 algorithm will use the attribute as it’s root to build the
decision tree.
Then it will again calculate information gain to find the next
node.
As far as we calculated, the most useful attribute is “Outlook” as
it is giving us more information than others. So, “Outlook” will
be the root of our tree.

28
29
30
We can now measure the information gain of Temperature and
Wind by following the same way we measured Gain(S,
Humidity). Finally, we will get:

31
So Humidity gives us the most information at this stage. The node
after “Outlook” at Sunny descendant will be Humidity.

The High descendant has only negative examples and the Normal
descendant has only positive examples.

So both of them become the leaf node and can not be furthered
expanded.

If we expand the Rain descendant by the same procedure we will


see that the Wind attribute is providing most information.

I am leaving this portion for the readers to do the calculation on


their own. Therefore our final decision tree looks like Figure 4:

32
33
34
35
36
37
Decision Boundary for Decision Trees

38
39
40
41
42
DECISION TREES

From Decision Trees to Rules

Next Step: Make rules from the decision tree

After making the identification tree, we trace each path from


the root node to leaf node, recording the test outcomes as
antecedents and the leaf node classification as the consequent

For our example we have:

If the Outlook is Sunny and the Humidity is High then No


If the Outlook is Sunny and the Humidity is Normal then Yes
...
43
DECISION TREES

Hypothesis Space Search

ID3 can be characterized as


searching a space of
hypotheses for one that fits
the training examples

The space searched is the set


of possible decision trees

ID3 performs a simple-to-


complex, hill-climbing
search through this
hypothesis space
44
DECISION TREES

Hypothesis Space Search

It begins with an empty tree,


then considers more and
more elaborate hypothesis
in search of a decision tree
that correctly classifies the
training data

The evaluation function that


guides this hill-climbing
search is the information
gain measure

45
DECISION TREES

Hypothesis Space Search


• ID3 searches the space of possible decision trees: doing
hill-climbing on information gain.

• It maintains only one hypothesis (unlike Candidate-


Elimination).

• It cannot tell us how many other viable ones there are.

• It does not do back tracking. Can get stuck in local optima.

• Uses all training examples at each step.

• Results are less sensitive to errors.


46
DECISION TREES

Reference

Sections 3.4.2 – 3.5 of T. Mitchell

47

You might also like