0% found this document useful (0 votes)
79 views

Unit en Decision Trees Algorithms

sumber : http://www.uni-weimar.de/medien/webis/teaching/lecturenotes/machine-learning/unit-en-decision-trees-algorithms.pdf

Uploaded by

Hasbi Hilmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Unit en Decision Trees Algorithms

sumber : http://www.uni-weimar.de/medien/webis/teaching/lecturenotes/machine-learning/unit-en-decision-trees-algorithms.pdf

Uploaded by

Hasbi Hilmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter ML:III

III. Decision Trees


q Decision Trees Basics
q Impurity Functions
q Decision Tree Algorithms
q Decision Tree Pruning

ML:III-66

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm

[Quinlan 1986] [CART Algorithm]

Characterization of the model (model world)

[ML Introduction] :

X is a set of feature vectors, also called feature space.

C is a set of classes.

c : X C is the ideal classifier for X.

D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.

Task: Based on D, construction of a decision tree T to approximate c.

ML:III-67

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm

[Quinlan 1986] [CART Algorithm]

Characterization of the model (model world)

[ML Introduction] :

X is a set of feature vectors, also called feature space.

C is a set of classes.

c : X C is the ideal classifier for X.

D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.

Task: Based on D, construction of a decision tree T to approximate c.


Characteristics of the ID3 algorithm:
1. Each splitting is based on one nominal feature and considers its complete
domain. Splitting based on feature A with domain {a1, . . . , ak } :
X = {x X : x|A = a1} . . . {x X : x|A = ak }
2. Splitting criterion is the information gain.

ML:III-68

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm

[Mitchell 1997] [algorithm template]

ID3(D, Attributes, Target)


q Create a node t for the tree.
q Label t with the most common value of Target in D.
q If all examples in D are positive, return the single-node tree t, with label +.
q If all examples in D are negative, return the single-node tree t, with label .
q If Attributes is empty, return the single-node tree t.
q Otherwise:
q Let A* be the attribute from Attributes that best classifies examples in D.
q Assign t the decision attribute A*.
q For each possible value a in A* do:
q Add a new tree branch below t, corresponding to the test A* = a.
q Let D_a be the subset of D that has value a for A*.
q If D_a is empty:

Then add a leaf node with label of the most common value of Target in D.
Else add the subtree ID3(D_a, Attributes \ {A*}, Target).
q Return t.

ML:III-69

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm (pseudo code)

[algorithm template]

ID3(D, Attributes, Target)


1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)

ML:III-70

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm (pseudo code)

[algorithm template]

ID3(D, Attributes, Target)


1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)

ML:III-71

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm (pseudo code)

[algorithm template]

ID3(D, Attributes, Target)


1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)

ML:III-72

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm (pseudo code)

[algorithm template]

ID3(D, Attributes, Target)


1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)

ML:III-73

Decision Trees

STEIN/LETTMANN 2005-2015

Remarks:
q Target designates the feature (= attribute) that is comprised of the labels according to

which an example can be classified. Within Mitchells algorithm the respective class labels
are + and , modeling the binary classification situation. In the pseudo code version,
Target may be comprised of multiple (more than two) classes.
q Step 3 of of the ID3 algorithm checks the purity of D and, given this case, assigns the

unique class c, c dom(Target), as label to the respective node.

ML:III-74

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Example
Example set D for mushrooms, implicitly defining a feature space X over the three
dimensions color, size, and points:

ML:III-75

Color

Size

Points

Eatability

red

small

yes

toxic

2
3
4
5

brown
brown
green
red

small
large
small
large

no
yes
no
no

eatable
eatable
eatable
eatable

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Example

(continued)

Top-level call of ID3. Analyze a splitting with regard to the feature color :

D|color

toxic eatable
red
1
1
=
brown
0
2
green
0
1

|Dred | = 2, |Dbrown | = 2, |Dgreen | = 1

Estimated a-priori probabilities:


pred =

ML:III-76

Decision Trees

2
= 0.4,
5

pbrown =

2
= 0.4,
5

pgreen =

1
= 0.2
5

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Example

(continued)

Top-level call of ID3. Analyze a splitting with regard to the feature color :

D|color

toxic eatable
red
1
1
=
brown
0
2
green
0
1

|Dred | = 2, |Dbrown | = 2, |Dgreen | = 1

Estimated a-priori probabilities:


pred =

2
= 0.4,
5

pbrown =

2
= 0.4,
5

pgreen =

1
= 0.2
5

Conditional entropy values for all attributes:


H(C | color)

= (0.4 ( 12 log2 12 + 12 log2 12 ) +


0.4 ( 20 log2 02 + 22 log2 22 ) +
0.2 ( 10 log2 01 + 11 log2 11 )) = 0.4

H(C | size)
0.55
H(C | points) = 0.4
ML:III-77

Decision Trees

STEIN/LETTMANN 2005-2015

Remarks:
q The smaller H(C | feature) is, the larger becomes the information gain. Hence, the

difference H(C) H(C | feature) needs not to be computed since H(C) is constant within
each recursion step.
q In the example, the information gain in the first recursion step is maximum for the two

features color and points.

ML:III-78

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Example

(continued)

Decision tree before the first recursion step:


attribute: points
yes

no

color

size

eatability

color

size

eatability

red
brown

small
large

toxic
eatable

brown
green
red

small
small
large

eatable
eatable
eatable

The feature points was chosen in Step 5 of the ID3 algorithm.

ML:III-79

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Example

(continued)

Decision tree before the second recursion step:


attribute: points
yes

no

attribute: color
red

brown

green

size

eatability

size

small

toxic

-/-

eatability
-/-

size

eatability

large

eatable

color

size

eatability

brown
green
red

small
small
large

eatable
eatable
eatable

The feature color was chosen in Step 5 of the ID3 algorithm.

ML:III-80

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Example

(continued)

Final decision tree after second recursion step:


attribute: points
yes

no
label: eatable

attribute: color
red
label: toxic

green
label: toxic

brown
label: eatable

Break of a tie: choosing the class toxic for Dgreen in Step 6 of the ID3 algorithm.

ML:III-81

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Hypothesis Space

...
A1
+

+o o

A2
+

+ o

A2
+

+
+

A2

A3 -

Decision Trees

+
o

...
ML:III-82

...

A4 -

...

...
STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Inductive Bias
Inductive bias is the rigidity in applying the (little bit of) knowledge learned from a
training set for the classification of unseen feature vectors.
Observations:
q

Decision tree search happens in the space of all hypotheses.


The target concept is a member of the hypothesis space.

To generate a decision tree, the ID3 algorithm needs per branch at most as
many decisions as features are given.
no backtracking takes place
local optimization of decision trees

ML:III-83

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Inductive Bias
Inductive bias is the rigidity in applying the (little bit of) knowledge learned from a
training set for the classification of unseen feature vectors.
Observations:
q

Decision tree search happens in the space of all hypotheses.


The target concept is a member of the hypothesis space.

To generate a decision tree, the ID3 algorithm needs per branch at most as
many decisions as features are given.
no backtracking takes place
local optimization of decision trees

ML:III-84

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


ID3 Algorithm: Inductive Bias
Inductive bias is the rigidity in applying the (little bit of) knowledge learned from a
training set for the classification of unseen feature vectors.
Observations:
q

Decision tree search happens in the space of all hypotheses.


The target concept is a member of the hypothesis space.

To generate a decision tree, the ID3 algorithm needs per branch at most as
many decisions as features are given.
no backtracking takes place
local optimization of decision trees

Where the inductive bias of the ID3 algorithm becomes manifest:


q

Small decision trees are preferred.

Highly discriminative features tend to be closer to the root.

Is this justified?
ML:III-85

Decision Trees

STEIN/LETTMANN 2005-2015

Remarks:
q Let Aj be the finite domain (the possible values) of feature Aj , j = 1, . . . , p, and let C be a

set of classes. Then, a hypothesis space H that is comprised of all decision trees
corresponds to the set of all functions h, h : A1 . . . Ap C. Typically, C = {0, 1}.
q The inductive bias of the ID3 algorithm is of a different kind than the inductive bias of the

candidate elimination algorithm (version space algorithm):


1. The underlying hypothesis space H of the candidate elimination algorithm is
incomplete. H corresponds to a coarsened view onto the space of all hypotheses since
H contains only conjunctions of attribute-value pairs as hypotheses. However, this
restricted hypothesis space is searched completely by the candidate elimination
algorithm. Keyword: restriction bias
2. The underlying hypothesis space H of the ID3 algorithm is complete. H corresponds to
the set of all discrete functions (from the Cartesian product of the feature domains onto
the set of classes) that can be represented in the form of a decision tree. However, this
complete hypothesis space is searched incompletely (following a preference).
Keyword: preference bias or search bias
q The inductive bias of the ID3 algorithm renders the algorithm robust with respect to noise.

ML:III-86

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


CART Algorithm

[Breiman 1984] [ID3 Algorithm]

Characterization of the model (model world)

[ML Introduction] :

X is a set of feature vectors, also called feature space. No restrictions are


presumed for the measurement scales of the features.

C is a set of classes.

c : X C is the ideal classifier for X.

D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.

Task: Based on D, construction of a decision tree T to approximate c.

ML:III-87

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


CART Algorithm

[Breiman 1984] [ID3 Algorithm]

Characterization of the model (model world)

[ML Introduction] :

X is a set of feature vectors, also called feature space. No restrictions are


presumed for the measurement scales of the features.

C is a set of classes.

c : X C is the ideal classifier for X.

D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.

Task: Based on D, construction of a decision tree T to approximate c.


Characteristics of the CART algorithm:
1. Each splitting is binary and considers one feature at a time.
2. Splitting criterion is the information gain or the Gini index.

ML:III-88

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


CART Algorithm

(continued)

1. Let A be a feature with domain A. Ensure a finite number of binary splittings


for X by applying the following domain partitioning rules:
If A is nominal, choose A0 A such that 0 < |A0| |A \ A0|.
If A is ordinal, choose a A such that xmin < a < xmax, where xmin, xmax
are the minimum and maximum values of feature A in D.
If A is numeric, choose a A such that a = (xk + xl )/2, where xk , xl are
consecutive elements in the ordered value list of feature A in D.

ML:III-89

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


CART Algorithm

(continued)

1. Let A be a feature with domain A. Ensure a finite number of binary splittings


for X by applying the following domain partitioning rules:
If A is nominal, choose A0 A such that 0 < |A0| |A \ A0|.
If A is ordinal, choose a A such that xmin < a < xmax, where xmin, xmax
are the minimum and maximum values of feature A in D.
If A is numeric, choose a A such that a = (xk + xl )/2, where xk , xl are
consecutive elements in the ordered value list of feature A in D.
2. For node t of a decision tree generate all splittings of the above type.
3. Choose a splitting from the set of splittings that maximizes the impurity
reduction :
|DL|
|DR |
(D(t), {D(tL), D(tR )}) = (t)
(tL)
(tR ),
|D|
|D|
where tL and tR denote the left and right successor of t.

ML:III-90

Decision Trees

STEIN/LETTMANN 2005-2015

Decision Tree Algorithms


CART Algorithm

(continued)

Illustration for two numeric features; i.e., the feature space X corresponds to a
two-dimensional plane:
t1 X(t1)
X(t7)
t2

t3

X(t2)

X(t4)

X(t3)

X(t6)
t5 X(t5)

t4 X(t4)
c3
X(t8)

c1

c3

t6 X(t6)
c1

X(t7)
c2

X(t8)

X(t9)

X = X(t1)

X(t9)

By a sequence of splittings the feature space X is partitioned into rectangles that


are parallel to the two axes.

ML:III-91

Decision Trees

STEIN/LETTMANN 2005-2015

You might also like