3.1 C 4.5 Algorithm-19
3.1 C 4.5 Algorithm-19
C4.5 Decision
Tree Example
https://sefiks.com/2018/05/13/a-step-by-step-
c4-5-decision-tree-example/
Post navigation
Decision trees are still hot topics nowadays in data science world. Here, ID3 is the most
common conventional decision tree algorithm but it has bottlenecks. Attributes must be
nominal values, dataset must not include missing data, and finally the algorithm tend to
fall into overfitting. Here, Ross Quinlan, inventor of ID3, made some improvements for
these bottlenecks and created a new algorithm named C4.5. Now, the algorithm can
create a more generalized models including continuous data and could handle missing
data. Additionally, some resources such as Weka named this algorithm as J48. Actually,
it refers to re-implementation of C4.5 release 8.
1 Sunny 85 85 Weak No
2 Sunny 80 90 Strong No
8 Sunny 72 95 Weak No
14 Rain 71 80 Strong No
We will do what we have done in ID3 example. Firstly, we need to calculate global
entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer
to no decision.
In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to calculate
gain ratios instead of gains.
Wind Attribute
Wind is a nominal attribute. Its possible values are weak and strong.
There are 8 weak wind instances. 2 of them are concluded as no, 6 of them are
concluded as yes.
There are 8 decisions for weak wind, and 6 decisions for strong wind.
Outlook Attribute
Outlook is a nominal attribute, too. Its possible values are sunny, overcast and rain.
Notice that log2(0) is actually equal to -∞ but assume that it is equal to 0. Actually,
lim (x->0) x.log2(x) = 0. If you wonder the proof, please look at this post.
There are 5 instances for sunny, 4 instances for overcast and 5 instances for rain
Humidity Attribute
As an exception, humidity is a continuous attribute. We need to convert continuous
values to nominal ones. C4.5 proposes to perform binary split based on a threshold
value. Threshold should be a value which offers maximum gain for that attribute. Let’s
focus on humidity attribute. Firstly, we need to sort humidity values smallest to largest.
7 65 Yes
6 70 No
9 70 Yes
11 70 Yes
13 75 Yes
3 78 Yes
5 80 Yes
10 80 Yes
14 80 No
1 85 No
2 90 No
12 90 Yes
8 95 No
4 96 Yes
Now, we need to iterate on all humidity values and seperate dataset into two parts as
instances less than or equal to current value, and instances greater than the current
value. We would calculate the gain or gain ratio for every step. The value which
maximizes the gain would be the threshold.
* The statement above refers to that what would branch of decision tree be for less than
or equal to 65, and greater than 65. It does not refer to that humidity is not equal to 65!
I think calculation demonstrations are enough. Now, I skip the calculations and write only
results.
As seen, gain maximizes when threshold is equal to 80 for humidity. This means that we
need to compare other nominal attributes and comparison of humidity to 80 to create a
branch in our tree.
Let’s summarize calculated gain and gain ratios. Outlook attribute comes with both
maximized gain and gain ratio. This means that we need to put outlook decision in root of
decision tree.
Outlook = Sunny
We’ve split humidity for greater than 80, and less than or equal to 80. Surprisingly,
decisions would be no if humidity is greater than 80 when outlook is sunny. Similarly,
decision would be yes if humidity is less than or equal to 80 for sunny outlook.
Outlook = Overcast
If outlook is overcast, then no matter temperature, humidity or wind are, decision will
always be yes.
Outlook = Rain
We’ve just filtered rain outlook instances. As seen, decision would be yes when wind is
weak, and it would be no if wind is strong.
Hum.
Day Outlook Temp. Wind Decision
> 80
6 Rain 65 No Strong No
14 Rain 71 No Strong No
So, C4.5 algorithm solves most of problems in ID3. The algorithm uses gain ratios
instead of gains. In this way, it creates more generalized trees and not to fall into
overfitting. Moreover, the algorithm transforms continuous attributes to nominal ones
based on gain maximization and in this way it can handle continuous data. Additionally, it
can ignore instances including missing data and handle missing dataset. On the other
hand, both ID3 and C4.5 requires high CPU and memory demand. Besides, most of
authorities think decision tree algorithms in data mining field instead of machine learning.