ID3 - Formula Based
ID3 - Formula Based
import pandas as pd
from pandas import DataFrame
from collections import Counter # to hold count of each element
In [5]: df_tennis=pd.read_csv('play_tennis.csv')
In [16]: df_tennis.head()
Out[16]:
day outlook temp humidity wind play
In [17]: df_tennis
Out[17]:
day outlook temp humidity wind play
In [6]: df_tennis.keys()[4]
Out[6]: 'wind'
In [7]: # function to compute entropy of individual attribute
def entropy(probs):
import math
return sum([-prob*math.log(prob,2)for prob in probs])
In [9]: # wind---strong----yes
# wind---strong---no
# lets make independent and dependent variable i.e. X & Y
# here Y is binary (Play: Yes/No)
print('\n Input dataset for entropy calculation:\n',df_tennis['play'])
Classes: No Yes
Information Gain = Entropy before splitting - Entropy after splitting IG(S, a) = H(S) – H(S | a)
H(S | a) = sum v in a Sa(v)/S * H(Sa(v)) where
IG(S, a) is the information for the dataset S for the variable a for a random variable H(S) is the
entropy for the dataset before any change H(S | a) is the conditional entropy for the dataset
given the variable a
NOBS= number of observations .agg function allows you to apply function along one axis
In [12]: print('Information Gain for Outlook is:'
+str(information_gain(df_tennis,'outlook','play')))
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes
Classes: No Yes