0% found this document useful (0 votes)
403 views

Construction of Decision Tree Attribute Selection Measures

This document compares two attribute selection measures for constructing decision trees: Information Gain and Gini Index. It first provides background on decision trees and their construction. It then describes how Information Gain and the Gini Index are calculated to select the optimal splitting attribute at each node. The document presents an example dataset and shows the results of applying both measures to select the splitting attribute. It finds that the Gini Index provides a simpler approach than Information Gain for this task.

Uploaded by

Harsh Nethra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
403 views

Construction of Decision Tree Attribute Selection Measures

This document compares two attribute selection measures for constructing decision trees: Information Gain and Gini Index. It first provides background on decision trees and their construction. It then describes how Information Gain and the Gini Index are calculated to select the optimal splitting attribute at each node. The document presents an example dataset and shows the results of applying both measures to select the splitting attribute. It finds that the Gini Index provides a simpler approach than Information Gain for this task.

Uploaded by

Harsh Nethra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

343

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013


ISSN 2278-7763

Construction of Decision Tree : Attribute Selection


Measures
R. Aruna devi¹, Dr. K. Nirmala²

¹Research Scholar, Manonmanion Sundaranar University & Asst. Professor, Department of Computer Science, Vidhya Sagar
Women’s College, Chengalpattu, Chennai, Tamil Nadu, India. Email: [email protected]

² Associate Professor, Department of Computer Science, Quaid-e- Millath Government College for Women(A),Chennai, Tamil
Nadu,India. Email: [email protected]

ABSTRACT

Attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given
data partition, D, of a class-labeled training tuples into individual classes. It determines how the tuples at a
given node are to be split. The attribute selection measure provides a ranking for each attribute describing
the given training tuples. The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.This paper, perform a comparative study of two attribute selection measures.
The Information gain is used to select the splitting attribute in each node in the tree. The attribute with the
highest information gain is chosen as the splitting attribute for the current node. The Gini index measures
use binary split for each attribute. The attribute with the minimum gini index as selected as the splitting
attribute. The results indicates that predicting a attribute selection in Gini index is more effective and
simple compared to Information gain.

Keywords: Heuristics, Information Gain, Gini Index, Attribute selection.

I.INTRODUCTION II.DECISION TREE

Data mining is the extraction of implicit, Decision trees are powerful and popular tools for
previously unknown, and potentially useful classification and prediction. Decision trees
information from large databases. It uses represent rules, which can be understood by
machine learning, statistical and visualization humans and used in knowledge system such as
techniques to discover and present knowledge in database. Decision tree learning is a method
a form, which is easily comprehensible to commonly used in data mining. The goal is to
humans. Data mining functionalities are used to create a model that predicts the value of a target
specify the kind of patterns to be found in data variable based on several input variables. A
mining tasks. Data mining task can be classified Decision tree is a flowchart like tree structure,
into two categories: Descriptive and Predictive. where each internal node denotes a test on an
Descriptive mining tasks characterize the general attribute, each branch represents an outcome of
properties of the data in the database. Predictive the test, and each leaf node (terminal node) holds
mining tasks perform inference on the current a class label. The topmost node in a tree is the
data in order to make prediction. root node. A tree can be “learned” by splitting the
344

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013


ISSN 2278-7763

source set into subsets based on an attribute value class-labeled training tuples.During tree
test. This process is repeated on each derived construction, attribute selection measures are
subset in a recursive manner called recursive used to select the attributes that partition the
partitioning. The recursion is completed when the tuples into distinct classes.
subset at a node all has the same value of the
III.INFORMATION GAIN
target variable, or when splitting no longer adds
value to the predictions. In data mining, decision
This measure is based on pioneering work by
trees can be described as the combination of
Claude Shannon on information theory, which
mathematical and computational techniques to
studied the value or “information content” of
aid the description, categorization and
message. Let node N represents or hold the tuple
generalization of a given set of data.
of partition D. The attribute with the highest
The construction of decision tree classifier does information gain is chosen as the splitting
not require any domain knowledge or parameter attribute for the node N. The expected
setting, and therefore is appropriate for information needed to classify a tuple in D is
exploratory knowledge discovery. Decision trees given by,
can handle high dimensional data. In general Info(D)= - ∑
decision tree classifier has good accuracy. Where Pi is the probability that an arbitrary tuple
Decision tree induction is a typical inductive in D belongs to class Ci and is estimated by |Ci,D|
approach to learn knowledge on classification. / |D|. Info(D) is the average amount of
The key requirements to do mining with decision information needed to identify the class label of a
trees are: (1) Attribute-value description: object tuple in D.Info(D) is also known as the entropy
or case must be expressible in terms of a fixed of D.The expected information required to
collection of properties or attributes. classify a tuple from D, based on the partitioning
(2)Predefined classes (target attribute values): by attribute A is calculated by,
The categories to which examples are to be InfoA(D)=∑ X Info(Dj)
assigned must have been established beforehand
Information gain is defined as the difference
(supervised data). (3)Discrete classes: A case
between the original information requirement
does or does not belong to a particular class, and
(i.e. based on the classes) and the new
there must be more cases than classes.
requirement (i.e. obtained after partitioning on A)
(4)Sufficient data: Usually hundreds or even
thousands of training cases. Decision tree
Gain(A)=Info(D)- InfoA(D)
induction is the learning of decision trees from
345

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013


ISSN 2278-7763

IV. GINI INDEX VI. EXPERIMENTAL RESULTS AND DISCUSSIONS

The Gini Index considers a binary split for each In this paper we wish to select the best attribute
attribute. The Gini Index measures the impurity measure to construct decision tree. Given the data
of D, a data partition or set of training tuples as, as in Table 1. The data tuple are described by the
attribute owns home, Married, Gender,
Gini(D)= ∑ 2
employed, class.
Where Pi is the probability that a tuple in D 6.1 INFORMATION GAIN ATTRIBUTE MEASURE
belongs to Class Ci and is estimated by |Ci,D| / |D|.
D=10,A=3,B=3,C=4,M=3
When considering a binary split, we compute a
weighted sum of the impurity of each resulting Info(D)=-3/10 -3/10 -
partition. For example, if a binary split on A
4/10 = 0.521+0.521+0.529= 1.57
partitions D into D1 and D2, the gini index of D
given that partitioning is
We can compute the Attribute “ownshome”
GiniA(D)= Gini(D1)+ Gini(D2)
Info ownshome(D)=5/10[-1/5 - 2/5 -
For each attribute, each of the possible binary 2/5 5/10[-2/5 - 1/5 -
split is considered. For a discrete valued attribute, 2/5 = 0.761+0.761 = 1.52
the subset that gives the minimum gini index for
that attribute is as its splitting attribute Gain(ownshome)=Info(D) - Info ownshome(D)
V. DATASET DESCRIPTION =1.57-1.52

The main objective of this paper is to select the Similarly we can compute the attributes married,
best attribute measure to construct decision tree. gender, employed.

Table 1 Table 2

Owns Married Gender Employed Class Attribute Info Gain


home
Owns home 1.52 0.05
Yes Yes Male Yes B
No No Female Yes A Married 0.847 0.72
Yes Yes Female Yes C
Gender 0.69 0.88
Yes No Male No B
No Yes Female Yes C Employed 1.12 0.45
No No Female Yes A
No No Male No B
Yes No Female Yes A Hence, Gender has the highest information gain
No Yes Female Yes C among the attribute, so it is selected as the
Yes Yes Female Yes C splitting attribute.
346

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013


ISSN 2278-7763

6.2 GINI INDEX ATTRIBUTE MEASURE VII. CONCLUSION AND FUTURE DEVELOPMENT

Total tuples(S)=10 In this paper, the comparative study of two


Total Classes(M)=3 attribute selection measure is compared. The Gini
Class A=3, Class B=3, Class C=4 index measure is very easy to select the best
attribute to construct decision tree because of its
Now, compute the gini index for each of the
simplicity, elegance, and robustness. The results
attributes.
indicate that selection of attribute using gini
index is very easy and simple compared to
Attribute=”ownshome”
Gini(D1)=1-(1/5)2-(2/5)2--(2/5)2 = 0.64 information gain. Possible extension of this work
Gini(D2)=1-(2/5)2-(1/5)2--(2/5)2 = 0.64 will be developed to use various attribute
Gini ownshome(D)=5/10(0.64)+5/10(0.64) = 0.64 selection measures like CHAID, C-SEP and
MDL- based measures.
Similarly we can compute the attributes married,
gender, employed. REFERENCES

Table 3 [1]A.K.Pujari, “Data Mining Techniques”,


Attribute GiniIndex University Press, India 2001.
Owns home 0.64 [2]Jiawei Han and Micheline Kamber “Data
Married 0.40
Mining Concepts and Techniques”
Gender 0.34
Employed 0.47 [3]S.N.Sivanandam and S.Sumathi, “Data
Mining Concepts Tasks and Techniques”,
Thomson, Business Information India
Here, Gender has the smallest gini index among Pvt.Ltd.India 2006
the attribute, so it is selected as the splitting [4] H. Wang, W. Fan, P. Yu, and J. Han.”Mining
attribute. concept-drifting data streams using ensemble
Classifiers”.
COMPARISON AND RESULTS
[5] V. Ganti, J. Gehrke, R.Ramakrishnan, and
For the comparison of our study, first we used an W.Loh.“Mining data streams under block
Information gain as attribute selection measure. evolution”.
Although information gain is usually a good [6] Friedman N, Geiger D, Goldszmidt M (1997)
measure for deciding the relevance of an “Bayesian network classifiers”.
attribute, it is not perfect. A notable problem [7] Jensen F., “An Introduction to Bayesian
occurs when information gain is applied to Networks”.
attributes that can take on a large number of [8] Murthy, “Automatic Construction of Decision
distinct values. Trees from Data”
[9]Website:www.cs.umd.edu/~samir/498/10Algo
Secondly we used a Gini index as attribute
rithms-08.pdf
selection measure; it has very time consuming [10] Website:en.wikipedia.org/wiki/Data_mining
and particularly suitable for multivalue attribute.
347

You might also like