Construction of Decision Tree Attribute Selection Measures
Construction of Decision Tree Attribute Selection Measures
¹Research Scholar, Manonmanion Sundaranar University & Asst. Professor, Department of Computer Science, Vidhya Sagar
Women’s College, Chengalpattu, Chennai, Tamil Nadu, India. Email: [email protected]
² Associate Professor, Department of Computer Science, Quaid-e- Millath Government College for Women(A),Chennai, Tamil
Nadu,India. Email: [email protected]
ABSTRACT
Attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given
data partition, D, of a class-labeled training tuples into individual classes. It determines how the tuples at a
given node are to be split. The attribute selection measure provides a ranking for each attribute describing
the given training tuples. The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.This paper, perform a comparative study of two attribute selection measures.
The Information gain is used to select the splitting attribute in each node in the tree. The attribute with the
highest information gain is chosen as the splitting attribute for the current node. The Gini index measures
use binary split for each attribute. The attribute with the minimum gini index as selected as the splitting
attribute. The results indicates that predicting a attribute selection in Gini index is more effective and
simple compared to Information gain.
Data mining is the extraction of implicit, Decision trees are powerful and popular tools for
previously unknown, and potentially useful classification and prediction. Decision trees
information from large databases. It uses represent rules, which can be understood by
machine learning, statistical and visualization humans and used in knowledge system such as
techniques to discover and present knowledge in database. Decision tree learning is a method
a form, which is easily comprehensible to commonly used in data mining. The goal is to
humans. Data mining functionalities are used to create a model that predicts the value of a target
specify the kind of patterns to be found in data variable based on several input variables. A
mining tasks. Data mining task can be classified Decision tree is a flowchart like tree structure,
into two categories: Descriptive and Predictive. where each internal node denotes a test on an
Descriptive mining tasks characterize the general attribute, each branch represents an outcome of
properties of the data in the database. Predictive the test, and each leaf node (terminal node) holds
mining tasks perform inference on the current a class label. The topmost node in a tree is the
data in order to make prediction. root node. A tree can be “learned” by splitting the
344
source set into subsets based on an attribute value class-labeled training tuples.During tree
test. This process is repeated on each derived construction, attribute selection measures are
subset in a recursive manner called recursive used to select the attributes that partition the
partitioning. The recursion is completed when the tuples into distinct classes.
subset at a node all has the same value of the
III.INFORMATION GAIN
target variable, or when splitting no longer adds
value to the predictions. In data mining, decision
This measure is based on pioneering work by
trees can be described as the combination of
Claude Shannon on information theory, which
mathematical and computational techniques to
studied the value or “information content” of
aid the description, categorization and
message. Let node N represents or hold the tuple
generalization of a given set of data.
of partition D. The attribute with the highest
The construction of decision tree classifier does information gain is chosen as the splitting
not require any domain knowledge or parameter attribute for the node N. The expected
setting, and therefore is appropriate for information needed to classify a tuple in D is
exploratory knowledge discovery. Decision trees given by,
can handle high dimensional data. In general Info(D)= - ∑
decision tree classifier has good accuracy. Where Pi is the probability that an arbitrary tuple
Decision tree induction is a typical inductive in D belongs to class Ci and is estimated by |Ci,D|
approach to learn knowledge on classification. / |D|. Info(D) is the average amount of
The key requirements to do mining with decision information needed to identify the class label of a
trees are: (1) Attribute-value description: object tuple in D.Info(D) is also known as the entropy
or case must be expressible in terms of a fixed of D.The expected information required to
collection of properties or attributes. classify a tuple from D, based on the partitioning
(2)Predefined classes (target attribute values): by attribute A is calculated by,
The categories to which examples are to be InfoA(D)=∑ X Info(Dj)
assigned must have been established beforehand
Information gain is defined as the difference
(supervised data). (3)Discrete classes: A case
between the original information requirement
does or does not belong to a particular class, and
(i.e. based on the classes) and the new
there must be more cases than classes.
requirement (i.e. obtained after partitioning on A)
(4)Sufficient data: Usually hundreds or even
thousands of training cases. Decision tree
Gain(A)=Info(D)- InfoA(D)
induction is the learning of decision trees from
345
The Gini Index considers a binary split for each In this paper we wish to select the best attribute
attribute. The Gini Index measures the impurity measure to construct decision tree. Given the data
of D, a data partition or set of training tuples as, as in Table 1. The data tuple are described by the
attribute owns home, Married, Gender,
Gini(D)= ∑ 2
employed, class.
Where Pi is the probability that a tuple in D 6.1 INFORMATION GAIN ATTRIBUTE MEASURE
belongs to Class Ci and is estimated by |Ci,D| / |D|.
D=10,A=3,B=3,C=4,M=3
When considering a binary split, we compute a
weighted sum of the impurity of each resulting Info(D)=-3/10 -3/10 -
partition. For example, if a binary split on A
4/10 = 0.521+0.521+0.529= 1.57
partitions D into D1 and D2, the gini index of D
given that partitioning is
We can compute the Attribute “ownshome”
GiniA(D)= Gini(D1)+ Gini(D2)
Info ownshome(D)=5/10[-1/5 - 2/5 -
For each attribute, each of the possible binary 2/5 5/10[-2/5 - 1/5 -
split is considered. For a discrete valued attribute, 2/5 = 0.761+0.761 = 1.52
the subset that gives the minimum gini index for
that attribute is as its splitting attribute Gain(ownshome)=Info(D) - Info ownshome(D)
V. DATASET DESCRIPTION =1.57-1.52
The main objective of this paper is to select the Similarly we can compute the attributes married,
best attribute measure to construct decision tree. gender, employed.
Table 1 Table 2
6.2 GINI INDEX ATTRIBUTE MEASURE VII. CONCLUSION AND FUTURE DEVELOPMENT