0% found this document useful (0 votes)

403 views

Construction of Decision Tree Attribute Selection Measures

This document compares two attribute selection measures for constructing decision trees: Information Gain and Gini Index. It first provides background on decision trees and their construction. It then describes how Information Gain and the Gini Index are calculated to select the optimal splitting attribute at each node. The document presents an example dataset and shows the results of applying both measures to select the splitting attribute. It finds that the Gini Index provides a simpler approach than Information Gain for this task.

Uploaded by

Harsh Nethra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

403 views

Construction of Decision Tree Attribute Selection Measures

Uploaded by

Harsh Nethra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

343

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

ISSN 2278-7763

Construction of Decision Tree : Attribute Selection

Measures
R. Aruna devi¹, Dr. K. Nirmala²

¹Research Scholar, Manonmanion Sundaranar University & Asst. Professor, Department of Computer Science, Vidhya Sagar
Women’s College, Chengalpattu, Chennai, Tamil Nadu, India. Email: [email protected]

² Associate Professor, Department of Computer Science, Quaid-e- Millath Government College for Women(A),Chennai, Tamil
Nadu,India. Email: [email protected]

ABSTRACT

Attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given
data partition, D, of a class-labeled training tuples into individual classes. It determines how the tuples at a
given node are to be split. The attribute selection measure provides a ranking for each attribute describing
the given training tuples. The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.This paper, perform a comparative study of two attribute selection measures.
The Information gain is used to select the splitting attribute in each node in the tree. The attribute with the
highest information gain is chosen as the splitting attribute for the current node. The Gini index measures
use binary split for each attribute. The attribute with the minimum gini index as selected as the splitting
attribute. The results indicates that predicting a attribute selection in Gini index is more effective and
simple compared to Information gain.

Keywords: Heuristics, Information Gain, Gini Index, Attribute selection.

I.INTRODUCTION II.DECISION TREE

Data mining is the extraction of implicit, Decision trees are powerful and popular tools for
previously unknown, and potentially useful classification and prediction. Decision trees
information from large databases. It uses represent rules, which can be understood by
machine learning, statistical and visualization humans and used in knowledge system such as
techniques to discover and present knowledge in database. Decision tree learning is a method
a form, which is easily comprehensible to commonly used in data mining. The goal is to
humans. Data mining functionalities are used to create a model that predicts the value of a target
specify the kind of patterns to be found in data variable based on several input variables. A
mining tasks. Data mining task can be classified Decision tree is a flowchart like tree structure,
into two categories: Descriptive and Predictive. where each internal node denotes a test on an
Descriptive mining tasks characterize the general attribute, each branch represents an outcome of
properties of the data in the database. Predictive the test, and each leaf node (terminal node) holds
mining tasks perform inference on the current a class label. The topmost node in a tree is the
data in order to make prediction. root node. A tree can be “learned” by splitting the
344

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

ISSN 2278-7763

source set into subsets based on an attribute value class-labeled training tuples.During tree
test. This process is repeated on each derived construction, attribute selection measures are
subset in a recursive manner called recursive used to select the attributes that partition the
partitioning. The recursion is completed when the tuples into distinct classes.
subset at a node all has the same value of the
III.INFORMATION GAIN
target variable, or when splitting no longer adds
value to the predictions. In data mining, decision
This measure is based on pioneering work by
trees can be described as the combination of
Claude Shannon on information theory, which
mathematical and computational techniques to
studied the value or “information content” of
aid the description, categorization and
message. Let node N represents or hold the tuple
generalization of a given set of data.
of partition D. The attribute with the highest
The construction of decision tree classifier does information gain is chosen as the splitting
not require any domain knowledge or parameter attribute for the node N. The expected
setting, and therefore is appropriate for information needed to classify a tuple in D is
exploratory knowledge discovery. Decision trees given by,
can handle high dimensional data. In general Info(D)= - ∑
decision tree classifier has good accuracy. Where Pi is the probability that an arbitrary tuple
Decision tree induction is a typical inductive in D belongs to class Ci and is estimated by |Ci,D|
approach to learn knowledge on classification. / |D|. Info(D) is the average amount of
The key requirements to do mining with decision information needed to identify the class label of a
trees are: (1) Attribute-value description: object tuple in D.Info(D) is also known as the entropy
or case must be expressible in terms of a fixed of D.The expected information required to
collection of properties or attributes. classify a tuple from D, based on the partitioning
(2)Predefined classes (target attribute values): by attribute A is calculated by,
The categories to which examples are to be InfoA(D)=∑ X Info(Dj)
assigned must have been established beforehand
Information gain is defined as the difference
(supervised data). (3)Discrete classes: A case
between the original information requirement
does or does not belong to a particular class, and
(i.e. based on the classes) and the new
there must be more cases than classes.
requirement (i.e. obtained after partitioning on A)
(4)Sufficient data: Usually hundreds or even
thousands of training cases. Decision tree
Gain(A)=Info(D)- InfoA(D)
induction is the learning of decision trees from
345

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

ISSN 2278-7763

IV. GINI INDEX VI. EXPERIMENTAL RESULTS AND DISCUSSIONS

The Gini Index considers a binary split for each In this paper we wish to select the best attribute
attribute. The Gini Index measures the impurity measure to construct decision tree. Given the data
of D, a data partition or set of training tuples as, as in Table 1. The data tuple are described by the
attribute owns home, Married, Gender,
Gini(D)= ∑ 2
employed, class.
Where Pi is the probability that a tuple in D 6.1 INFORMATION GAIN ATTRIBUTE MEASURE
belongs to Class Ci and is estimated by |Ci,D| / |D|.
D=10,A=3,B=3,C=4,M=3
When considering a binary split, we compute a
weighted sum of the impurity of each resulting Info(D)=-3/10 -3/10 -
partition. For example, if a binary split on A
4/10 = 0.521+0.521+0.529= 1.57
partitions D into D1 and D2, the gini index of D
given that partitioning is
We can compute the Attribute “ownshome”
GiniA(D)= Gini(D1)+ Gini(D2)
Info ownshome(D)=5/10[-1/5 - 2/5 -
For each attribute, each of the possible binary 2/5 5/10[-2/5 - 1/5 -
split is considered. For a discrete valued attribute, 2/5 = 0.761+0.761 = 1.52
the subset that gives the minimum gini index for
that attribute is as its splitting attribute Gain(ownshome)=Info(D) - Info ownshome(D)
V. DATASET DESCRIPTION =1.57-1.52

The main objective of this paper is to select the Similarly we can compute the attributes married,
best attribute measure to construct decision tree. gender, employed.

Table 1 Table 2

Owns Married Gender Employed Class Attribute Info Gain

home
Owns home 1.52 0.05
Yes Yes Male Yes B
No No Female Yes A Married 0.847 0.72
Yes Yes Female Yes C
Gender 0.69 0.88
Yes No Male No B
No Yes Female Yes C Employed 1.12 0.45
No No Female Yes A
No No Male No B
Yes No Female Yes A Hence, Gender has the highest information gain
No Yes Female Yes C among the attribute, so it is selected as the
Yes Yes Female Yes C splitting attribute.
346

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

ISSN 2278-7763

6.2 GINI INDEX ATTRIBUTE MEASURE VII. CONCLUSION AND FUTURE DEVELOPMENT

Total tuples(S)=10 In this paper, the comparative study of two

Total Classes(M)=3 attribute selection measure is compared. The Gini
Class A=3, Class B=3, Class C=4 index measure is very easy to select the best
attribute to construct decision tree because of its
Now, compute the gini index for each of the
simplicity, elegance, and robustness. The results
attributes.
indicate that selection of attribute using gini
index is very easy and simple compared to
Attribute=”ownshome”
Gini(D1)=1-(1/5)2-(2/5)2--(2/5)2 = 0.64 information gain. Possible extension of this work
Gini(D2)=1-(2/5)2-(1/5)2--(2/5)2 = 0.64 will be developed to use various attribute
Gini ownshome(D)=5/10(0.64)+5/10(0.64) = 0.64 selection measures like CHAID, C-SEP and
MDL- based measures.
Similarly we can compute the attributes married,
gender, employed. REFERENCES

Table 3 [1]A.K.Pujari, “Data Mining Techniques”,

Attribute GiniIndex University Press, India 2001.
Owns home 0.64 [2]Jiawei Han and Micheline Kamber “Data
Married 0.40
Mining Concepts and Techniques”
Gender 0.34
Employed 0.47 [3]S.N.Sivanandam and S.Sumathi, “Data
Mining Concepts Tasks and Techniques”,
Thomson, Business Information India
Here, Gender has the smallest gini index among Pvt.Ltd.India 2006
the attribute, so it is selected as the splitting [4] H. Wang, W. Fan, P. Yu, and J. Han.”Mining
attribute. concept-drifting data streams using ensemble
Classifiers”.
COMPARISON AND RESULTS
[5] V. Ganti, J. Gehrke, R.Ramakrishnan, and
For the comparison of our study, first we used an W.Loh.“Mining data streams under block
Information gain as attribute selection measure. evolution”.
Although information gain is usually a good [6] Friedman N, Geiger D, Goldszmidt M (1997)
measure for deciding the relevance of an “Bayesian network classifiers”.
attribute, it is not perfect. A notable problem [7] Jensen F., “An Introduction to Bayesian
occurs when information gain is applied to Networks”.
attributes that can take on a large number of [8] Murthy, “Automatic Construction of Decision
distinct values. Trees from Data”
[9]Website:www.cs.umd.edu/~samir/498/10Algo
Secondly we used a Gini index as attribute
rithms-08.pdf
selection measure; it has very time consuming [10] Website:en.wikipedia.org/wiki/Data_mining
and particularly suitable for multivalue attribute.
347

COMP 2131 - Self-Paced
No ratings yet
COMP 2131 - Self-Paced
19 pages
Processes and Threads: CS423 Homework 1 - Solution
No ratings yet
Processes and Threads: CS423 Homework 1 - Solution
13 pages
Darpan Chaudhary Analytics Take-Home Test
No ratings yet
Darpan Chaudhary Analytics Take-Home Test
6 pages
FRM - Purple Fix
No ratings yet
FRM - Purple Fix
6 pages
IFPUG Counting Practices Manual 4.1
No ratings yet
IFPUG Counting Practices Manual 4.1
335 pages
Infosys Leet Code
No ratings yet
Infosys Leet Code
41 pages
SSRN Id3177534 PDF
No ratings yet
SSRN Id3177534 PDF
11 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
Quantum Computing An Emerging Ecosystem
No ratings yet
Quantum Computing An Emerging Ecosystem
40 pages
A Study On Remaining Useful Life Prediction For Prognostic Applic
No ratings yet
A Study On Remaining Useful Life Prediction For Prognostic Applic
30 pages
Deeplearningsmartnetworks 190505233523
100% (1)
Deeplearningsmartnetworks 190505233523
101 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Test Driven Development Simplified in 5 Steps: Pete Heard
100% (1)
Test Driven Development Simplified in 5 Steps: Pete Heard
24 pages
Ant Colony Optimization Algorithms
No ratings yet
Ant Colony Optimization Algorithms
13 pages
Animejs
No ratings yet
Animejs
27 pages
The 9 Deep Learning Papers You Need To Know About 3
No ratings yet
The 9 Deep Learning Papers You Need To Know About 3
19 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
C++ STL 1
No ratings yet
C++ STL 1
15 pages
Python3 Data Structures Cheat Sheet: by Via
No ratings yet
Python3 Data Structures Cheat Sheet: by Via
1 page
Artificial Intelligence: Spring 2008, Juris Vīksna
No ratings yet
Artificial Intelligence: Spring 2008, Juris Vīksna
67 pages
How To Model Residual Errors To Correct Time Series Forecasts With Python
No ratings yet
How To Model Residual Errors To Correct Time Series Forecasts With Python
22 pages
Notes On COMPUTER VISION
No ratings yet
Notes On COMPUTER VISION
10 pages
Design and Analysis of Algorithms For R-2017 by Krishna Sankar P., Shangaranarayanee N.P.
No ratings yet
Design and Analysis of Algorithms For R-2017 by Krishna Sankar P., Shangaranarayanee N.P.
6 pages
Correlation and Regression - Interview Questions in Business Analytics
No ratings yet
Correlation and Regression - Interview Questions in Business Analytics
5 pages
Introduction To Computation and Programming Using Python, Revised - Guttag, John V..64 PDF
0% (2)
Introduction To Computation and Programming Using Python, Revised - Guttag, John V..64 PDF
1 page
Software Enginering Basics
No ratings yet
Software Enginering Basics
8 pages
Write Go Like A Senior Engineer. What I Wish I Knew When I Started - by Jacob Bennett - Level Up Coding
No ratings yet
Write Go Like A Senior Engineer. What I Wish I Knew When I Started - by Jacob Bennett - Level Up Coding
14 pages
Sudharani - Core Java - SR Associate Assessment - 1520497729331
No ratings yet
Sudharani - Core Java - SR Associate Assessment - 1520497729331
63 pages
FADM Assignment Solved
No ratings yet
FADM Assignment Solved
12 pages
2012 CIO Event Scotland
No ratings yet
2012 CIO Event Scotland
42 pages
Hierarchichal Database Model
No ratings yet
Hierarchichal Database Model
3 pages
AZ (PDF) Full Download Python For Probability, Statistics, and Machine Learning Read Online
No ratings yet
AZ (PDF) Full Download Python For Probability, Statistics, and Machine Learning Read Online
1 page
Stock Price Prediction Using Time Series
No ratings yet
Stock Price Prediction Using Time Series
9 pages
Alert Based Monitoring of Stock Trading Systems
No ratings yet
Alert Based Monitoring of Stock Trading Systems
3 pages
Understanding Unit and Integration Testing in Golang
No ratings yet
Understanding Unit and Integration Testing in Golang
59 pages
Evolutionary Architectures Day 2816211628782168865
No ratings yet
Evolutionary Architectures Day 2816211628782168865
213 pages
Binary Search Trees
No ratings yet
Binary Search Trees
71 pages
Bahria College, Islamabad Bahria College, Islamabad Bahria College, Islamabad
No ratings yet
Bahria College, Islamabad Bahria College, Islamabad Bahria College, Islamabad
1 page
Sigmod278 Silberstein
No ratings yet
Sigmod278 Silberstein
12 pages
Networking Assignment...
No ratings yet
Networking Assignment...
26 pages
Ps and Solution CS229
No ratings yet
Ps and Solution CS229
55 pages
Classification of Mushroom Fungi Using Machine Lea
No ratings yet
Classification of Mushroom Fungi Using Machine Lea
8 pages
Vision Statement
No ratings yet
Vision Statement
1 page
Prediction and Monitoring of Air Pollution Using Internet of Things (IoT)
No ratings yet
Prediction and Monitoring of Air Pollution Using Internet of Things (IoT)
4 pages
(Skiena, 2017) - Book - The Data Science Design Manual - 2
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 2
1 page
Machine Learning
No ratings yet
Machine Learning
11 pages
Indian Banking System
100% (1)
Indian Banking System
36 pages
Implement of Salary Prediction System To Improve Student Motivation Using Data Mining Technique PDF
No ratings yet
Implement of Salary Prediction System To Improve Student Motivation Using Data Mining Technique PDF
6 pages
The Node - Js Developer Roadmap For 2021
No ratings yet
The Node - Js Developer Roadmap For 2021
6 pages
What Have You Gained From Competitive Programming
No ratings yet
What Have You Gained From Competitive Programming
313 pages
Modern Web Development With Scala Sample
No ratings yet
Modern Web Development With Scala Sample
45 pages
Pandas
100% (1)
Pandas
1,131 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Dynamodb Applied Design Patterns: Chapter No. 1 "Data Modeling With Dynamodb"
No ratings yet
Dynamodb Applied Design Patterns: Chapter No. 1 "Data Modeling With Dynamodb"
23 pages
Master of Data Science Strategy and Leadership
No ratings yet
Master of Data Science Strategy and Leadership
30 pages
Tinyos Tutorial: Cs580S Sensor Networks and Systems February 7, 2007 Jisu Oh Dept. of Computer Science Suny-Binghamton
0% (1)
Tinyos Tutorial: Cs580S Sensor Networks and Systems February 7, 2007 Jisu Oh Dept. of Computer Science Suny-Binghamton
45 pages
GATE - CS - Engineering Mathematics
No ratings yet
GATE - CS - Engineering Mathematics
33 pages
Cloud Computing : Beginners And Intermediate User Guide
From Everand
Cloud Computing : Beginners And Intermediate User Guide
David comer
No ratings yet
Decitions Tree
No ratings yet
Decitions Tree
6 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Case Study
No ratings yet
Case Study
7 pages
INDEX 10C1 Fault Code 69 PDF
No ratings yet
INDEX 10C1 Fault Code 69 PDF
1 page
KKDAT Form 1
No ratings yet
KKDAT Form 1
2 pages
week 5 quiz
No ratings yet
week 5 quiz
3 pages
Digital Electronics 4 QP
No ratings yet
Digital Electronics 4 QP
7 pages
Dr. N S Kumar: Chaithrashree.R
No ratings yet
Dr. N S Kumar: Chaithrashree.R
37 pages
Travel Order - Jenette B. Pagpaguitan
No ratings yet
Travel Order - Jenette B. Pagpaguitan
4 pages
EZ2000
No ratings yet
EZ2000
80 pages
On Portal User Manualand Training Guide Rev 3
No ratings yet
On Portal User Manualand Training Guide Rev 3
95 pages
Tle9 Q2mod6 Electrical-Schematic Drawings Simeon Pongtan v1
100% (1)
Tle9 Q2mod6 Electrical-Schematic Drawings Simeon Pongtan v1
25 pages
Portable Flat Panel Detector For Digital Radiography
No ratings yet
Portable Flat Panel Detector For Digital Radiography
2 pages
Checking Procedure (Tech 2)
No ratings yet
Checking Procedure (Tech 2)
4 pages
Lovejeet Ar Worksheet 10
No ratings yet
Lovejeet Ar Worksheet 10
2 pages
Firmware Upgrade Procedure TV HDMI KB 01 V1
No ratings yet
Firmware Upgrade Procedure TV HDMI KB 01 V1
12 pages
Arcom ELAN-104NC: PC/104 Compatible Embedded Processor Card Technical Manual
No ratings yet
Arcom ELAN-104NC: PC/104 Compatible Embedded Processor Card Technical Manual
55 pages
Hydraulic Complete GB en
No ratings yet
Hydraulic Complete GB en
621 pages
Fall Detection With Three-Axis Accelerometer and Magnetometer in A Smartphone
No ratings yet
Fall Detection With Three-Axis Accelerometer and Magnetometer in A Smartphone
6 pages
UNIT4 - Memory Organization
No ratings yet
UNIT4 - Memory Organization
65 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
p276 Rakubutu
No ratings yet
p276 Rakubutu
10 pages
C++ Tutorial Solution 2022 - 230128 - 072845
No ratings yet
C++ Tutorial Solution 2022 - 230128 - 072845
125 pages
Creating and Showing Comments
No ratings yet
Creating and Showing Comments
8 pages
Phased Array Ultrasonic Technique Speeds Up Examination of Aluminothermic Rail Welds
No ratings yet
Phased Array Ultrasonic Technique Speeds Up Examination of Aluminothermic Rail Welds
1 page
The World Curtains. - .: Behind Your
No ratings yet
The World Curtains. - .: Behind Your
16 pages
AN003 - Modbus Power Meters and HomeLYnk - v1.3
No ratings yet
AN003 - Modbus Power Meters and HomeLYnk - v1.3
24 pages
Stack and Queue Using Linked List
No ratings yet
Stack and Queue Using Linked List
17 pages
Unit I - Introduction To C++
No ratings yet
Unit I - Introduction To C++
12 pages
Mathematics 216 Carrousel 2:: Itinerary 5
No ratings yet
Mathematics 216 Carrousel 2:: Itinerary 5
2 pages
DP67BG PerfTuningGuide
No ratings yet
DP67BG PerfTuningGuide
32 pages

Construction of Decision Tree Attribute Selection Measures

Uploaded by

Construction of Decision Tree Attribute Selection Measures

Uploaded by

343

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

Construction of Decision Tree : Attribute Selection

Keywords: Heuristics, Information Gain, Gini Index, Attribute selection.

I.INTRODUCTION II.DECISION TREE

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

IV. GINI INDEX VI. EXPERIMENTAL RESULTS AND DISCUSSIONS

Owns Married Gender Employed Class Attribute Info Gain

International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013

Total tuples(S)=10 In this paper, the comparative study of two

Table 3 [1]A.K.Pujari, “Data Mining Techniques”,

You might also like