Aiml M4 C1
Aiml M4 C1
Chapter – 6
Decision Tree Learning
6.1 INTRODUCTION TO DECISION TREE LEARNING
MODEL
• Decision tree learning model, one of the most popular supervised
predictive learning models, classifies data instances with high
accuracy and consistency.
• The model performs an inductive inference that reaches a general
conclusion from observed examples.
• This model is variably used for solving complex classification
applications.
• Decision tree is a concept tree which summarizes the information
contained in the training dataset in the form of a tree structure.
• Once the concept model is built, test data can be easily classified.
• This model can be used to classify both categorical target
variables and continuous-valued target variables.
• Given a training dataset X, this model computes a hypothesis
function f(X) as decision tree.
• Inputs to the model are data instances or objects with a set of
features or attributes which can be discrete or continuous
and the output of the model is a decision tree which predicts or
classifies the target class for the test data object.
• In statistical terms, attributes or features are called as
independent variables.
• The target feature or target class is called as response
variable which indicates the category we need to predict on a
test object.
• The decision tree learning model generates a complete
hypothesis space in the form of a tree structure with the
given training dataset and allows us to search through the
possible set of hypotheses which in fact would be a smaller
decision tree as we walk through the tree.
• This kind of search bias is called as preference bias.
6.1.1 Structure of a Decision Tree
• A decision tree has a structure that consists of a root node, internal
nodes/decision nodes, branches, and terminal nodes/leaf nodes.
The topmost node in the tree is the root node.
• Internal nodes are the test nodes and are also called as decision
nodes. These nodes represent a choice or test of an input attribute
and the outcome or outputs of the test condition are the branches
emanating from this decision node.
• The branches are labelled as per the outcomes or output values of the
test condition. Each branch represents a sub-tree or sub section n of
the entire tree.
• Every decision node is part of a path to a leaf node. The leaf nodes
represent the labels or the outcome of a decision path. The labels of
the leaf nodes are the different target classes a data instance can
belong to.
• Every path from root to a leaf node represents a logical rule that
corresponds to a conjunction of test attributes and the whole
tree represents a disjunction of these conjunctions.
• The decision tree model, in general, represents a collection of
logical rules of classification in the form of a tree structure.
• Decision networks, otherwise called as influence diagrams,
have a directed graph structure with nodes and links.
• It is an extension of Bayesian belief networks that represents
information about each node’s current state, its possible
actions, the possible outcome of those actions, and their utility.
• Figure 6.1 shows symbols that are used in this book to
represent different nodes in the construction of a decision tree.
• A circle is used to represent a root node, a diamond symbol is
used to represent a decision node or the internal nodes, and all
leaf nodes are represented with a rectangle.
• A decision tree consists of two major procedures discussed below.
Advantages of Decision Trees
• 1. Easy to model and interpret
• 2. Simple to understand
• 3. The input and output attributes can be discrete or continuous
predictor variables.
• 4. Can model a high degree of nonlinearity in the relationship
between the target variables and the predictor variables
• 5. Quick to train
Disadvantages of Decision Trees
• Some of the issues that generally arise with a decision tree learning
are that:
• 1. It is difficult to determine how deeply a decision tree can be grown
or when to stop growing it.
• 2. If training data has errors or missing attribute values, then the
decision tree constructed may become unstable or biased.
• 3. If the training data has continuous valued attributes, handling it is
computationally complex and has to be discretized.
• 4. A complex decision tree may also be over-fitting with the training
data.
• 5. Decision tree learning is not well suited for classifying multiple
output classes.
• 6. Learning an optimal decision tree is also known to be
NP-complete.
6.1.2 Fundamentals of Entropy
• Given the training dataset with a set of attributes or features, the
decision tree is constructed by finding the attribute or feature
that best describes the target class for the given test instances.
• The best split feature is the one which contains more
information about how to split the dataset among all features so
that the target class is accurately identified for the test
instances.
• In other words, the best split attribute is more informative to
split the dataset into sub datasets and this process is continued
until the stopping criterion is reached.
• This splitting should be pure at every stage of selecting the
best feature.
• The best feature is selected based on the amount of information
among the features which are basically calculated on
probabilities.
• Quantifying information is closely related to information theory. In
the field of information theory, the features are quantified by a
measure called Shannon Entropy which is calculated based on
the probability distribution of the events.
• Entropy is the amount of uncertainty or randomness in the
outcome of a random variable or an event.
• Moreover, entropy describes about the homogeneity of the data
instances.
• The best feature is selected based on the entropy value.
• For example, when a coin is flipped, head or tail are the two
outcomes, hence its entropy is lower when compared to rolling a
dice which has got six outcomes.
6.2 DECISION TREE INDUCTION
ALGORITHMS
• There are many decision tree algorithms, such as ID3, C4.5,
CART, CHAID, QUEST, GUIDE, CRUISE, and CTREE, that are
used for classification in real-time environment.
• The most commonly used decision tree algorithms are ID3
(Iterative Dichotomizer 3), developed by J.R Quinlan in 1986,
and C4.5 is an advancement of ID3 presented by the same
author in 1993.
• CART, that stands for Classification and Regression Trees, is
another algorithm which was developed by Breiman et al. in
1984.
• The accuracy of the tree constructed depends upon the selection
of the best split attribute.
• Different algorithms are used for building decision trees which
use different measures to decide on the splitting criterion.
• Algorithms such as ID3, C4.5 and CART are popular algorithms
used in the construction of decision trees.
• The algorithm ID3 uses ‘Information Gain’ as the splitting
criterion whereas the algorithm C4.5 uses ‘Gain Ratio’ as the
splitting criterion.
• The CART algorithm is popularly used for classifying both
categorical and continuous-valued target variables. CART
uses GINI Index to construct a decision tree.
• Decision trees constructed using ID3 and C4.5 are also called
as univariate decision trees which consider only one
feature/attribute to split at each decision node whereas decision
trees constructed using CART algorithm are multivariate
decision trees which consider a conjunction of univariate splits.
6.2.1 ID3 Tree
• ID3 is a supervised learning algorithm which uses a training
dataset with labels and constructs a decision tree.
• ID3 is an example of univariate decision trees as it considers
only one feature at each decision node.
• This leads to axis-aligned splits. The tree is then used to classify
the future test instances.
• It constructs the tree using a greedy approach in a top-down
fashion by identifying the best attribute at each level of the
tree.
• ID3 works well if the attributes or features are considered as
discrete/categorical values. If some attributes are continuous,
then we have to partition attributes or features to be
decretized or nominal attributes or features.
Axis-aligned split function uses only one feature at a time to separate the feature space of training samples
by a hyper-plane that is aligned to the feature axes
• The algorithm builds the tree using a purity measure called
‘Information Gain’ with the given training data instances and
then uses the constructed tree to classify the test data.
• It is applied for training set with only nominal attributes or
categorical attributes and with no missing values for
classification.
• ID3 works well for a large dataset.
• If the dataset is small, overfitting may occur. Moreover, it is not
accurate if the dataset has missing attribute values.
• No pruning is done during or after construction of the tree and it
is prone to outliers.
• C4.5 and CART can handle both categorical attributes and
continuous attributes.
• Both C4.5 and CART can also handle missing values, but C4.5
is prone to outliers whereas CART can handle outliers as well.
The algorithm C4.5 is based on Occam’s Razor which says that given two correct
solutions, the simpler solution has to be chosen. Moreover, the algorithm requires a larger
training set for better accuracy. It uses Gain Ratio as a measure during the construction
of decision trees. ID3 is more biased towards attributes with larger values.
Inductive bias is the set of assumptions that a machine learning algorithm makes about the
relationship between input variables (features) and output variables (labels) based on the
• It applies a hill-climbing search that does not backtrack and
may finally converge to a locally optimal solution that is not
globally optimal.
• The shorter tree is preferred using Occam’s razor principle
which states that the simplest solution is the best solution.
• Overfitting is also a general problem with decision trees.
• Once the decision tree is constructed, it must be validated for better
accuracy and to avoid over-fitting and under-fitting.
• There is always a tradeoff between accuracy and complexity of the
tree.
• The tree must be simple and accurate.
• If the tree is more complex, it can classify the data instances
accurately for the training set but when test data is given, the
tree constructed may perform poorly which means misclassifications
are higher and accuracy is reduced.
• This problem is called as over-fitting.
• To avoid overfitting of the tree, we need to prune the trees and
construct an optimal decision tree.
• Trees can be pre-pruned or post-pruned.
• If tree nodes are pruned during construction or the construction is
stopped earlier without exploring the nodes' branches, then it is
called as pre-pruning whereas if tree nodes are pruned after the
construction is over then it is called as post-pruning.
• Basically, the dataset is split into three sets called training dataset,
validation dataset and test dataset.
Pruning reduces the size of decision trees by removing parts of the tree that do not provide power to classify
• Generally, 40% of the dataset is used for training the
decision tree and the remaining 60% is used for validation and
testing.
• Once the decision tree is constructed, it is validated with the
validation dataset and the misclassifications are identified.
• Using the number of instances correctly classified and number
of instances wrongly classified, Average Squared Error (ASE)
is computed.
• The tree nodes are pruned based on these computations and
the resulting tree is validated until we get a tree that performs
better.
• Cross validation is another way to construct an optimal decision
tree. Here, the dataset is split into k-folds, among which k–1 folds
are used for training the decision tree and the kth fold is used
for validation and errors are computed.
• The process is repeated for randomly k–1 folds and the mean of
the errors is computed for different trees.
• The tree with the lowest error is chosen with which the
performance of the tree is improved.
• This tree can now be tested with the test dataset and predictions
are made.
For Understanding
Another approach is that after the tree is constructed using the training
set, statistical tests like error estimation and Chi-square test are used to
estimate whether pruning or splitting is required for a particular node to
find a better accurate tree.
REVISION