Cours #4—Decision Tree
Cours #4—Decision Tree
Abdelouahab Moussaoui
Cours
Apprentissage Automatique I
Professeur
Abdelouahab Moussaoui
1 — Introduction
The decision tree algorithm is a popular supervised machine learning algorithm for its simple
approach to dealing with complex datasets. Decision trees get the name from their
resemblance to a tree that includes roots, branches and leaves in the form of nodes and
edges. They are used for decision analysis much like a flowchart of if-else based decisions
that lead to the required prediction. The tree learns these if-else decision rules to split
the data set to make a tree-like model.
Decision trees find their usage in the prediction of discrete results for classification problems
and continuous numeric results for regression problems. There are many different algorithms
developed over the years, like CART, C4.5 and ensembles, such as random
forest and Gradient Boosted Trees.
2 — Characteristics
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
These criteria will calculate values for every attribute. The values are sorted, and attributes
are placed in the tree by following the order i.e, the attribute with a high value(in case of
information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical, and for
the Gini index, attributes are assumed to be continuous.
1 — Entropy
Entropy is a measure of the randomness in the information being processed. The higher the
entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an
example of an action that provides information that is random.
From the above graph, it is quite evident that the entropy H(X) is zero when the probability
is either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects
perfect randomness in the data and there is no chance if perfectly determining the outcome.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A brach
with entropy more than zero needs further splitting.
Mathematically Entropy for 1 attribute is represented as:
2 — Information Gain
Information gain or IG is a statistical property that measures how well a given attribute
separates the training examples according to their target classification. Constructing a
decision tree is all about finding an attribute that returns the highest information gain and
the smallest entropy.
With this table, other people would be able to use your intuition to decide whether they should
play tennis by looking up what you did given a certain weather pattern, but after just 14 days,
it’s a little unwieldy to match your current weather situation with one of the rows in the table.
We could represent this tabular data in the form of tree.
Algorithm
1. Start with a training data set which we’ll call S. It should have attributes and
classification.
2. Determine the best attribute in the dataset. (We will go over the definition of best
attribute)
3. Split S into subset that contains the possile values for the best attribute.
4. Make decision tree node that contains the best attribute.
5. Recursively generate new decision trees by using the subset of data created from
step 3 until a stage is reached where you cannot classify the data further.
Represent the class as leaf node.
In Decision Tree algorithm, the best mean the attribute which has most information gain.
The left split has less information gain as the data is split on two classes which has almost
equal ‘+’ and ‘-’ examples. While the split on the right as more ‘+’ example in one class and
more ‘-’ example in the other class. In order to calculate best attribute we will use Entropy.
Entropy
In machine learning sense and especially in this case Entropy is the measure of homegeneity
in the data. Its value is ranges from 0 to 1. Its value is close to 0 if all the example belongs
to same class and is close to 1 is there is almost equal split of the data into different classes.
Now the formula to calculate entropy is :
Here pi represents the proportion of the data with ith classification and c represents the
different types of classification.
Now Information Gain measure the reduction in entropy by classifying the data on a
particular attribute. The formula to calculate Gain by splitting the data on Dataset ‘S’ and on
the attribute ‘A’ is :
Here Entropy(S) represents the entropy of the dataset and the second term on the right is
the weighted entropy of the different possible classes obtain after the split. Now the goal is
to maximize this information gain. The attribute which has the maximum information gain
is selected as the parent node and successively data is split on the node.
As we can see Outlook attribute has the maximum information gain and hence it is placed at
top of the tree.
Information gain is a decrease in entropy. It computes the difference between entropy before
split and average entropy after split of the dataset based on given attribute values. ID3
(Iterative Dichotomiser) decision tree algorithm uses information gain.
Mathematically, IG is represented as:
Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.
3 — Gini Index
You can understand the Gini index as a cost function used to evaluate splits in the dataset.
It is calculated by subtracting the sum of the squared probabilities of each class from one. It
favors larger partitions and easy to implement whereas information gain favors smaller
partitions with distinct values.
Gini Index works with the categorical target variable “Success” or “Failure”. It performs only
Binary splits.
Higher value of Gini index implies higher inequality, higher heterogeneit
4 — Gain ratio
Information gain is biased towards choosing attributes with a large number of values as root
nodes. It means it prefers the attribute with a large number of distinct values.
C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain
that reduces its bias and is usually the best option. Gain ratio overcomes the problem with
information gain by taking into account the number of branches that would result before
making the split. It corrects information gain by taking the intrinsic information of a split into
account.
Let us consider if we have a dataset that has users and their movie genre preferences based
on variables like gender, group of age, rating, blah, blah. With the help of information gain,
you split at ‘Gender’ (assuming it has the highest information gain) and now the variables
‘Group of Age’ and ‘Rating’ could be equally important and with the help of gain ratio, it will
penalize a variable with more distinct values which will help us decide the split at the next
level.
Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.
5 — Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables (regression
problems). This algorithm uses the standard formula of variance to choose the best split.
The split with lower variance is selected as the criteria to split the population:
Above X-bar is the mean of the values, X is actual and n is the number of values.
6— Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the
oldest tree classification methods. It finds out the statistical significance between the
differences between sub-nodes and parent node. We measure it by the sum of squares of
standardized differences between observed and expected frequencies of the target variable.
It works with the categorical target variable “Success” or “Failure”. It can perform two or
more splits. Higher the value of Chi-Square higher the statistical significance of differences
between sub-node and Parent node.
It generates a tree called CHAID (Chi-square Automatic Interaction Detector).
Mathematically, Chi-squared is represented as:
ID3
ID3 or Iterative Dichotomiser 3 is an algorithm used to build a decision tree by employing a
top-down approach. The tree is built from the top and each iteration with the best feature
helps create a node.
Here are the steps:
• The root node is the start point known as a set S.
• Each iteration of the algorithm will iterate through unused attributes of the root node
and calculate the information gain (IG) and entropy (S).
• It will select the attribute with the tiniest entropy or higher information gain.
• We divide set S by choosing the attribute to produce the data subset.
• The algorithm will continue if there is no repetition in the attributes chosen.
C4.5
The C4.5 algorithm is an improved version of ID3. C in the algorithm indicates that it uses C
programming language and 4.5 is the algorithm’s version. It is one of the more popular
algorithms for machine learning. It is also used as a decision tree classifier and to generate
a decision tree.
CART
Classification and Regression Tree or CART is a predictive algorithm used to generate future
predictions based on already available values. These algorithms serve as a base of machine
learning algorithms like bagged decision trees, boosted decision trees, or random forests.
There are marked differences between regression trees and classification trees.
• Regression trees: Predict continuous values depending on information sources or
previous data. For instance, to predict the price of an item, previous data needs to be
analyzed.
• Classification trees: Determine whether an event occurred. It usually has outcomes
as either yes or no. This type of decision tree algorithm is often used in real-world
decision-making.
CHAID
Chi-square automatic interaction detection (CHAID) is a tree classification method that finds
the importance between the parent nodes and root nodes. It is measured by adding the
squares of standardized differences between the expected and observed frequencies of the
target variable.
It works using the categorical target variables, Success or Failure, and can work on two or
more splits. If the Chi-square value is high, the statistical importance of the variation of the
parent node and root nodes will also be high. It will generate CHAID.
MARS
Multivariate adaptive regression splines or MARS is a complex algorithm that helps solve
non-linear regression problems. It lets us find a set of linear functions that provide the best
prediction. It is a combination of simple linear regression functions.
Advantages
1. Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
2. A decision tree does not require normalization of data.
3. A decision tree does not require scaling of data as well.
4. Missing values in the data also do NOT affect the process of building a decision tree
to any considerable extent.
5. A Decision tree model is very intuitive and easy to explain to technical teams as
well as stakeholders.
Disadvantage
1. A small change in the data can cause a large change in the structure of the decision
tree causing instability.
2. For a Decision tree sometimes calculation can go far more complex compared to
other algorithms.
3. Decision tree often involves higher time to train the model.
4. Decision tree training is relatively expensive as the complexity and time has taken
are more.
5. The Decision Tree algorithm is inadequate for applying regression and predicting
continuous values.