0% found this document useful (0 votes)
15 views18 pages

Cours #4—Decision Tree

The document discusses Decision Trees, a supervised machine learning algorithm used for classification and regression problems, characterized by a tree-like structure of nodes and branches. It explains key concepts such as root nodes, leaf nodes, splitting, and various attribute selection measures like Entropy, Information Gain, and Gini Index. The document also outlines the algorithm's process for building decision trees and the importance of selecting the best attribute for effective classification.

Uploaded by

sufyanalthawri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

Cours #4—Decision Tree

The document discusses Decision Trees, a supervised machine learning algorithm used for classification and regression problems, characterized by a tree-like structure of nodes and branches. It explains key concepts such as root nodes, leaf nodes, splitting, and various attribute selection measures like Entropy, Information Gain, and Gini Index. The document also outlines the algorithm's process for building decision trees and the importance of selecting the best attribute for effective classification.

Uploaded by

sufyanalthawri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Apprentissage Automatique I Prof.

Abdelouahab Moussaoui

Cours
Apprentissage Automatique I

Cours #4 – Decision Tree

Professeur
Abdelouahab Moussaoui

Cours #4 — Decision Tree


1
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

1 — Definitions and concepts

1 — Introduction
The decision tree algorithm is a popular supervised machine learning algorithm for its simple
approach to dealing with complex datasets. Decision trees get the name from their
resemblance to a tree that includes roots, branches and leaves in the form of nodes and
edges. They are used for decision analysis much like a flowchart of if-else based decisions
that lead to the required prediction. The tree learns these if-else decision rules to split
the data set to make a tree-like model.
Decision trees find their usage in the prediction of discrete results for classification problems
and continuous numeric results for regression problems. There are many different algorithms
developed over the years, like CART, C4.5 and ensembles, such as random
forest and Gradient Boosted Trees.

Cours #4 — Decision Tree


2
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Decision tree example

2 — Characteristics
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Cours #4 — Decision Tree


3
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

3 — Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
• Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
• The logic behind the decision tree can be easily understood because it shows a tree-
like structure.

4 — Decision Tree Terminologies


• Root Node: Root node is from where the decision tree starts. It is the starting node
at the top of the decision tree that contains all the attribute values. It represents the
entire dataset, which further gets divided into two or more homogeneous sets. Root
node splits into decision nodes based on the decision rules that the algorithm has
learnt.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node. Leaf nodes are terminal nodes that represent the
target prediction. These nodes do not split any further.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions. In binary splits, the branches denote true and
false paths.
• Branch/Sub Tree: A tree formed by splitting the tree. Branches are connectors
between nodes that correspond to the values of attributes. Internal nodes are decision
nodes between root node and leaf nodes that correspond to decision rules and their
answer paths. Nodes denote questions and branches show paths based on relevant
answers to those questions.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.

Cours #4 — Decision Tree


4
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

5 — How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

Cours #4 — Decision Tree


5
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

2 — Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique
which is called as Attribute selection measure or ASM. By this measurement, we can
easily select the best attribute for the nodes of the tree. There are two popular techniques
for ASM, which are:
§ Entropy,
§ Information gain,
§ Gini index,
§ Gain Ratio,
§ Reduction in Variance,
§ Chi-Square

These criteria will calculate values for every attribute. The values are sorted, and attributes
are placed in the tree by following the order i.e, the attribute with a high value(in case of
information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical, and for
the Gini index, attributes are assumed to be continuous.

1 — Entropy
Entropy is a measure of the randomness in the information being processed. The higher the
entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an
example of an action that provides information that is random.

From the above graph, it is quite evident that the entropy H(X) is zero when the probability
is either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects
perfect randomness in the data and there is no chance if perfectly determining the outcome.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A brach
with entropy more than zero needs further splitting.
Mathematically Entropy for 1 attribute is represented as:

Cours #4 — Decision Tree


6
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Where S → Current state, and Pi → Probability of an event i of state S or Percentage of


class i in a node of state S.
Mathematically Entropy for multiple attributes is represented as:

where T→ Current state and X → Selected attribute

2 — Information Gain
Information gain or IG is a statistical property that measures how well a given attribute
separates the training examples according to their target classification. Constructing a
decision tree is all about finding an attribute that returns the highest information gain and
the smallest entropy.

Cours #4 — Decision Tree


7
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Information Gain Example


Decision Trees in Machine Learning
Trees occupy an important place in the life of man. The trees provide us flowers, fruits, fodder
for animals, wood for fire and furniture and provide cool shadow from scorching sun. They
give us so many such good things and yet expect nothing in return. Besides being such a
important element for the survival of human beings, trees have also inspired wide variety of
algorithms in Machine Learning both classification and regression.

Representation of Algorithm as a Tree


Decision Tree learning algorithm generates decision trees from the training data to solve
classification and regression problem. Consider you would like to go out for game of Tennis
outside. Now the question is how would one decide whether it is ideal to go out for a game of
tennis. Now this depends upon various factors like time, weather, temperature etc. We call
these factors as features which will influence our decision. If you could record all the factors
and decision you took, you could get a table something like this.

Cours #4 — Decision Tree


8
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Data collected of last 14 days

With this table, other people would be able to use your intuition to decide whether they should
play tennis by looking up what you did given a certain weather pattern, but after just 14 days,
it’s a little unwieldy to match your current weather situation with one of the rows in the table.
We could represent this tabular data in the form of tree.

Cours #4 — Decision Tree


9
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Creating tree from the data


Here all the information is represented in the form of tree. The rectangular box represents
the node of the tree. Splitting of data is done by asking question to the node. The branches
represents various possible known outcome obtained by asking the question on the node. The
end nodes are the leafs. They represent various classes in which the data can be classified.
The two classes in this example are Yes and No. Thus to obtain the class/final output, ask the
question to the node and using the answer travel through branch until one reaches the leaf
node.

Algorithm

1. Start with a training data set which we’ll call S. It should have attributes and
classification.
2. Determine the best attribute in the dataset. (We will go over the definition of best
attribute)
3. Split S into subset that contains the possile values for the best attribute.
4. Make decision tree node that contains the best attribute.
5. Recursively generate new decision trees by using the subset of data created from
step 3 until a stage is reached where you cannot classify the data further.
Represent the class as leaf node.

Deciding the “BEST ATTRIBUTE”


Now the most important part of Decision Tree algorithm is deciding the best attribute. But
what does ‘best’ actually mean?

In Decision Tree algorithm, the best mean the attribute which has most information gain.

Cours #4 — Decision Tree


10
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

The left split has less information gain as the data is split on two classes which has almost
equal ‘+’ and ‘-’ examples. While the split on the right as more ‘+’ example in one class and
more ‘-’ example in the other class. In order to calculate best attribute we will use Entropy.

Entropy
In machine learning sense and especially in this case Entropy is the measure of homegeneity
in the data. Its value is ranges from 0 to 1. Its value is close to 0 if all the example belongs
to same class and is close to 1 is there is almost equal split of the data into different classes.
Now the formula to calculate entropy is :

Here pi represents the proportion of the data with ith classification and c represents the
different types of classification.

Now Information Gain measure the reduction in entropy by classifying the data on a
particular attribute. The formula to calculate Gain by splitting the data on Dataset ‘S’ and on
the attribute ‘A’ is :

Here Entropy(S) represents the entropy of the dataset and the second term on the right is
the weighted entropy of the different possible classes obtain after the split. Now the goal is
to maximize this information gain. The attribute which has the maximum information gain
is selected as the parent node and successively data is split on the node.

Entropy of the dataset (Entropy(S)) is :

Cours #4 — Decision Tree


11
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Now to calculate Information Gain

Cours #4 — Decision Tree


12
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

Applying the formula of information gain on all different attributes.

As we can see Outlook attribute has the maximum information gain and hence it is placed at
top of the tree.

Information gain is a decrease in entropy. It computes the difference between entropy before
split and average entropy after split of the dataset based on given attribute values. ID3
(Iterative Dichotomiser) decision tree algorithm uses information gain.
Mathematically, IG is represented as:

In a much simpler way, we can conclude that:

Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.

3 — Gini Index
You can understand the Gini index as a cost function used to evaluate splits in the dataset.
It is calculated by subtracting the sum of the squared probabilities of each class from one. It
favors larger partitions and easy to implement whereas information gain favors smaller
partitions with distinct values.

Gini Index works with the categorical target variable “Success” or “Failure”. It performs only
Binary splits.
Higher value of Gini index implies higher inequality, higher heterogeneit

Steps to Calculate Gini index for a split


1. Calculate Gini for sub-nodes, using the above formula for success(p) and failure(q)
(p²+q²).
2. Calculate the Gini index for split using the weighted Gini score of each node of that
split.
CART (Classification and Regression Tree) uses the Gini index method to create split points.

Cours #4 — Decision Tree


13
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

4 — Gain ratio
Information gain is biased towards choosing attributes with a large number of values as root
nodes. It means it prefers the attribute with a large number of distinct values.
C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain
that reduces its bias and is usually the best option. Gain ratio overcomes the problem with
information gain by taking into account the number of branches that would result before
making the split. It corrects information gain by taking the intrinsic information of a split into
account.
Let us consider if we have a dataset that has users and their movie genre preferences based
on variables like gender, group of age, rating, blah, blah. With the help of information gain,
you split at ‘Gender’ (assuming it has the highest information gain) and now the variables
‘Group of Age’ and ‘Rating’ could be equally important and with the help of gain ratio, it will
penalize a variable with more distinct values which will help us decide the split at the next
level.

Where “before” is the dataset before the split, K is the number of subsets generated by the
split, and (j, after) is subset j after the split.

5 — Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables (regression
problems). This algorithm uses the standard formula of variance to choose the best split.
The split with lower variance is selected as the criteria to split the population:

Above X-bar is the mean of the values, X is actual and n is the number of values.

Steps to calculate Variance:


1. Calculate variance for each node.
2. Calculate variance for each split as the weighted average of each node variance.

Cours #4 — Decision Tree


14
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

6— Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the
oldest tree classification methods. It finds out the statistical significance between the
differences between sub-nodes and parent node. We measure it by the sum of squares of
standardized differences between observed and expected frequencies of the target variable.
It works with the categorical target variable “Success” or “Failure”. It can perform two or
more splits. Higher the value of Chi-Square higher the statistical significance of differences
between sub-node and Parent node.
It generates a tree called CHAID (Chi-square Automatic Interaction Detector).
Mathematically, Chi-squared is represented as:

Steps to Calculate Chi-square for a split:


1. Calculate Chi-square for an individual node by calculating the deviation for Success
and Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of
each node of the split

Cours #4 — Decision Tree


15
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

3 — Algoriths used in Decision Trees

The following are the algorithms used in decision trees.

ID3
ID3 or Iterative Dichotomiser 3 is an algorithm used to build a decision tree by employing a
top-down approach. The tree is built from the top and each iteration with the best feature
helps create a node.
Here are the steps:
• The root node is the start point known as a set S.
• Each iteration of the algorithm will iterate through unused attributes of the root node
and calculate the information gain (IG) and entropy (S).
• It will select the attribute with the tiniest entropy or higher information gain.
• We divide set S by choosing the attribute to produce the data subset.
• The algorithm will continue if there is no repetition in the attributes chosen.

C4.5
The C4.5 algorithm is an improved version of ID3. C in the algorithm indicates that it uses C
programming language and 4.5 is the algorithm’s version. It is one of the more popular
algorithms for machine learning. It is also used as a decision tree classifier and to generate
a decision tree.

Cours #4 — Decision Tree


16
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

CART
Classification and Regression Tree or CART is a predictive algorithm used to generate future
predictions based on already available values. These algorithms serve as a base of machine
learning algorithms like bagged decision trees, boosted decision trees, or random forests.
There are marked differences between regression trees and classification trees.
• Regression trees: Predict continuous values depending on information sources or
previous data. For instance, to predict the price of an item, previous data needs to be
analyzed.
• Classification trees: Determine whether an event occurred. It usually has outcomes
as either yes or no. This type of decision tree algorithm is often used in real-world
decision-making.

CHAID
Chi-square automatic interaction detection (CHAID) is a tree classification method that finds
the importance between the parent nodes and root nodes. It is measured by adding the
squares of standardized differences between the expected and observed frequencies of the
target variable.
It works using the categorical target variables, Success or Failure, and can work on two or
more splits. If the Chi-square value is high, the statistical importance of the variation of the
parent node and root nodes will also be high. It will generate CHAID.

MARS
Multivariate adaptive regression splines or MARS is a complex algorithm that helps solve
non-linear regression problems. It lets us find a set of linear functions that provide the best
prediction. It is a combination of simple linear regression functions.

Cours #4 — Decision Tree


17
Apprentissage Automatique I Prof. Abdelouahab Moussaoui

4 — Advantages and Disadvantages of Decision Trees

Advantages
1. Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
2. A decision tree does not require normalization of data.
3. A decision tree does not require scaling of data as well.
4. Missing values in the data also do NOT affect the process of building a decision tree
to any considerable extent.
5. A Decision tree model is very intuitive and easy to explain to technical teams as
well as stakeholders.

Disadvantage
1. A small change in the data can cause a large change in the structure of the decision
tree causing instability.
2. For a Decision tree sometimes calculation can go far more complex compared to
other algorithms.
3. Decision tree often involves higher time to train the model.
4. Decision tree training is relatively expensive as the complexity and time has taken
are more.
5. The Decision Tree algorithm is inadequate for applying regression and predicting
continuous values.

Cours #4 — Decision Tree


18

You might also like