A Survey of Decision Trees Concepts Algorithms and Applications

This paper provides a comprehensive survey of decision trees in machine learning, covering their core concepts, algorithms, and applications. It discusses various decision tree algorithms such as CART, ID3, C4.5, and ensemble methods like random forests, highlighting their mathematical formulations and real-world applications in fields like medical diagnosis and fraud detection. The authors aim to fill the gap in existing literature by presenting a detailed overview that combines early developments with modern advancements in decision tree methodologies.

Uploaded by

HijodeDavid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

A Survey of Decision Trees Concepts Algorithms and Applications

Uploaded by

HijodeDavid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Received 26 May 2024, accepted 17 June 2024, date of publication 19 June 2024, date of current version 27 June 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3416838

A Survey of Decision Trees: Concepts,

Algorithms, and Applications
IBOMOIYE DOMOR MIENYE , (Member, IEEE), AND NOBERT JERE
Department of Information Technology, Walter Sisulu University, Buffalo City Campus, East London 5200, South Africa
Corresponding author: Ibomoiye Domor Mienye ([email protected])

ABSTRACT Machine learning (ML) has been instrumental in solving complex problems and significantly
advancing different areas of our lives. Decision tree-based methods have gained significant popularity
among the diverse range of ML algorithms due to their simplicity and interpretability. This paper presents a
comprehensive overview of decision trees, including the core concepts, algorithms, applications, their early
development to the recent high-performing ensemble algorithms and their mathematical and algorithmic
representations, which are lacking in the literature and will be beneficial to ML researchers and industry
experts. Some of the algorithms include classification and regression tree (CART), Iterative Dichotomiser 3
(ID3), C4.5, C5.0, Chi-squared Automatic Interaction Detection (CHAID), conditional inference trees, and
other tree-based ensemble algorithms, such as random forest, gradient-boosted decision trees, and rotation
forest. Their utilisation in recent literature is also discussed, focusing on applications in medical diagnosis
and fraud detection.

INDEX TERMS Algorithms, CART, C4.5, C5.0, decision tree, ensemble learning, ID3, machine learning.

I. INTRODUCTION Furthermore, decision trees are known for their inter-

Machine learning-based applications are revolutionising pretability [9], [10]. The resulting tree structure allows users
various industries and sectors, including healthcare, finance, to understand and interpret the decision-making process
and marketing [1], [2], [3], [4]. With the advancement of tech- easily. This is especially valuable in domains where trans-
nology and the availability of large datasets, ML algorithms parency and explainability are crucial, making it easier for
have become increasingly powerful and accurate in making stakeholders to trust and validate the results. Another sig-
predictions and informed decisions. These applications are nificance of decision tree-based algorithms is their ability to
transforming how organisations operate and paving the way handle categorical and numerical data. Traditional statistical
for a more efficient and data-driven future. methods often struggle with categorical variables, requiring
Decision tree-based algorithms have been employed in them to be converted into numerical values. Decision trees,
diverse applications, including but not limited to classifi- on the other hand, can directly handle both types of data,
cation, regression, and feature selection [5], [6], [7]. The eliminating the need for data preprocessing. This makes
basic idea behind decision tree-based algorithms is that decision tree-based algorithms more versatile and efficient in
they recursively partition the data into subsets based on the a wide range of applications.
values of different attributes until a stopping criterion is There are a few reviews of decision trees in the lit-
met. This process results in a tree-like structure, where each erature; for example, Che et al. [11] presented a review
node represents a decision or a split based on a specific of decision trees and ensemble classifiers with specific
attribute [8]. The algorithm determines the best attribute to applications to bioinformatics. The review focused on ID3,
use for each split based on certain criteria, such as information CART, and ensemble methods such as bagging, boosting,
gain, gain ratio, and Gini index. and stacked generalization. Cañete-Sifuentes et al. [12]
reviewed multivariate decision trees (MDT) and compared
the performance of several MDT induction classifiers.
The associate editor coordinating the review of this manuscript and
approving it for publication was Yilun Shang. Anuradha and Gupta [13] presented a review of decision

2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
86716 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 12, 2024
I. D. Mienye, N. Jere: Survey of Decision Trees: Concepts, Algorithms, and Applications

tree classifiers, focusing on a high-level description of key

concepts, such as node splitting and tree pruning. Meanwhile,
Costa and Pedreira [14] reviewed recent decision tree-based
classifier advances. The paper covered three main issues: how
decision trees fit the training data, their generalization, and
interpretability.
However, most of the existing surveys and reviews of
decision trees focus on their applications in specific domains
or a high-level overview of the decision tree concept. There-
fore, the current literature lacks a comprehensive overview of
decision tree algorithms, their early developments, succinct
mathematical formulations, and algorithmic representations
in a single peer-reviewed paper. Therefore, it is essential to
have a review that fills this gap in view of the continuous use
and prevalence of decision tree-based algorithms and their
application in today’s technological advancements. Hence,
in this study, we present a detailed review of decision tree-
based algorithms. Specifically, the paper aims to cover the
different decision tree algorithms, including ID3, C4.5, C5.0,
FIGURE 1. A decision tree example.
CART, conditional inference trees, and CHAID, together
with other tree-based ensemble algorithms, such as random
forest, rotation forest, and gradient boosting decision trees.
The paper aims to present their mathematical formulations as shown in Figure 1. The root node, which is the starting
and algorithmic representations clearly and concisely. point of the tree, represents the entire dataset. The algorithm
The rest of the paper is structured as follows: Section II identifies the feature and the threshold that leads to the best
presents a comprehensive overview of the decision tree, split based on a specific criterion [20]. The process continues
covering key areas such as splitting criteria and tree recursively, with each subset of the data being further split at
pruning methods. Section III discusses different decision each child node. This continues until a stopping criterion is
tree algorithms, their learning process, splitting criteria, and reached, typically when the nodes are pure (i.e., all data points
mathematical formulations. Section IV reviews decision tree in a node belong to the same class) or when a predefined
applications in recent literature, including applications in depth of the tree is reached. The nodes where the tree ends,
medical diagnosis and fraud detection. Section V discusses called leaf node or terminal node, represent the outcomes
key findings and future research directions, and Section VI or class labels. The decision to split at each node is made
concludes the paper. using mathematical formulations such as information gain,
Gini impurity, or variance reduction.
II. OVERVIEW OF DECISION TREE Furthermore, the success of decision tree techniques
This section provides a comprehensive overview of decision mainly depends on several factors contributing to their
trees, focusing on the main building blocks and splitting performance, interpretability, and applicability to a wide
criteria. Decision trees, as a concept in ML, have a range of problems. These factors include data quality, tree
history that dates back to the mid-20th century. Initial depth, splitting criteria, and tree pruning method. According
decision tree studies were started by Charles J. Clopper and to Piramuthu [21], the effectiveness of decision trees is highly
Egon S. Pearson in 1934, who introduced the concept of dependent on the training data quality. Hence, it is necessary
binary decision processes [15], [16]. However, the modern to use clean or preprocessed data not containing missing
implementation of decision trees in the context of ML started values and outliers, which can significantly enhance the
decades later. Breiman [17] developed the CART algorithm in performance of the resulting models. Additionally, feature
1984, introducing concepts such as the Gini index and binary selection and feature engineering are necessary because
splitting, which are now widespread in decision tree designs. inputting relevant and well-transformed features can lead to
Quinlan [18] developed ID3, one of the first notable decision more efficient and accurate splits.
tree algorithms, in 1986. Furthermore, Quinlan [19] enhanced
the ID3, introducing the C4.5 decision tree in 1993. These A. SPLITTING RULES
developments and integration of decision trees into ensemble The term splitting criteria, or splitting rules, describes the
methods like random forests and boosting algorithms have methods used to determine where a tree should make a split
solidified their place as fundamental algorithms in machine in its nodes, effectively deciding how to divide the dataset
learning. into subsets based on different conditions [22], [23]. The
The learning procedure of decision trees involves a series choice of splitting criterion is crucial as it directly impacts
of steps where the data is split into homogenous subsets, the tree’s structure and, ultimately, its performance. Different