Random_Forest_Algorithm_Overview
Random_Forest_Algorithm_Overview
Review Article
Random Forest Algorithm Overview
1 Computer Science Department , University Arts, Sciences and Technology, Beirut, Lebanon.
2 Computer And Communication Engineering Department, School of Engineering Lebanese International University, Beirut, Lebanon.
3 Department of Computer Systems And Networks, University Tishreen, faculty of information engineering, Latakia, Syria.
1. INTRODUCTION
Random Forest (RF) is a popular machine learning technique in the field of data mining [1]. It operates under the supervision
of a group and has received significant recognition. Data mining can be categorized into two primary types: descriptive and
predictive. Descriptive data mining is primarily concerned with providing detailed descriptions and summaries of data. On
the other hand, predictive data mining involves studying historical data to identify patterns and trends that can be used to
make predictions about the future. Metadata Mining is the process of de-scribing and summarizing data, uncovering patterns
and relationships within the data, and using historical data to make predictions about future trends. Predictive models are
constructed by analyzing the features of predictive factors to provide hypotheses that assist in making future decisions.
Predictive models are constructed by analyzing the characteristics of variables used for forecasting, and the results are
hypotheses that can be empirically examined. The precision of such models relies on error estimating techniques. Metadata
mining often employs unsupervised machine learning methods, whereas predictive data mining use supervised machine
learning methods. Random forests are created by generating several decision trees. This is done by gath-ering random
samples of data using Bootstrap samples and randomly selecting input features. Each decision tree is considered a simple
decision tree [2]. One advantage of random forests is their high accuracy compared to other approaches like as bagging and
boosting. They also func-tion effectively on huge databases and can accommodate many variables, allowing us to analyze
thousands of input variables without the need to delete any of them. In order to balance the cat-egory error in unbalanced
data sets and assess the importance of variables, an unbiased estima-tion of the configuration error is necessary. The forest
demonstrates its power and efficiency through these advantages. Randomization is a widely used technique in machine
learning and data mining due to its effectiveness in several practical applications [2].
2. MACHINE LEARNING
This algorithm involves a target or result variable, also known as the dependent variable, which may be predicted based on
a specific collection of predictors, also known as independent varia-bles. With this set of variables, we construct a function
that maps inputs to the intended output. The training procedure persists until the model attains the desired degree of accuracy
in the training data. The supervised learning techniques commonly used for regression analysis include decision tree, random
forest, closest neighbor (KN), and logistic regression [4].
Unsupervised learning is used when there is data accessible solely in the input form and no cor-responding output variable.
These algorithms utilize statistical models to analyze the fundamen-tal patterns inside the data in order to gain a deeper
understanding of its characteristics.
Clustering is a prominent category of unsupervised algorithms. This technique involves the iden-tification of intrinsic
groupings within the data, which are then utilized to forecast the output of hidden inputs. A notable example of this
methodology would be forecasting customer purchas-ing patterns [3].
This algorithm lacks both a target variable and a prediction or estimate outcome. It is utilized to categorize the population
into distinct categories, a practice commonly employed to segment consumers for targeted services. Unsupervised learning
examples include the prior-ISM method [4].
Reinforcement learning is utilized when the objective is to make a sequence of choices that lead to a final reward. Throughout
the learning process, the artificial worker is given either rewards or penalties based on the acts it carries out. The objective
is to optimize the overall reward. Exami-nations encompass the acquisition of knowledge related to playing computer games
or carrying out activities involving robots, with the overarching objective [3].
By employing this method, the machine undergoes training to produce precise determinations. An instance involves
subjecting the device to a setting where it consistently improves its perfor-mance by undergoing repeated trials and
adjustments. This machine employs experiential learn-ing to acquire and assimilate optimal knowledge in order to make
precise business judgments. An illustration of reinforcement learning: A Markov Decision Process [4] is a mathematical
framework used to model decision-making in situations where outcomes are uncertain and influ-enced by previous actions.
3. DECISION TREE
Decision Trees are a technique utilized in the fields of Statistics, data mining, and machine learn-ing. It falls within the
category of supervised machine learning. Data analysis technology catego-rizes data into different entities that are potentially
associated with a specific procedure. There are two categories of such entities: contract and paper. The supervised learning
approach em-ploys the decision tree as a prediction model to examine the observations of an item in the branches and deduce
the target value of the item in the leaves. The decision nodes symbolize the segmentation of data, while the sheets symbolize
the outcomes.
A decision tree typically embodies cognitive processes resembling human thinking in order to facilitate informed decision-
making. Therefore, decision trees are highly comprehensible. Fur-thermore, there exists a hierarchical structure known as a
decision tree, which greatly facilitates the comprehension of the underlying reasoning. For clarity, the following is the
operational pro-cess of the terminal for your attention.
The root node: the starting point of the decision tree. The root node symbolizes the complete dataset, which is partitioned
into two or more groups that can be compared.
Leaf node: The leaf nodes correspond to the ultimate result. The algorithm is unable to further divide the tree once it reaches
the leaf node.
Splitting refers to the process of separating the root node or decision node into distinct sub-nodes based on specific
parameters.
Pruning is the act of removing superfluous branches from a tree. By eliminating irrelevant branches, one can reach a
conclusion much more quickly.
The parent node is the root node of the tree, while the other nodes are its children. A subtree is created when the primary tree
is divided, resulting in new subtrees and branches. Machine learn-ing encompasses two primary categories of decision trees,
which are distinguished by the goal variable.
71 Salman et al, Babylonian Journal of Machine Learning Vol.2024, 69–79
1. Categorization of trees
2. Trees used for regression analysis
4. The shopping cart is completely immune to outliers, interconnecting lines, varying elasticity, or distributive error
structures that impact parametric actions. Outliers are data points that are distinct from the rest and do not influence
the division of data. In contrast to parametric modeling, the shopping cart employs linear variables in the
"alternative" division(s).
5. The cart has the ability to identify and expose the relationships inside the dataset.
6. The cart remains unchanged when the independent variables are transformed in a monoto-nous manner. This means
that transforming the explanatory variables into logarithms, squares, or square roots does not have any impact on
the resulting tree.
The shopping cart efficiently handles higher dimensions, meaning it may generate valuable findings by analyzing a huge
number of factors while focusing on only a few variables that are immediately relevant.
A significant drawback of the cart is its lack of reliance on a probabilistic model. Predictions made using a cart tree to classify
fresh data do not have an associated probability level or con-fidence interval. The level of trust an analyst can have in
interpreting the results generated by a certain model, such as a tree, is determined exclusively by its historical accuracy. This
refers to the amount to which the model accurately anticipated the desired outcome in previous situ-ations [7].
1. The shipping quantity range is separated into equal time periods to ensure equal allocation of time. This guarantees
that the data is organized in a consistent manner, facilitating the identifi-cation of trends and patterns with greater
ease.
2. The periods are separated according to the strength of the charge quantity. In this scenario, the time intervals vary
and are contingent upon the concentration of data in specific regions within the range. This sort of segmentation
enables concentration on the most crucial sections of the data that include more intricate and statistically meaningful
information.
Data partitioning has several advantages. One of the key benefits is the reduction of interfer-ence. By splitting the data into
particular periods, the interference between the data is mini-mized. This reduction in interference ultimately leads to
improved accuracy in forecasting.
Data segmentation enhances the efficacy of algorithms employed in data analysis and RF fore-casting, hence boosting their
overall performance. The ease of analysis is enhanced when data is adequately segmented, as this allows for easier
recognition of patterns and trends. By applying these concepts to divide time intervals, algorithms can be enhanced, leading
to more precise and effective predicted outcomes. This, in turn, facilitates a more accurate comprehension and pre-diction
of RF frequencies.
in order to enhance the precision of categorization. Typically, machine learning feature selection strategies can be cat-
egorized into three primary groups: filter method (Filter), wrapper method (Wrapping), and the integration method [13].
This is the introductory section of the filtering technique. The filtration method involves using statistical analysis to assign
weights to different characteristics. These weights are then used to rank the features. By applying certain rules and setting a
threshold, features with weights above the threshold are kept, while those below it is removed. The feature selection process
of the filtering approach is conducted based on the characteristics of the dataset, re-gardless of the particular classification
algorithm being used. Several widely used filtering methods include Fisher ratio, information gain, Relief, T-test, and
variance analysis. The sub-sequent section will provide a concise introduction to variance analysis [12].
2. Out-of-bag (OOB) estimates
Random Forest (RF) is a collection of regression or classification trees that was initially proposed by Bierman. One of the
two stochastic elements in a random forest (RF) pertains to the selection of variables utilized for partitioning. For each
division in a tree, the optimal variable for division is chosen from a random subset of predictors. If the selected entry number
is too little, it is possible that none of the factors that make up the subset are significant, and that in-significant variables are
frequently chosen for a split. The resultant trees exhibit little predictive capacity. If the subset consists of a significant number
of predictors, it is probable that the same factors, specifically those with the greatest impact, are frequently picked for a split,
whereas var-iables with lesser impacts have minimal chances of being selected. Hence, it is imperative to re-gard mtry as a
tuning parameter [13].
Another stochastic element in Random Forest (RF) pertains to the selection of training data for each tree. Every tree in the
algorithm known as Random Forests (RF) is constructed using a ran-domly chosen portion of the data. Typically, this refers
to a bootstrap sample or a group of sam-ples that is 0.632 times the size of the original sample (n). Hence, only select
observations are utilized in the construction of a particular tree. Observations that are not utilized in the construc-tion of a
tree are referred to as out-of-bag (OOB) observations. Within a Random Forest (RF), every tree is constructed using a distinct
subset of the original data, resulting in certain observa-tions being excluded from some of the trees. The forecast for an
observation can be obtained by utilizing only those trees that were not constructed using the observation.
By following this method, a classification is assigned to each observation, and the error rate can be determined based on
these predictions. The error rate that is obtained as a result is commonly known as the out-of-bag (OOB) error. The Bierman
Method, initially described by Bierman, has become a well-established technique for estimating errors in RF [13].
Out-of-Bag is synonymous with validation or test data. Random forests do not require a distinct testing dataset for result
validation. The calculation is performed internally, within the algo-rithm's execution, using the following method.
Since the forest is constructed using training data, each tree is evaluated using 36.8% of the samples that were not used to
create that specific tree. This is comparable to the validation data set.
The out-of-bag error estimate, often referred to as the internal error estimate, is a measure of the error in a random forest
model while it is being built [14].
3. Variable importance measure (VIM) of Random Forest
The random forest model often uses the variable relevance measure to choose features across different categories. Díaz-
Urartu and De Andres did a study on the utilization of random forests to identify a set of informative genes. The study
showed that the random forest model has a similar level of prediction accuracy as the k-NN, support vector machine (SVM),
and Di-agonal Linear Discriminant Analysis (DLDA) models. The authors showed the importance of random forest variable
relevance in discerning useful variables. Methods could be employed to identify a limited set of genes while maintaining
predicting accuracy.
Through both simulation and empirical data, the model exhibits strong resilience when it comes to the parameters of node
size and tree structure. The parameters determine the number of fea-tures that are available for each partitioning, the
minimum number of samples required to reach a node before halting, and the number of trees utilized in the forest. The study
demonstrated that increasing the size of the tree marginally enhanced the stability of the variable significance measurements.
When the ratio of useful variables to the total number of variables is minimal, adding an entry results in a slight improvement
in prediction accuracy. Finally, it was observed that the parameter node size, which determines the minimum size of the
terminal nodes, had lim-ited impact on prediction accuracy. The study delves deeper into the issue of identifying predic-tors,
specifically the multiplicity problem. This problem arises when there are multiple subsets of predictor variables that yield
the same level of prediction accuracy. Multiplicity is a frequently encountered difficulty in situations when the number of
variables p is significantly greater than the number of observed cases n. This topic is extensively studied in the statistical
literature. Ac-cording to the author, solving a multitude of difficulties can be challenging in both small and large contexts.
One possible approach to address this problem is to employ a range of methodol-ogies and see if there is a specific group of
variables that are consistently chosen by the majority of the models [14].
76 Salman et al, Babylonian Journal of Machine Learning Vol.2024, 69–79
Random forests can be employed to assess the importance of variables in regression or classifica-tion problems using two
significance measures. The first method, known as Mean Decrease Im-purity (MDI), calculates the overall reduction in
impurity of nodes when splitting on a particular variable. This calculation is then averaged over all trees. The second method
is known as Mean Decrease Accuracy (MDA) [14].
4. Proximity measures
Accessing the original data in large population epidemiological research might be challenging due to privacy concerns
associated with genetic data. Nevertheless, the study's summary statis-tics might be openly accessible and utilized by external
entities for additional studies, without any worries regarding privacy concerns. The summary data may consist of individual
association coefficients (regression betas) and probability values (p-values) indicating the relationship be-tween genetic
factors and phenotypic traits. By doing a comparison of the association patterns between numerous variables for one genetic
variant and another, we may potentially get new insights regarding the functional similarities of these two variants. If two
variations exhibit a comparable association pattern with a given collection of phenotypes, it is probable that both genes are
functionally interconnected in some manner. Conversely, if their association patterns differ significantly, it is likely that they
are functionally independent [15].
Varying approaches. Several procedures utilize machine learning techniques, including kernel methods, graph-based
methods, Markov random field, and others.
Random Forests have been utilized in various genetic analysis projects in recent years due to their exceptional effectiveness
in high-dimensional analysis. Random Forests have the capability to provide many measures in addition to the classification
model. These measurements include the proximity matrix, feature importance values, and the local importance matrix. After
training the Random Forest, the proximity matrix displays the similarity between the samples in the Out of Bag (OOB) set.
The OOB set is an internal validation set of the RF algorithm used to obtain performance measurements and the proximity
matrix. The proximity between two samples is de-termined by counting the occurrences when these two instances end up in
the same terminal node of the same tree in the Random Forest (RF), and then dividing it by the total number of trees in the
forest [15].
5. Missing data
Random Forest is an ensemble method that combines many trees constructed from boot-strap samples of the original data.
Random Forest is used for both classification and re-gression and provides many advantages such as having high accuracy,
calculating a generali-zation error, determining the important variables and outliers, performing supervised and un-
supervised learning, and imputing missing values with an algorithm based on proximity ma-trix [16].
Random forest (RF) missing data algorithms are an attractive approach for dealing with missing data. They have the desirable
properties of being able to handle mixed types of miss-ing data, they are adaptive to interactions and nonlinearity, and they
have the potential to scale to big data settings. Currently, there are many different RF imputation algorithms but relatively
little guidance about their efficacy, which motivated us to study their perfor-mance. Using a large, diverse collection of data
sets, the performance of various RF algo-rithms was assessed under different missing data mechanisms. Algorithms included
prox-imity imputation.
Approaches are missing the calculation of value.
The first method: - The algorithm for calculating the missing value of radio frequencies that was based on the proximity
Matrix RF calculates the (n*n) proximity matrix to evaluate the simi-larity of observations. Elements outside the diagonal
of the Matrix give the similarity of two dif-ferent observations [11]. Based on these proximity values, the RF performs an
iterative process of inclusion by following the following steps: First, an initial forest is created after using the mean
embedding and then an approximation is calculated. The New calculated values are calcu-lated by a weighted average based
on proximity. With this updated data set, a new forest is cre-ated and by doing so a new affinity and calculated values are
obtained. It was found that after performing 5 or 6 repetitions.
The second approach: - the k-nearest neighbor (KN) embedding method, was applied to the data set before fitting the RF. In
the KNN imputation method [15].
first, the neighbors are determined by calculating the distance measures between obser-vations. These measures are
obtained through Makowski, Manhattan, or Euclidean func-tions.
Because of is the most popular one amongst the others, the Euclidean distance function was used, and imputations
were done based on weighted mean values of k nearest neighbors.
77 Salman et al, Babylonian Journal of Machine Learning Vol.2024, 69–79
The weights are inversely proportional to the distance measures. Not only different dis-tance functions but also
different algorithms of KNN can be seen. Some of them do not permit the neighbor values to contain missing values.
this might cause the method to give less efficient results [16].
6.1 Using classification and regression tasks learning technique random forest algorithm
1. Learn the band: - A random forest is a group learning method, which means that it builds multiple decision trees
during training and combines them to obtain a more accurate and stable prediction.
2. Decision trees: - Each decision tree in the random forest is trained on a different subset of the training data and
selects the best split at each node based on a random subset of features. Such randomness helps to decorative connect
trees and prevent overfitting.
3. Boot Assembly (packing): - A random forest uses a technique called Bootstrap aggregation or packing, in which
each tree is trained on the bootstrap sample (a randomly sampled subset with substitution) of the training data. This
adds more randomness to the model and helps to improve generalization.
4. Voting or average: - For classification tasks, a random forest combines the predictions of individual trees by
majority voting. For regression tasks, it averages the predictions of individ-ual trees to obtain the final prediction.
5. Feature importance: - A random forest provides a measure of feature importance based on how much each feature
reduces impurities across all decision trees. This can be useful for fea-ture selection and understanding the
underlying patterns in the data.
6. Durability and scalability: - Random forests are known for the power of noisy data and outliers. It also handles
high-dimensional data well and is relatively insensitive to hyperparame-ters, making it easy to use and less prone
to over-processing compared to single decision trees.
7. Applications: - The random forest is widely used in various fields, including, but not limited to:
Rating and regression problems in finance, such as credit scoring and stock price fore-casting.
Healthcare for diagnosing disease and predicting patient outcomes.
Remote sensing for land cover classification and vegetation mapping.
Marketing to segment customers and predict wildly.
Natural language processing for text classification and sentiment analysis.
78 Salman et al, Babylonian Journal of Machine Learning Vol.2024, 69–79
Random forest is a versatile and efficient algorithm that can produce high-quality predictions across a wide range of tasks
and datasets [12].
7. RESULTS
Random forests are represented by a set of benefits and features that make them a popu-lar choice in various fields of machine
learning and predictive analysis. Here are some of the main consequences of using random forests:
1. High accuracy: random forests are powerful and reliable algorithms in many tasks, as they provide high accuracy
in classification and forecasting.
2. Reduced variation: thanks to the assembly Learning technique, random forests can reduce mutation and increase
stability in predictions compared to individual classification models.
3. Reduce the problem of overfitting: Random forests use the technique of building trees on different and independent
sub-models, which reduces the problem of overfitting and improves the ability to generalize.
4. Improved performance in data Miscellaneous (performance): random forests work well on a variety of large and
complex data and show a good ability to deal with lost data and jam-ming.
5. Provide an estimate of the importance of features (Feature Importance Estimation): Random forests provide an
estimate of the importance of features, enabling users to identify the most important features of the model and
understand the relative impact of each feature on the results.
6. Ease of Use and Integration: Random forests can be easily implemented using software libraries such as Sickie-
learn in Python, they are compatible with most software work environ-ments and allow seamless integration with
other technologies in the field of machine learning.
In general, random forests offer a set of significant benefits and positive results that make them a popular choice for solving
a wide range of problems in the field of predictive analysis and machine learning.
8. CONCLUSIONS
Random forests are a highly efficient tool for making accurate predictions. As a result of the law of big numbers, they do not
exceed their capacity. By including appropriate levels of un-predictability, they become precise classifiers and regressions.
Furthermore, the framework elu-cidates the predictability of a random forest by examining the potency of individual
predictors and their interconnections [13].
The random forest method is a user-friendly and adaptable machine learning technique. Group learning is utilized to address
regression and classification problems within businesses. This technique is optimal for developers as it effectively addresses
the issue of excessive data processing. This method is highly valuable for generating precise forecasts that are essential for
strategic decision-making in companies [14].
Aggregation approaches seek to enhance the accuracy of classification by combining predic-tions from many classifiers. The
more the diversity and weaker the connections between the basic classifiers, the higher the accuracy of the set will be.
The random forest method is employed.
1) The process of selecting a subset of examples/cases, such as in packing, is referred to as sub-sampling.
2) When a subset of features is selected, it is called Feature Selection.
Both of these tactics are employed in random forests to incorporate randomization and attain diversity.
Conflicts Of Interest
No competing financial interests are reported in the author's paper.
Funding
No grant or sponsorship is mentioned in the paper, suggesting that the author received no financial assistance.
79 Salman et al, Babylonian Journal of Machine Learning Vol.2024, 69–79
Acknowledgment
The author would like to thank the institution for creating an enabling environment that fostered the development of this
research.
References
[1] Y. K. Rushall and P. K. Sinha, "Random Forest Classifiers: A Survey and Future Research Directions," in Int. J. Adv.
Comput., vol. 36, no. 1, 2013.
[2] B. Leo, "Random Forests," in Machine Learning, vol. 45, pp. 5-32, 2001.
[3] S. Sah, "Machine Learning: A Review of Learning Types," doi:10.20944/preprints202007.0230.v1, 2020.
[4] A. Abdi, "Three types of Machine Learning Algorithms," DOI: 10.13140/RG.2.2.26209.10088, 2016.
[5] "Decision Trees in Machine Learning," [Online]. Available: https://pdf.co/blog/decision-trees-in-machine-learning.
[6] R. Timofeev, "Classification and Regression Trees (CART) Theory and Applications," Master's Thesis, Center of
Applied Statistics and Economics, Humboldt University, Berlin, 2004.
[7] Y. Yohannes and J. Rhodiot, "Classification and Regression Trees: Introduction," Int. Food Policy Res. Inst., USA,
2006.
[8] A. Cutler, D. R. Cutler, and J. R. Stevens, "Random forest," in Machine Learning Mag., DOI: 10.1007/978-1-4419-
9326-7_5, Jan. 2011.
[9] Y. Lu et al., "The Application of Improved Random Forest Algorithm on the Prediction of Electric Vehicle Charging
Load," Energies, doi:10.3390/en11113207, 2018.
[10] "An Introduction to Random Forest Algorithm for Beginners," [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/10/an-introduction-to-random-forest-algorithm-for-beginners/.
[11] "Bagging: 25 Questions to Test Your Skills on Random Forest Algorithm," [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/05/bagging-25-questions-to-test-your-skills-on-random-forest-
algorithm/.
[12] G. Biau, "A Random Forest Guided Tour," Sorbonne Universités, UPMC Univ Paris 06, F-75005, Paris, France &
Institut Universitaire de France, 2010.
[13] S. Janitza and R. Hornung, "On the overestimation of random forest's out-of-bag error," Y-h. Taguchi, Chuo
University, Japan, 2018.
[14] A. Hjerpe, "Computing Random Forests Variable Importance Measures (VIM) on Mixed Continuous and Categorical
Data," Kth Royal Institute of Technology, School of Computer Science and Communication, Stockholm, Sweden,
2016.
[15] J. A. Seoane, I. N. M. Day, et al., "A Random Forest proximity matrix as a new measure for gene annotation,"
European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges,
Belgium, i6doc.com publ., ISBN 978-287419095-7, 2014.
[16] H. Ozen and C. Bal, "A Study on Missing Data Problem in Random Forest," Osmangazi J. Med., doi:
10.20515/otd.496524, 2019.
[17] "Optimizing a Random Forest," [Online]. Available: https://medium.datadriveninvestor.com/optimizing-a-random-
forest.