A Comparative Analysis Study of Stock Prediction Based on Random Forest and Decision Tree
A Comparative Analysis Study of Stock Prediction Based on Random Forest and Decision Tree
Abstract—As financial markets become increasingly driving data, providing powerful support for decision-making
complex and volatile, so does the need for accurate and efficient and prediction with complexity and variability. Especially in
stock forecasting models. This study expects to investigate the prediction, the machine learning algorithm itself is efficient,
effectiveness and accuracy of machine learning methods in stock adaptive and highly predictable, and has been concerned and
price forecasting. In the article, the performance of random used by many researchers [1]. When it comes to stock
forest algorithm and decision tree algorithm in reducing stock prediction, machine learning algorithms can learn more
investment risk is systematically evaluated and compared by accurately about future trends and trends by training large
using the MSE, MAE and R2 score as evaluation indicators
amounts of historical stock and stock price data to understand
through cross-validation technology. On this basis, this paper
the complex laws of the stock market.
also deeply discusses the performance differences of the two
algorithms in the field of stock forecasting through At present, the research on population prediction based on
hyperparameter tuning and feature importance analysis. The different algorithms at home and abroad has been gradually
study found that random forest was better than decision tree in improved, mainly by collecting a large number of historical
prediction accuracy. The MSE and MAE of random forest are population data, using various machine learning algorithms
lower, 2.605 and 1.152 respectively, and R² is closer to 1 (0.983). for model training, and verifying the prediction ability of the
In contrast, the performance of decision tree model is slightly model through the test set.
inferior, with higher MSE and MAE, 5.124 and 1.644, also a
lower R2 (0.966). This conclusion that random forests are As of late, the utilization of machine learning methods in
superior highlights the ability of random forests to process the field of stock prediction has gradually increased, like
complex data. Although this paper has some limitations on the support vector machines (SVMs), neural networks (NN),
size of the dataset and the optimization of the model, the random naive Bayes (NB), etc. These algorithms have their own
forest algorithm still shows greater advantages in the characteristics and advantages, such as SVM is suitable for
comparative analysis with the decision tree algorithm in stock processing high-dimensional data and classification problems,
prediction. This study provides a reference for stock investors NN is more suitable for modeling and prediction of nonlinear
to choose a more accurate stock price prediction model in the problems, but on the contrary, SVM is sensitive to parameters
future. This can effectively reduce investment risks and increase
and NN is easy to fall into overfitting, which also requires
returns.
researchers to explore other machine learning algorithms.
Keywords—Random forests, decision trees, stock price This is where regression algorithms such as random
predictions, predictive models, fintech forests and decision trees are a good choice. They have strong
feature selection, nonlinear fitting and generalization
I. INTRODUCTION
capabilities, can adapt to different data distributions and
With the global financial boom, equity investment has feature types, and automatically extract available information
gradually become one of the largest capital markets with its from a large number of datasets to derive more accurate
high yields and returns. However, the high volatility and prediction results. At the same time, the two algorithms are
uncertainty of the stock market also bring great risks and easy to implement and adjust the parameters, which provides
challenges to investors. Therefore, accurate prediction of a good opportunity for the application and promotion of the
stock price volatility helps to help investors make more two algorithms in real life.
objective and rational decisions about their investment
behavior, and more importantly, helps to maintain the stability These types of method can take into account a variety of
and prosperity of a large capital market. factors such as financial reports, market information,
company news and macroeconomic environment, etc. At the
The financial equity sector is a typically complex non- same time, researchers at home and abroad are constantly
linear field. Stock prices are affected by a variety of exploring how to optimize the application of the Random
circumstances such as macroeconomic conditions, Forest algorithm in stock forecasting, using methods such as
government policy changes, company financial conditions adjusting model parameters and improving feature selection
and operating results, market sentiment, etc. Therefore, it is for forecasting. Although there have been a large number of
often difficult for traditional linear regression models to studies exploring the application of random forests, decision
capture the complex relationship between multiple factors, trees and other algorithms in stock forecasting respectively,
and the final prediction results will be unsatisfactory. The the comparative analyses of these algorithms in stock
emergence and widespread use of machine learning can forecasting are still relatively limited, and there is a lack of
gradually remedy the above deficiencies. systematic research and in-depth analyses [2].
Using machine learning methods, the underlying patterns Because stock price prediction involves multiple factors
and forms can be found in massive amounts of information by and complex market dynamics, and traditional binary
97
Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
evaluations [9,10]. Therefore, in order to find the capacity of the model.
hyperparameter configuration that can optimize the
performance of the random forest and decision tree models, a III. RESULTS
hyperparameter tuning experiment is carried out in this paper. A. Data Characteristics and Analysis
2) Evaluation :Through the training and evaluation of the In this paper, after preprocessing, the stock price trend,
model, this study understands the features and rules extracted trading volume changes, closing price box plots and
from the model data, further quantifies the model and tests histograms grouped by month are plotted, and these charts
whether it meets the preset standards, which is helpful to intuitively show the dynamic changes and distribution
understand the advantages and disadvantages of the model, characteristics of the data, which provide a strong support for
so as to facilitate the subsequent parameter adjustment and the subsequent predictive analysis. The results of the
optimization of the model. R2 is to 1, the better the illustrative visualization are shown in Figure 1.
From Figure 1, it very well may be tracked down that the than that of the decision tree, indicating that the random forest
closing price of the stock shows a pattern of expanding and model can show stronger fitting ability. Although the decision
afterward diminishing with the time series, with the most tree model has a slightly higher MSE and MAE than a random
noteworthy point around 2013. In terms of trading volume, forest, it still achieves a good R2 value of 0.9663. In
although the volume showed a certain decline, it remained conclusion, random forest model is better than decision tree
stable overall. As the close price distribution by month, the overall, probably because it can integrate the prediction results
upper and lower limits of the closing price have basically not of multiple tree structures and better capture the complexity
changed much. However, the height of the box shows that the and nonlinear dynamics of the stock market, which also lays
closing price data has changed from scattered to concentrated a good foundation for investors to provide more accurate
with the change of month. The last histogram of the closing predictions.
price shows that in terms of frequency, the closing price is
mostly concentrated in the range of 140-160. C. Cross-validation and Node Splitting Criterion
The dataset is cleverly separated into training and testing
B. Evaluation Results parts during the model's training and evaluation, avoiding the
need for a single unified set.
TABLE I: EVALUATION RESULTS OF RANDOM FOREST AND DECISION TREE
MODELS
TABLE II: PERFORMANCE DIFFERENCES OF RANDOM FOREST AND
Random Forest Decision Tree DECISION TREE MODELS FOR REGRESSION TASKS IN CROSS-VALIDATION
MSE MAE R2 MSE MAE R2
Model Cross-validated MSE
2.6053 1.1516 0.9829 5.1241 1.6441 0.9663
Random Forest 1.469 +/- 0.578
The comparison in Table I shows that the random forest Decision Tree 3.195 +/- 1.363
model performs better on the test set. Its MSE value is 2.6053 Looking at Table II, it can be seen that the random forest
and its MAE value is 1.1516, both of which are lower than the model achieves an average MSE of 1.469 in cross-validation,
two values of the decision tree, 5.1241 and 1.6441, which showing a low prediction error. Further, the MSE values vary
indicates that the predictive power of random forest is more minimally between different folds, with a small standard
accurate than that of the decision tree. At the same time, the deviation of 0.578. This demonstrates the random forest
R2 value of the random forest is 0.9829, which is closer to 1 model's remarkable stability, generalization ability, and
98
Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
robustness across various datasets. In contrast, the mean MSE and 500, which can be seen that under different node
of the decision tree model in cross-validation is significantly partitioning criteria, the performance of these two models.
higher than that of the random forest. The values fluctuate There is no significant difference in the performance of both
more between different folds with some instability, which two models under different node partitioning criteria. This
likewise shows that the decision tree has a bigger expectation may be because in our dataset, both criteria are effective in
mistake on the regression task. The performance is more guiding node segmentation, and the segmentation results also
likely to be affected by changes in the dataset. have high prediction accuracy and stability.
Secondly, in order to ensure the performance of the Although the performance of Gini index and entropy is
constructed random forest and decision tree models in real similar when they are used as the node splitting criteria in the
scenarios for practical applications, this paper employs a k- experiments of this paper, analyzing from the theoretical point
fold cross-validation method using accuracy, recall and of view, the Gini index is simple to compute and is suitable
precision as the main assessment measurements, which give a for multi-category classification problems; while the entropy
complete perspective on the model’s presentation in various can better reflect the uncertainty of the data, and it may be
viewpoints, and the paper also evaluates the Gini index and more effective for the treatment of unbalanced datasets. In the
entropy, two criteria used for deciding how to partition the prediction of stock prices, since the data usually present
decision tree when constructing the data points. certain imbalances, such as uneven upward and downward
proportions, etc., the use of entropy as a node segmentation
TABLE III: SELECTION OF NODE SPLITTING CRITERIA FOR RANDOM FOREST criterion may help the model better deal with this kind of
AND DECISION TREE MODELS
problem. However, in the experiments of this paper, both
Criterion Cv_fold Accuracy Recall Precision criteria achieved similar performance due to the specific
Gini 3 0.985 0.984 0.986 nature of the dataset and the optimization of the model
Gini 10 0.985 0.984 0.986 parameters in this paper.
Gini 50 0.985 0.984 0.986
Gini 500 0.985 0.984 0.986 D. Hyperparameter Tuning
Entropy 3 0.985 0.984 0.986 The internal parameters configuration can play an
Entropy 10 0.985 0.984 0.986
Entropy 50 0.985 0.984 0.986
important role in the prediction performance [11]. So, after
Entropy 500 0.985 0.984 0.986 applying two common methods, grid search and random
… … … … … search, for the two models in this paper, this study chose three
This paper uses the Gini index and entropy as the node key hyperparameters, max_depth, min_samples_split and
partitioning criteria for random forest and decision tree, n_estimators, the last of which is only for the Random Forest
respectively, and observes performance of the models under model. Finally, to evaluate the performance of different
different criteria. In Table III, the accuracy, recall and parameter combinations, this paper adopted MSE as the
precision of the random forest model and the decision tree evaluation index, also, the particular outcomes are displayed
model are kept around 0.985 when the cv_fold is 3, 10, 50, in Table IV.
TABLE IV: HYPERPARAMETER TUNING FOR RANDOM FOREST AND DECISION TREE MODELS
Random forest Decision tree
Best MSE 48.827 Best MSE 36.487
Best parameters for random forest max_depth 10 Best parameters for decision max_depth 10
min_samples_split 2 tree min_samples_split 10
n_estimators 50 n_estimators /
Through hyperparameter tuning, it can be found that the to random forests in some metrics, suggesting that specific
appropriate hyperparameter configuration can significantly contexts such as hierarchical simplicity and smaller datasets
improve the model's performance [12]. It is worth noting that are equally well modeled using decision trees.
although the decision tree model is slightly better than the
random forest model in terms of MSE, this result is likely to After further tuning, this paper uses the random forest
be affected by many factors, for example, the dataset selected model to obtain the importance of different features in stock
in this paper has a smaller amount of data or fewer features, forecasting, as shown in Figure 2.
and the integration effect of the random forest is not obvious,
and a single decision tree can instead fit the data better; or the
randomness of the training process of the random forest even
if a fixed random seed is used. There will still be some random
effects and so on. Model selection should also consider other
performance indicators and model robustness in specific
practical applications.
E. Analysis of Experimental Results
In stock forecasting, especially in Table I, random forest
is superior to decision tree and can better capture the changes
in the stock market because of its lower MSE, smaller MAE,
higher coefficient of determination, and closer to 1. From
Table II and III, cross-validation shows that different node
partitioning criteria do not significantly affect the Figure 2 Feature importance analysis of the random forest model
performance of the two models. In Table IV, hyper-parametric
tuning reveals that well-tuned decision trees are comparable
99
Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
The importance chart of a feature reflects the primary and gradient boosted decision trees, XGBoost, etc., or more
secondary influencing factors in the experiment. Figure 2 comprehensively adjust the random forest parameters to arrive
shows the order of importance of the influencing factors for at a more optimal solution.
this experiment. The lowest price has the greatest impact on
the model, followed by the highest price. Through analysis, REFERENCES
this paper argues that the possible reason for this situation is [1] Zhan Z ,Kim K S .Versatile time-window sliding machine learning
that the minimum price usually reflects the pessimism or techniques for stock market forecasting.Artificial Intelligence
selling pressure of the financial market at a specific time, Review,2024,57(8).
which is helpful for predicting future price movements. The [2] Malti B , Apoorva G , Apoorva C . Stock Market Prediction with High
Accuracy using Machine Learning Techniques.Procedia Computer
highest price, on the other hand, means optimism and pressure Science, 2022, 215247-215265.
to buy, which helps to understand the changes in market [3] Bhatta A , Poudyal P , Maharjan K D , et al. Assessing Machine
dynamics. Learning's Accuracy in Stock Price Prediction.International Journal of
Computer (IJC),2023,49(1):46-63.
IV. CONCLUSION [4] Shilpa S ,Millie P ,Varuna G . Analysis and prediction of Indian stock
In this study, by comparing the practical applications of market: a machine-learning approach.International Journal of System
Assurance Engineering and Management,2023,14(4):1567-1585.
the random forest and decision tree in stock prediction
problems, it is observed that the random forest is superior to [5] Guan Jun, Zhang Shaopeng, Ren Yue et al. Spatial and temporal
differentiation and influencing factors evolution of agricultural net
the decision tree model in terms of overall performance. carbon sinks in China based on random forest model [J/OL]. China
Specifically, since the MSE of the random forest is 2.6053, environmental science: 1-13 [2023-10-19]. https://doi.org/10.19674/j.
which is much smaller than that of the decision tree of 5.1241, cnki.issn1000-6923. 20230928.002.
the prediction error of the random forest is small. The MAE [6] YAN Wenxin. Application of machine learning in stock forecasting.
of random forest is 1.1516, which is also lower than that of Information Systems Engineering,2024,(04):40-43.
decision tree, and the R2 value of random forest of 0.98 is also [7] Liu Yizhe. Research on dynamic remote sensing monitoring of drought
closer to 1 than that of decision tree of 0.96, which once again in northern Tibet[D]. Chengdu University of Information Science &
Technology, 2020. DOI:10.27716/d.cnki.gcdxx.2020.000108.
proves that the expectation exactness of random forest is
[8] SONG Feiyu, LU Xiaochun, and LI Ke. "Research on Location Model
higher. This finding provides an important reference for of Shanghai Express Outlets Based on Big Data and Machine
financial investors to choose a prediction model for stock Learning". Conference Proceedings of the 8th International
prices. At the same time, this paper also notes that the decision Symposium on Project Management, China (ISPM2020). Ed. School
tree model can also show strong predictive ability under of Economics and Management, Beijing Jiaotong University; Beijing
specific circumstances, through fine hyper-parameter tuning Logistics Informatics Research Base; Land Space Technology
Corporation Ltd; 2020, 1430-1436.
and feature selection. Therefore, in practical applications,
[9] Long Xu, et al."Machine learning method to predict dynamic
investors should select appropriate prediction models and compressive response of concrete-like material at high strain rates."
perform appropriate tuning according to specific needs and Defence Technology, 23.(2023):100-111.
data characteristics. [10] Gao Yuan, Huang Li. Stock Price Index Prediction Based on Deep
Learning. Software Engineering, 2024, 27(5): 7-13.
Notwithstanding, as far as the size of the dataset, the data DOI:10.19644/j.cnki.issn2096-1472.2024.005.002.
of this analysis was little and received limited features and [11] Thakkar A , Chaudhari K .Applicability of genetic algorithms for stock
labels due to the source dataset; in terms of the model, in the market prediction: A systematic survey of the last decade.Computer
future, researchers can explore new feature engineering Science Review, 2024, 53100652.
methods, further consider heuristic algorithms to optimize the [12] Das D J ,Thulasiram K R ,Henry C , et al.Encoder–Decoder Based
model, or try to use other integrated learning methods such as LSTM and GRU Architectures for Stocks and Cryptocurrency
Prediction.Journal of Risk and Financial Management,2024,17(5):200.
100
Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.