0% found this document useful (0 votes)
3 views5 pages

A Comparative Analysis Study of Stock Prediction Based on Random Forest and Decision Tree

This study evaluates the effectiveness of random forest and decision tree algorithms for stock price forecasting, finding that random forest outperforms decision tree in terms of accuracy, with lower mean square error (MSE) and mean absolute error (MAE). The research utilizes historical stock data and employs cross-validation to assess the models' predictive capabilities, concluding that random forests better capture the complexities of stock market dynamics. The findings provide insights for investors seeking more reliable stock prediction models to mitigate risks and enhance returns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

A Comparative Analysis Study of Stock Prediction Based on Random Forest and Decision Tree

This study evaluates the effectiveness of random forest and decision tree algorithms for stock price forecasting, finding that random forest outperforms decision tree in terms of accuracy, with lower mean square error (MSE) and mean absolute error (MAE). The research utilizes historical stock data and employs cross-validation to assess the models' predictive capabilities, concluding that random forests better capture the complexities of stock market dynamics. The findings provide insights for investors seeking more reliable stock prediction models to mitigate risks and enhance returns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2024 International Conference on Electronics and Devices, Computational Science (ICEDCS)

A Comparative Analysis Study of Stock Prediction


2024 International Conference on Electronics and Devices, Computational Science (ICEDCS) | 979-8-3315-2762-4/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICEDCS64328.2024.00022

Based on Random Forest and Decision Tree


Yixiao Gu
The School of Computing Science, University of Glasgow, Glasgow, Scotland, G32, United Kingdom
[email protected]

Abstract—As financial markets become increasingly driving data, providing powerful support for decision-making
complex and volatile, so does the need for accurate and efficient and prediction with complexity and variability. Especially in
stock forecasting models. This study expects to investigate the prediction, the machine learning algorithm itself is efficient,
effectiveness and accuracy of machine learning methods in stock adaptive and highly predictable, and has been concerned and
price forecasting. In the article, the performance of random used by many researchers [1]. When it comes to stock
forest algorithm and decision tree algorithm in reducing stock prediction, machine learning algorithms can learn more
investment risk is systematically evaluated and compared by accurately about future trends and trends by training large
using the MSE, MAE and R2 score as evaluation indicators
amounts of historical stock and stock price data to understand
through cross-validation technology. On this basis, this paper
the complex laws of the stock market.
also deeply discusses the performance differences of the two
algorithms in the field of stock forecasting through At present, the research on population prediction based on
hyperparameter tuning and feature importance analysis. The different algorithms at home and abroad has been gradually
study found that random forest was better than decision tree in improved, mainly by collecting a large number of historical
prediction accuracy. The MSE and MAE of random forest are population data, using various machine learning algorithms
lower, 2.605 and 1.152 respectively, and R² is closer to 1 (0.983). for model training, and verifying the prediction ability of the
In contrast, the performance of decision tree model is slightly model through the test set.
inferior, with higher MSE and MAE, 5.124 and 1.644, also a
lower R2 (0.966). This conclusion that random forests are As of late, the utilization of machine learning methods in
superior highlights the ability of random forests to process the field of stock prediction has gradually increased, like
complex data. Although this paper has some limitations on the support vector machines (SVMs), neural networks (NN),
size of the dataset and the optimization of the model, the random naive Bayes (NB), etc. These algorithms have their own
forest algorithm still shows greater advantages in the characteristics and advantages, such as SVM is suitable for
comparative analysis with the decision tree algorithm in stock processing high-dimensional data and classification problems,
prediction. This study provides a reference for stock investors NN is more suitable for modeling and prediction of nonlinear
to choose a more accurate stock price prediction model in the problems, but on the contrary, SVM is sensitive to parameters
future. This can effectively reduce investment risks and increase
and NN is easy to fall into overfitting, which also requires
returns.
researchers to explore other machine learning algorithms.
Keywords—Random forests, decision trees, stock price This is where regression algorithms such as random
predictions, predictive models, fintech forests and decision trees are a good choice. They have strong
feature selection, nonlinear fitting and generalization
I. INTRODUCTION
capabilities, can adapt to different data distributions and
With the global financial boom, equity investment has feature types, and automatically extract available information
gradually become one of the largest capital markets with its from a large number of datasets to derive more accurate
high yields and returns. However, the high volatility and prediction results. At the same time, the two algorithms are
uncertainty of the stock market also bring great risks and easy to implement and adjust the parameters, which provides
challenges to investors. Therefore, accurate prediction of a good opportunity for the application and promotion of the
stock price volatility helps to help investors make more two algorithms in real life.
objective and rational decisions about their investment
behavior, and more importantly, helps to maintain the stability These types of method can take into account a variety of
and prosperity of a large capital market. factors such as financial reports, market information,
company news and macroeconomic environment, etc. At the
The financial equity sector is a typically complex non- same time, researchers at home and abroad are constantly
linear field. Stock prices are affected by a variety of exploring how to optimize the application of the Random
circumstances such as macroeconomic conditions, Forest algorithm in stock forecasting, using methods such as
government policy changes, company financial conditions adjusting model parameters and improving feature selection
and operating results, market sentiment, etc. Therefore, it is for forecasting. Although there have been a large number of
often difficult for traditional linear regression models to studies exploring the application of random forests, decision
capture the complex relationship between multiple factors, trees and other algorithms in stock forecasting respectively,
and the final prediction results will be unsatisfactory. The the comparative analyses of these algorithms in stock
emergence and widespread use of machine learning can forecasting are still relatively limited, and there is a lack of
gradually remedy the above deficiencies. systematic research and in-depth analyses [2].
Using machine learning methods, the underlying patterns Because stock price prediction involves multiple factors
and forms can be found in massive amounts of information by and complex market dynamics, and traditional binary

979-8-3315-2762-4/24/$31.00 ©2024 IEEE 96


DOI 10.1109/ICEDCS64328.2024.00022
Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
prediction may ignore the continuity and complexity of stock capabilities and predicting stock price movements more
prices, this paper will make predictions through non-binary accurately.
prediction methods and cautiously assess their effectiveness
and applicability. Decision trees are another intuitive and easy-to-implement
machine learning algorithm that builds a tree structure by
The reason for this article is to look at the precision of two recursively dividing a dataset into subsets [6]. Decision trees
methods (random forest and decision tree) in stock prediction have always been intuitive and show decision paths that are
by using metrics like mean square error (MSE), mean absolute easy to understand and interpret.
error (MAE) and R² score, and then compare and contrast the
accuracy of two machine learning algorithms (random forest These two algorithms both have strong feature selection,
and decision tree) in stock prediction through hyperparameter nonlinear fitting and generalization capabilities, can adapt to
tuning and feature importance analysis, so as to provide a different data distributions and feature types, and
certain reference for investors’ future financial investment. automatically extract available information from a large
The data analysis and graphing of this article were carried out number of datasets to derive more accurate prediction results.
using the Python programming language and libraries such as At the same time, these two algorithms are relatively easy to
Pandas, NumPy, Matplotlib, and Scikit learn [3]. implement and adjust the parameters, which facilitates the
practice and application of these two algorithms in real life.
II. DATA AND METHOD C. Training and Evaluation
A. Data Sources and Preprocessing 1) Training: Through the training and evaluation of the
The dataset studied in this paper is derived from stock model, this study understands the features and rules extracted
trading data in the financial markets, including IBM's stock from the model data, further quantifies the model and tests
price and trading volume information. The dataset is derived whether it meets the preset standards, which is helpful to
from the population prediction dataset published on the whale understand the advantages and disadvantages of the model,
community in 2022 and is highly reliable and realtime so as to facilitate the subsequent parameter adjustment and
(https://www.heywhale.com/mw/dataset/634d1afda40ed167 optimization of the model.
1d65423b). In order to obtain a more reliable model performance
This dataset covers daily trading records from 2006 to estimate, the model is trained and tested on different subsets.
2018, and each record contains basic information such as date, This method not only effectively avoids overfitting, but also
opening price, high price, low price, closing price, trading helps to understand the generalization ability of the model
volume, etc., which provides rich data support for the study of under different data distributions. Firstly, the entire dataset is
price fluctuations and trading behavior in the stock market. divided into two parts: the training set, which accounts for 80%
of the total, and the test set, which accounts for 20%, with
In the data preprocessing stage, the study first cleaned and 2415 and 604 samples respectively. Each sample in both
converted the data, specifically including converting the date groups contains 9 characteristics and a target variable that
columns to datetime format for time series analysis, and represents the closing price. This splitting of the dataset is
extracted the time features such as year, month, day, day of intended to assess the model's capacity to generalize the
the week, and quarter in order to enhance the predictive ability concealed information. However, in order to preemptively
of the model. In order to eliminate the difference in the solve the latent missing values (NaN) in the dataset, this paper
magnitude between different indicators, the study again do a adopts an average imputation method, which uses the average
more standard and complete processing of the data, after value of the corresponding features in the training set to fill in
preprocessing all the time features in the ‘DataFrame’ are the missing values to ensure the integrity of the model input
converted to int32 type and there is no invalid data. data.
Due to the unbalanced nature of the stock market, the price Next, to extract useful features from the available data to
of stocks can be volatile or relatively stable [4]. In order to improve the model performance, this paper employs some
balance the dataset, this paper adopts the resampling common feature engineering techniques to compute various
technique, and makes appropriate down-sampling for stocks indicators of science and technology including Moving
with large fluctuation in price, and makes appropriate up- Averages (MA), Relative Strength Index (RSI), and Bollinger
sampling for those with small fluctuations in price. Finally, Bands.
the study obtained a balanced dataset containing about 3000
samples. In the training phase of the model, the opening price, the
highest price, the lowest price, the trading volume and the
B. Method calculated moving average, the upper and lower limits of the
The algorithms used in this paper are random forests and Bollinger bands and the RSI were chosen as the feature
decision trees. variables, and the closing price of the stock was used as the
target variable. On this basis, this study trained two models,
Random forest is an integrated learning method based on random forest and decision tree, respectively.
decision trees, this is done by constructing a number of
decision trees for prediction, each of which is built by The presentation of these two algorithms in regression task
randomly selecting samples and features from the original is then evaluated through cross-validation, providing
dataset for training [5]. For example, in order to improve the quantitative information about the level of contrast between
accuracy and diversity of the model, the model can further the anticipated and genuine upsides of the model [7,8].
handle the complex nonlinear relationship of stocks by Different hyperparameter configurations can result in
capturing the relationship between different characteristics significant differences in model performance between the
and the target price. Random forest models can also training and test sets for model training and multiple
effectively reduce overfitting by improving generalization

97

Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
evaluations [9,10]. Therefore, in order to find the capacity of the model.
hyperparameter configuration that can optimize the
performance of the random forest and decision tree models, a III. RESULTS
hyperparameter tuning experiment is carried out in this paper. A. Data Characteristics and Analysis
2) Evaluation :Through the training and evaluation of the In this paper, after preprocessing, the stock price trend,
model, this study understands the features and rules extracted trading volume changes, closing price box plots and
from the model data, further quantifies the model and tests histograms grouped by month are plotted, and these charts
whether it meets the preset standards, which is helpful to intuitively show the dynamic changes and distribution
understand the advantages and disadvantages of the model, characteristics of the data, which provide a strong support for
so as to facilitate the subsequent parameter adjustment and the subsequent predictive analysis. The results of the
optimization of the model. R2 is to 1, the better the illustrative visualization are shown in Figure 1.

Figure 1 Visualization of the research data

From Figure 1, it very well may be tracked down that the than that of the decision tree, indicating that the random forest
closing price of the stock shows a pattern of expanding and model can show stronger fitting ability. Although the decision
afterward diminishing with the time series, with the most tree model has a slightly higher MSE and MAE than a random
noteworthy point around 2013. In terms of trading volume, forest, it still achieves a good R2 value of 0.9663. In
although the volume showed a certain decline, it remained conclusion, random forest model is better than decision tree
stable overall. As the close price distribution by month, the overall, probably because it can integrate the prediction results
upper and lower limits of the closing price have basically not of multiple tree structures and better capture the complexity
changed much. However, the height of the box shows that the and nonlinear dynamics of the stock market, which also lays
closing price data has changed from scattered to concentrated a good foundation for investors to provide more accurate
with the change of month. The last histogram of the closing predictions.
price shows that in terms of frequency, the closing price is
mostly concentrated in the range of 140-160. C. Cross-validation and Node Splitting Criterion
The dataset is cleverly separated into training and testing
B. Evaluation Results parts during the model's training and evaluation, avoiding the
need for a single unified set.
TABLE I: EVALUATION RESULTS OF RANDOM FOREST AND DECISION TREE
MODELS
TABLE II: PERFORMANCE DIFFERENCES OF RANDOM FOREST AND
Random Forest Decision Tree DECISION TREE MODELS FOR REGRESSION TASKS IN CROSS-VALIDATION
MSE MAE R2 MSE MAE R2
Model Cross-validated MSE
2.6053 1.1516 0.9829 5.1241 1.6441 0.9663
Random Forest 1.469 +/- 0.578
The comparison in Table I shows that the random forest Decision Tree 3.195 +/- 1.363
model performs better on the test set. Its MSE value is 2.6053 Looking at Table II, it can be seen that the random forest
and its MAE value is 1.1516, both of which are lower than the model achieves an average MSE of 1.469 in cross-validation,
two values of the decision tree, 5.1241 and 1.6441, which showing a low prediction error. Further, the MSE values vary
indicates that the predictive power of random forest is more minimally between different folds, with a small standard
accurate than that of the decision tree. At the same time, the deviation of 0.578. This demonstrates the random forest
R2 value of the random forest is 0.9829, which is closer to 1 model's remarkable stability, generalization ability, and

98

Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
robustness across various datasets. In contrast, the mean MSE and 500, which can be seen that under different node
of the decision tree model in cross-validation is significantly partitioning criteria, the performance of these two models.
higher than that of the random forest. The values fluctuate There is no significant difference in the performance of both
more between different folds with some instability, which two models under different node partitioning criteria. This
likewise shows that the decision tree has a bigger expectation may be because in our dataset, both criteria are effective in
mistake on the regression task. The performance is more guiding node segmentation, and the segmentation results also
likely to be affected by changes in the dataset. have high prediction accuracy and stability.
Secondly, in order to ensure the performance of the Although the performance of Gini index and entropy is
constructed random forest and decision tree models in real similar when they are used as the node splitting criteria in the
scenarios for practical applications, this paper employs a k- experiments of this paper, analyzing from the theoretical point
fold cross-validation method using accuracy, recall and of view, the Gini index is simple to compute and is suitable
precision as the main assessment measurements, which give a for multi-category classification problems; while the entropy
complete perspective on the model’s presentation in various can better reflect the uncertainty of the data, and it may be
viewpoints, and the paper also evaluates the Gini index and more effective for the treatment of unbalanced datasets. In the
entropy, two criteria used for deciding how to partition the prediction of stock prices, since the data usually present
decision tree when constructing the data points. certain imbalances, such as uneven upward and downward
proportions, etc., the use of entropy as a node segmentation
TABLE III: SELECTION OF NODE SPLITTING CRITERIA FOR RANDOM FOREST criterion may help the model better deal with this kind of
AND DECISION TREE MODELS
problem. However, in the experiments of this paper, both
Criterion Cv_fold Accuracy Recall Precision criteria achieved similar performance due to the specific
Gini 3 0.985 0.984 0.986 nature of the dataset and the optimization of the model
Gini 10 0.985 0.984 0.986 parameters in this paper.
Gini 50 0.985 0.984 0.986
Gini 500 0.985 0.984 0.986 D. Hyperparameter Tuning
Entropy 3 0.985 0.984 0.986 The internal parameters configuration can play an
Entropy 10 0.985 0.984 0.986
Entropy 50 0.985 0.984 0.986
important role in the prediction performance [11]. So, after
Entropy 500 0.985 0.984 0.986 applying two common methods, grid search and random
… … … … … search, for the two models in this paper, this study chose three
This paper uses the Gini index and entropy as the node key hyperparameters, max_depth, min_samples_split and
partitioning criteria for random forest and decision tree, n_estimators, the last of which is only for the Random Forest
respectively, and observes performance of the models under model. Finally, to evaluate the performance of different
different criteria. In Table III, the accuracy, recall and parameter combinations, this paper adopted MSE as the
precision of the random forest model and the decision tree evaluation index, also, the particular outcomes are displayed
model are kept around 0.985 when the cv_fold is 3, 10, 50, in Table IV.
TABLE IV: HYPERPARAMETER TUNING FOR RANDOM FOREST AND DECISION TREE MODELS
Random forest Decision tree
Best MSE 48.827 Best MSE 36.487
Best parameters for random forest max_depth 10 Best parameters for decision max_depth 10
min_samples_split 2 tree min_samples_split 10
n_estimators 50 n_estimators /

Through hyperparameter tuning, it can be found that the to random forests in some metrics, suggesting that specific
appropriate hyperparameter configuration can significantly contexts such as hierarchical simplicity and smaller datasets
improve the model's performance [12]. It is worth noting that are equally well modeled using decision trees.
although the decision tree model is slightly better than the
random forest model in terms of MSE, this result is likely to After further tuning, this paper uses the random forest
be affected by many factors, for example, the dataset selected model to obtain the importance of different features in stock
in this paper has a smaller amount of data or fewer features, forecasting, as shown in Figure 2.
and the integration effect of the random forest is not obvious,
and a single decision tree can instead fit the data better; or the
randomness of the training process of the random forest even
if a fixed random seed is used. There will still be some random
effects and so on. Model selection should also consider other
performance indicators and model robustness in specific
practical applications.
E. Analysis of Experimental Results
In stock forecasting, especially in Table I, random forest
is superior to decision tree and can better capture the changes
in the stock market because of its lower MSE, smaller MAE,
higher coefficient of determination, and closer to 1. From
Table II and III, cross-validation shows that different node
partitioning criteria do not significantly affect the Figure 2 Feature importance analysis of the random forest model
performance of the two models. In Table IV, hyper-parametric
tuning reveals that well-tuned decision trees are comparable

99

Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.
The importance chart of a feature reflects the primary and gradient boosted decision trees, XGBoost, etc., or more
secondary influencing factors in the experiment. Figure 2 comprehensively adjust the random forest parameters to arrive
shows the order of importance of the influencing factors for at a more optimal solution.
this experiment. The lowest price has the greatest impact on
the model, followed by the highest price. Through analysis, REFERENCES
this paper argues that the possible reason for this situation is [1] Zhan Z ,Kim K S .Versatile time-window sliding machine learning
that the minimum price usually reflects the pessimism or techniques for stock market forecasting.Artificial Intelligence
selling pressure of the financial market at a specific time, Review,2024,57(8).
which is helpful for predicting future price movements. The [2] Malti B , Apoorva G , Apoorva C . Stock Market Prediction with High
Accuracy using Machine Learning Techniques.Procedia Computer
highest price, on the other hand, means optimism and pressure Science, 2022, 215247-215265.
to buy, which helps to understand the changes in market [3] Bhatta A , Poudyal P , Maharjan K D , et al. Assessing Machine
dynamics. Learning's Accuracy in Stock Price Prediction.International Journal of
Computer (IJC),2023,49(1):46-63.
IV. CONCLUSION [4] Shilpa S ,Millie P ,Varuna G . Analysis and prediction of Indian stock
In this study, by comparing the practical applications of market: a machine-learning approach.International Journal of System
Assurance Engineering and Management,2023,14(4):1567-1585.
the random forest and decision tree in stock prediction
problems, it is observed that the random forest is superior to [5] Guan Jun, Zhang Shaopeng, Ren Yue et al. Spatial and temporal
differentiation and influencing factors evolution of agricultural net
the decision tree model in terms of overall performance. carbon sinks in China based on random forest model [J/OL]. China
Specifically, since the MSE of the random forest is 2.6053, environmental science: 1-13 [2023-10-19]. https://doi.org/10.19674/j.
which is much smaller than that of the decision tree of 5.1241, cnki.issn1000-6923. 20230928.002.
the prediction error of the random forest is small. The MAE [6] YAN Wenxin. Application of machine learning in stock forecasting.
of random forest is 1.1516, which is also lower than that of Information Systems Engineering,2024,(04):40-43.
decision tree, and the R2 value of random forest of 0.98 is also [7] Liu Yizhe. Research on dynamic remote sensing monitoring of drought
closer to 1 than that of decision tree of 0.96, which once again in northern Tibet[D]. Chengdu University of Information Science &
Technology, 2020. DOI:10.27716/d.cnki.gcdxx.2020.000108.
proves that the expectation exactness of random forest is
[8] SONG Feiyu, LU Xiaochun, and LI Ke. "Research on Location Model
higher. This finding provides an important reference for of Shanghai Express Outlets Based on Big Data and Machine
financial investors to choose a prediction model for stock Learning". Conference Proceedings of the 8th International
prices. At the same time, this paper also notes that the decision Symposium on Project Management, China (ISPM2020). Ed. School
tree model can also show strong predictive ability under of Economics and Management, Beijing Jiaotong University; Beijing
specific circumstances, through fine hyper-parameter tuning Logistics Informatics Research Base; Land Space Technology
Corporation Ltd; 2020, 1430-1436.
and feature selection. Therefore, in practical applications,
[9] Long Xu, et al."Machine learning method to predict dynamic
investors should select appropriate prediction models and compressive response of concrete-like material at high strain rates."
perform appropriate tuning according to specific needs and Defence Technology, 23.(2023):100-111.
data characteristics. [10] Gao Yuan, Huang Li. Stock Price Index Prediction Based on Deep
Learning. Software Engineering, 2024, 27(5): 7-13.
Notwithstanding, as far as the size of the dataset, the data DOI:10.19644/j.cnki.issn2096-1472.2024.005.002.
of this analysis was little and received limited features and [11] Thakkar A , Chaudhari K .Applicability of genetic algorithms for stock
labels due to the source dataset; in terms of the model, in the market prediction: A systematic survey of the last decade.Computer
future, researchers can explore new feature engineering Science Review, 2024, 53100652.
methods, further consider heuristic algorithms to optimize the [12] Das D J ,Thulasiram K R ,Henry C , et al.Encoder–Decoder Based
model, or try to use other integrated learning methods such as LSTM and GRU Architectures for Stocks and Cryptocurrency
Prediction.Journal of Risk and Financial Management,2024,17(5):200.

100

Authorized licensed use limited to: University of Wollongong. Downloaded on June 02,2025 at 18:09:35 UTC from IEEE Xplore. Restrictions apply.

You might also like