Improving Deep Reinforcement Learning Agent Tradin
Improving Deep Reinforcement Learning Agent Tradin
MANUSCRIPT ID NUMBER
T
oday, due to economic fluctuations, most people have Lele et al. used a Reinforcement Learning (RL) agent to
become more willing to invest in the forex market to learn stock market trends and suggest trading decisions that
earn more money in this market. To be successful in this maximize profits. They used Trust Region Policy Optimization
market, it is necessary to spend most of your time sitting (TRPO) [11], PPO, and Vanilla (VPG) algorithms to increase
behind the system so that you can find a suitable strategy with the profit from trading shares of three different companies
high profits. But the challenge is whether is it possible to find a including Microsoft, Apple, and Nike. They created a trading
strategy with suitable profits or not. Second, is it necessary to agent, which performs trading actions on the environment and
spend the whole day sitting behind the system and just trade? re-train according to the received rewards. The result showed
Today, researchers have shown that Deep Reinforcement that the TRPO algorithm performed better than the other two
Learning (DRL) extracts meaningful features from data and algorithms in terms of profit [12].
makes decisions based on them [1]. Also, it has been shown that Lin et al. used an end-to-end framework to improve
transaction execution quality and reduce per-transaction costs
This paragraph of the first footnote will contain the date on which you Davoud Sarani, is Mastery program student at Information Technology and
submitted your paper for review, which is populated by IEEE. It is IEEE style Computer Engineering Department, Urmia University of Technology, Urmia,
to display support information, including sponsor and financial support Iran (e-mail: [email protected]).
acknowledgment, here and not in an acknowledgment section at the end of the Dr. Parviz Rashidi-Khazaee is Assistant Professor at Information
article. For example, “This work was supported in part by the U.S. Department Technology and Computer Engineering Department, Urmia University of
of Commerce under Grant 123456.” The name of the corresponding author Technology, 4km Band Road, Urmia, Iran (e-mail: [email protected],
appears after the financial information, e.g. [email protected]).
(Corresponding author: Dr. Parviz Rashidi-Khazaee).
Sahar Arabha is Mastery program student at Information Technology and
Computer Engineering Department, Urmia University of Technology, Urmia,
Iran (e-mail: [email protected]).
2
MANUSCRIPT ID NUMBER
by optimizing the PPO algorithm using two separate neural the best policy to perform profitable trading decisions. The
network architectures: 1) LSTM and 2) Fully Connected problem that has been bothering the mind so far is that the
Networks (FCNs) to be able to perform more efficient reward received from this method has noise due to the dynamic
transactions. They used a distributed reward signal for their environment of the forex market. In other words, the learning
reward function and results have shown that the process is slow and unstable and cannot show the intended
Implementation Shortfall (difference between the decision result [3]. Therefore, the main goal of this study is to build a
price and the final price of a transaction), standard deviation new PPO-Based model by changing shaping the reward in such
profit-loss ratio, and performance of the proposed model a way that it can have fast and sustainable results. The agent
outperform Almgren and Chriss model, Volume-Weighted learns from various signals in the environment using pre-
Average Price, Time-Weighted Average Price and DQN [13]. processed data based on unsupervised learning principles. This
Tsai et al. addressed the long-term problem of unstable approach enables the agent to continuously increase his
trends in DL predictions and used an RL method to make understanding without relying on direct rewards and thereby
transactions. They encrypted their data with the Gramian expand the scope of his learning capabilities. To do this, an
Angular Field and designed their trading model with DQN and auxiliary task was combined with PPO. The purpose of the
PPO and used it for trading of three different currency pairs auxiliary tasks is to recognize and classify patterns without
(AUD/USD, GBP/USD, EUR/USD). The comparison result human intervention, accelerate training, strengthen the agent,
showed that The PPO algorithm was considered superior for improve its overall performance, reduce the risk of transactions,
optimizing forex trading strategies [7]. and finally try to create a profitable model.
DRL methods try to find and apply the optimal trading
strategy by using the reward received from the environment. II. BACKGROUNDS
Therefore, most of the research done in the field of the forex At first, DL, RL, and DRL algorithms will be presented in
trading market is focused on how to shape the reward function. this section briefly, and then the structure of the proposed model
The reward function plays a pivotal role in increasing the will be presented.
agent's ability to identify and execute profitable transactions
and thus improve overall performance. Konstantinos used an
RL agent along with an auxiliary task to learn investment A. Deep Learning
strategies. He used an agent whose purpose is to learn from DL [1] - [2] is a subset of machine learning (ML), and
historical data and suggest investment strategies for buying and research has shown that DL algorithms can outperform ML
selling to earn profit through changes in the price of digital algorithms. There are several famous DL algorithms in this
currency. The main model uses LSTM networks for feature field such as RNN, LSTM, CNN, and Generative Adversarial
extraction and fully connected layers for the output function and Network. These algorithms have benefited from simulating the
auxiliary regression output. The Sharpe Ratio and Drawdown function of neurons in the human brain. While one neuron is not
results indicate that the use of the auxiliary function gives much able to do complex tasks only. Therefore, the human brain has
better results than the Base-line and Buy and Hold method [14]. billions of neurons that are stacked in layers and form a network
Tsantekidis et al. addressed the problem of the noise of sparse [1]. Based on that fact the structure of DL algorithms contains
bonuses and proposed the design of an additional bonus to lots of layers and neurons.
increase the return and Sharpe Ratio in forex trading. By adding
a trailing reward to the main reward Profit and Loss (PnL) and
using a relatively more complex network structure consisting of B. Reinforcement learning
LSTM and FC layers, they have helped to increase efficiency RL stands as a distinct domain within the field of ML.
and the Sharpe Ratio of transactions. In addition, they used the Diverging from conventional ML approaches like supervised
Actor-Critic (AC) method for the value of the situation and the and unsupervised learning, RL entails agents learning by
action performed in each situation, and finally, they used the actively engaging with their environment, navigating through
PPO and Double DQN, to train the RL agent. They designed trial and error. RL has evolved rapidly over the past few years
separate neural networks for each deep reinforcement learning with a wide range of applications from building a
algorithm. Their obtained results indicate that adding a recommendation system to self-driving cars. With the advent of
additional rewards to the main rewards has increased the sharp new algorithms and libraries, RL stands out as one of the most
ratio and return [3]. promising domains within ML. Here we get to know some key
Lele et al. used TRPO, PPO, and Vanila to increase profits elements of RL [1]:
in the shares of 3 different companies [12]. Their reward Agent: An agent is a learning robot that makes intelligent
function is not optimized for the risk measure. Also, they decisions. For example, a chess player can be considered an
suggested to use a supervised learning model that takes agent because the chess player learns to make the best moves
sentiment analysis as input to predict stock trends to obtain (decisions) to win the game.
more informed and efficient results [12]. Lin et al. used a Environment: The environment is the world in which the
distributed reward function and optimized the PPO algorithm agent performs actions. For example, a chessboard is called an
for better execution quality in the stock market, which has environment. Complex and high-dimensional environments can
brought favorable results [13]. be challenging because the learning process may be slow or
In this research, the PPO algorithm has been used to find the unstable.
optimal policy in the forex market so that the agent can adopt
3
MANUSCRIPT ID NUMBER
These artificial incentives assist in guiding the Our contribution in this research is the use of auxiliary task
reinforcement learner towards favorable policies or steering to better and more optimally train the RL agent in such a way
away from unfavorable ones [18]. For example, the agent can that it has better learning and makes more profitable trades. The
be rewarded for detecting a change in market trends, or for auxiliary task makes the agent better understand the structure of
identifying a particular pattern in the data. The reward shaping the market environment and learn the optimal trading policies.
used in this research is based on a new shaping algorithm of For example, the agent learns that a certain set of situations
DeepMind's article on Auxiliary tasks [19]. leads to a suitable trading position. Therefore, it tries to learn
them. The proposed auxiliary task examines the input data and
after extracting the golden features, clusters them and creates a
F. Auxiliary Tasks
suitable label for each input data. This mechanism helps
Auxiliary tasks are additional cost functions that can help transform the problem from unsupervised learning to
the RL agent as an additional leverage so that the agent itself supervised learning using Forex time series data (features) and
can learn, predict, and observe data from the environment then the DRL agent uses the generated labels to recognize
without human intervention. In other words, while DL trends to make an optimal deal. To build the proposed model,
algorithms have made considerable advancements across the following steps are taken:
numerous fields, they demand expensive annotations on 1) Pre-processing stage
extensive datasets. Self-supervised learning (SSL) utilizing
unlabeled data has emerged as a viable alternative by First, the time series of forex data is pre-processed and
eliminating the need for manual annotation. SSL achieves this 5 new features are extracted.
by crafting feature representations through auxiliary tasks that 2) Labeling stage
function without manual annotation. Consequently, models 5 features extracted in 16 previous time steps, a total of
trained on these tasks can extract valuable latent 80 features, are given to Auto Encoder (AE) to extract
representations, subsequently enhancing downstream tasks like 12 golden features from its latent. Then the golden
object classification and detection[20]. This means that even in features are given to the K-Means algorithm to cluster
the absence of a strong reward signal, it measures and defines each input instance into 12 separate clusters.
the loss (error or discrepancy) in a system using alternative or 3) Learning the optimal trading policy
indirect indicators derived from unlabeled data. Unsupervised
auxiliary tasks act as additional learning objectives alongside The input data along with the created label are given to
the main RL task. These auxiliary tasks are designed to help the the Actor-Critic network so that the appropriate trading
agent learn useful features or representations of the data that can action is performed by the Actor network. The action
help improve the agent's performance on the main task. The performed by the actor is checked by the Critic and
main idea of combining RL with unsupervised auxiliary tasks Auxiliary Task, and based on their feedback, the
is to exploit the data structure to accelerate and stabilize the network is modified to learn a more optimal policy.
learning process [21]. By forcing the agent to learn related 4) Testing stage
features or patterns through auxiliary tasks, it can gain a better The performance of the trained network on new data,
understanding of the environment and make more informed the last seven months of the dataset, and their results
decisions in the main task. This approach can lead to faster are presented.
convergence, improved efficiency, and better generalization
when facing new situations. In general, RL with unsupervised A. Pre-Processing
auxiliary tasks is a promising way to advance the capabilities of The dataset has OHLC features, which include hourly
AI agents and enable them to tackle complex real-world opening price, highest price, lowest price, and closing price of
problems using reinforcement learning and unsupervised currency pairs. To improve the performance of the model and
learning techniques. solve the time lag problem, as in [3], these features are
processed and five new features are extracted as follows:
III. PROPOSED MODEL STRUCTURE
𝑝𝑐 (𝑡)−𝑝𝑐 (𝑡−1)
In this research, to correctly identify the trends and make 𝑥1𝑡 = (5)
𝑝𝑐 (𝑡−1)
profitable trades in the forex market, the performance of the
reinforcement learning agent has been improved by using a 𝑝ℎ (𝑡)−𝑝ℎ(𝑡−1)
separate auxiliary task. The main idea is to develop an agent 𝑥2𝑡 = (6)
𝑝ℎ (𝑡−1)
that can learn from different signals in the forex market by
recognizing patterns. This idea is derived from the logic of 𝑝𝑙 (𝑡)−𝑝𝑙 (𝑡−1)
𝑥3𝑡 = (7)
unsupervised learning, where the agent continues to develop its 𝑝𝑙 (𝑡−1)
understanding of the environment even without direct reward.
𝑝ℎ (𝑡)−𝑝𝑐 (𝑡−1)
This approach expands the agent's learning scope and focuses 𝑥4𝑡 = (8)
𝑝𝑐 (𝑡−1)
on predicting and controlling various characteristics of the
market environment and makes the agent able to flexibly 𝑝𝑐 (𝑡)−𝑝𝑙 (𝑡−1)
control and understand his experiences by understanding the 𝑥5𝑡 = (9)
𝑝𝑐 (𝑡−1)
environment more.
5
MANUSCRIPT ID NUMBER
𝑆ℎ𝑎𝑟𝑝 𝑅𝑎𝑡𝑖𝑜 = (𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑚𝑒𝑎𝑛) / (𝑠𝑡𝑑 𝑜𝑓 𝑟𝑒𝑡𝑢𝑟𝑛𝑠) (12) Overall Return -25.2% 14.86 % 158.8%
Here, std indicate standard deviations of returns Sharp-Ratio -2.618 0.249 109.2%
(𝑣𝑎𝑙𝑢𝑒𝑛𝑒𝑤 − 𝑣𝑎𝑙𝑢𝑒𝑜𝑟𝑔𝑖𝑛𝑎𝑙 )
𝑃𝑃𝐼 = ⨯ 100 (13)
|𝑣𝑎𝑙𝑢𝑒𝑜𝑟𝑔𝑖𝑛𝑎𝑙 |
IV. RESULTS
In this section, after introducing the used datasets, the
mechanism of setting the parameters of the environment and
tuning meta-parameters, the results of the proposed model will
be presented and compared with the basic PPO model. Fig. 3. PPO model Backtesting Performance on DS1
TABLE I
Backtesting Average Performance on DS1
7
MANUSCRIPT ID NUMBER
TABLE II
Backtesting Average Performance on DS2
TABLE III
Performance of PPO + Additional reward model [3]
By calculating the performance improvement percentage of [4] M. A. I. Sunny, M. M. S. Maswood, and A. G. Alharbi,
the model [3] and the proposed model in this study, for both the "Deep learning-based stock price prediction using
DS1 and DS2, it can be concluded that the proposed model of LSTM and bi-directional LSTM model," in 2020 2nd
this research has significantly increased in both return and novel intelligent and leading emerging sciences
sharp-ratio performance criteria. The percentage of conference (NILES), 2020: IEEE, pp. 87-92.
performance improvement in the Overall Return criterion is [5] F. Liu, Y. Li, B. Li, J. Li, and H. Xie, "Bitcoin
158.85%, 1891.51%, and 87.88% respectively for the PPO, the transaction strategy construction based on deep
PPO+AXT, and [3], which shows a significant improvement in reinforcement learning," Applied Soft Computing, vol.
the performance of the proposed model in this research. In 113, p. 107952, 2021/12/01/ 2021, doi:
terms of Sharp-Ratio, the percentage of performance https://doi.org/10.1016/j.asoc.2021.107952.
improvement has been obtained as 109.20%, 116.04%, and [6] A. Shavandi and M. Khedmati, "A multi-agent deep
102.67% respectively, which shows that the trading risk has reinforcement learning framework for algorithmic
been reduced in the proposed model and the trader can trust the trading in financial markets," Expert Systems with
model. The presented results showed the powerfulness of the Applications, vol. 208, p. 118124, 2022.
proposed PPO+AXT model. [7] Y.-C. Tsai, C.-C. Wang, F.-M. Szu, and K.-J. Wang,
"Deep Reinforcement Learning for Foreign Exchange
VI. CONCLUSIONS Trading," in Trends in Artificial Intelligence Theory
To improve the performance of the deep reinforcement and Applications. Artificial Intelligence Practices:
learning agent in the forex market, increase the trading profit, 33rd International Conference on Industrial,
and reduce the investment risk, a new model using the PPO Engineering and Other Applications of Applied
algorithm is proposed. This algorithm uses the Actor-Critic Intelligent Systems, IEA/AIE 2020, Kitakyushu, Japan,
structure to monitor the trading environment and make optimal September 22-25, 2020, Proceedings 33, 2020:
trades. To better monitor the trading environment, correctly Springer, pp. 387-392.
evaluate inputs, and find suitable trading positions in the [8] Q. Kang, , H. Zhou, , Y. Kang, and "An Asynchronous
proposed model, a new function has been used as an auxiliary Advantage Actor-Critic Reinforcement Learning
task. The new auxiliary task uses AE and K-Means algorithm Method for Stock Selection and Portfolio
to cluster the input data and consider the cluster as the input Management," Computer Science
data label. The Actor network has tried to adapt to the International Conference on Big Data Research, October 2018
environmental conditions by predicting the input data label and October 2018, doi:
comparing them with the output of auxiliary task, and in this https://doi.org/10.1145/3291801.3291831.
way, by increasing its knowledge, it has improved its trading [9] T. P. Lillicrap et al., "Continuous control with deep
strategy to increase its accuracy. The performance of the reinforcement learning," US Patent, vol. 15, no.
proposed model using the auxiliary task compared to the basic 217,758, 2020.
model without the auxiliary task and comparing it with the [10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
paper [3] shows a significant improvement, which shows the O. Klimov, "Proximal policy optimization
efficiency of the proposed method. The results of applying the algorithms," arXiv preprint arXiv:1707.06347, 2017.
proposed model on two separate datasets have confirmed the [11] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P.
main idea of this research about shaping rewards using auxiliary Moritz, "Trust region policy optimization," in
tasks. Therefore, labeling the input data using the unsupervised International conference on machine learning, 2015:
method, makes it possible for the trading agent to earn more PMLR, pp. 1889-1897.
knowledge of the environment and, with increased awareness, [12] S. Lele, K. Gangar, H. Daftary, and D. Dharkar, "Stock
find a more optimal strategy for trading in the complex and market trading agent using on-policy reinforcement
dynamic forex market. As a result, the proposed agent model learning algorithms," Available at SSRN 3582014,
could be used for trading in this market. 2020.
[13] S. Lin and P. A. Beling, "An end-to-end optimal trade
REFERENCES execution framework based on proximal policy
[1] O. Gym and N. Sanghi, Deep reinforcement learning optimization," in Proceedings of the Twenty-Ninth
with python. Springer, 2021. International Conference on International Joint
[2] Y. Matsuo et al., "Deep learning, reinforcement Conferences on Artificial Intelligence, 2021, pp. 4548-
learning, and world models," Neural Networks, vol. 4554.
152, pp. 267-275, 2022. [14] C. Konstantinos, "Reinforcement Learning with
[3] A. Tsantekidis, N. Passalis, A.-S. Toufa, K. Saitas - Auxiliary Tasks in Trading," ARISTOTLE
Zarkias, S. Chairistanidis, and A. Tefas, "Price UNIVERSITY OF THESSALONIKI, submitted in
Trailing for Financial Trading Using Deep partial fulfillment of the requirements
Reinforcement Learning," IEEE Transactions on
Neural Networks and Learning Systems, vol. PP, pp. for the Master Degree of Artificial Intelligence, Sept 2022.
1-10, 06/09 2020, doi: [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft
10.1109/TNNLS.2020.2997523. actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor," in
9
MANUSCRIPT ID NUMBER