0% found this document useful (0 votes)
14 views

Improving Deep Reinforcement Learning Agent Tradin

Uploaded by

mikylkimx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Improving Deep Reinforcement Learning Agent Tradin

Uploaded by

mikylkimx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

MANUSCRIPT ID NUMBER

Improving Deep Reinforcement Learning Agent


Trading Performance in Forex using Auxiliary Task
Sahar Arabha, Davoud Sarani, and Parviz Rashidi-Khazaee

artificial intelligent agents based on DRL have helped to solve


Abstract—Advanced algorithms based on Deep Reinforcement this problem [2].
Learning (DRL) have been able to become a reliable tool for the Deep Learning (DL) algorithms such as short-term memory
Forex market traders and provide a suitable strategy for networks (LSTM) have been used to predict and trade in the
maximizing profit and reducing trading risk. These tools try to forex market. Tsantekidis et al. have used LSTM networks to
find the most profitable strategy in this market by examining past design a model for executing optimal trades in the forex market
market data. Artificial intelligent agents based on the Proximal
Policy Optimization (PPO) algorithm, one of the DRL algorithms,
[3]. To improve the performance of LSTM and Bidirectional
have shown a special ability to determine a profitable strategy. In LSTM (BI-LSTM), it is necessary to optimally and properly
this research, to increase profitability and determine the optimal adjust their various parameters, and the accuracy of their
strategy for the PPO, an auxiliary task has been used. The prediction depends on the optimization mechanism of
auxiliary function helps the PPO to model the reward function in determining their parameters [4]. Liu et al. devised a strategy
a better way by recognizing and classifying patterns to obtain for automated Bitcoin transactions and designed an automated
additional information from the problem's inputs and data. The transaction execution strategy based on Proximal Policy
results of simulation and backtesting on the EUR/USD currency Optimization (PPO) using LSTM as a basis for policy
pair have shown that the new proposed model has been able to generation. Experimental results have shown that the LSTM
increase the overall return from -25.25% to 14.86% and the value
of sharp ratio from -2.61 to 0.24 in dataset 1 and increased overall
algorithm has improved the performance of PPO [5].
return from 2.12% to 42.22% and the value of sharp ratio from - The DRL methods, with special capabilities, have provided
2.93 to 0.47 in dataset 2. These results indicate better identification more possibilities and allowed more exploitation of time series
of trends, timely trading, and great improvement in risk reduction data [6] - [7]. In recent years, published DRL algorithms such
while increasing the profit. It has also reduced the transaction risk as Asynchronous Advantage Actor Critic (A3C) [8], Deep Q-
and can be used as an effective tool for automatic trading in the Network (DQN) [9], and PPO [10], have been able to perform
complex forex market in various fields and have made it possible to perform optimal,
profitable and low-risk transactions in trading markets [6]. If
Index Terms— Forex Trading, Actor-Critic, Deep the DRL agent is properly trained through time series data,
Reinforcement Learning, PPO algorithm, Auxiliary Task, Pairs
trends can be predicted correctly, and more profitable
Currency Trading
transactions can be executed according to market trends. These
algorithms have been able to achieve significant results in
I. INTRODUCTION
trading markets [3].

T
oday, due to economic fluctuations, most people have Lele et al. used a Reinforcement Learning (RL) agent to
become more willing to invest in the forex market to learn stock market trends and suggest trading decisions that
earn more money in this market. To be successful in this maximize profits. They used Trust Region Policy Optimization
market, it is necessary to spend most of your time sitting (TRPO) [11], PPO, and Vanilla (VPG) algorithms to increase
behind the system so that you can find a suitable strategy with the profit from trading shares of three different companies
high profits. But the challenge is whether is it possible to find a including Microsoft, Apple, and Nike. They created a trading
strategy with suitable profits or not. Second, is it necessary to agent, which performs trading actions on the environment and
spend the whole day sitting behind the system and just trade? re-train according to the received rewards. The result showed
Today, researchers have shown that Deep Reinforcement that the TRPO algorithm performed better than the other two
Learning (DRL) extracts meaningful features from data and algorithms in terms of profit [12].
makes decisions based on them [1]. Also, it has been shown that Lin et al. used an end-to-end framework to improve
transaction execution quality and reduce per-transaction costs

This paragraph of the first footnote will contain the date on which you Davoud Sarani, is Mastery program student at Information Technology and
submitted your paper for review, which is populated by IEEE. It is IEEE style Computer Engineering Department, Urmia University of Technology, Urmia,
to display support information, including sponsor and financial support Iran (e-mail: [email protected]).
acknowledgment, here and not in an acknowledgment section at the end of the Dr. Parviz Rashidi-Khazaee is Assistant Professor at Information
article. For example, “This work was supported in part by the U.S. Department Technology and Computer Engineering Department, Urmia University of
of Commerce under Grant 123456.” The name of the corresponding author Technology, 4km Band Road, Urmia, Iran (e-mail: [email protected],
appears after the financial information, e.g. [email protected]).
(Corresponding author: Dr. Parviz Rashidi-Khazaee).
Sahar Arabha is Mastery program student at Information Technology and
Computer Engineering Department, Urmia University of Technology, Urmia,
Iran (e-mail: [email protected]).
2
MANUSCRIPT ID NUMBER

by optimizing the PPO algorithm using two separate neural the best policy to perform profitable trading decisions. The
network architectures: 1) LSTM and 2) Fully Connected problem that has been bothering the mind so far is that the
Networks (FCNs) to be able to perform more efficient reward received from this method has noise due to the dynamic
transactions. They used a distributed reward signal for their environment of the forex market. In other words, the learning
reward function and results have shown that the process is slow and unstable and cannot show the intended
Implementation Shortfall (difference between the decision result [3]. Therefore, the main goal of this study is to build a
price and the final price of a transaction), standard deviation new PPO-Based model by changing shaping the reward in such
profit-loss ratio, and performance of the proposed model a way that it can have fast and sustainable results. The agent
outperform Almgren and Chriss model, Volume-Weighted learns from various signals in the environment using pre-
Average Price, Time-Weighted Average Price and DQN [13]. processed data based on unsupervised learning principles. This
Tsai et al. addressed the long-term problem of unstable approach enables the agent to continuously increase his
trends in DL predictions and used an RL method to make understanding without relying on direct rewards and thereby
transactions. They encrypted their data with the Gramian expand the scope of his learning capabilities. To do this, an
Angular Field and designed their trading model with DQN and auxiliary task was combined with PPO. The purpose of the
PPO and used it for trading of three different currency pairs auxiliary tasks is to recognize and classify patterns without
(AUD/USD, GBP/USD, EUR/USD). The comparison result human intervention, accelerate training, strengthen the agent,
showed that The PPO algorithm was considered superior for improve its overall performance, reduce the risk of transactions,
optimizing forex trading strategies [7]. and finally try to create a profitable model.
DRL methods try to find and apply the optimal trading
strategy by using the reward received from the environment. II. BACKGROUNDS
Therefore, most of the research done in the field of the forex At first, DL, RL, and DRL algorithms will be presented in
trading market is focused on how to shape the reward function. this section briefly, and then the structure of the proposed model
The reward function plays a pivotal role in increasing the will be presented.
agent's ability to identify and execute profitable transactions
and thus improve overall performance. Konstantinos used an
RL agent along with an auxiliary task to learn investment A. Deep Learning
strategies. He used an agent whose purpose is to learn from DL [1] - [2] is a subset of machine learning (ML), and
historical data and suggest investment strategies for buying and research has shown that DL algorithms can outperform ML
selling to earn profit through changes in the price of digital algorithms. There are several famous DL algorithms in this
currency. The main model uses LSTM networks for feature field such as RNN, LSTM, CNN, and Generative Adversarial
extraction and fully connected layers for the output function and Network. These algorithms have benefited from simulating the
auxiliary regression output. The Sharpe Ratio and Drawdown function of neurons in the human brain. While one neuron is not
results indicate that the use of the auxiliary function gives much able to do complex tasks only. Therefore, the human brain has
better results than the Base-line and Buy and Hold method [14]. billions of neurons that are stacked in layers and form a network
Tsantekidis et al. addressed the problem of the noise of sparse [1]. Based on that fact the structure of DL algorithms contains
bonuses and proposed the design of an additional bonus to lots of layers and neurons.
increase the return and Sharpe Ratio in forex trading. By adding
a trailing reward to the main reward Profit and Loss (PnL) and
using a relatively more complex network structure consisting of B. Reinforcement learning
LSTM and FC layers, they have helped to increase efficiency RL stands as a distinct domain within the field of ML.
and the Sharpe Ratio of transactions. In addition, they used the Diverging from conventional ML approaches like supervised
Actor-Critic (AC) method for the value of the situation and the and unsupervised learning, RL entails agents learning by
action performed in each situation, and finally, they used the actively engaging with their environment, navigating through
PPO and Double DQN, to train the RL agent. They designed trial and error. RL has evolved rapidly over the past few years
separate neural networks for each deep reinforcement learning with a wide range of applications from building a
algorithm. Their obtained results indicate that adding a recommendation system to self-driving cars. With the advent of
additional rewards to the main rewards has increased the sharp new algorithms and libraries, RL stands out as one of the most
ratio and return [3]. promising domains within ML. Here we get to know some key
Lele et al. used TRPO, PPO, and Vanila to increase profits elements of RL [1]:
in the shares of 3 different companies [12]. Their reward Agent: An agent is a learning robot that makes intelligent
function is not optimized for the risk measure. Also, they decisions. For example, a chess player can be considered an
suggested to use a supervised learning model that takes agent because the chess player learns to make the best moves
sentiment analysis as input to predict stock trends to obtain (decisions) to win the game.
more informed and efficient results [12]. Lin et al. used a Environment: The environment is the world in which the
distributed reward function and optimized the PPO algorithm agent performs actions. For example, a chessboard is called an
for better execution quality in the stock market, which has environment. Complex and high-dimensional environments can
brought favorable results [13]. be challenging because the learning process may be slow or
In this research, the PPO algorithm has been used to find the unstable.
optimal policy in the forex market so that the agent can adopt
3
MANUSCRIPT ID NUMBER

State and Action: A state is a situation or moment in the 𝐿(θ) = 𝔼𝑡 [


𝜋𝜃 (𝑎𝑡 ∣𝑠𝑡 )
𝐴𝑡 ] (2)
environment that the agent can be in. E.g. in the game of chess, 𝜋𝜃 (𝑎𝑡 ∣𝑠𝑡 )
old
each position on the chessboard is called a state. The agent Here, the expectation Et[. . .] indicates the expected return
interacts with the environment and moves from one state to average over a finite batch of samples. The 𝜋 indicates policy
another by performing an action. E.g. in the chess game and the symbol 𝜃 represents parameters that need to be
environment, the action is the move that the player (agent) learned to improve the strategy. We parameterize the old
makes. policy with 𝜃 as 𝜋𝜃old and the new policy with 𝜃 as 𝜋𝜃.
Rewards: the reward is a value for performing an action, Finally, considering the ratio of the new policy to the old
for example +1 for a good action and -1 for a bad action. policy as 𝑟𝑡, the following formula is obtained:
C. Deep reinforcement learning algorithms
𝐿(𝜃) = 𝔼𝑡 [𝑟𝑡 (𝜃)𝐴𝑡 ] (3)
DRL integrates Deep Neural Networks with an RL
framework to better handle high-dimensional input spaces such
It means that it tries to maximize its policy along with the
as images or raw data. These networks are used to approximate
limitation. The objective function is as follows:
policy (agent strategy) or value functions (cumulative expected
reward) for each state-action pair [1]. The goal of RL is to find
𝐿(𝜃) = 𝔼𝑡 [𝑚𝑖𝑛(𝑟𝑡 (𝜃)𝐴𝑡 , 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖)𝐴𝑡 )] (4)
the optimal policy, which is the policy that provides the
maximum return. There are two methods to find the optimal
policy: 1) value-based method and 2) policy-based method. In The 𝑟𝑡 is the ratio of the new policy to the old one, and A𝑡 is
this paper, policy-based methods were used. Most policy-based the advantage(benefit) function, which clipped 𝑟𝑡 in the range
methods use a stochastic policy, which with a stochastic policy [1 − 𝜖, 1+ 𝜖], but why? This can be explained by considering
selects actions based on the probability distribution in the action two cases of the benefit function, when the benefit is positive
space, which allows the agent to examine different actions and when the benefit is negative.
instead of performing one action at a time. Several policy-based
algorithms were available [15], in this paper, the PPO algorithm 1) When the advantage is positive:
was used. When A𝑡 >0, It signifies that the associated action ought to
Before dealing with the PPO algorithm, it is necessary to be favored above the mean of all alternative actions, so we can
have some detailed information about one of the popular policy- increase the value of 𝑟𝑡(𝜃) for that action so that it has a higher
based methods for finding the optimal policy Actor-Critic (AC) chance of being selected. However, while increasing 𝑟𝑡 (𝜃)
[16]. should not be increased too much to deviate from the old policy
and clipped at 1+ 𝜖.

D. Actor-Critic 2) When the advantage is negative:


The AC is one of the most popular DRL methods and the PPO When A𝑡 < 0, It signifies that the associated action ought to
is designed based on AC. The AC method is a combination of be favored above the mean of all alternative actions, so we can
value-based and policy-based optimal policy methods [16]. The reduce the value of 𝑟𝑡(𝜃) for that action so that it has a lower
Actor Critic framework comprises two distinct networks: 1) the chance of being selected, however, while reducing 𝑟𝑡 (𝜃) should
Actor Network and 2) the Critic Network. The Actor Network not be reduced too much to move away from the old policy and
is tasked with determining the most effective policy, whereas clipped at 1- 𝜖.
the Critic Network evaluates the policy generated by the Actor
Network. The actor produces the optimal policy through the The problem that has been bothering the mind so far is that
advantage function, through the advantage function actor can the reward received from PPO method has noise due to the
understand whether an action is really good or just gives it a dynamic environment of the forex market. In other words, the
similar value to all other actions, and the critic criticizes the learning process is slow and unstable and cannot show the
actor's action in the acting point through the MSE function. intended result. So a decision was made to change the way of
shaping the reward in such a way that it can have fast and
E. Proximal Policy Optimization
sustainable results.
PPO is an improved algorithm of TRPO and its Reward Shaping is a technique in DRL that involves
implementation is simple. There are two different types of PPO modifying the reward function to guide agent learning toward a
algorithms: 1) PPO-clipped and 2) PPO-penalty. In this article, desired behavior [17]. A reward shaping model is shaping by
the PPO-clipped was used. To ensure that the policy updates are providing additional information so that the agent can receive
in the trust zone (the new policy is not far from the old policy), additional information about the environment as a reward, and
the PPO creates a function called the cut function, which since the higher the reward received, the greater the profit from
ensures that the new policy is not far from the old policy. the proposed model, so the learning agent will achieve
The ratio of the new policy to the old policy is calculated advanced results by directly maximizing cumulative reward. In
through the following formula: order to enhance learning efficiency, researchers have
𝜋𝜃 (𝑎𝑡 ∣𝑠𝑡 ) investigated the utilization of shaping incentives. Essentially,
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝔼𝑡 [ (𝑎𝑡 ∣𝑠𝑡 )
𝐴𝑡 ] (1)
𝜃 𝜋𝜃
old shaping involves enhancing the reward system to communicate
existing knowledge to a standard RL system.
4
MANUSCRIPT ID NUMBER

These artificial incentives assist in guiding the Our contribution in this research is the use of auxiliary task
reinforcement learner towards favorable policies or steering to better and more optimally train the RL agent in such a way
away from unfavorable ones [18]. For example, the agent can that it has better learning and makes more profitable trades. The
be rewarded for detecting a change in market trends, or for auxiliary task makes the agent better understand the structure of
identifying a particular pattern in the data. The reward shaping the market environment and learn the optimal trading policies.
used in this research is based on a new shaping algorithm of For example, the agent learns that a certain set of situations
DeepMind's article on Auxiliary tasks [19]. leads to a suitable trading position. Therefore, it tries to learn
them. The proposed auxiliary task examines the input data and
after extracting the golden features, clusters them and creates a
F. Auxiliary Tasks
suitable label for each input data. This mechanism helps
Auxiliary tasks are additional cost functions that can help transform the problem from unsupervised learning to
the RL agent as an additional leverage so that the agent itself supervised learning using Forex time series data (features) and
can learn, predict, and observe data from the environment then the DRL agent uses the generated labels to recognize
without human intervention. In other words, while DL trends to make an optimal deal. To build the proposed model,
algorithms have made considerable advancements across the following steps are taken:
numerous fields, they demand expensive annotations on 1) Pre-processing stage
extensive datasets. Self-supervised learning (SSL) utilizing
unlabeled data has emerged as a viable alternative by First, the time series of forex data is pre-processed and
eliminating the need for manual annotation. SSL achieves this 5 new features are extracted.
by crafting feature representations through auxiliary tasks that 2) Labeling stage
function without manual annotation. Consequently, models 5 features extracted in 16 previous time steps, a total of
trained on these tasks can extract valuable latent 80 features, are given to Auto Encoder (AE) to extract
representations, subsequently enhancing downstream tasks like 12 golden features from its latent. Then the golden
object classification and detection[20]. This means that even in features are given to the K-Means algorithm to cluster
the absence of a strong reward signal, it measures and defines each input instance into 12 separate clusters.
the loss (error or discrepancy) in a system using alternative or 3) Learning the optimal trading policy
indirect indicators derived from unlabeled data. Unsupervised
auxiliary tasks act as additional learning objectives alongside The input data along with the created label are given to
the main RL task. These auxiliary tasks are designed to help the the Actor-Critic network so that the appropriate trading
agent learn useful features or representations of the data that can action is performed by the Actor network. The action
help improve the agent's performance on the main task. The performed by the actor is checked by the Critic and
main idea of combining RL with unsupervised auxiliary tasks Auxiliary Task, and based on their feedback, the
is to exploit the data structure to accelerate and stabilize the network is modified to learn a more optimal policy.
learning process [21]. By forcing the agent to learn related 4) Testing stage
features or patterns through auxiliary tasks, it can gain a better The performance of the trained network on new data,
understanding of the environment and make more informed the last seven months of the dataset, and their results
decisions in the main task. This approach can lead to faster are presented.
convergence, improved efficiency, and better generalization
when facing new situations. In general, RL with unsupervised A. Pre-Processing
auxiliary tasks is a promising way to advance the capabilities of The dataset has OHLC features, which include hourly
AI agents and enable them to tackle complex real-world opening price, highest price, lowest price, and closing price of
problems using reinforcement learning and unsupervised currency pairs. To improve the performance of the model and
learning techniques. solve the time lag problem, as in [3], these features are
processed and five new features are extracted as follows:
III. PROPOSED MODEL STRUCTURE
𝑝𝑐 (𝑡)−𝑝𝑐 (𝑡−1)
In this research, to correctly identify the trends and make 𝑥1𝑡 = (5)
𝑝𝑐 (𝑡−1)
profitable trades in the forex market, the performance of the
reinforcement learning agent has been improved by using a 𝑝ℎ (𝑡)−𝑝ℎ(𝑡−1)
separate auxiliary task. The main idea is to develop an agent 𝑥2𝑡 = (6)
𝑝ℎ (𝑡−1)
that can learn from different signals in the forex market by
recognizing patterns. This idea is derived from the logic of 𝑝𝑙 (𝑡)−𝑝𝑙 (𝑡−1)
𝑥3𝑡 = (7)
unsupervised learning, where the agent continues to develop its 𝑝𝑙 (𝑡−1)
understanding of the environment even without direct reward.
𝑝ℎ (𝑡)−𝑝𝑐 (𝑡−1)
This approach expands the agent's learning scope and focuses 𝑥4𝑡 = (8)
𝑝𝑐 (𝑡−1)
on predicting and controlling various characteristics of the
market environment and makes the agent able to flexibly 𝑝𝑐 (𝑡)−𝑝𝑙 (𝑡−1)
control and understand his experiences by understanding the 𝑥5𝑡 = (9)
𝑝𝑐 (𝑡−1)
environment more.
5
MANUSCRIPT ID NUMBER

Fig. 2. Proposed trading agent structure


C. The Proposed Structure of Trading Agent
Figure 2 shows the structure of the proposed trading agent
Fig. 1.The Proposed AXT Model Structure. architecture. In this structure, the input data is given to the ATX
network to produce suitable labels, then these labels are given
In which 𝑝𝑐 (𝑡) is the closing price at time 𝑡, 𝑝ℎ (𝑡) is the to the AC network as a target along with the inputs to make a
highest price and 𝑝𝑙 (𝑡) is the lowest price of the currency pair suitable trading decision. The AC network structure consists of
in the step 𝑡. (𝑡 − 1) indicates the previous time step. four layers consisting of 1 LSTM layer with 128 neurons and 3
The closing price has been used to calculate the return in a FC layers with 32, 64, and 64 neurons respectively. In this
step-by-step manner, which is calculated as follows: structure, the output of the fourth layer is provided to the Actor
and Critic to make the appropriate trading decision and predict
𝑝𝑐 (𝑡)−𝑝𝑐 (𝑡−1)
𝑧𝑡 = (10) the input data label. In the proposed structure, the Actor has 2
𝑝𝑐 (𝑡−1)
heads, one head performs the trading decisions (buying, selling,
B. Proposed Auxiliary Task Labeling Mechanism nothing) and the other head is related to the auxiliary task that
One of the best ways to use the input data and help improve the checks the labels of the input data and corrects the action. The
performance of the model is to select or create optimal features Critic network is also used to predict the expected reward and
from a set of features. Among the various methods, PCA is one evaluate the Actor's performance to help improve the Actor's
of the effective techniques for extracting new and optimal performance and improve its predictions. The basic DRL
features, which has been successfully used to extract superior algorithm used in this research is the PPO, whose performance
features and optimize the effectiveness of stock market data has been improved by using auxiliary tasks. Therefore, the
clustering [22]. Another method is using Auto-Encoder (AE), proposed structure is called PPO+AXT.
which has special characteristics [23]. Initial evaluations have D. Evaluation Metrics
shown the performance of AE, therefore, it was used in this
The following three criteria have been used to evaluate the
research, In addition to extracting new and optimal features, it
performance of models.
also reduces computing costs [23]. In this step, the 5 features
1) The Amount of Cumulative Profit
generated at a pre-processing step in the previous 16 time steps,
As mentioned earlier, the agent is rewarded by the environment
a total of 80 features, are fed into the AE network. The Encoder
for acting as each time step. In this problem, this main reward
part of the AE network has 4 layers and each layer has 80, 128,
is called PnL, which is calculated in time step t using:
64, and 32 neurons respectively. The AE network preprocesses
the inputs and then 12 new features, called golden features, will
be generated from the output of Encoder. Then the golden 𝑇𝑜𝑡𝑎𝑙 𝑅𝑒𝑡𝑢𝑟𝑛 = ∑𝑡 𝑟𝑡 𝑃𝑛𝐿 . 𝑟𝑡𝑃𝑛𝐿 = {−1.0. +1}𝑧𝑡 (11)
features are given to the K-Means algorithm for clustering and
categorizing the trading data to produce a suitable label. The Where, {-1, 0, 1} are trading actions that indicate selling, doing
proposed structure for creating an appropriate label for input nothing (stay out of market), and buying respectively, and 𝑧𝑡 is
data shown in Fig. 1 named as AXT. the return. The resulting cumulative profit is the sum of the
profits of all time steps.
2) Sharp Ratio
6
MANUSCRIPT ID NUMBER

This criterion is used to evaluate the risk of trading and is


calculated as follows: PPO PPO+AXT PPI%

𝑆ℎ𝑎𝑟𝑝 𝑅𝑎𝑡𝑖𝑜 = (𝑟𝑒𝑡𝑢𝑟𝑛𝑠 𝑚𝑒𝑎𝑛) / (𝑠𝑡𝑑 𝑜𝑓 𝑟𝑒𝑡𝑢𝑟𝑛𝑠) (12) Overall Return -25.2% 14.86 % 158.8%
Here, std indicate standard deviations of returns Sharp-Ratio -2.618 0.249 109.2%

3) The Percentage of Performance Improvement


Since the conditions and training environment of our work are
not the same as the basic article [3], the above parameters
cannot be used to compare the two works. To compare the
proposed model with the model of the basic article, the
percentage of performance improvement (PPI) criterion is used,
which is calculated as follows:

(𝑣𝑎𝑙𝑢𝑒𝑛𝑒𝑤 − 𝑣𝑎𝑙𝑢𝑒𝑜𝑟𝑔𝑖𝑛𝑎𝑙 )
𝑃𝑃𝐼 = ⨯ 100 (13)
|𝑣𝑎𝑙𝑢𝑒𝑜𝑟𝑔𝑖𝑛𝑎𝑙 |

IV. RESULTS
In this section, after introducing the used datasets, the
mechanism of setting the parameters of the environment and
tuning meta-parameters, the results of the proposed model will
be presented and compared with the basic PPO model. Fig. 3. PPO model Backtesting Performance on DS1

A. Datasets C. Model Backtesting Performance


In this article, two separate datasets are used. The first To evaluate the performance of the proposed model and to
dataset (DS1) is similar to [3] and the second dataset (DS2) is trust the obtained results, the model training process has been
used from the data of recent years. The reason for using two carried out based on different seeds, which are considered 30,
separate datasets is that the DS1 was collected during the 50, 70, and 99 respectively, and the average result of 4 different
golden period of the European and American economy, and the run was considered as models overall performance. The number
DS2 was collected during the war in Ukraine, the coronavirus, provided after the name of the series in Fig. 3 to 6 indicated the
and the economic crisis in Europe and the United States, so that seed number. Table I shows the average performance of the
we can compare our model performance in two completely basic PPO model and the proposed model on the DS1 in the
different periods in the field of trading markets. time step of 1000000. The results show that the PPO+AXT has
DS1: This dataset contains forex market data for the EUR/USD been able to obtain promising results compared to the agent
currency pair from the beginning of 2009 to 07/31/2017 for 1- without an auxiliary task and has proven the assumption that the
hour candles, extracted using the MetaTrader 5 terminal. The use of an auxiliary task improves learning in the reinforcement
data from 2009 to the end of 2016 were used for training and learning agent. The proposed model shows a better average
from the first 7 months of 2017 for backtesting. performance than the base PPO model.
DS2: This dataset contains forex market data for the EUR/USD Figure 3 shows the return performance of the PPO agent in
currency pair from 03/03/2013 to 08/01/2023 for 1-hour the 7-month backtesting period on DS1 for 4 different seeds. As
candles, extracted using the MetaTrader 5 terminal. Data from it is clear, the basic PPO agent for the seed value of 70 did not
2013 to the end of 2022 were used for training and data from have a good return and it was called, and by the seed value of
the first 7 months of 2023 for backtesting. 30 and 50 the agent did not earn a good return performance, but
the execution by the seed value of 90 obtained a good return.
B. Parameters and Hyper Parameters Tuning Figure 4 shows the performance of the PPO+AXT agent in
In this research, the OPTUNA framework [24] was used to the 7-month backtesting period on DS2 for 4 different seeds. As
obtain the optimal value of meta-parameters of AE and K- it is clear, the PPO+AXT agent for a seed value of 30 did not
Means algorithms. Using OPTUNA for AE, the best batch size, perform well and with a seed value of 50, it worked well at first,
Learning Rate, and Encoding Dimension are 32, 0.0000879678, but then its efficiency decreased. The proposed model with seed
and 12, respectively, and 12 clusters are suggested for the K- values of 99 and 70 has achieved good results.
Means algorithm.
In the environment of this problem, every 600-time steps are
considered as an Episode, and 1000000 time steps are
considered for DRL agent training. The value of epsilon (ε) in
the proposed algorithm is set to 0.2.

TABLE I
Backtesting Average Performance on DS1
7
MANUSCRIPT ID NUMBER

TABLE II
Backtesting Average Performance on DS2

PPO PPO+AXT PPI %


Overall Return 2.123% 42.228% 1891.51%

Sharp-Ratio -2.933 0.473 116.4 %

Fig. 4. PPO+AXT Backtesting Performance on DS1

Fig. 6. PPO+AXT Backtesting Performance on DS2

TABLE III
Performance of PPO + Additional reward model [3]

DQN [3] PPO+Trail[3] PPI %


Return 3.3% 6.2% 87.88%
Sharp Ratio 0.753 1.525 102.67%
Fig. 5. PPO model Backtesting Performance on DS2 TABLE IV
PPI% of the Proposed Model Vs. Model [3]
Table II shows the backtesting results of the basic model and
the proposed model on the DS2 at a time step of 1,000,000. The PPO+Trail [3]
PPO+AXT
obtained results show that the proposed PPO+AXT model has
been able to perform much better than the basic PPO model and Return 158.85% 87.88%
has a significant improvement. Sharp Ratio 109.20% 102.67%
Figure 5 shows the return performance of the PPO agent in
the 7-month backtesting period on Ds2 for 4 different seeds. As through the PPO algorithm. Table III shows that the PPO+Trail
it is clear, the PPO agent for the seed value of 70 did not have a agent has shown better performance than the DQN agent and
good return and it was called. The seed value of 99 had a good the additional reward problem has been able to improve the
return at first and then its return decreased. But, seeds with the performance of the PPO agent [3].
value of 30 and 50 have achieved good results. Table IV shows the PPI% of the model of the article [3] with
Figure 6 shows the performance of the PPO+AXT agent in the PPI% of the proposed model on DS1. The results have
the 7-month backtesting period on DS2 for 4 different seeds. As shown that in all criteria, the proposed model has shown better
it is clear, the basic reinforcement learning agent for seeds with performance. The results indicated that the combination of the
a value of 30 did not have a good return, but it achieved a good PPO algorithm with auxiliary work is better than its
return for the other 3 seeds. combination with additional reward and has provided better
performance.
V. DISCUSSION In this research, to provide standard results, the model was
We have compared and evaluated the results of this article, trained and evaluated with different seeds, respectively, with
the use of auxiliary tasks, with the results of the basic article, values of 30, 50, 70, and 99. The reason for using this method
that used additional rewards [3]. In the article [3], the model is that random seeds are generated every time through the seed
without additional reward is calculated through the DQN method, and these random seeds tell the agent from which state
algorithm, and the model with additional reward is calculated to start learning, if we start the training again, it will be from
the same training state and makes the results to be reproducible.
8
MANUSCRIPT ID NUMBER

By calculating the performance improvement percentage of [4] M. A. I. Sunny, M. M. S. Maswood, and A. G. Alharbi,
the model [3] and the proposed model in this study, for both the "Deep learning-based stock price prediction using
DS1 and DS2, it can be concluded that the proposed model of LSTM and bi-directional LSTM model," in 2020 2nd
this research has significantly increased in both return and novel intelligent and leading emerging sciences
sharp-ratio performance criteria. The percentage of conference (NILES), 2020: IEEE, pp. 87-92.
performance improvement in the Overall Return criterion is [5] F. Liu, Y. Li, B. Li, J. Li, and H. Xie, "Bitcoin
158.85%, 1891.51%, and 87.88% respectively for the PPO, the transaction strategy construction based on deep
PPO+AXT, and [3], which shows a significant improvement in reinforcement learning," Applied Soft Computing, vol.
the performance of the proposed model in this research. In 113, p. 107952, 2021/12/01/ 2021, doi:
terms of Sharp-Ratio, the percentage of performance https://doi.org/10.1016/j.asoc.2021.107952.
improvement has been obtained as 109.20%, 116.04%, and [6] A. Shavandi and M. Khedmati, "A multi-agent deep
102.67% respectively, which shows that the trading risk has reinforcement learning framework for algorithmic
been reduced in the proposed model and the trader can trust the trading in financial markets," Expert Systems with
model. The presented results showed the powerfulness of the Applications, vol. 208, p. 118124, 2022.
proposed PPO+AXT model. [7] Y.-C. Tsai, C.-C. Wang, F.-M. Szu, and K.-J. Wang,
"Deep Reinforcement Learning for Foreign Exchange
VI. CONCLUSIONS Trading," in Trends in Artificial Intelligence Theory
To improve the performance of the deep reinforcement and Applications. Artificial Intelligence Practices:
learning agent in the forex market, increase the trading profit, 33rd International Conference on Industrial,
and reduce the investment risk, a new model using the PPO Engineering and Other Applications of Applied
algorithm is proposed. This algorithm uses the Actor-Critic Intelligent Systems, IEA/AIE 2020, Kitakyushu, Japan,
structure to monitor the trading environment and make optimal September 22-25, 2020, Proceedings 33, 2020:
trades. To better monitor the trading environment, correctly Springer, pp. 387-392.
evaluate inputs, and find suitable trading positions in the [8] Q. Kang, , H. Zhou, , Y. Kang, and "An Asynchronous
proposed model, a new function has been used as an auxiliary Advantage Actor-Critic Reinforcement Learning
task. The new auxiliary task uses AE and K-Means algorithm Method for Stock Selection and Portfolio
to cluster the input data and consider the cluster as the input Management," Computer Science
data label. The Actor network has tried to adapt to the International Conference on Big Data Research, October 2018
environmental conditions by predicting the input data label and October 2018, doi:
comparing them with the output of auxiliary task, and in this https://doi.org/10.1145/3291801.3291831.
way, by increasing its knowledge, it has improved its trading [9] T. P. Lillicrap et al., "Continuous control with deep
strategy to increase its accuracy. The performance of the reinforcement learning," US Patent, vol. 15, no.
proposed model using the auxiliary task compared to the basic 217,758, 2020.
model without the auxiliary task and comparing it with the [10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
paper [3] shows a significant improvement, which shows the O. Klimov, "Proximal policy optimization
efficiency of the proposed method. The results of applying the algorithms," arXiv preprint arXiv:1707.06347, 2017.
proposed model on two separate datasets have confirmed the [11] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P.
main idea of this research about shaping rewards using auxiliary Moritz, "Trust region policy optimization," in
tasks. Therefore, labeling the input data using the unsupervised International conference on machine learning, 2015:
method, makes it possible for the trading agent to earn more PMLR, pp. 1889-1897.
knowledge of the environment and, with increased awareness, [12] S. Lele, K. Gangar, H. Daftary, and D. Dharkar, "Stock
find a more optimal strategy for trading in the complex and market trading agent using on-policy reinforcement
dynamic forex market. As a result, the proposed agent model learning algorithms," Available at SSRN 3582014,
could be used for trading in this market. 2020.
[13] S. Lin and P. A. Beling, "An end-to-end optimal trade
REFERENCES execution framework based on proximal policy
[1] O. Gym and N. Sanghi, Deep reinforcement learning optimization," in Proceedings of the Twenty-Ninth
with python. Springer, 2021. International Conference on International Joint
[2] Y. Matsuo et al., "Deep learning, reinforcement Conferences on Artificial Intelligence, 2021, pp. 4548-
learning, and world models," Neural Networks, vol. 4554.
152, pp. 267-275, 2022. [14] C. Konstantinos, "Reinforcement Learning with
[3] A. Tsantekidis, N. Passalis, A.-S. Toufa, K. Saitas - Auxiliary Tasks in Trading," ARISTOTLE
Zarkias, S. Chairistanidis, and A. Tefas, "Price UNIVERSITY OF THESSALONIKI, submitted in
Trailing for Financial Trading Using Deep partial fulfillment of the requirements
Reinforcement Learning," IEEE Transactions on
Neural Networks and Learning Systems, vol. PP, pp. for the Master Degree of Artificial Intelligence, Sept 2022.
1-10, 06/09 2020, doi: [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft
10.1109/TNNLS.2020.2997523. actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor," in
9
MANUSCRIPT ID NUMBER

International conference on machine learning, 2018:


PMLR, pp. 1861-1870.
[16] V. Konda and J. Tsitsiklis, "Actor-critic algorithms,"
Advances in neural information processing systems,
vol. 12, pp. NIPS Proceedings, 13, 1008-1014., 1999,
doi: 10.1137/S0363012901385691.
[17] Y. Hu et al., "Learning to utilize shaping rewards: A
new approach of reward shaping," Advances in Neural
Information Processing Systems, vol. 33, pp. 15931-
15941, 2020.
[18] A. Laud and G. DeJong, "The influence of reward on
the speed of reinforcement learning: An analysis of
shaping," in Proceedings of the 20th International
Conference on Machine Learning (ICML-03), 2003,
pp. 440-447.
[19] M. Jaderberg et al., "Reinforcement learning with
unsupervised auxiliary tasks," arXiv preprint
arXiv:1611.05397, 2016.
[20] S. Albelwi, "Survey on self-supervised learning:
auxiliary pretext tasks and contrastive learning
methods in imaging," Entropy, vol. 24, no. 4, p. 551,
2022.
[21] E. W. A. D. Joel Ye1∗ Dhruv Batra2, "Auxiliary Tasks
Speed Up Learning Point Goal Navigation,"
InConference on Robot Learning 2021 Oct 4 (pp. 498-
516). PMLR., 2020 2020, doi:
https://doi.org/10.48550/arXiv.2007.04561.
[22] C. Han, Z. He, and A. J. W. Toh, "Pairs trading via
unsupervised learning," European Journal of
Operational Research, vol. 307, no. 2, pp. 929-947,
2023.
[23] S. Ladjal, A. Newson, and C.-H. Pham, "A PCA-like
autoencoder," arXiv preprint arXiv:1904.01277, 2019.
[24] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M.
Koyama, "Optuna: A next-generation hyperparameter
optimization framework," in Proceedings of the 25th
ACM SIGKDD international conference on knowledge
discovery & data mining, 2019, pp. 2623-2631.

You might also like