0327_AaronGerardDaniel
0327_AaronGerardDaniel
SUBMITTED BY
Name of Candidate: Aaron Gerard Daniel
Registration Number: 19-300-3-01-0327
College Roll Number: 0327
College Room Number: 12
Department: B.Com(Hons.) – Morning
SUPERVISED BY:
Name of the Supervisor: Dr. Soheli Ghose
Name of College: St. Xavier’s College (Autonomous), Kolkata
Signature: ....................................
2|Page
Acknowledgement
Surpassing milestones towards a mission sometimes gives us such a degree of
jubilance that we tend to forfeit the precious guidance and help extended by the
people to whom the success of mission is solely dedicated. A project depends
on contributions from a wide range of people for its success. I would like to take
this opportunity to acknowledge the many people who have contributed a great
deal of their time and expertise to the development of this project. Firstly, I
would like to express my sincere gratitude towards the college and father
Principal and Vice Principal for giving us this opportunity to write a dissertation
project. I would also like to express my sincere thanks to my project guide Dr.
Soheli Ghose for getting me started and for guiding me throughout the project.
Without her knowledge, guidance and experience, this project would not have
gone so far. She has been a source of constant inspiration, stimulating me to
learn and pick up minute of the topics, making my learning process a worthy
experience. She has given a sense of completeness to this project and ensures its
full proof by completely monitoring my work. Also I would like to thank my
family, friends and relatives for their continuous support throughout the time
period of my project. Above all I want to thank God for giving me confidence
and patience to successfully complete the project.
3|Page
INTRODUCTION
A Deep Neural Network (DNN), on its own figures out the consecutive representation of the
lower dimension by extrapolating it with that of the higher dimension input data. The
essential feature of DNN is that it incorporates the respondent's bias into the hierarchical
neural network architecture. As a result, Deep Learning has a high ability to perceive and
extract features. Its main flaw is that it lacks the ability to make decisions. Although
Reinforcement Learning may be used to make decisions, it has limits when it comes to fully
expressing observation. This has prompted to combine Deep Learning and Reinforcement
Learning because one method complements the other. The integrated method can give a
framework for developing a complex cognitive decision-making system.
The stock market is characterised by fast fluctuation, a plethora of interfering factors, and a
lack of timely data. Stock trading is a game with imperfect information, and dealing with
serialised decision problems is tough using a single-objective supervised learning model.
Reinforcement learning is among the most effective methods for resolving these problems.
Traditional quantitative investment is frequently based on technical indexes, which have
limited self-adaptability and have a short life period. This study aims to demonstrate the
application of bringing a Deep Reinforcement Learning model to the financial sector, which
can deal with large amounts of data in the financial market, improve data processing and
feature extraction from transaction signals, and improve transaction capability. Furthermore,
this research applies Deep Learning and Reinforcement Learning theory from computer
science to the realm of finance, demonstrating the ability of Neural Networks to capture and
evaluate information from large amounts of data. For example, stock trading is a sequential
decision-making strategy, and the final aim in Reinforcement Learning is to learn multiple
stage behaviour patterns. The approach can determine the optimal pricing in a given state in
4|Page
order to reduce transaction costs. As a result, it is the most practical in the investment
industry.
LITERATURE REVIEW
The main point of view on early Deep Reinforcement Learning is that it uses neural networks
to reduce the dimensionality of input from higher dimensions in order to speed up data
processing. Shibata et al. advocated using a competent deep auto encoder for visual learning
control and devised "visual motion learning" to teach the agent to have human-like perception
and decision-making abilities. Lange et al. proposed the application of competent deep auto
encoder to visual learning control, and came up with "visual motion learning" to train the
agent have human-like perception and decision-making capacity. Abtahi et al. proposed the
Deep Belief Networks (DBN) into Reinforcement Learning, where the DBN is utilised to
replace the original value function approximator during model development, and the model is
successfully applied to the character segmentation problem of a licence plate image. Then,
Lange et al. introduced Deep Q-Learning, which used Reinforced Learning to control the car
automatically based on a visual issue. Koutnik et al. applied the Neural Evolution (NE)
approach and Reinforcement Learning to the popular car racing game TORCS and were able
to achieve fully automated driving.
The DL part automatically observes the current market environment for feature learning,
while the RL part constructs the interaction with deep characterisation and performs trading
decisions to accumulate the final return of the current updated environment.
The pioneers of DRL were Mnih et al. To answer the decision-making problem of atari
games, the pixel points of the game screen are used as input data (S), while the front, rear,
left, and right directions of the game joystick are used as actions (A). Finally, he
demonstrated that in 2015, the performance of a Deep q-network agent could outperform all
other methods. Later on, many researchers enhanced DQN. Double-DQN was proposed by
Van Hasselt et al., in which one of the Q networks decides the action and the other Q network
reviews it. To overcome the deviation problem that exists in a single DQN, the two networks
collaborate. To speed up the training process, Silver et al. developed a replay mechanism
based on the original Double-DQN and added camouflage samples in 2016. The Dueling
Network, proposed by Wang et al., is a DQN-based approach that divides the original
network into an output scalar V(s) and an output action to the dominant value, and integrates
5|Page
two Q values after operation. Deterministic Policy Gradient Algorithms (DPG) were shown
by Silver et al., and the DDPG paper by Google integrated DQN and DPG to incorporate
DRL into continuous motion space control. According to Berkeley University research, the
method's credibility in simulating and increasing the stability of the DRL model is crucial.
Gabriel et al. established the concept of action embedding, which involves embedding a
discrete action in reality into a continuous space in order to apply the reinforcement learning
method to large-scale learning issues. The preceding findings show that the deep
reinforcement learning algorithm is constantly developed and optimised in order to adapt to
more realistic scenarios. Reinforcement learning is able to examine the world without
supervision, actively explore and experiment, and self-summarize outstanding experience.
Although the active learning system, which combines deep learning with reinforcement
learning, is still in its early stages, it has shown to be effective in learning a variety of video
games.
Krollner et al. examine a variety of machine learning-based stock market forecasting articles,
including neural network-based models, evolution and optimization mechanics, multiple and
compound approaches, and so on. Scientists frequently utilise the Artificial Neural Network
(ANN) to forecast stock market trends. Guresen et al., for example, employ the Dynamic
Artificial Neural Network (DANN) and Multi-Layer Perceptron (MLP) Model to predict the
NASDAQ Stock Index. To overcome the problem of portfolio selection, Hu et al. integrated a
reinforcement learning algorithm with a cointegration paired trading strategy. The adaptive
dynamic modification of model parameters is realised using the sotino ratio as the return
6|Page
index, and the return rate and sotino ratio are considerably improved. The maximum
retracement decreases, and the transaction frequency decreases. There are fewer bonds,
smaller data sets, and fewer status indications, on the other hand. Vanstone and colleagues
develop an MLP-based trading system for detecting trade signals in the Australian stock
market. Hybrid machine learning (HML) models have been utilised to resolve financial
trading points depending on time sequence due to the constraints of a single model. As
follows, HML models have become the standard for financial analysis. A hybrid Support
Vector Regression model is proposed by J.Wang et al. It can anticipate trading prices by
combining Principal Component Analysis and Brainstorm optimization. Mabu et al.
employed a rule-based evolutionary algorithm in conjunction with MLP to discover stock
market trading points.
Financial market data has the characteristics of uncertainty and timing due to the reciprocal
influence of a large number of complicated components. Data analysis is a difficult,
nonlinear, and inherently unstable problem. In financial forecasting and sequence decision
making, traditional statistical models and mass data mining models are ineffective. Technical
indicators and evaluation criteria are frequently used in traditional quantitative investment
algorithms. These techniques are notorious for their extended lifespan and poor self-
adaptation. Following machine learning algorithms, strategic data in financial investment can
be greatly improved. The machine's processing speed can greatly increase strategy
adaptability and the capacity to derive market features from real-time trading signals.
BACKGROUND
Types of Learning:
This study is primarily based on the application of Deep Reinforcement Learning for
Algorithmic Trading, but before we come to that, lets first understand the different types of
learning:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
How can supervised learning be used for Algorithmic Trading?
We can build a model based on supervised learning, but first we must analyse all outcomes
and scenarios. Initially, a large number of scenarios will have to be fed to the machine in
order for it to act. A large number of real-time scenarios and use-cases will be used to build
the model. However, the machine will know what to do in a specific scenario and will be able
7|Page
to act on it (provided a scenario is fed into the system). The system becomes dependent due
to the feeding of all scenarios, data, and use cases.
The reliance is on supervision. Though all types of learning are supervised, they are all
supervised by a loss function or a function that tells the machine what is good and what is
bad. Since childhood, we humans have been influenced by this type of learning.
The brain takes each situation and tries to act on it in a specific way, but first it analyses all
types of scenarios and tries to predict the best possible outcome or step to take and act on it.
The brain then executes that outcome in the hope that it will occur in the manner that it hoped
for success. And, following execution, it receives a success/failure result.
Thus, Reinforcement Learning is a technique for teaching machines to solve a task without
supervision and then receiving a reward based on the outcome.
When a task or scenario is assigned, a series of steps must be completed in order for the final
result to be achieved. The scenarios change at each step, and a new "state" is presented, based
on which the decision is made. Finally, a reward is given based on the outcome of the steps
taken, which defines success or failure. Because Reinforcement Learning learns through
experience, it is self-sufficient.
8|Page
from supervised learning algorithms in that the computer must be explicitly told what
behaviour is acceptable.
Reinforcement learning algorithms are especially well-suited for tasks that are difficult to
specify explicitly, such as chess or driving a car. In these cases, telling the computer exactly
what to do to achieve the desired result is difficult or impossible. Reinforcement learning
algorithms, on the other hand, can learn how to perform a task by trial and error by providing
feedback (e.g., rewards or punishments) for specific behaviours.
2. When we perform a specific action, we move from one state to another. Essentially, we are
changing from one state to another.
4. Because the transition is based on action, what exactly motivates a specific action? This is
referred to as policy. This policy is incorporated into the system that drives our actions. The
policy here could be based on something that will ultimately benefit us.
APPLICATION TO TRADING:
The question that lies here is how do we apply DRL to the stock market.
So, when the machine enters the market, it is essentially entering the environment of a
specific stock. When a machine attempts to understand a chart, multiple factors and cases are
presented to it. In this case, the various market parameters, such as data from candle charts,
trend analysis, or sentimental analysis, form the state and provide an understanding to the
machine, on which it can act based on the policies that it has been fed. The policies in this
case could be trading strategies or various types of analysis. The machine's actions are limited
to buy/sell/hold. If the machine decides to buy a particular stock at a specific step, it will
reach the point where it will own the stock. Following that, if it decides to hold in situations
where the stock price rises or falls, it reaches another state, and so on. Finally, it receives an
award if it sells the stock at a profit. The system will aim to maximise the reward in the short
or long term based on the strategy or policy fed to it.
9|Page
THE PROBLEM:
To put it simply, the problem is determining when to act so that the shift can occur to another
state. In the normal course of events, humans often know what all possible scenarios are by
judging and drawing conclusions based on past statistical data, and we may know the best
time to buy/sell/hold. Because knowing when to buy/sell/hold comes from experience, and
DRL teaches through experience, let's use an example to illustrate the issue:
The machine enters a trade in which it shorts an option at price 'x.' This option has a high
level of volatility. The price fluctuates as 'x+10', 'x+5', 'x-2', 'x-7', and finally falls to 'x-10', at
which point the machine closes the trade. Now, let's take a look at the entire scenario. When
the prices were rising initially, there was a possibility that the price would not have gone up
indefinitely and, as a result of speculative behaviour, the machine could have waited for the
price to fall but, in doing so, it could have earned a huge loss later on if such a scenario had
not occurred. Another possibility is that prices would have continued to fall even after 'x-10,'
resulting in profits greater than 'x-10' in the long run. There could be an infinite number of
possibilities, both in the short and long term.
As a result, we can conclude that the problem here is to assist the machine in making
decisions in all possible scenarios, as well as to assist the machine in dealing with the concept
of 'delayed gratification,' and thus labelling different decisions/actions.
To deal with the problem explained above of ‘Delayed Gratification’, we label each trade
with a particular profit/loss function. So here in the image above the machine executed a long
position at $1200 and exited the trade at $1300 thus making a profit of $100. Here the system
has identified the value of each hold option as zero and it is skipping the previous trades just
on the speculation that it might receive a higher profit at a later stage and therefore it executes
the final trade at $1300 considering that it is making a profit of $100 and if exercised earlier,
it would have made a lower profit.
BELLMAN’S EQUATION
10 | P a g e
The Bellman equation is a mathematical formula used in dynamic programming and
reinforcement learning to compute the optimal value function for a given Markov decision
process (MDP). The equation is named after its creator, Richard Bellman. The Bellman
equation states that the optimal value function for a given MDP and discount factor is equal
to the sum of the expected values of the immediate rewards from each possible state,
discounted by the discount factor. In other words, the optimal value function is the sum of the
expected values of all possible future rewards, discounted by the discount factor. The
Bellman equation can be used to solve for the optimal policy for an MDP. The optimal policy
is the policy that maximizes the expected value of the discounted sum of future rewards. The
Bellman equation can also be used to compute the value function for a given policy. The
value function for a given policy is the expected value of the discounted sum of future
rewards under that policy. The Bellman equation is a fundamental tool in dynamic
programming and reinforcement learning. The Bellman equation is a key component of
dynamic programming, which is a method of solving optimisation problems by breaking
them down into smaller subproblems. The Bellman equation can be used to find an optimal
solution to a wide range of optimisation problems, including those involving decision making
under uncertainty, such as in economics and finance. The Bellman equation is a recursive
equation, which means that it defines a sequence of future optimal values in terms of a
current optimal value. In other words, the equation tells us how to find the best future
decision given our current situation.
Here we use the Bellman’s equation to label each and every action. Each of the hold
decisions are labelled which wasn’t in the earlier example and they were being identified as
zero. Through Bellman’s Equation we give a value to each of those hold options that if we
hold the trade then how much value would it be worth at the end. The value with which these
hold values are labelled can be considered as profit and loss value for each of those trades.
This is not exactly a Profit or loss value but it is a quasi-profit and loss value. This is done so
that it will give an idea of how much the decision to hold is worth compared to the decisions
to sell at different points during the trade. If the machine concludes that the decision to hold
is more compared to that of selling, the machine will continue to hold the trade and this cycle
will continue until it reaches a point where it figures out to achieve a profit.
11 | P a g e
Here the value for γ =0.9
◦ s’ is the future state where we end up after performing an action (in the future).
◦ The equation works retroactively since we go from future state & evaluate the
previous state.
• γ is the discount factor. It gives weight to the future state w.r.t the previous state.
• The larger the value for γ , the higher the weight to the future outcomes. Based
on the value, the reward of holding the position will be higher than that of
closing the position.
Therefore, we can state that executing trades based on the Bellman’s Equation, we are not
just speculating or taking a decision based on greed or by just considering the on-spot returns,
but we also consider the other aspects of holding the option with a conviction that we will be
able to receive higher returns in the future. Therefore, Reinforcement Learning helps us make
decisions based on the expectancy of future outcomes rather than on the basis of immediate
actions.
There are a few key constraints to using deep reinforcement learning for stock trading.
1. The deep reinforcement learning algorithm must be able to accurately predict future stock
prices.
12 | P a g e
2. The deep reinforcement learning algorithm must be able to make quick and accurate
decisions in order to take advantage of stock price fluctuations.
3. The deep reinforcement learning algorithm must be able to handle large amounts of data in
order to make informed decisions.
An algorithmic trading strategy is a set of rules that determine when to buy and sell a
security. These rules can be based on a number of factors, including price, volume, and
technical indicators. Algorithmic traders use these rules to create and execute trading orders
automatically. This allows them to execute trades quickly and efficiently, without the need
for human intervention. Algorithmic trading strategies can be used for a variety of purposes,
including market making, arbitrage, and trend following. Algorithmic trading strategies can
be used by individual investors, or by professional money managers. Many institutional
investors use algorithmic trading strategies to minimize the costs of trading, and to improve
the speed and execution of their orders.
There are a variety of different types of algorithmic trading strategies, including trend-
following, breakout, and mean-reversion. Each of these strategies has its own strengths and
weaknesses, and may be more or less appropriate for a particular type of security or market
condition. Trend following is a strategy that attempts to capture long-term price trends. It
looks for stocks that are moving in a particular direction and attempts to ride the trend for as
long as possible. Breakout trading is a strategy that looks for stocks that have recently broken
out of a trading range. The trader buys the stock when it breaks out and sells it when it returns
to the range. Mean reversion is a strategy that looks for stocks that have been moving in a
particular direction for a while and then reverses course. The trader buys the stock when it
begins to move back in the opposite direction and sells it when it returns to its original
direction.
13 | P a g e
It is important to remember that no single algorithmic trading strategy is guaranteed to be
successful in all market conditions. It is important to carefully test any algorithmic trading
strategy before using it in a live market. There are many different types of algorithmic trading
strategies, but all share a common goal: to make money by buying and selling securities at
the right time.
In algorithmic trading, time is often discretized into small time-steps in order to make the
calculations involved in trading more manageable. This discretization can be done in a
number of ways, each with its own set of benefits and drawbacks.
One way to discretize time is to use a fixed time-step size. This approach is simple to
implement, and it is easy to calculate the time-step size that is needed to achieve a desired
level of accuracy. However, this approach can lead to inaccuracies if the time-step size is not
large enough, and it can also cause problems if the time-step size is too large. Another way to
discretize time is to use a variable time-step size. This approach can be more accurate than
the fixed time-step size approach, but it can be more complex to implement. Additionally, it
can be more difficult to calculate the time-step size that is needed to achieve a desired level
of accuracy.
A time series can be discretized into a sequence of points in time, called a time-series
discretization. This sequence can be used to approximate the original time series, and can be
used in algorithms for time-series analysis and trading. The time-series discretization can be
done in different ways, and different ways can be better or worse for different purposes.
1. The environment should be able to provide feedback to the learner about the success of its
actions. This feedback can be in the form of rewards (positive feedback) or punishments
(negative feedback).
2. The learner should be able to determine which actions are most likely to result in positive
outcomes, and then select those actions more often.
3. The learner should also be able to adapt its behavior over time, based on the results of its
previous actions.
14 | P a g e
4. The environment should be able to provide enough information to the learner so that it can
make informed decisions.
5. The environment should be stable, so that the learner can make reliable predictions about
the outcomes of its actions.
The deep Q-network (DQN) algorithm is a neural network algorithm used for training
artificial intelligence agents, specifically those used in reinforcement learning. The DQN
algorithm was proposed by DeepMind in 2013 as an extension of the Q-learning algorithm.
The DQN algorithm is a neural network that uses a deep learning approach to reinforcement
learning. It consists of two parts: a deep belief network (DBN) and a Q-network. The DBN is
used to learn the state representation of the environment, while the Q-network is used to learn
the optimal action for each state.
The DQN algorithm has been shown to be more effective than traditional reinforcement
learning algorithms, such as Q-learning and SARSA. In particular, it has been shown to be
able to learn more effectively from experience and generalize better to new environments.
Double DQN
Double DQN is a reinforcement learning algorithm that combines the advantages of both
deep Q-learning and double Q-learning.
The algorithm is based on the idea of "teacher forcing" in deep Q-learning, which uses a
teacher agent to provide feedback on the quality of each action. In double Q-learning, this
feedback is used to improve the estimates of the value functions for both the agent and
teacher.
15 | P a g e
The Double DQN algorithm extends this idea by using two teacher agents, one for each
player in a two-player game. The agents share a common value function, which is updated
using feedback from both players. This allows the Double DQN algorithm to learn more
effectively in games with multiple rounds or states, where traditional deep Q-learning
algorithms can struggle.
ADAM optimiser:
The ADAM optimiser is a tool that can be used to improve the performance of your ADAM
deployment. The optimiser can be used to identify and correct potential performance
bottlenecks in your deployment.
To use the ADAM optimiser, you must first install the Microsoft Windows Performance
Toolkit. The Windows Performance Toolkit is available as a free download from Microsoft.
Once you have installed the Windows Performance Toolkit, you can launch the ADAM
optimiser by running the following command from a command prompt:
adamopt.exe
The ADAM optimiser will prompt you for the location of your ADAM installation. Once you
have specified the location of your ADAM installation, the optimiser will scan your
deployment for potential performance bottlenecks.
The ADAM optimiser will report any potential performance bottlenecks that it finds. The
optimiser will also provide recommendations on how to correct these bottlenecks.
Huber loss
In information theory, the Huber loss is a measure of the distortion introduced by a statistical
estimator. It is named after Leo Huber, who introduced it in 1962.
where "x" is the true value of a random variable, "y" is the estimate of "x" produced by an
estimator, and "ε" is an arbitrarily small positive number. The Huber loss penalizes estimates
that are too far from the true value more than estimates that are close to the true value. This
ensures that the estimated values are clustered around the true value, rather than being
scattered all over the place.
16 | P a g e
The Huber loss is a popular measure of distortion because it is relatively insensitive to
outliers. This makes it a good choice for estimating the values of variables that are affected
by outliers, such as the weights of patients in a medical study.
Gradient clipping
The gradient clipping feature allows you to clip the gradient to a specific shape. This is useful
when you want to create a custom gradient effect.
To clip the gradient, select the shape and then select the Gradient tool. In the Options bar,
click the Clipping button and then select the shape from the list.
Xavier initialisation
The Xavier initialisation is a technique used to improve the performance of deep learning
networks. It was proposed by Geoffrey Hinton in 2014.
The Xavier initialisation technique uses a weight initialization scheme that is designed to
produce a more uniform distribution of activations across the layers of a deep neural network.
This helps to avoid problems such as "sparse" activations, where some layers have many
more activated neurons than others. The Xavier initialisation also helps to prevent the
formation of "clusters" of activated neurons, which can lead to poor network performance.
Batch normalisation layers are a type of layer used in deep learning networks. They are used
to improve the accuracy and speed of training by normalising the input data across a batch of
examples. This helps to prevent the network from overfitting to any one example in the batch,
and leads to more accurate predictions.
Batch normalisation layers are typically added after the input layer and before the first fully-
connected layer in a deep learning network.
Regularisation techniques
There are a number of ways to regularise an image. The most common are:
1. Use a filter to smooth the image. This will remove some of the noise, but may also blur the
image.
17 | P a g e
2. Use a median filter to remove noise from an image. This is a more sophisticated technique
that can be more effective than a simple smoothing filter.
3. Use a Laplacian filter to remove noise from an image. This is another sophisticated
technique that can be more effective than a simple smoothing or median filter.
4. Use a wavelet transform to reduce the noise in an image. This is a more sophisticated
technique that can be more effective than a simple smoothing, median or Laplacian filter.
The data was preprocessed and normalised using the following steps:
1. The data was split into a training set and a test set.
Preprocessing
1. The data was split into a training set and a test set.
There are a few different ways to add data to your training set in order to improve your
machine learning models.
Synthetic data is data that is artificially generated, usually using a computer. This can be done
in a number of ways, including:
sampling from a distribution : This involves randomly selecting values from within a given
distribution. For example, you could use this technique to generate new data points that are
similar to those in your training set. This can help improve the accuracy of your models by
increasing the diversity of the data they are trained on.
18 | P a g e
DATA RESULTS
19 | P a g e
Figure: DQN NETWORK: TRAIN & TEST DATA
20 | P a g e
Figure: DOUBLE DQN NETWORK: TRAIN & TEST DATA
Figure: DOUBLE DQN NETWORK: TRAIN & TEST – REWARD & LOSS
21 | P a g e
Figure: DUELING DOUBLE DQN NETWORK: TRAIN & TEST DATA
Figure: DOUBLE DQN NETWORK: TRAIN & TEST – REWARD & LOSS
22 | P a g e