Deep Reinforcement Learning - Guide To Deep Q-Learning
Deep Reinforcement Learning - Guide To Deep Q-Learning
ai MLQ VC MLQ App Machine Learning Quantum Computing Quantitative Finance Ve More Sign up Sign in
Reinforcement Learning
Deep Reinforcement
Learning: Guide to Deep Q-
Learning
In this article, we discuss two important topics in reinforcement
learning: Q-learning and deep Q-learning.
By Peter Foy
Table of contents
In this article, we cover two important concepts in the field of reinforcement learning:
Q-learning and deep Q-learning.
This article is based on the Deep Reinforcement Learning 2.0 course and is organized
as follows:
4. Q-Learning Intuition
5. Temporal Difference
7. Experience Replay
An environment: This could be something like a maze, a video game, the stock
market, and so on.
An agent: This is the AI that learns how to operate and succeed in a given
environment
The way the agent learns how to operate in the environment is through an iterative
feedback loop. The agent first takes an action and as a result, the state will change
Here's a visual representation of this iterative feedback loop of actions, states, and
rewards from Sutton and Barto's RL textbook:
By taking actions and receiving rewards from the environment, the agent can learn
If you want more of an introduction to to this topic, check out our other
s - State
a - Action
R - Reward
γ - Discount factor
The Bellman Equation was introduced by Dr. Richard Bellman (who's known as the
Programming.
How do we train a robot to reach the end goal with the shortest path without
stepping on a mine?
Source: freeCodeCamp
-1 point at each step. This is to encourage the agent to reach the goal in the
shortest path.
+1 for landing on a⚡
A Q-table is a lookup table that calculates the expected future rewards for each
action in each state. This lets the agent choose the best action in each state.
In this example, our agent has 4 actions (up, down, left, right) and 5 possible states
So the question is: how do we calculate the maximum expected reward in each
We learn the value of the Q-table through an iterative process using the Q-learning
′
V (s) = maxa R(s, a) + γV (s ))
Here's a summary of the equation from our earlier Guide to Reinforcement Learning:
If you're familiar with Finance, the discount factor can be thought of like the time
value of money.
First let's review the difference between deterministic and non-deterministic search.
Deterministic Search
Deterministic search means that if the agent tries to go up (in our maze
example), then with 100% probability it will in fact go up.
Non-Deterministic Search
Non-deterministic search means that if our agent wants to up, there could be
an 80% chance it goes up, a 10% it goes left, and 10% it goes right, for example.
So there is an element of randomness in the environment that we need to
account for this.
This is where two new concepts come in: Markov processes and Markov decision
processes.
To simplify this, in an environment with a Markov property the way the environment
is designed in such a way that what happens in the future doesn't depend on the
past.
A Markov decision process is the framework that the agent will use in order to
′
V (s) = maxa R(s, a) + γV (s ))
Now that we have some randomness we don't actually know what s′ we'll end up in.
To do this, we would multiply our three possible states by their probability. Now our
′ ′
0.8 ∗ V (s ) + 0.1 ∗ V (s ) + 0.1 ∗ V (s3 )
1 2
′ ′
V (s) = maxa (R(s, a) + γ ∑ P (s, a, s )V (s ))
′
s
This equation is what we'll be dealing with going forward since most realistic
environments are stochastic in nature.
4. Q-Learning Intuition
Now that we understand the Bellman equation and understand a Markov decision
process for the probability of the next state given an action, let's move on to Q-
learning.
So far we've been dealing with the value of being in a given state, and we know we
want to make an optimal decision about where to go next given our current state.
Now instead of looking at the value of each state V (s), we're going to look at the
Let's look at how we actually derive the value of Q(s, a) by comparing is to V (s).
′ ′
V (s) = maxa (R(s, a) + γ ∑ P (s, a, s )V (s ))
′
s
Now the agent is in the next state s′ , and because the agent can end up in
several states, we add the value of the next state which is the expected value of
the next state
′ ′
Q(s, a) = R(s, a) + γ ∑ P (s, a, s )V (s ))
′
s
What you will notice looking at this equation is that Q(s, a) is the same value as
Let's take this equation a step further by getting rid of V , since V is a recursive
function of V .
′ ′ ′
Q(s, a) = R(s, a) + γ ∑ P (s, a, s )maxa′ Q(s , a ))
′
s
5. Temporal Difference
Temporal difference is an important concept at the heart of the Q-learning
algorithm.
One thing we haven't mentioned yet about non-deterministic search is that it can be
For now, we're just going to use the deterministic Bellman equation for simplicity,
which to recap is: Q(s, a) = R(s, a) + γmaxa′ Q(s , a )
′ ′
, but we'll use this refer to
stochastic environments.
So we know that before an agent takes an action it has a Q-value: Q(s, a).
After an action is taken we know what reward the agent actually got and what the
value of the new state is: R(s, a) + γmaxa Q(s′ , a′ )$. ′
′ ′
T D(a, s) = R(s, a) + γmaxa′ Q(s , a ) − Qt−1 (s, a)
The first element is what we get after taking an action and the second element is the
previous Q-value.
Ideally, these two values should be the same since the first,
But these values may not be the same because of the randomness that exists in the
environment.
The reason it's called temporal difference is because of time. We have Q(s, a) which
So the question is: has there been a difference between these values in time?
Now that we have our temporal difference, here's how we use it:
where:
So we take our previous Qt−1 (s, a) and add on the temporal difference times the
learning rate to get our new Qt (s, a).
Let's now plug in the T D(a, s) equation into our new Q-learning equation:
′ ′
Qt (s, a) = Qt−1 (s, a) + α(R(s, a) + γmaxa′ Q(s , a ) − Qt−1 (s, a))
We now have the full Q-learning equation, so let's move on to deep Q-learning.
In terms of the neural network we feed in the state, pass that through several hidden
layers (the exact number depends on the architecture) and then output the Q-
values.
Analytics Vidhya:
You may be wondering why we need to introduce deep learning to the Q-learning
equation.
Q-learning works well when we have a relatively simple environment to solve, but
when the number of states and actions we can take gets more complex we use
deep learning as a function approximator.
′ ′
T D(a, s) = R(s, a) + γmaxa′ Q(s , a ) − Qt−1 (s, a)
In the maze example, the neural network will predict 4 values: up, right, left, or down.
We then take these 4 values and compare it to the values that were previously
predicted, which are stored in memory.
Recall that neural networks work by updating their weights, so we need to adapt our
So what we're going to do is calculate a loss by taking the sum of the squared
differences of the Q-values and their targets:
2
L = ∑(Q − T arget − Q)
We then take this loss and use backpropagation, or stochastic gradient descent,
and pass it through the network and update the weights.
This is the learning part, now let's move on to how the agent selects the best action
to take.
To choose which action is the best, we use the Q-values that we have and pass
them through a softmax function.
7. Experience Replay
Now that we've discussed how to apply neural networks to Q-learning, let's review
another important concept in deep Q-learning: experience replay.
One thing that can lead to our agent misunderstanding the environment is
For example, if we're teaching a self-driving car how to drive, and the first part of the
road is just a straight line, the agent might not learn how to deal with any curves in
the road.
From our self-driving car example, what happens with experience replay is that the
initial experiences of driving in a straight line don't get put through the neural
network right away.
Once the agent reaches a certain threshold then we tell the agent to learn from it.
So the agent is now learning from a batch of experiences. From these experiences,
the agent randomly selects a uniformly distributed sample from this batch and
learns from that.
Each experience is characterized by that state it was in, the action it took, the state it
By randomly sampling from the experiences, this breaks the bias that may have
come from the sequential nature of a particular environment, for example driving in
a straight line.
If you want to learn more about experience replay, check out this paper from
DeepMind called Prioritized Experience Replay.
So the question is: once we have the Q-values, how do decide which one to use?
Recall that in simple Q-learning we just choose the action with the highest Q-value.
In reality, the action selection policy doesn't need to be softmax , there are others
ϵ greedy
ϵ -soft
Softmax
The reason that we don't just use the highest Q-value comes down to an important
concept in reinforcement learning: the exploration vs. exploitation dilemma.
If it takes advantage of what it already knows it could gain more rewards, but if it
learning. We then took this information a step further and applied deep learning to
We saw that with deep Q-learning we take advantage of experience replay, which is
when an agent learns from a batch of experience. In particular, the agent randomly
selects a uniformly distributed sample from this batch and learns from this.
Finally, we looked at several action selection policies, which are used to tackle the
Resources
Introduction to Portfolio Construction and AI Legislation in the EU, Drug Discovery, and
Analysis: Risk & Returns Robotic Pizza Deliveries - This Week in AI
Keep reading
Premium Premium
4 hours ago • 23 min read 7 months ago • 5 min read 7 months ago • 8 min read
Venture Capital
AI Startups
Crypto Startups
© MLQ.ai 2022