3 Balakrishnan, Kaushik TensorFlow Reinforcement Learning Quick
3 Balakrishnan, Kaushik TensorFlow Reinforcement Learning Quick
Kaushik Balakrishnan
BIRMINGHAM - MUMBAI
TensorFlow Reinforcement
Learning Quick Start Guide
Cop y right © 2019 Packt Publishing
All rights reserved. No p art of this book may be rep roduced, stored in a retrieval sy stem, or transmitted in any form or by any
means, without the p rior written p ermission of the p ublisher, excep t in the case of brief quotations embedded in critical articles
or reviews.
Every effort has been made in the p rep aration of this book to ensure the accuracy of the information p resented. However, the
information contained in this book is sold without warranty, either exp ress or imp lied. Neither the author, nor Packt Publishing
or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by
this book.
Packt Publishing has endeavored to p rovide trademark information about all of the comp anies and p roducts mentioned in this
book by the ap p rop riate use of cap itals. However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78953-358-3
www.packtpub.com
To Sally, my dearest.
– Kaushik Balakrishnan
mapt.io
Mapt is an online digital library that gives you full access to over 5,000
books and videos, as well as industry leading tools to help you plan your
personal development and advance your career. For more information,
please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks
and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
At www.packt.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters, and receive exclusive discounts and offers
on Packt books and eBooks.
Contributors
About the author
Kaushik Balakrishnan works for BMW in Silicon Valley, and applies
reinforcement learning, machine learning, and computer vision to solve
problems in autonomous driving. Previously, he also worked at Ford Motor
Company and NASA Jet Propulsion Laboratory. His primary expertise is in
machine learning, computer vision, and high-performance computing, and he
has worked on several projects involving both research and industrial
applications. He has also worked on numerical simulations of rocket
landings on planetary surfaces, and for this he developed several high-
fidelity models that run efficiently on supercomputers. He holds a PhD in
aerospace engineering from the Georgia Institute of Technology in Atlanta,
Georgia.
About the reviewer
Narotam Singh recently took voluntary retirement from his post of
meteorologist with the Indian Meteorological Department, Ministry of Earth
Sciences, to pursue his dream of learning and helping society. He has been
actively involved with various technical programs and the training of GOI
officers in the field of IT and communication. He did his masters in the field
of electronics, having graduated with a degree in physics. He also holds a
diploma and a postgraduate diploma in the field of computer engineering.
Presently, he works as a freelancer. He has many research publications to his
name and has also served as a technical reviewer for numerous books. His
present research interests involve AI, ML, DL, robotics, and spirituality.
Packt is searching for authors like
you
If you're interested in becoming an author for Packt, please visit authors.packtp
ub.com and apply today. We have worked with thousands of developers and
tech professionals, just like you, to help them share their insight with the
global tech community. You can make a general application, apply for a
specific hot topic that we are recruiting an author for, or submit your own
idea.
Table of Contents
Title Page
Copyright and Credits
TensorFlow Reinforcement Learning Quick Start Guide
Dedication
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
1. Up and Running with Reinforcement Learning
Why RL?
Formulating the RL problem
The relationship between an agent and its environment
Defining the states of the agent
Defining the actions of the agent
Understanding policy, value, and advantage functions
Identifying episodes
Identifying reward functions and the concept of discounted rewards
Rewards
Learning the Markov decision process 
Defining the Bellman equation
On-policy versus off-policy learning
On-policy method
Off-policy method
Model-free and model-based training
Algorithms covered in this book
Summary
Questions
Further reading
2. Temporal Difference, SARSA, and Q-Learning
Technical requirements
Understanding TD learning
Relation between the value functions and state
Understanding SARSA and Q-Learning 
Learning SARSA 
Understanding Q-learning
Cliff walking and grid world problems
Cliff walking with SARSA
Cliff walking with Q-learning
Grid world with SARSA
Summary
Further reading
3. Deep Q-Network
Technical requirements
Learning the theory behind a DQN
Understanding target networks
Learning about replay buffer
Getting introduced to the Atari environment
Summary of Atari games
Pong
Breakout
Space Invaders
LunarLander
The Arcade Learning Environment 
Coding a DQN in TensorFlow
Using the model.py file
Using the funcs.py file
Using the dqn.py file
Evaluating the performance of the DQN on Atari Breakout
Summary
Questions
Further reading
4. Double DQN, Dueling Architectures, and Rainbow
Technical requirements
Understanding Double DQN 
Updating the Bellman equation
Coding DDQN and training to play Atari Breakout
Evaluating the performance of DDQN on Atari Breakout
Understanding dueling network architectures
Coding dueling network architecture and training it to play Atari
Breakout
Combining V and A to obtain Q
Evaluating the performance of dueling architectures on Atari Breakout&
#xA0;
Understanding Rainbow networks
DQN improvements
Prioritized experience replay 
Multi-step learning
Distributional RL
Noisy nets
Running a Rainbow network on Dopamine
Rainbow using Dopamine
Summary
Questions
Further reading
5. Deep Deterministic Policy Gradient
Technical requirements
Actor-Critic algorithms and policy gradients
Policy gradient
Deep Deterministic Policy Gradient
Coding ddpg.py
Coding AandC.py
Coding TrainOrTest.py
Coding replay_buffer.py
Training and testing the DDPG on Pendulum-v0
Summary
Questions
Further reading
6. Asynchronous Methods - A3C and A2C
Technical requirements
The A3C algorithm
Loss functions
CartPole and LunarLander
CartPole
LunarLander
The A3C algorithm applied to CartPole
Coding cartpole.py
Coding a3c.py
The AC class
The Worker() class
Coding utils.py
Training on CartPole
The A3C algorithm applied to LunarLander
Coding lunar.py
Training on LunarLander
The A2C algorithm
Summary
Questions
Further reading
7. Trust Region Policy Optimization and Proximal Policy Optimization
Technical requirements
Learning TRPO
TRPO equations
Learning PPO
PPO loss functions
Using PPO to solve the MountainCar problem
Coding the class_ppo.py file
Coding train_test.py file
Evaluating the performance
Full throttle
Random throttle
Summary
Questions
Further reading
8. Deep RL Applied to Autonomous Driving
Technical requirements
Car driving simulators
Learning to use TORCS
State space
Support files
Training a DDPG agent to learn to drive
Coding ddpg.py
Coding AandC.py
Coding TrainOrTest.py
Training a PPO agent
Summary
Questions
Further reading
Assessment
Chapter 1
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Other Books You May Enjoy
Leave a review - let other readers know what you think
Preface
This book provides a summary of several different reinforcement learning
(RL) algorithms, including the theory involved in the algorithms as well as
coding them using Python and TensorFlow. Specifically, the algorithms
covered in this book are Q-learning, SARSA, DQN, DDPG, A3C, TRPO,
and PPO. The applications of these RL algorithms include computer games
from OpenAI Gym and autonomous driving using the TORCS racing car
simulator.
Who this book is for
This book is designed for machine learning (ML) practitioners interested in
learning RL. It will help ML engineers, data scientists, and graduate students.
A basic knowledge of ML, and experience of coding in Python and
TensorFlow, is expected of the reader in order to be able to complete this
book successfully.
What this book covers
Chapter 1, Up and Running with Reinforcement Learning, provides an
overview of the basic concepts of RL, such as an agent, an environment, and
the relationship between them. It also covers topics such as reward functions,
discounted rewards, and value and advantage functions. The reader will also
get familiar with the Bellman equation, on-policy and off-policy algorithms,
as well as model-free and model-based RL algorithms.
algorithm of the book, DQN. It will also discuss how to code this in Python
and TensorFlow. The code will then be used to train an RL agent to play
Atari Breakout.
the TORCS racing car simulator, coding the DDPG algorithm for training an
agent to drive a car autonomously. The code files for this chapter also
include the PPO algorithm for the same TORCS problem, and is provided as
an exercise for the reader.
To get the most out of this book
The reader is expected to have a good knowledge of ML algorithms, such as
deep neural networks, convolutional neural networks, stochastic gradient
descent, and Adam optimization. The reader is also expected to have hands-
on coding experience in Python and TensorFlow.
Download the example code files
You can download the example code files for this book from your account at
www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/
Once the file is downloaded, please make sure that you unzip or extract the
folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/Pac
ktPublishing/TensorFlow-Reinforcement-Learning-Quick-Start-Guide. In case there's an
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the
screenshots/diagrams used in this book. You can download it here: http://www.
packtpub.com/sites/default/files/downloads/9781789533583_ColorImages.pdf.
Conventions used
There are a number of text conventions used throughout this book.
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk
image file as another disk in your system."
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
def random_action():
# a = 0 : top/north
# a = 1 : right/east
# a = 2 : bottom/south
# a = 3 : left/west
a = np.random.randint(nact)
return a
Bold: Indicates a new term, an important word, or words that you see on
screen. For example, words in menus or dialog boxes appear in the text like
this. Here is an example: "Select System info from the Administration panel."
Warnings or important notes appear like this.
General feedback: If you have questions about any aspect of this book,
mention the book title in the subject of your message and email us at
[email protected].
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we
would be grateful if you would report this to us. Please visit www.packt.com/subm
it-errata, selecting your book, clicking on the Errata Submission Form link,
Piracy: If you come across any illegal copies of our works in any form on the
internet, we would be grateful if you would provide us with the location
address or website name. Please contact us at [email protected] with a link to
the material.
In this chapter, we will delve deep into the basic concepts of RL. We will
learn the meaning of the RL jargon, the mathematical relationships between
them, and also how to use them in an RL setting to train an agent. These
concepts will lay the foundations for us to learn RL algorithms in later
chapters, along with how to apply them to train agents. Happy learning!
Some of the main topics that will be covered in this chapter are as follows:
While RL has been around for over three decades, the field has gained a new
resurgence in recent years with the successful demonstration of the use of
deep learning in RL to solve real-world tasks, wherein deep neural networks
are used to make decisions. The coupling of RL with deep learning is
typically referred to as deep RL, and is the main topic of discussion of this
book.
Now, let's delve into the formulation of the RL problem. We will compare
how RL is similar in spirit to a child learning to walk.
Formulating the RL problem
The basic problem that is solved is training a model to make predictions of
some pre-defined task without any labeled data. This is accomplished by a
trial-and-error approach, akin to a baby learning to walk for the first time. A
baby, curious to explore the world around them, first crawls out of their crib
not knowing where to go nor what to do. Initially, they take small steps, make
mistakes, keep falling on the floor, and cry. But, after many such episodes,
they start to stand on their feet on their own, much to the delight of their
parents. Then, with a giant leap of faith, they start to take slightly longer
steps, slowly and cautiously. They still make mistakes, albeit fewer than
before.
After many more such tries—and failures—they gain more confidence that
enables them to take even longer steps. With time, these steps get much
longer and faster, until eventually, they start to run. And that's how they grow
up into a child. Was any labeled data provided to them that they used to learn
to walk? No. they learned by trial and error, making mistakes along the way,
learning from them, and getting better at it with infinitesimal gains made for
every attempt. This is how RL works, learning by trial and error.
The robot has certain pre-defined goals, for example, to move goods from
one side of the factory to the other without colliding with obstacles such as
walls and/or other robots. The environment is the region available for the
robot to navigate and includes all the places the robot can go to, including the
obstacles that the robot could crash in to. So the primary task of the robot, or
more precisely, the agent, is to explore the environment, understand how the
actions it takes affects its rewards, be cognizant of the obstacles that can
cause catastrophic crashes or failures, and then master the art of maximizing
the goals and improving its performance over time.
In this process, the agent inevitably interacts with the environment, which can
be good for the agent regarding certain tasks, but could be bad for the agent
regarding other tasks. So, the agent must learn how the environment will
respond to the actions that are taken. This is a trial-and-error learning
approach, and only after numerous such trials can the agent learn how the
environment will respond to its decisions.
Let's now come to understand what the state space of an agent is, and the
actions that the agent performs to explore the environment.
Defining the states of the agent
In RL parlance, states represent the current situation of the agent. For
example, in the previous industrial mobile robot agent case, the state at a
given time instant is the location of the robot inside the factory – that is,
where it is located, its orientation, or more precisely, the pose of the robot.
For a robot that has joints and effectors, the state can also include the precise
location of the joints and effectors in a three-dimensional space. For an
autonomous car, its state can represent its speed, location on a map, distance
to other obstacles, torques on its wheels, the rpm of the engine, and so on.
States are usually deduced from sensors in the real world; for instance, the
measurement from odometers, LIDARs, radars, and cameras. States can be a
one-dimensional vector of real numbers or integers, or two-dimensional
camera images, or even higher-dimensional, for instance, three-dimensional
voxels. There are really no precise limitations on states, and the state just
represents the current situation of the agent.
A schematic of the agent and its interaction with the environment is shown in
the following diagram:
Figure 1: Schematic showing the agent and its interaction with the environment
Now that we know what an agent is, we will look at the policies that the
agent learns, what value and advantage functions are, and how these quantities
are used in RL.
Understanding policy, value, and
advantage functions
A policy defines the guidelines for an agent's behavior at a given state. In
mathematical terms, a policy is a mapping from a state of the agent to the
action to be taken at that state. It is like a stimulus-response rule that the agent
follows as it learns to explore the environment. In RL literature, it is usually
denoted as π(at|st) – that is, it is a conditional probability distribution of
taking an action at in a given state st. Policies can be deterministic, wherein
the exact value of at is known at st, or can be stochastic where at is sampled
from a distribution – typically this is a Gaussian distribution, but it can also
be any other probability distribution.
In RL, value functions are used to define how good a state of an agent is.
They are typically denoted by V(s) at state s and represent the expected long-
term average rewards for being in that state. V(s) is given by the following
expression where E[.] is an expectation over samples:
Note that V(s) does not care about the optimum actions that an agent needs to
take at the state s. Instead, it is a measure of how good a state is. So, how can
an agent figure out the most optimum action at to take in a given state st at time
instant t? For this, you can also define an action-value function given by the
following expression:
Note that Q(s,a) is a measure of how good is it to take action a in state s and
follow the same policy thereafter. So, t is different from V(s), which is a
measure of how good a given state is. We will see in the following chapters
how the value function is used to train the agent under the RL setting.
A happy, or good, ending can be when the agent accomplishes its pre-defined
goal, which could be successfully navigating to a final destination for a
mobile robot, or successfully picking up a peg and placing it in a hole for an
industrial robot arm, and so on. Episodes can also have a sad ending, where
the agent crashes into obstacles or gets trapped in a maze, unable to get out of
it, and so on.
We will next find out what a reward function is and why we need to discount
future rewards. This reward function is the key, as it is the signal for the
agent to learn.
Identifying reward functions and
the concept of discounted rewards
Rewards in RL are no different from real world rewards – we all receive
good rewards for doing well, and bad rewards (aka penalties) for inferior
performance. Reward functions are provided by the environment to guide an
agent to learn as it explores the environment. Specifically, it is a measure of
how well the agent is performing.
The reward function defines what the good and bad things are that can
happen to the agent. For instance, a mobile robot that reaches its goal is
rewarded, but is penalized for crashing into obstacles. Likewise, an
industrial robot arm is rewarded for putting a peg into a hole, but is
penalized for being in undesired poses that can be catastrophic by causing
ruptures or crashes. Reward functions are the signal to the agent regarding
what is optimum and what isn't. The agent's long-term goal is to maximize
rewards and minimize penalties.
Rewards
In RL literature, rewards at a time instant t are typically denoted as Rt. Thus,
the total rewards earned in an episode is given by R = r1+ r2 + ... + rt,
where t is the length of the episode (which can be finite or infinite).
Since 0 ≤ γ ≤ 1, the rewards into the distant future are valued much less than
the rewards that the agent can earn in the immediate future. This helps the
agent to not waste time and to prioritize its actions. In practice, γ = 0.9-0.99
is typically used in most RL problems.
Learning the Markov decision
process
The Markov property is widely used in RL, and it states that the environment's
response at time t+1 depends only on the state and action at time t. In other
words, the immediate future only depends on the present and not on the past.
This is a useful property that simplifies the math considerably, and is widely
used in many fields such as RL and robotics.
Note that the probability of being in state st+1 depends only on st and at and
not on the past. An environment that satisfies the following state transition
property and reward function as follows is said to be a Markov Decision
Process (MDP):
Let's now define the very foundation of RL: the Bellman equation. This
equation will help in providing an iterative solution to obtaining value
functions.
Defining the Bellman equation
The Bellman equation, named after the great computer scientist and applied
mathematician Richard E. Bellman, is an optimality condition associated
with dynamic programming. It is widely used in RL to update the policy of an
agent.
The first quantity, Ps,s', is the transition probability from state s to the new
state s'. The second quantity, Rs,s', is the expected reward the agent receives
from state s, taking action a, and moving to the new state s'. Note that we
have assumed the MDP property, that is, the transition to state at time t+1
only depends on the state and action at time t. Stated in these terms, the
Bellman equation is a recursive relationship, and is given by the following
equations respectively for the value function and action-value function:
Note that the Bellman equations represent the value function V at a state, and
as functions of the value function at other states; similarly for the action-
value function Q.
On-policy versus off-policy learning
RL algorithms can be classified as on-policy or off-policy. We will now
learn about both of these classes and how to distinguish a given RL algorithm
into one or the other.
On-policy method
On-policy methods use the same policy to evaluate as was used to make the
decisions on actions. On-policy algorithms generally do not have a replay
buffer; the experience encountered is used to train the model in situ. The
same policy that was used to move the agent from state at time t to state at
time t+1, is used to evaluate if the performance was good or bad. For
example, if a robot is exploring the world at a given state, if it uses its
current policy to ascertain whether the actions it took in the current state
were good or bad, then it is an on-policy algorithm, as the current policy is
also used to evaluate its actions. SARSA, A3C, TRPO, and PPO are on-
policy algorithms that we will be covering in this book.
Off-policy method
Off-policy methods, on the other hand, use different policies to make action
decisions and to evaluate the performance. For instance, many off-policy
algorithms use a replay buffer to store the experiences, and sample data from
this buffer to train the model. During the training step, a mini-batch of
experience data is randomly sampled and used to train the policy and value
functions. Coming back to the previous robot example, in an off-policy
setting, the robot will not use the current policy to evaluate its performance,
but rather use a different policy for exploring and for evaluation. If a replay
buffer is used to sample a mini-batch of experience data and then train the
agent, then it is off-policy learning, as the current policy of the robot (which
was used to obtain the immediate actions) is different from the policy that
was used to obtain the samples in the mini-batch of experience used to train
the agent (as the policy has changed from an earlier time instant when the
data was collected, to the current time instant). DQN, DDQN, and DDPG are
off-policy algorithms that we'll look at in later chapters of this book.
Model-free and model-based
training
RL algorithms that do not learn a model of how the environment works are
called model-free algorithms. By contrast, if a model of the environment is
constructed, then the algorithm is called model-based. In general, if value (V)
or action-value (Q) functions are used to evaluate the performance, they are
called model-free algorithms as no specific model of the environment is
used. On the other hand, if you build a model of how the environment
transitions from one state to another or determines how many rewards the
agent will receive from the environment via a model, then they are called
model-based algorithms.
Some of the topics that will be covered in this chapter are as follows:
Understanding TD learning
Learning SARSA
Understanding Q-learning
Cliff walking with SARSA and Q-learning
Grid world with SARSA
Technical requirements
Knowledge of the following will help you to better understand the concepts
presented in this chapter:
Python (version 2 or 3)
NumPy
TensorFlow (version 1.4 or higher)
Understanding TD learning
We will first learn about TD learning. This is a very fundamental concept in
RL. In TD learning, the learning of the agent is attained by experience.
Several trial episodes are undertaken of the environment, and the rewards
accrued are used to update the value functions. Specifically, the agent will
keep an update of the state-action value functions as it experiences new
states/actions. The Bellman equation is used to update this state-action value
function, and the goal is to minimize the TD error. This essentially means the
agent is reducing its uncertainty of which action is the optimal action in a
given state; it gains confidence on the optimal action in a given state by
lowering the TD error.
Relation between the value
functions and state
The value function is an agent's estimate of how good a given state is. For
instance, if a robot is near the edge of a cliff and may fall, that state is bad
and must have a low value. On the other hand, if the robot/agent is near its
final goal, that state is a good state to be in, as the rewards they will soon
receive are high, and so that state will have a higher value.
Note that in some reference papers or books, the preceding formula will have
rt instead of rt+1. This is just a difference in convention and is not an error;
rt+1 here denotes the reward received from st state and transitioning to st+1.
There is also another TD learning variant called TD(λ) that used eligibility
traces e(s), which are a record of visiting a state. More formally, we perform
a TD(λ) update as follows:
Here, from a given state st, we take action at, receive a reward rt+1,
transition to a new state st+1, and thereafter take an action at+1 that then
continues on and on. This quintuple (st, at, rt+1, st+1, at+1) gives the algorithm
the name SARSA. SARSA is an on-policy algorithm, as the same policy is
updated as was used to estimate Q. For exploration, you can use, say, ε-
greedy.
Understanding Q-learning
Q-learning is an off-policy algorithm that was first proposed by Christopher
Watkins in 1989, and is a widely used RL algorithm. Q-learning, such as
SARSA, keeps an update of the state-action value function for each state-
action pair, and recursively updates it using the Bellman equation of dynamic
programming as new experiences are collected. Note that it is an off-policy
algorithm as it uses the state-action value function evaluated at the action,
which will maximize the value. Q-learning is used for problems where the
actions are discrete – for example, if we have the actions move north, move
south, move east, move west, and we are to decide the optimum action in a
given state, then Q-learning is applicable in such settings.
The α is the learning rate, which is a hyper-parameter that the user can
specify.
Before we code the algorithms in Python, let's find out what kind of problems
will be considered.
Cliff walking and grid world
problems
Let's consider cliff walking and grid world problems. First, we will
introduce these problems to you, then we will proceed on to the coding part.
For both problems, we consider a rectangular grid with nrows (number of
rows) and ncols (number of columns). We start from one cell to the south of
the bottom left cell, and the goal is to reach the destination, which is one cell
to the south of the bottom right cell.
Note that the start and destination cells are not part of the nrows x ncols grid of
cells. For the cliff walking problem, the cells to the south of the bottom row
of cells, except for the start and destination cells, form a cliff where, if the
agent enters, the episode ends with catastrophic fall into the cliff. Likewise,
if the agent tries to leave the left, top, or right boundaries of the grid of cells,
it is placed back in the same cell, that is, it is equivalent to taking no action.
For the grid world problem, we do not have a cliff, but we have obstacles
inside the grid world. If the agent tries to enter any of these obstacle cells, it
is bounced back to the same cell from which it came. In both these problems,
the goal is to find the optimum path from the start to the destination.
We will now summarize the code involved to solve the grid world problem.
In a Terminal, use your favorite editor (for example, gedit, emacs, or vi) to
code the following:
import numpy as np
import sys
import matplotlib.pyplot as plt
We will use a 3 x 12 grid for the cliff walking problem, that is, 3 rows and 12
columns. We also have 4 actions to take at any cell. You can go north, east,
south, or west:
nrows = 3
ncols = 12
nact = 4
The learning rate, α, is chosen as 0.1, and the discount factor γ = 0.95 is used,
which are typical values for this problem:
nepisodes = 100000
epsilon = 0.1
alpha = 0.1
gamma = 0.95
We will next assign values for the rewards. For any normal action that does
not fall into the cliff, the reward is -1; if the agent falls down the cliff, the
reward is -100; for reaching the destination, the reward is also -1. Feel free to
explore other values for these rewards later, and investigate its effect on the
final Q values and path taken from start to destination:
reward_normal = -1
reward_cliff = -100
reward_destination = -1
The Q values for state-action pairs are initialized to zero. We will use a
NumPy array for Q, which is nrows x ncols x nact, that is, we have a nact
number of entries for each cell, and we have nrows x ncols total number of
cells:
Q = np.zeros((nrows,ncols,nact),dtype=np.float)
We will define a function to make the agent go to the start location, which has
(x, y) co-ordinates of (x=0, y=nrows):
def go_to_start():
# start coordinates
y = nrows
x = 0
return x, y
function, as follows:
def random_action():
# a = 0 : top/north
# a = 1 : right/east
# a = 2 : bottom/south
# a = 3 : left/west
a = np.random.randint(nact)
return a
We will now define the move function, which will take a given (x, y) location
of the agent and the current action, a, and then will perform the action. It will
return the new location of the agent after taking the action, (x1, y1), as well
as the state of the agent, which we define as state = 0 for the agent to be OK
after taking the action; state = 1 for reaching the destination; and state = 2 for
falling into the cliff. If the agent leaves the domain through the left, top, or
right, it is sent back to the same grid (equivalent to taking no action):
def move(x,y,a):
# state = 0: OK
# state = 1: reached destination
# state = 2: fell into cliff
state = 0
We will next define the exploit function, which will take the (x, y) location of
the agent and take the greedy action based on the Q values, that is, it will take
the a action that has the highest Q value at that (x, y) location. We will do this
using NumPy's np.argmax(). If we are at the start location, we go north (a = 0);
if we are one step away from the destination, we go south (a = 2):
def exploit(x,y,Q):
# start location
if (x == 0 and y == nrows):
a = 0
return a
# destination location
if (x == ncols-1 and y == nrows-1):
a = 2
return a
# interior location
if (x < 0 or x > ncols-1 or y < 0 or y > nrows-1):
print("error ", x, y)
sys.exit()
a = np.argmax(Q[y,x,:])
return a
Next, we will perform the Bellman update using the following bellman()
function:
def bellman(x,y,a,reward,Qs1a1,Q):
if (y == nrows and x == 0):
# at start location; no Bellman update possible
return Q
if (r < epsilon):
# explore
a = random_action()
else:
# exploit
a = exploit(x,y,Q)
return a
We now have all the functions required for the cliff walking problem. So, we
will loop over the episodes and for each episode we start at the starting
location, then explore or exploit, then we move the agent one step depending
on the action. Here is the Python code for this:
for n in range(nepisodes+1):
# start
x, y = go_to_start()
# explore or exploit
a = explore_exploit(x,y,Q)
while(True):
# move one step
x1, y1, state = move(x,y,a)
We perform the Bellman update based on the rewards received; note that this
is based on the equations presented earlier in this chapter in the theory
section. We stop the episode if we reach the destination or fall down the
cliff; if not, we continue the exploration or exploitation strategy for one more
step, and this goes on and on. The state variable in the following code takes
the 1 value for reaching the destination, takes the value 2 for falling down the
cliff, and is 0 otherwise:
# Bellman update
if (state == 1):
reward = reward_destination
Qs1a1 = 0.0
Q = bellman(x,y,a,reward,Qs1a1,Q)
break
elif (state == 2):
reward = reward_cliff
Qs1a1 = 0.0
Q = bellman(x,y,a,reward,Qs1a1,Q)
break
elif (state == 0):
reward = reward_normal
# Sarsa
a1 = explore_exploit(x1,y1,Q)
if (x1 == 0 and y1 == nrows):
# start location
Qs1a1 = 0.0
else:
Qs1a1 = Q[y1,x1,a1]
Q = bellman(x,y,a,reward,Qs1a1,Q)
x = x1
y = y1
a = a1
The preceding code will complete all the episodes, and we now have the
converged values of Q. We will now plot this using matplotlib for each of the
actions:
for i in range(nact):
plt.subplot(nact,1,i+1)
plt.imshow(Q[:,:,i])
plt.axis('off')
plt.colorbar()
if (i == 0):
plt.title('Q-north')
elif (i == 1):
plt.title('Q-east')
elif (i == 2):
plt.title('Q-south')
elif (i == 3):
plt.title('Q-west')
plt.savefig('Q_sarsa.png')
plt.clf()
plt.close()
path = np.zeros((nrows,ncols,nact),dtype=np.float)
x, y = go_to_start()
while(True):
a = exploit(x,y,Q)
print(x,y,a)
x1, y1, state = move(x,y,a)
if (state == 1 or state == 2):
print("breaking ", state)
break
elif (state == 0):
x = x1
y = y1
if (x >= 0 and x <= ncols-1 and y >= 0 and y <= nrows-1):
path[y,x] = 100.0
plt.imshow(path)
plt.savefig('path_sarsa.png')
That's it. We have completed the coding required for the cliff walking
problem with SARSA. We will now view the results. In the following
screenshot, we present the Q values for each of the actions (going north, east,
south, or west) at each of the locations in the grid. As per the legend, yellow
represents high Q values and violet represents low Q values.
SARSA clearly tries to avoid the cliff by choosing to not go south when the
agent is just one step to the north of the cliff, as is evident from the large
negative Q values for the south action:
Figure 1: Q values for the cliff walking p roblem using SARSA
We will next plot the path taken by the agent from start to finish in the
following screenshot:
We will compute the Q value at the new state using the max_Q() function
defined previously:
Qs1a1 = max_Q(x1,y1,Q)
That's it for coding Q-learning. We will now apply this to solve the cliff
walking problem and present the Q values for each of the actions, and the
path traced by the agent to go from start to finish, which are shown in the
following screenshots:
Figure 3: Q values for the cliff walking p roblem using Q-learning
As evident, the path traced is now different for Q-learning vis-à-vis SARSA.
Since Q-learning is a greedy strategy, the agent now takes a path close to the
cliff at the bottom of the following screenshot (Figure 4), as it is the shortest
path:
Figure 4: Path traced by the agent for the cliff walking p roblem using Q-learning
On the other hand, since SARSA is more far-sighted, and so chooses the safe
but longer path that is the top row of cells (see Figure 2).
Our next problem is the grid world problem, where we must navigate a grid.
We will code this in SARSA.
Grid world with SARSA
We will next consider the grid world problem, and we will use SARSA to
solve it. We will introduce obstacles in place of a cliff. The goal of the agent
is to navigate the grid world from start to destination by avoiding the
obstacles. We will store the coordinates of the obstacle cells in the
obstacle_cells list, where each entry is the (x, y) coordinate of the obstacle
cell.
1. Most of the code is the same as previously used, the differences will be
summarized here
2. Place obstacles in the grid
3. The move() function has to also look for obstacles in the grid
4. Plot Q values and the path traced by the agent
nrows = 3
ncols = 12
nact = 4
nepisodes = 100000
epsilon = 0.1
alpha = 0.1
gamma = 0.95
reward_normal = -1
reward_destination = -1
# obstacles
obstacle_cells = [(4,1), (4,2), (8,0), (8,1)]
The move() function will now change, as we have to also look for obstacles. If
the agent ends up in one of the obstacle cells it is pushed back to where it
came from, as shown in the following code snippet:
def move(x,y,a):
# state = 0: OK
# state = 1: reached destination
state = 0
if (a == 0):
x1 = x
y1 = y - 1
elif (a == 1):
x1 = x + 1
y1 = y
elif (a == 2):
x1 = x
y1 = y + 1
elif (a == 3):
x1 = x - 1
y1 = y
Figure 5: Q-values for each of the actions for the grid world p roblem using SARSA
As we can see in the following diagram, the agent navigates around the
obstacles to reach its destination:
Figure 6: Path traced by the agent for the grid world p roblem using SARSA
In the next chapter, we will look at the use of deep neural networks in RL that
gives rise to deep RL. We will see a variant of Q-learning called Deep Q-
Networks (DQNs) that will use a neural network instead of a tabular state-
action value function, which we saw in this chapter. Note that only problems
with small number of states and actions are suited to Q-learning and SARSA.
When we have a large number of states and/or actions, we encounter what is
called as the Curse of Dimensionality, where a tabular approach will be
unfeasible due to excessive memory use; in these problems, DQN is better
suited, and will be the crux of the next chapter.
Further reading
Reinforcement Learning: an Introduction by Richard Sutton and
Andrew Barto, 2018
Deep Q-Network
Deep Q-Networks (DQNs) revolutionized the field of reinforcement
learning (RL). I am sure you have heard of Google DeepMind, which used
to be a British company called DeepMind Technologies until Google
acquired it in 2014. DeepMind published a paper in 2013 titled Playing
Atari with Deep RL, where they used Deep Neural Networks (DNNs) in the
context of RL, or DQNs as they are referred to – which is an idea that is
seminal to the field. This paper revolutionized the field of deep RL, and the
rest is history! Later, in 2015, they published a second paper, titled Human
Level Control Through Deep RL, in Nature, where they had more interesting
ideas that further improved the former paper. Together, the two papers led to
a Cambrian explosion in the field of deep RL, with several new algorithms
that have improved the training of agents using neural networks, and have
also pushed the limits of applying deep RL to interesting real-world
problems.
In this chapter, we will investigate a DQN and also code it using Python and
TensorFlow. This will be our first use of deep neural networks in RL. It will
also be our first effort in this book to use deep RL to solve real-world
control problems.
4. We then train the neural network on the DQN by minimizing this loss
function L(θ) using optimization algorithms, such as gradient descent,
RMSprop, and Adam.
We used the least squared loss previously for the DQN loss function, also
referred to as the L2 loss. You can also consider other losses, such as the
Huber loss, which combines the L1 and L2 losses, with the L2 loss in the
vicinity of zero and L1 in regions far away. The Huber loss is less sensitive
to outliers than the L2 loss.
We will now look at the use of target networks. This is a very important
concept, required to stabilize training.
Understanding target networks
An interesting feature of a DQN is the utilization of a second network during
the training procedure, which is referred to as the target network. This
second network is used for generating the target-Q values that are used to
compute the loss function during training. Why not use just use one network
for both estimations, that is, for choosing the action a to take, as well as
updating the Q-network? The issue is that, at every step of training, the Q-
network's values change, and if we use a constantly changing set of values to
update our network, then the estimations can easily become unstable – the
network can fall into feedback loops between the target and estimated Q-
values. In order to mitigate this instability, the target network's weights are
fixed – that is, slowly updated to the primary Q-network's values. This leads
to training that is far more stable and practical.
Let's now learn about the use of replay buffer in off-policy algorithms.
Learning about replay buffer
We need the tuple (s, a, r, s', done) for updating the DQN, where s and a are
respectively the state and actions at time t; s' is the new state at time t+1; and
done is a Boolean value that is True or False depending on whether the episode
is not completed or has ended, also referred to as the terminal value in the
literature. This Boolean done or terminal variable is used so that, in the
Bellman update, the last terminal state of an episode is properly handled
(since we cannot do an r + γ max Q(s',a') for the terminal state). One
problem in DQNs is that we use contiguous samples of the (s, a, r, s', done)
tuple, they are correlated, and so the training can overfit.
To mitigate this issue, a replay buffer is used, where the tuple (s, a, r, s', done)
is stored from experience, and a mini-batch of such experiences are
randomly sampled from the replay buffer and used for training. This ensures
that the samples drawn for each mini-batch are independent and identically
distributed (IID). Usually, a large-size replay buffer is used, say, 500,000 to
1 million samples. At the beginning of the training, the replay buffer is filled
to a sufficient number of samples and populated with new experiences. Once
the replay buffer is filled to a maximum number of samples, the older
samples are discarded one by one. This is because the older samples were
generated from an inferior policy, and are not desired for training at a later
stage as the agent has advanced in its learning.
We will now look into the Atari environment. If you like playing video
games, you will love this section!
Getting introduced to the Atari
environment
The Atari 2600 game suite was originally released in the 1970s, and was a
big hit at that time. It involves several games that are played by users using
the keyboard to enter actions. These games were a big hit back in the day,
and inspired many computer game players of the 1970s and 1980s, but are
considered too primitive by today's video game players' standards. However,
they are popular today in the RL community as a portal to games that can be
trained by RL agents.
Summary of Atari games
Here is a summary of a select few games from Atari (we won't present
screenshots of the games for copyright reasons, but will provide links to
them).
Pong
Our first example is a ping pong game called Pong, which allows the user to
move up or down to hit a ping pong ball to an opponent, which is the
computer. The first one to score 21 points is the winner of the game. A
screenshot of the Pong game from Atari can be found at https://gym.openai.com/e
nvs/Pong-v0/.
Breakout
In another game, called Breakout, the user must move a paddle to the left or
right to hit a ball that then bounces off a set of blocks at the top of the screen.
The higher the number of blocks hit, the more points or rewards the player
can accrue. There are a total of five lives per game, and if the player misses
the ball, it results in the loss of a life. A screenshot of the Breakout
game from Atari can be found at https://gym.openai.com/envs/Breakout-v0/.
Space Invaders
If you like shooting space aliens, then Space Invaders is the game for you. In
this game, wave after wave of space aliens descend from the top, and the
goal is to shoot them using a laser beam, accruing points. The link to this can
be found at https://gym.openai.com/envs/SpaceInvaders-v0/.
LunarLander
Or, if you are fascinated by space travel, then LunarLander is about landing a
spacecraft (which resembles the Apollo 11 Eagle) on the surface of the
moon. For each level, the surface of the landing zone changes and the goal is
to guide the spacecraft to land on the lunar surface between two flags. A
screenshot of LunarLander from Atari can be found at https://gym.openai.com/env
s/LunarLander-v2/.
The Arcade Learning Environment
Over 50 such games exist in Atari. They are now part of the Arcade
Learning Environment (ALE), which is an object-oriented framework built
on top of Atari. OpenAI's gym is used to invoke Atari games these days so
that RL agents can be trained to play these games. For instance, you can
import gym in Python and play them as follows.
The reset() function resets the game environment, and render() renders the
screenshot of the game:
import gym
env = gym.make('SpaceInvaders-v0')
env.reset()
env.render()
dqn.py: This file will have the main loop, where we explore the
environment and call the update functions
model.py: This file will have the class for the DQN agent, where we will
have the neural network and the functions we require to train it
funcs.py: This file will involve some utility functions—for example, to
process the image frames, or to populate the replay buffer
Using the model.py file
Let's first code the model.py file. The steps involved in this are as follows:
3. Choose the loss function (L2 loss or the Huber loss): For the Q-
learning loss function, we can use either the L2 loss or the Huber loss.
Both options will be used in the code. We will choose huber for now:
LOSS = 'huber' # 'L2'
5. Define the QNetwork() class: We will then define the QNetwork() class
as follows. It will have an __init__() constructor and the _build_model(),
predict(), and update() functions. The __init__ constructor is shown as
follows:
class QNetwork():
def __init__(self, scope="QNet", VALID_ACTIONS=[0, 1, 2, 3]):
self.scope = scope
self.VALID_ACTIONS = VALID_ACTIONS
with tf.variable_scope(scope):
self._build_model()
# smaller net
# 2 conv layers
conv1 = tf.contrib.layers.conv2d(X, 16, 8, 4, padding='VALID',
activation_fn=tf.nn.relu, weights_initializer=winit)
conv2 = tf.contrib.layers.conv2d(conv1, 32, 4, 2,
padding='VALID',activation_fn=tf.nn.relu, weights_initializer=winit)
# fully connected layers
flattened = tf.contrib.layers.flatten(conv2)
fc1 = tf.contrib.layers.fully_connected(flattened, 256,
activation_fn=tf.nn.relu, weights_initializer=winit)
# Q(s,a)
self.predictions = tf.contrib.layers.fully_connected(fc1,
len(self.VALID_ACTIONS), activation_fn=None, weights_initializer=winit)
action_one_hot = tf.one_hot(self.tf_actions, tf.shape(self.predictions)
[1], 1.0, 0.0, name='action_one_hot')
self.action_predictions = tf.reduce_sum(self.predictions * action_one_hot,
reduction_indices=1, name='act_pred')
9. Computing loss for training the Q-network: We compute the loss for
training the Q-network, stored in self.loss, using either the L2 loss or the
Huber loss, which is determined using the LOSS variable. For L2 loss, we
use the tf.squared_difference() function; for the Huber loss, we use
huber_loss(), which we will soon define. The loss is averaged over many
samples, and for this we use the tf.reduce_mean() function. Note that we
will compute the loss between the tf_y placeholder that we defined
earlier and the action_predictions variable that we obtained in the
previous step:
if (LOSS == 'L2'):
# L2 loss
self.loss = tf.reduce_mean(tf.squared_difference(self.tf_y,
self.action_predictions), name='loss')
elif (LOSS == 'huber'):
# Huber loss
self.loss = tf.reduce_mean(huber_loss(self.tf_y-self.action_predictions),
name='loss')
10. Using the optimizer: We use either the RMSprop or Adam optimizer,
and store it in self.optimizer. Our learning objective is to minimize
self.loss, and so we use self.optimizer.minimize(). This is stored in
self.train_op:
# optimizer
#self.optimizer = tf.train.RMSPropOptimizer(learning_rate=0.00025,
momentum=0.95, epsilon=0.01)
self.optimizer = tf.train.AdamOptimizer(learning_rate=2e-5)
self.train_op=
self.optimizer.minimize(self.loss,global_step=tf.contrib.framework.get_globa
l_step())
11. Define the predict() function for the class: In the predict() function, we
run the self.predictions function defined earlier using TensorFlow's
sess.run(), where sess is the tf.Session() object that is passed to this
function. The states are passed as an argument to this function in
the s variable, which is passed on to the TensorFlow placeholder, tf_X:
def predict(self, sess, s):
return sess.run(self.predictions, { self.tf_X: s})
12. Define the update() function for the class: Finally, in the update()
function, we call the train_op and loss objects, and feed the a dictionary
to the placeholders involved in performing these operations, which we
call feed_dict. The states are stored in s, the actions in a, and the targets
in y:
def update(self, sess, s, a, y):
feed_dict = { self.tf_X: s, self.tf_y: y, self.tf_actions: a }
_, loss = sess.run([self.train_op, self.loss], feed_dict)
return loss
13. Define the huber_loss() function outside the class: The last thing to
complete model.py is the definition of the Huber loss function, which is a
blend of L1 and L2 losses. Whenever the input is < 1.0, the L2 loss is
used, and the L1 loss otherwise:
# huber loss
def huber_loss(x):
condition = tf.abs(x) < 1.0
output1 = 0.5 * tf.square(x)
output2 = tf.abs(x) - 0.5
return tf.where(condition, output1, output2)
Using the funcs.py file
We will next code funcs.py by completing the following steps:
3. Copy model parameters from one network to another: The next step
is to write a function called copy_model_parameters(), which will take as
arguments the tf.Session() object sess, and two networks (in this case, the
Q-network and the target network). Let's call them qnet1 and qnet2. The
function will copy the parameter values from qnet1 to qnet2:
# copy params from qnet1 to qnet2
def copy_model_parameters(sess, qnet1, qnet2):
q1_params = [t for t in tf.trainable_variables() if
t.name.startswith(qnet1.scope)]
q1_params = sorted(q1_params, key=lambda v: v.name)
q2_params = [t for t in tf.trainable_variables() if
t.name.startswith(qnet2.scope)]
q2_params = sorted(q2_params, key=lambda v: v.name)
update_ops = []
for q1_v, q2_v in zip(q1_params, q2_params):
op = q2_v.assign(q1_v)
update_ops.append(op)
sess.run(update_ops)
replay_memory = []
env.render()
next_state, reward, done, _ = env.step(VALID_ACTIONS[action])
if done:
state = env.reset()
state = state_processor.process(sess, state)
state = np.stack([state] * 4, axis=2)
else:
state = next_state
return replay_memory
2. Set the game and choose the valid actions: We will then set the game.
Let's choose the BreakoutDeterministic-v4 game for now, which is a later
version of Breakout v0. This game has four actions, numbered zero to
three, and they represent 0: no-operation (noop), 1: fire, 2: move left, and
3: move right:
3. Set the mode (train/test) and the start iterations: We will then set the
mode in the train_or_test variable. Let's start with train to begin with
(you can later set it to test to evaluate the model after the training is
complete). We will also train from scratch from the 0 iteration:
# set parameters for running
train_or_test = 'train' #'test' #'train'
train_from_scratch = True
start_iter = 0
start_episode = 0
epsilon_start = 1.0
4. Create environment: We will create the environment env object, which
will create the GAME game. env.action_space.n will print the number of
actions in this game. env.reset() will reset the game and output the initial
state/observation (note that state and observation in RL parlance are the
same and are interchangeable). observation.shape will print the shape of
the state space:
env = gym.envs.make(GAME)
print("Action space size: {}".format(env.action_space.n))
observation = env.reset()
print("Observation space shape: {}".format(observation.shape)
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
# policy
policy = epsilon_greedy_policy(q_net, len(VALID_ACTIONS))
7. Populate the replay memory with experiences encountered with
initial random actions: Then, we populate the replay memory with the
initial samples:
# populate replay memory
if (train_or_test == 'train'):
print("populating replay memory")
replay_memory = populate_replay_mem(sess, env, state_processor,
replay_memory_init_size, policy, epsilon_start, epsilon_end[0],
epsilon_decay_steps[0], VALID_ACTIONS, Transition)
8. Set the epsilon values: Next, we will set the epsilon values. Note that
we have a double linear function, which will decrease the value of
epsilon, first from 1 to 0.1, and then from 0.1 to 0.01, in as many steps,
specified in epsilon_decay_steps:
# epsilon start
if (train_or_test == 'train'):
delta_epsilon1 = (epsilon_start -
epsilon_end[0])/float(epsilon_decay_steps[0])
delta_epsilon2 = (epsilon_end[0] -
epsilon_end[1])/float(epsilon_decay_steps[1])
if (train_from_scratch == True):
epsilon = epsilon_start
else:
if (start_iter <= epsilon_decay_steps[0]):
epsilon = max(epsilon_start - float(start_iter) * delta_epsilon1,
epsilon_end[0])
elif (start_iter > epsilon_decay_steps[0] and start_iter <
epsilon_decay_steps[0]+epsilon_decay_steps[1]):
epsilon = max(epsilon_end[0] - float(start_iter) * delta_epsilon2,
epsilon_end[1])
else:
epsilon = epsilon_end[1]
elif (train_or_test == 'test'):
epsilon = epsilon_end[1]
10. Then, the main loop starts over the episodes from the start to the total
number of episodes. We reset the episode, process the first frame, and
stack it up 4 times. Then, we will initialize loss, time_steps, and
episode_rewards to 0. The total number of lives per episode for Breakout is
5, and so we keep count of it in the ale_lives variable. The total number
of time steps in this life of the agent is initialized to a large number:
for ep in range(start_episode, num_episodes):
# save ckpt
saver.save(tf.get_default_session(), checkpoint_path)
# env reset
state = env.reset()
state = state_processor.process(sess, state)
state = np.stack([state] * 4, axis=2)
loss = 0.0
time_steps = 0
episode_rewards = 0.0
ale_lives = 5
info_ale_lives = ale_lives
steps_in_this_life = 1000000
num_no_ops_this_life = 0
11. Keeping track of time steps: We will use an inner while loop to keep
track of the time steps in a given episode (note: the outer for loop is
over episodes, and this inner while loop is over time steps in the current
episode). We will decrease epsilon accordingly, depending on whether it
is in the 0.1 to 1 range or in the 0.01 to 0.1 range, both of which have
different delta_epsilon values:
while True:
if (train_or_test == 'train'):
#epsilon = max(epsilon - delta_epsilon, epsilon_end)
if (total_t <= epsilon_decay_steps[0]):
epsilon = max(epsilon - delta_epsilon1, epsilon_end[0])
elif (total_t >= epsilon_decay_steps[0] and total_t <=
epsilon_decay_steps[0]+epsilon_decay_steps[1]):
epsilon = epsilon_end[0] - (epsilon_end[0]-epsilon_end[1]) /
float(epsilon_decay_steps[1]) * float(total_t-epsilon_decay_steps[0])
epsilon = max(epsilon, epsilon_end[1])
else:
epsilon = epsilon_end[1]
12. Updating the target network: We update the target network if the total
number of time steps so far is a multiple of update_target_net_every, which
is a user-defined parameter. This is accomplished by calling
the copy_model_parameters() function:
# update target net
if total_t % update_target_net_every == 0:
copy_model_parameters(sess, q_net, target_net)
print("\n copied params from Q net to target net ")
13. At the start of every new life of the agent, we undertake a no-op
(corresponding to action probabilities [1, 0, 0, 0]) a random number of
times between zero and seven to make the episode different from past
episodes, so that the agent gets to see more variations when it explores
and learns the environment. This was also done in the original
DeepMind paper, and ensures that the agent learns better, since this
randomness will ensure that more diversity is experienced. Once we are
outside this initial randomness cycle, the actions are taken as per the
policy() function.
14. Note that we still need to take one fire operation (action probabilities
[0, 1, 0, 0]) for one time step at the start of every new life to kick-start
the agent. This is a requirement for the ALE framework, without which
the frames will freeze. Thus, the life cycle evolves as a 1 fire operation,
followed by a random number (between zero and seven) of no-ops, and
then the agent uses the policy function:
time_to_fire = False
if (time_steps == 0 or ale_lives != info_ale_lives):
# new game or new life
steps_in_this_life = 0
num_no_ops_this_life = np.random.randint(low=0,high=7)
action_probs = [0.0, 1.0, 0.0, 0.0] # fire
time_to_fire = True
if (ale_lives != info_ale_lives):
ale_lives = info_ale_lives
else:
action_probs = policy(sess, state, epsilon)
steps_in_this_life += 1
if (steps_in_this_life < num_no_ops_this_life and not time_to_fire):
# no-op
action_probs = [1.0, 0.0, 0.0, 0.0] # no-op
15. We will then take the action using NumPy's random.choice, which will use
the action_probs probabilities. Then, we render the environment and take
one step. info['ale.lives'] will let us know the number of lives remaining
for the agent, from which we can ascertain whether the agent lost a life
in the current time step. In the DeepMind paper, the rewards were set to
+1 or -1 depending on the sign of the reward, so as to be able to compare
the different games. This is accomplished using np.sign(reward), which we
will not use for now. We will then process next_state_img to convert to
grayscale of the desired size, which is then appended to the next_state
vector, which maintains a sequence of four contiguous frames. The
rewards obtained are used to increment episode_rewards, and we also
increment time_steps:
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
env.render()
next_state_img, reward, done, info = env.step(VALID_ACTIONS[action])
info_ale_lives = int(info['ale.lives'])
episode_rewards += reward
time_steps += 1
# update net
if (total_t % 4 == 0):
states_batch = np.array(states_batch)
loss = q_net.update(sess, states_batch, action_batch,
targets_batch)
18. We exit the inner while loop if done = True, otherwise, we proceed to the
next time step, where the state will now be new_state from the previous
time step. We can also print on the screen the episode number, time
steps for the episode, the total rewards earned in the episode, the
current epsilon, and the replay buffer size at the end of the episode.
These values are also useful for analysis later, and so we store them in a
text file called performance.txt:
if done:
#print("done: ", done)
break
state = next_state
total_t += 1
if (train_or_test == 'train'):
print('\n Episode: ', ep, '| time steps: ', time_steps, '| total
episode reward: ', episode_rewards, '| total_t: ', total_t, '| epsilon: ',
epsilon, '| replay mem size: ', len(replay_memory))
elif (train_or_test == 'test'):
print('\n Episode: ', ep, '| time steps: ', time_steps, '| total
episode reward: ', episode_rewards, '| total_t: ', total_t, '| epsilon: ',
epsilon)
if (train_or_test == 'train'):
f = open("experiments/" + str(env.spec.id) + "/performance.txt", "a+")
f.write(str(ep) + " " + str(time_steps) + " " + str(episode_rewards) +
" " + str(total_t) + " " + str(epsilon) + '\n')
f.close()
19. The next few lines of code will complete dqn.py. We reset the
TensorFlow graph to begin with using tf.reset_default_graph(). Then, we
create two instances of the QNetwork class, the q_net and target_net objects.
We create a state_processor object of the ImageProcess class and also create
the TensorFlow saver object:
tf.reset_default_graph()
# state processor
state_processor = ImageProcess()
# tf saver
saver = tf.train.Saver()
# run
deep_q_learning(sess, env, q_net=q_net, target_net=target_net,
state_processor=state_processor, num_episodes=25000,
train_or_test=train_or_test,train_from_scratch=train_from_scratch,
start_iter=start_iter, start_episode=start_episode,
replay_memory_size=300000, replay_memory_init_size=5000,
update_target_net_every=10000, gamma=0.99, epsilon_start=epsilon_start,
epsilon_end=[0.1,0.01], epsilon_decay_steps=[1e6,1e6], batch_size=32)
2. Plot episode reward versus time step number: In the following graph,
we plot the total episode reward versus time time step for Atari
Breakout using the DQN algorithm. As we can see, the peak episode
rewards are close to 400 (blue curve), and the exponentially weighted
moving average is approximately 160 to 180 toward the end of the
training. We used a replay memory size of 300,000, which is fairly
small by modern standards, due to RAM limitations. If a bigger replay
memory size was used, a higher average episode reward could be
obtained. This is left for experimentation by the reader:
Figure 2: Total ep isode reward versus time step number for Atari Breakout using the DQN
This chapter has laid the foundation for us to delve deeper into deep RL (no
pun intended!). In the next chapter, we will look at other DQN extensions,
such as DDQN, dueling network architectures, and rainbow networks.
Questions
1. Why is a replay buffer used in a DQN?
2. Why do we use target networks?
3. Why do we stack four frames into one state? Will one frame alone
suffice to represent one state?
4. Why is the Huber loss sometimes preferred over L2 loss?
5. We converted the RGB input image into grayscale. Can we instead use
the RGB image as input to the network? What are the pros and cons of
using RGB images instead of grayscale?
Further reading
Playing Atari with Deep Reinforcement Learning, by Volodymyr Mnih,
Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
Daan Wierstra, and Martin Riedmiller, arXiv:1312.5602: https://arxiv.
org/abs/1312.5602
Human-level control through deep reinforcement learning by
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller,
Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles
Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan
Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, Nature,
2015: https://www.nature.com/articles/nature14236
Double DQN, Dueling
Architectures, and Rainbow
We discussed the Deep Q-Network (DQN) algorithm in the previous
chapter, coded it in Python and TensorFlow, and trained it to play Atari
Breakout. In DQN, the same Q-network was used to select and evaluate an
action. This, unfortunately, is known to overestimate the Q values, which
results in over-optimistic estimates for the values. To mitigate this,
DeepMind released another paper where it proposed the decoupling of the
action selection and action evaluation. This is the crux of the Double DQN
(DDQN) architectures, which we will investigate in this chapter.
Even later, DeepMind released another paper where they proposed the Q-
network architecture with two output values, one representing the value, V(s),
and the other the advantage of taking an action at the given state, A(s,a).
DeepMind then combined these two to compute the Q(s,a) action-value,
instead of directly determining it as done in DQN and DDQN. These Q-
network architectures are referred to as the dueling network architectures, as
the neural network now has dual output values, V(s) and A(s,a), which are
later combined to obtain Q(s,a). We will also see these dueling networks in
this chapter.
The topics that will be covered in this chapter are the following:
Python (2 or 3)
NumPy
Matplotlib
TensorFlow (version 1.4 or higher)
Dopamine (we will discuss this in more detail later)
Understanding Double DQN
DDQN is an extension to DQN, where we use the target network in the
Bellman update. Specifically, in DDQN, we evaluate the target network's Q
function using the action that would be greedy maximization of the primary
network's Q function. First, we will use the vanilla DQN target for the
Bellman equation update step, then, we will extend to DDQN for the same
Bellman equation update step; this is the crux of the DDQN algorithm. We
will then code DDQN in TensorFlow to play Atari Breakout. Finally, we
will compare and contrast the two algorithms: DQN and DDQN.
Updating the Bellman equation
In vanilla DQN, the target for the Bellman update is this:
funcs.py
model.py
ddqn.py
funcs.py and model.py are the same as used before for DQN in Chapter 3, Deep
Q-Network (DQN). The ddqn.py file is the only code where we need to make
changes to implement DDQN. We will use the same dqn.py file from the
previous chapter and make changes to it to code DDQN. So, let's first copy
the dqn.py file from before and rename it ddqn.py.
We will summarize the changes we will make to ddqn.py, which are actually
quite minimal. We will still not delete the DQN-related lines of code in the
file, and instead, use if loops to choose between the two algorithms. This
helps to use one code for both algorithms, which is a better way to code.
First, we create a variable called ALGO, which will store one of two strings:
DQN or DDQN, which is where we specify which of the two algorithms to use:
Then, in the lines of code where we evaluate the targets for the mini-batch,
we use if loops to decide whether the algorithm to use is DQN or DDQN and
accordingly compute the targets as follows. Note that, in DQN, the greedy_q
variable stores the Q value corresponding to the greedy action taking, that is,
the largest Q value in the target network, which is computed using np.amax()
and then used to compute the target variable, targets_batch.
In DDQN, on the other hand, we compute the action corresponding to the
maximum Q in the primary Q-network, which we store in greedy_q and
evaluate using np.argmax(). Then, we use greedy_q (which represents an action
now) in the target network Q values. Note that, for Terminal time steps, that
is, done = True, we should not consider the next state and likewise, for non-
Terminal steps, done = False, and here we consider the next step. This is easily
accomplished using np.invert().astype(np.float32) on done_batch. The following
lines of code show DDQN:
# calculate q values and targets
if (ALGO == 'DQN'):
In the following screenshot, we present the number of time steps per episode
on Atari Breakout using DDQN and its exponentially weighted moving
average. As evident, the peak number of time steps is ~2,000 for many
episodes toward the end of the training, with one episode where it exceeded
even 3,000 time steps! The moving average is approximately 1,500 time
steps toward the end of the training:
Figure 1: Number of time step s p er ep isode for Atari Breakout using DDQN
In the following screenshot, we show the total rewards received per episode
versus the time number of the global time step. The peak episode reward is
over 350, with the moving average near 150. Interestingly, the moving
average (in orange) is still increasing toward the end, which means you can
run the training even longer to see further gains. This is left to the interested
reader:
Figure 2: Total ep isode reward versus time step for Atari Breakout using DDQN
Note that, due to RAM constraints (16 GB), we used a replay buffer size of 300,000
only. If the user has access to more RAM power, a bigger replay buffer size can be
used—for example, 500,000 to 1,000,000, which can result in even better scores.
As we can see, the DDQN agent is learning to play Atari Breakout well. The
moving average of the episode rewards is constantly going up, which means
you can train longer to obtain even higher rewards. This upward trend in the
episode reward demonstrates the efficacy of the DDQN algorithm for such
problems.
Understanding dueling network
architectures
We will now understand the use of dueling network architectures. In DQN
and DDQN, and other DQN variants in the literature, the focus was primarily
on algorithms, that is, how to efficiently and stably update the value function
neural networks. While this is crucial for developing robust RL algorithms, a
parallel but complementary direction to advance the field is to also innovate
and develop novel neural network architectures that are well suited for
model-free RL. This is precisely the concept behind dueling network
architectures, another contribution from DeepMind.
For instance, if the agent is a car driving on a straight road with no traffic, no
action is necessary and so V(s) alone will suffice in these states. On the other
hand, if the road suddenly curves or other cars come into the vicinity of the
agent, then the agent needs to take actions and so, in these states, the advantage
function comes into play to find the incremental returns a given action can
provide over the state value function. This is the intuition behind separating
the estimation of V(s) and A(s,a) in the same network by using two different
branches, and later combining them.
Figure 3: Schematic of the standard DQN network (top ) and the dueling network architecture (bottom)
However, this is not unique in that you can have an amount, δ, over-predicted
in V(s) and the same amount, δ, under-predicted in A(s,a). This makes the
neural network predictions unidentifiable. To circumvent this problem, the
authors of the dueling network paper recommend the following way to
combine V(s) and A(s,a):
|A| represents the number of actions and θ is the neural network parameters
that are shared between the V(s) and A(s,a) streams; in addition, α and β are
used to denote the neural network parameters in the two different streams,
that is, in the A(s,a) and V(s) streams, respectively. Essentially, in the
preceding equation, we subtract the average advantage function from the
advantage function and sum it to the state value function to obtain Q(s,a).
model.py
funcs.py
dueling.py
We will use funcs.py, which was used earlier for DDQN, so we reuse it. The
dueling.py code is also identical to ddqn.py (which was used earlier, so we just
rename and reuse it). The only changes to be made are in model.py. We copy
the same model.py file from DDQN and summarize here the changes to be
made for the dueling network architecture. The steps involved are the
following.
We will write the code with an if loop so that the DUELING variable, if False,
will use the earlier code we used in DDQN, and if True, we will use the
dueling network. We will use the flattened object that is the flattened version
of the output of the convolutional layers to create two sub-neural network
streams. We send flattened separately into two different fully connected layers
with 512 neurons, using the relu activation function and the winit weights
initializer defined earlier; the output values of these fully connected layers
are called valuestream and advantagestream, respectively:
if (not DUELING):
# Q(s,a)
self.predictions = tf.contrib.layers.fully_connected(fc1,
len(self.VALID_ACTIONS), activation_fn=None, weights_initializer=winit)
else:
# Deuling network
# branch out into two streams using flattened (i.e., ignore fc1 for Dueling
DQN)
will set all negative values to zero!). Finally, we combine the advantage and
value streams using tf.subtract() to subtract the advantage and the mean of the
advantage function. The mean is computed using tf.reduce_mean() on the advantage
function:
# A(s,a)
self.advantage = tf.contrib.layers.fully_connected(advantagestream,
len(self.VALID_ACTIONS), activation_fn=None, weights_initializer=winit)
# V(s)
self.value = tf.contrib.layers.fully_connected(valuestream, 1, activation_fn=None,
weights_initializer=winit)
That's it for coding dueling network architectures. We will train an agent with
the dueling network architecture and evaluate its performance on Atari
Breakout. Note that we can use the dueling architecture in conjunction with
either DQN or DDQN. That is to say that we only changed the neural
network architecture, not the actual Bellman update, and so the dueling
architecture works with both DQN and DDQN.
Evaluating the performance of
dueling architectures on Atari
Breakout
We will now evaluate the performance of dueling architectures on Atari
Breakout. Here, we will plot the performance of our dueling network
architecture with DDQN on Atari Breakout using the performance.txt file that
we wrote during the training of the agent. We will use matplotlib to plot two
graphs as explained in the following.
In the following screenshot, we present the number of time steps per episode
on Atari Breakout using DDQN (in blue) and its exponentially weighted
moving average (in orange). As evident, the peak number of time steps is
~2,000 for many episodes toward the end of the training, with a few episodes
even exceeding 4,000 time steps! The moving average is approximately
1,500 time steps toward the end of the training:
Figure 4: Number of time step s p er ep isode on Atari Breakout using dueling network architecture and DDQN
In the following screenshot, we show the total rewards received per episode
versus the time number of the global time step. The peak episode reward is
over 400, with the moving average near 220. We also note that the moving
average (in orange) is still increasing toward the end, which means you can
run the training even longer to obtain further gains. Overall, the average
rewards are higher with the dueling network architecture vis-a-vis the non-
dueling counterparts, and so it is strongly recommended to use these dueling
architectures:
Figure 5: Total ep isode reward received versus global time step number for Atari Breakout using dueling network architecture
and DDQN
Note that, due to RAM constraints (16 GB), we used a replay buffer size of only
300,000. If the user has access to more RAM power, a bigger replay buffer size can
be used—for example, 500,000 to 1,000,000, which can result in even better scores.
Understanding Rainbow networks
We will now move on to Rainbow networks, which is a confluence of
several different DQN improvements. Since the original DQN paper, several
different improvements were proposed with notable success. This motivated
DeepMind to combine several different improvements into an integrated
agent, which they refer to as the Rainbow DQN. Specifically, six different
DQN improvements are combined into one integrated Rainbow DQN agent.
These six improvements are summarized as follows:
DDQN
Dueling network architecture
Prioritized experience replay
Multi-step learning
Distributional RL
Noisy nets
DQN improvements
We have already seen DDQN and dueling network architectures and have
coded them in TensorFlow. The rest of the improvements are described in the
following sections.
Prioritized experience replay
We used a replay buffer where all of the samples have an equal probability
of being sampled. This, however, is not very efficient, as some samples are
more important than others. This is the motivation behind prioritized
experience replay, where samples that have a higher Temporal Difference
(TD) error are sampled with a higher probability than others. The first time a
sample is added to the replay buffer, it is set a maximum priority value so as
to ensure that all samples in the buffer are sampled at least once. Thereafter,
the TD error is used to determine the probability of the experience to be
sampled, which we compute as this:
Then, the n-step return, rt(n) , is used in the Bellman update and is known to
lead to faster learning.
Distributional RL
In distributional RL, we learn to approximate the distribution of returns
instead of the expected return. This is mathematically complicated, is beyond
the scope of this book, and is not discussed further.
Noisy nets
In some games (such as Montezuma's revenge), ε-greedy does not work well,
as many actions need to be executed before the first reward is received.
Under this setting, the use of a noisy linear layer that combined a
deterministic and a noisy stream is recommended, shown as follows:
Here, x is the input, y is the output, and b and W are the biases and weights in
the deterministic stream; bnoisy and Wnoisy are the biases and weights,
respectively, in the noisy stream; and εb and εW are random variables and are
applied as element-wise product to the biases and weights, respectively, in
the noisy stream. The network may choose to ignore the noisy stream in some
regions of the state space and may use them otherwise, as required. This
allows for a state-determined exploration strategy.
Easy experimentation
Flexible development
Compact and reliable
Reproducible
OK
You should see OK at the end to confirm that everything went well with the
download.
Rainbow using Dopamine
To run Rainbow DQN, type the following command into a Terminal:
python -um dopamine.atari.train --agent_name=rainbow --base_dir=/tmp/dopamine --
gin_files='dopamine/agents/rainbow/configs/rainbow.gin'
That's it. Dopamine will start training Rainbow DQN and print out training
statistics on the screen, as well as save checkpoint files. The configuration
file is stored in the following path:
dopamine/dopamine/agents/rainbow/configs/rainbow.gin
It looks like the following code. game_name is set to Pong as default; feel free to
try other Atari games. The number of agent steps for training is set in
training_steps, and for evaluation in evaluation_steps. In addition, it introduces
stochasticity to the training by using the concept of sticky actions, where the
most recent action is repeated multiple times with a probability of 0.25. That
is, if a uniform random number (computed using NumPy's np.random.rand()) is <
0.25, the most recent action is repeated; otherwise, a new action is taken
from the policy.
RainbowAgent.num_atoms = 51
RainbowAgent.vmax = 10.
RainbowAgent.gamma = 0.99
RainbowAgent.update_horizon = 3
RainbowAgent.min_replay_history = 20000 # agent steps
RainbowAgent.update_period = 4
RainbowAgent.target_update_period = 8000 # agent steps
RainbowAgent.epsilon_train = 0.01
RainbowAgent.epsilon_eval = 0.001
RainbowAgent.epsilon_decay_period = 250000 # agent steps
RainbowAgent.replay_scheme = 'prioritized'
RainbowAgent.tf_device = '/gpu:0' # use '/cpu:*' for non-GPU version
RainbowAgent.optimizer = @tf.train.AdamOptimizer()
Runner.game_name = 'Pong'
# Sticky actions with probability 0.25, as suggested by (Machado et al., 2017).
Runner.sticky_actions = True
Runner.num_iterations = 200
Runner.training_steps = 250000 # agent steps
Runner.evaluation_steps = 125000 # agent steps
Runner.max_steps_per_episode = 27000 # agent steps
WrappedPrioritizedReplayBuffer.replay_capacity = 1000000
WrappedPrioritizedReplayBuffer.batch_size = 32
Feel free to experiment with the hyperparameters and see how the learning is
affected. This is a very nice way to ascertain the sensitivity of the different
hyperparameters on the learning of the RL agent.
Summary
In this chapter, we were introduced to DDQN, dueling network architectures,
and the Rainbow DQN. We extended our previous DQN code to DDQN and
dueling architectures and tried it out on Atari Breakout. We can clearly see
that the average episode rewards are higher with these improvements, and so
these improvements are a natural choice to use. Next, we also saw Google's
Dopamine and used it to train a Rainbow DQN agent. Dopamine has several
other RL algorithms, and the user is encouraged to dig deeper and try out
these other RL algorithms as well.
This chapter was a good deep dive into the DQN variants, and we really
covered a lot of mileage as far as coding of RL algorithms is involved. In the
next chapter, we will learn about our next RL algorithm called Deep
Deterministic Policy Gradient (DDPG), which is our first Actor-Critic RL
algorithm and our first continuous action space RL algorithm.
Questions
1. Why does DDQN perform better than DQN?
2. How does the dueling network architecture help in the training?
3. Why does prioritized experience replay speed up the training?
4. How do sticky actions help in the training?
Further reading
The DDQN paper, Deep Reinforcement Learning with Double Q-
learning, by Hado van Hasselt, Arthur Guez, and David Silver can be
obtained from the following link, and the interested reader is
recommended to read it: https://arxiv.org/abs/1509.06461
Rainbow: Combining Improvements in Deep Reinforcement Learning,
Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg
Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and
David Silver, arXiv:1710.02298 (the Rainbow DQN): https://arxiv.org/a
bs/1710.02298
We will have two neural networks—one referred to as the actor, and the
other as the critic
The actor is like the student, as we described previously, and takes an
action at a given state
The critic is like the teacher, as we described previously, and provides
feedback for the actor to learn
Unlike a teacher in a school, the critic network should also be trained
from scratch, which makes the problem challenging
The policy gradient is used to train the actor
The L2 norm on the Bellman update is used to train the critic
Policy gradient
The policy gradient is defined as follows:
The ddpg.py file is the main file from which we start the training and testing. It
will call the training or testing functions, which are present in TrainOrTest.py.
The AandC.py file has the TensorFlow code for the actor and the critic
networks. Finally, replay_buffer.py stores the samples in a replay buffer by
using a deque data structure. We will train the DDPG to learn to hold an
inverted pendulum vertically, using OpenAI Gym's Pendulum-v0, which has
three states and one continuous action, which is the torque to be applied to
hold the pendulum as vertically inverted.
Coding ddpg.py
We will first code the ddpg.py file. The steps that are involved are as follows.
import argparse
import pprint as pp
import sys
2. Defining the train() function: We will define the train() function. This
takes the argument parser object, args. We create a TensorFlow session
as sess. The name of the environment is used to make a Gym
environment stored in the env object. We also set the random number of
seeds and the maximum number of steps for an episode of the
environment. We also set the state and action dimensions in state_dim and
action_dim, which take the values of 3 and 1, respectively, for the
Pendulum-v0 problem. We then create actor and critic objects, which
are instances of the ActorNetwork class and the CriticNetwork class,
respectively, which will be described later, in the AandC.py file. We then
call the trainDDPG() function, which will start the training of the RL agent.
def train(args):
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_bound = env.action_space.high
saver = tf.train.Saver()
saver.save(sess, "ckpt/model")
print("saved model ")
3. Defining the test() function: The test() function is defined next. This
will be used once we have finished the training and want to test how
well our agent is performing. The code is as follows for the test()
function and is very similar to train(). We will restore the saved model
from train() by using tf.train.Saver() and saver.restore(). We call the
testDDPG() function to test the model:
def test(args):
env = gym.make(args['env'])
np.random.seed(int(args['random_seed']))
tf.set_random_seed(int(args['random_seed']))
env.seed(int(args['random_seed']))
env._max_episode_steps = int(args['max_episode_len'])
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_bound = env.action_space.high
# agent parameters
parser.add_argument('--actor-lr', help='actor network learning rate',
default=0.0001)
parser.add_argument('--critic-lr', help='critic network learning rate',
default=0.001)
parser.add_argument('--gamma', help='discount factor for Bellman
updates', default=0.99)
parser.add_argument('--tau', help='target update parameter',
default=0.001)
parser.add_argument('--buffer-size', help='max size of the replay
buffer', default=1000000)
parser.add_argument('--minibatch-size', help='size of minibatch',
default=64)
# run parameters
parser.add_argument('--env', help='gym env', default='Pendulum-v0')
parser.add_argument('--random-seed', help='random seed', default=258)
parser.add_argument('--max-episodes', help='max num of episodes',
default=250)
parser.add_argument('--max-episode-len', help='max length of each
episode', default=1000)
parser.add_argument('--render-env', help='render gym env',
action='store_true')
parser.add_argument('--mode', help='train/test', default='train')
args = vars(parser.parse_args())
pp.pprint(args)
if (args['mode'] == 'train'):
train(args)
elif (args['mode'] == 'test'):
test(args)
2. Define initializers for the weights and biases: Next, we define the
weights and biases initializers:
winit = tf.contrib.layers.xavier_initializer()
binit = tf.constant_initializer(0.01)
rand_unif = tf.keras.initializers.RandomUniform(minval=-3e-3,maxval=3e-3)
regularizer = tf.contrib.layers.l2_regularizer(scale=0.0)
class ActorNetwork(object):
# actor
self.state, self.out, self.scaled_out =
self.create_actor_network(scope='actor')
# actor params
self.network_params = tf.trainable_variables()
# target network
self.target_state, self.target_out, self.target_scaled_out =
self.create_actor_network(scope='act_target')
self.target_network_params = tf.trainable_variables()
[len(self.network_params):]
[self.target_network_params[i].assign(tf.multiply(self.network_params[i],
self.tau) + tf.multiply(self.target_network_params[i], 1. - self.tau))
for i in range(len(self.target_network_params))]
# actor gradients
self.unnormalized_actor_gradients = tf.gradients(
self.scaled_out, self.network_params, -self.action_gradient)
self.actor_gradients = list(map(lambda x: tf.div(x, self.batch_size),
self.unnormalized_actor_gradients))
def update_target_network(self):
self.sess.run(self.update_target_network_params)
def get_num_trainable_vars(self):
return self.num_trainable_vars
# critic params
self.network_params = tf.trainable_variables()[num_actor_vars:]
# target Network
self.target_state, self.target_action, self.target_out =
self.create_critic_network(scope='crit_target')
9. Critic target network: Similar to the actor's target, the critic's target
network is also updated by using weighted averaging. We then create a
TensorFlow placeholder called predicted_q_value, which is the target
value. We then define the L2 norm in self.loss, which is the quadratic
error on the Bellman residual. Note that self.out is the Q(s,a) that we
saw earlier, and predicted_q_value is the r + γQ(s',a') in the Bellman
equation. Again, we use the Adam optimizer to minimize this L2 loss
function. We then evaluate the gradient of Q(s,a) with respect to the
actions by calling tf.gradients(), and we store this in self.action_grads.
This gradient is used later in the computation of the policy gradients:
# update target using tau and 1 - tau as weights
self.update_target_network_params = \
[self.target_network_params[i].assign(tf.multiply(self.network_params[i],
self.tau) \
+ tf.multiply(self.target_network_params[i], 1. - self.tau))
for i in range(len(self.target_network_params))]
1. Import packages and functions: The TrainOrTest.py file starts with the
importing of the packages and other Python files:
import tensorflow as tf
import numpy as np
import gym
from gym import wrappers
import argparse
import pprint as pp
import sys
The actor's policy is sampled to obtain the action for the current state.
We feed this action into env.step(), which takes one time step of this
action and, in the process, moves to the next state, s2. The environment
also gives this a reward, r, and information on whether the episode is
terminated is stored in the Boolean variable terminal. We add the tuple
(state, action, reward, terminal, new state) to the replay buffer for sampling
later and for training:
def trainDDPG(sess, env, args, actor, critic):
sess.run(tf.global_variables_initializer())
s = env.reset()
ep_reward = 0
ep_ave_max_q = 0
for j in range(int(args['max_episode_len'])):
if args['render_env']:
env.render()
3. Sample a mini-batch of data from the replay buffer: Once we have more
than the mini-batch size of samples in the replay buffer, we sample a mini-
batch of data from the buffer. For the subsequent state, s2, we use the critic's
target network to compute the target Q value and store it in target_q. Note
the use of the critic's target and not the critic—this is done for stability
reasons. We then use the Bellman equation to evaluate the target, y_i, which
is computed as r + γ Q for non-Terminal time steps and as r for Terminal
steps:
# sample from replay buffer
if replay_buffer.size() > int(args['minibatch_size']):
s_batch, a_batch, r_batch, t_batch, s2_batch =
replay_buffer.sample_batch(int(args['minibatch
_size']))
# Calculate target q
target_q = critic.predict_target(s2_batch,
actor.predict_target(s2_batch))
y_i = []
for k in range(int(args['minibatch_size'])):
if t_batch[k]:
y_i.append(r_batch[k])
else:
y_i.append(r_batch[k] + critic.gamma *
target_q[k])
4. Use the preceding to train the actor and critic: We then train the critic
for one step on the mini-batch by calling critic.train(). Then, we compute
the gradient of Q with respect to the action by calling
critic.action_gradients() and we store it in grads; note that this action gradient
is used to compute the policy gradient, as we mentioned previously. We
then train the actor for one step by calling actor.train() and passing grads as
an argument, along with the state that we sampled from the replay buffer.
Finally, we update the actor and critic target networks by calling the
appropriate functions for the actor and critic objects:
# Update critic
predicted_q_value, _ = critic.train(s_batch, a_batch,
np.reshape(y_i, (int(args['minibatch_size']), 1)))
ep_ave_max_q += np.amax(predicted_q_value)
The new state, s2, is assigned to the current state, s, as we proceed to the
next time step. If the episode has terminated, we print the episode
reward and other observations on the screen, and we write them into a
text file called pendulum.txt for later analysis. We also break out of the
inner for loop, as the episode has terminated:
s = s2
ep_reward += r
if terminal:
print('| Episode: {:d} | Reward: {:d} | Qmax: {:.4f}'.format(i,
int(ep_reward), (ep_ave_max_q / float(j))))
f = open("pendulum.txt", "a+")
f.write(str(i) + " " + str(int(ep_reward)) + " " +
str(ep_ave_max_q / float(j)) + '\n')
break
s = env.reset()
ep_reward = 0
ep_ave_max_q = 0
for j in range(int(args['max_episode_len'])):
if args['render_env']:
env.render()
s = s2
ep_reward += r
if terminal:
print('| Episode: {:d} | Reward: {:d} |'.format(i,
int(ep_reward)))
break
class ReplayBuffer(object):
3. Define the add and size functions: We then define the add() function to
add the experience as a tuple (state, action, reward, terminal, new state).
The self.count function keeps a count of the number of samples we have
in the replay buffer. If this count is less than the replay buffer size
(self.buffer_size), we append the current experience to the buffer and
increment the count. On the other hand, if the count is equal to (or
greater than) the buffer size, we discard the old samples from the buffer
by calling popleft(), which is a built-in function of deque. Then, we add
the experience to the replay buffer; the count need not be incremented,
as we discarded one old data sample in the replay buffer and replaced it
with the new data sample or experience, so the total number of samples
in the buffer remains the same. We also define the size() function to
obtain the current size of the replay buffer:
def add(self, s, a, r, t, s2):
experience = (s, a, r, t, s2)
if self.count < self.buffer_size:
self.buffer.append(experience)
self.count += 1
else:
self.buffer.popleft()
self.buffer.append(experience)
def size(self):
return self.count
def clear(self):
self.buffer.clear()
self.count = 0
That concludes the code for the DDPG. We will now test it.
Training and testing the DDPG on
Pendulum-v0
We will now train the preceding DDPG code on Pendulum-v0. To train the
DDPG agent, simply type the following in the command line at the same level
as the rest of the code:
python ddpg.py
{'actor_lr': 0.0001,
'buffer_size': 1000000,
'critic_lr': 0.001,
'env': 'Pendulum-v0',
'gamma': 0.99,
'max_episode_len': 1000,
'max_episodes': 250,
'minibatch_size': 64,
'mode': 'train',
'random_seed': 258,
'render_env': False,
'tau': 0.001}
.
.
.
2019-03-03 17:23:10.529725: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to
match DSO: 384.130.0
| Episode: 0 | Reward: -7981 | Qmax: -6.4859
| Episode: 1 | Reward: -7466 | Qmax: -10.1758
| Episode: 2 | Reward: -7497 | Qmax: -14.0578
Once the training is complete, you can also test the trained DDPG agent, as
follows:
python ddpg.py --mode test
We can also plot the episodic rewards during training by using the following
code:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('pendulum.txt')
plt.plot(data[:,0], data[:,1])
plt.xlabel('episode number', fontsize=12)
plt.ylabel('episode reward', fontsize=12)
#plt.show()
plt.savefig("ddpg_pendulum.png")
Figure 1: Plot showing the ep isode rewards during training for the Pendulum-v0 p roblem, using the DDPG
As you can see, the DDPG agent has learned the problem very well. The
maximum rewards are slightly negative, and this is the best for this problem.
Summary
In this chapter, we were introduced to our first continuous actions RL
algorithm, DDPG, which also happens to be the first Actor-Critic algorithm
in this book. DDPG is an off-policy algorithm, as it uses a replay buffer. We
also covered the use of policy gradients to update the actor, and the use of the
L2 norm to update the critic. Thus, we have two different neural networks.
The actor learns the policy and the critic learns to evaluate the actor's policy,
thereby providing a learning signal to the actor. You saw how to compute the
gradient of the state-action value, Q(s,a), with respect to the action, and also
the gradient of the policy, both of which are combined to evaluate the policy
gradient, which is then used to update the actor. We trained the DDPG on the
inverted pendulum problem, and the agent learned it very well.
We have come a long way in this chapter. You have learned about Actor-
Critic algorithms and how to code your first continuous control RL algorithm.
In the next chapter, you will learn about the A3C algorithm, which is an on-
policy deep RL algorithm.
Questions
1. Is the DDPG an on-policy or off-policy algorithm?
2. We used the same neural network architectures for both the actor and the
critic. Is this required, or can we choose different neural network
architectures for the actor and the critic?
3. Can we use the DDPG for Atari Breakout?
4. Why are the biases of the neural networks initialized to small positive
values?
5. This is left as an exercise: Can you modify the code in this chapter to
train an agent to learn InvertedDoublePendulum-v2, which is more
challenging than the Pendulum-v0 that you saw in this chapter?
6. Here is another exercise: Vary the neural network architecture and
check whether the agent can learn the Pendulum-v0 problem. For
instance, keep decreasing the number of neurons in the first hidden layer
with the values 400, 100, 25, 10, 5, and 1, and check how the agent
performs for the different number of neurons in the first hidden layer. If
the number of neurons is too small, it can lead to information
bottlenecks, where the input of the network is not sufficiently
represented; that is, the information is lost as we go deeper into the
neural network. Do you observe this effect?
Further reading
Continuous control with deep reinforcement learning, by Timothy P.
Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom
Erez, Yuval Tassa, David Silver, and Daan Wierstra, original DDPG
paper from DeepMind, arXiv:1509.02971: https://arxiv.org/abs/1509.02971
Asynchronous Methods - A3C and
A2C
We looked at the DDPG algorithm in the previous chapter. One of the main
drawbacks of the DDPG algorithm (as well as the DQN algorithm that we
saw earlier) is the use of a replay buffer to obtain independent and
identically distributed samples of data for training. Using a replay buffer
consumes a lot of memory, which is not desirable for robust RL applications.
To overcome this problem, researchers at Google DeepMind came up with
an on-policy algorithm called Asynchronous Advantage Actor Critic
(A3C). A3C does not use a replay buffer; instead, it uses parallel worker
processors, where different instances of the environment are created and the
experience samples are collected. Once a finite and fixed number of samples
are collected, they are used to compute the policy gradients, which are
asynchronously sent to a central processor that updates the policy. This
updated policy is then sent back to the worker processors. The use of
parallel processors to experience different scenarios of the environment
gives rise to independent and identically distributed samples that can be used
to train the policy. This chapter will cover A3C, and will also briefly touch
upon a variant of it called the Advantage Actor Critic (A2C).
In this chapter, you will learn about the A3C and A2C algorithms, as well as
how to code them using Python and TensorFlow. We will also apply the A3C
algorithm to solving two OpenAI Gym problems: CartPole and LunarLander.
Technical requirements
To successfully complete this chapter, some knowledge of the following will
be of great help:
L is the total loss, which has to be minimized. Note that we would like to
maximize the advantage function, so we have a minus sign in Lp, as we are
minimizing L. Likewise, we would like to maximize the entropy term, and
since we are minimizing L, we have a minus sign in the term -0.005 Le in L.
CartPole and LunarLander
In this section, we will apply A3C to OpenAI Gym's CartPole and
LunarLander.
CartPole
CartPole consists of a vertical pole on a cart that needs to be balanced by
moving the cart either to the left or to the right. The state dimension is four
and the action dimension is two for CartPole.
Check out the following link for more details on CartPole: https://gym.openai.c
om/envs/CartPole-v0/.
LunarLander
LunarLander, as the name suggests, involves the landing of a lander on the
lunar surface. For example, when Apollo 11's Eagle lander touched down on
the moon's surface in 1969, the astronauts Neil Armstrong and Buzz Aldrin
had to control the rocket thrusters during the final phase of the descent and
safely land the spacecraft on the surface. After this, of course, Armstrong
walked on the moon and remarked the now famous sentence: "One small step
for a man, one giant leap for mankind". In LunarLander, there are two
yellow flags on the lunar surface, and the goal is to land the spacecraft
between these flags. Fuel in the lander is infinite, unlike the case in Apollo
11's Eagle lander. The state dimension is eight and the action dimension is
four for LunarLander, with the four actions being do nothing, fire the left
thruster, fire the main thruster, or fire the right thruster.
Check out the following link for a schematic of the environment: https://gym.op
enai.com/envs/LunarLander-v2/.
The A3C algorithm applied to
CartPole
Here, we will code A3C in TensorFlow and apply it so that we can train an
agent to learn the CartPole problem. The following code files will be
required to code:
2. Next, we set the parameters for the problem. We only need to train for
200 episodes (yes, CartPole is an easy problem!). We set the discount
factor gamma to 0.99. The state and action dimensions are 4 and 2,
respectively, for CartPole. If you want to load a pre-trained model and
resume training, set load_model to True; for fresh training from scratch, set
this to False. We will also set the model_path:
max_episode_steps = 200
gamma = 0.99
s_size = 4
a_size = 2
load_model = False
model_path = './model'
3. We reset the TensorFlow graph and also create a directory for storing
our model. We will refer to the master processor as CPU 0. Worker
threads have non-zero CPU numbers. The master processor will
undertake the following: first, it will create a count of global variables
in the global_episodes object. The total number of worker threads will be
stored in num_workers, and we can use Python's multiprocessing library to
obtain the number of available processors in our system by calling
. We will use the Adam optimizer and store it in an object
cpu_count()
called trainer, along with an appropriate learning rate. We will later
define an actor critic class called AC, so we must first create a master
network object of the type AC class, called master_network. We will also
pass the appropriate arguments to the class' constructor. Then, for each
worker thread, we will create a separate instance of the CartPole
environment and an instance of a Worker class, which will soon be
defined. Finally, for saving the model, we will also create a
TensorFlow saver:
tf.reset_default_graph()
if not os.path.exists(model_path):
os.makedirs(model_path)
with tf.device("/cpu:0"):
# Adam optimizer
trainer = tf.train.AdamOptimizer(learning_rate=2e-4, use_locking=True)
# global network
master_network = AC(s_size,a_size,'global',None)
workers = []
for i in range(num_workers):
env = gym.make('CartPole-v0')
workers.append(Worker(env,i,s_size,a_size,trainer,model_path,global_episodes
))
# tf saver
saver = tf.train.Saver(max_to_keep=5)
if load_model == True:
print ('Loading Model...')
ckpt = tf.train.get_checkpoint_state(model_path)
saver.restore(sess,ckpt.model_checkpoint_path)
else:
sess.run(tf.global_variables_initializer())
Then, we need to set the initializers for the weights and biases; specifically,
we use the Xavier initializer for the weights and zero bias. For the last output
layer of the network, the weights are uniform random numbers within a
specified range:
xavier = tf.contrib.layers.xavier_initializer()
bias_const = tf.constant_initializer(0.05)
rand_unif = tf.keras.initializers.RandomUniform(minval=-3e-3,maxval=3e-3)
regularizer = tf.contrib.layers.l2_regularizer(scale=5e-4)
The AC class
We will now describe the AC class, which is also part of a3c.py. We define the
constructor of the AC class with an input placeholder, two fully connected
hidden layers with 256 and 128 neurons, respectively, and the elu activation
function. This is followed by the policy network with the softmax activation,
since our actions our discrete for CartPole. In addition, we also have a value
network with no activation function. Note that we share the same hidden
layers for both the policy and value, unlike in past examples:
class AC():
def __init__(self,s_size,a_size,scope,trainer):
with tf.variable_scope(scope):
self.inputs = tf.placeholder(shape=[None,s_size],dtype=tf.float32)
# 2 FC layers
net = tf.layers.dense(self.inputs, 256, activation=tf.nn.elu,
kernel_initializer=xavier, bias_initializer=bias_const,
kernel_regularizer=regularizer)
net = tf.layers.dense(net, 128, activation=tf.nn.elu,
kernel_initializer=xavier, bias_initializer=bias_const,
kernel_regularizer=regularizer)
# policy
self.policy = tf.layers.dense(net, a_size, activation=tf.nn.softmax,
kernel_initializer=xavier, bias_initializer=bias_const)
# value
self.value = tf.layers.dense(net, 1, activation=None,
kernel_initializer=rand_unif, bias_initializer=bias_const)
For worker threads, we need to define the loss functions. Thus, when the
TensorFlow scope is not global, we define an actions placeholder, as well as
its one-hot representation; we also define placeholders for the target value
and advantage functions. We then compute the product of the policy distribution
and the one-hot actions, sum them, and store them in the policy_times_a object.
Then, we combine these terms to construct the loss functions, as we
mentioned previously. We compute the sum over the batch of the L2 loss for
value; the Shannon entropy as the policy distribution multiplied with its
logarithm, with a minus sign; the policy loss as the product of the logarithm
of the policy distribution; and the advantage function, summed over the batch of
samples. Finally, we use the appropriate weights to combine these losses to
compute the total loss, which is stored in self.loss:
# only workers need tf operations for loss functions and gradient updating
if scope != 'global':
self.actions = tf.placeholder(shape=[None],dtype=tf.int32)
self.actions_onehot =
tf.one_hot(self.actions,a_size,dtype=tf.float32)
self.target_v = tf.placeholder(shape=[None],dtype=tf.float32)
self.advantages = tf.placeholder(shape=[None],dtype=tf.float32)
self.policy_times_a = tf.reduce_sum(self.policy *
self.actions_onehot, [1])
# loss
self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v -
tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy +
1.0e-8))
self.policy_loss = -tf.reduce_sum(tf.log(self.policy_times_a + 1.0e-
8) * self.advantages)
the global network using tf.get_collection() with a scope of global and apply
the gradients in the Adam optimizer by using apply_gradients(). This will
compute the policy gradients:
# get gradients from local networks using local losses; clip them to avoid exploding
gradients
local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
self.gradients = tf.gradients(self.loss,local_vars)
self.var_norms = tf.global_norm(local_vars)
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
We also create the local instance of the AC class, with the appropriate
arguments passed in. We then create a TensorFlow operation to copy the
model parameters from global to local. We also create a NumPy identity
matrix with ones on the diagonal, as well as an environment object:
# local copy of the AC network
self.local_AC = AC(s_size,a_size,self.name,trainer)
self.actions = np.identity(a_size,dtype=bool).tolist()
self.env = env
Next, we create the train() function, which is the most important part of the
Worker class. The states, actions, rewards, next states, or observations and
values are obtained from the experience list that's received as an argument by
the function. We computed the discounted sum over the rewards by using a
utility function called discount(), which we will define soon. Similarly, the
advantage function is also discounted:
# train function
def train(self,experience,sess,gamma,bootstrap_value):
experience = np.array(experience)
observations = experience[:,0]
actions = experience[:,1]
rewards = experience[:,2]
next_observations = experience[:,3]
values = experience[:,5]
# discounted rewards
self.rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value])
discounted_rewards = discount(self.rewards_plus,gamma)[:-1]
# value
self.value_plus = np.asarray(values.tolist() + [bootstrap_value])
# advantage function
advantages = rewards + gamma * self.value_plus[1:] - self.value_plus[:-1]
advantages = discount(advantages,gamma)
the update, after which it will release the lock using lock.release(). Finally, we
return the losses from the function:
# lock for updating global params
lock = Lock()
lock.acquire()
# release lock
lock.release()
return value_loss / len(experience), policy_loss / len(experience), entropy /
len(experience)
Now, we need to define the workers' work() function. We first obtain the
global episode count and set total_steps to zero. Then, inside a TensorFlow
session, while the threads are still coordinated, we copy the global
parameters to the local network using self.update_local_ops. We then start an
episode. Since the episode hasn't been terminated, we obtain the policy
distribution and store it in a_dist. We sample an action from this distribution
using NumPy's random.choice() function. This action, a, is fed into the
environment's step() function to obtain the new state, the reward, and the
Terminal Boolean. We can shape the reward by dividing it by 100.0.
The experience is stored in the local buffer, called episode_buffer. We also add
the reward to episode_reward, and we increment the total_steps count, as well as
episode_step_count:
episode_reward = 0
episode_step_count = 0
d = False
s = self.env.reset()
episode_frames.append(s)
while not d:
# step
s1, r, d, info = self.env.step(a)
# normalize reward
r = r/100.0
if d == False:
episode_frames.append(s1)
else:
s1 = s
episode_values.append(v[0,0])
episode_reward += r
s = s1
total_steps += 1
episode_step_count += 1
If we have 25 entries in the buffer, it's time for an update. First, the value is
computed and stored in v1, which is then passed to the train() function, which
will output the three loss values: value, policy, and entropy. After this, the
episode_buffer is reset. If the episode has terminated, we break from the loop.
Finally, we print the episode count and reward on the screen. Note that we
have used 25 entries as the time to do the update. Feel free to vary this and
see how the training is affected by this hyperparameter:
# if buffer has 25 entries, time for an update
if len(episode_buffer) == 25 and d != True and episode_step_count !=
max_episode_steps - 1:
v1 = sess.run(self.local_AC.value, feed_dict={self.local_AC.inputs:[s]})[0,0]
value_loss, policy_loss, entropy = self.train(episode_buffer,sess,gamma,v1)
episode_buffer = []
sess.run(self.update_local_ops)
# idiot check to ensure we did not miss update for some unforseen reason
if (len(episode_buffer) > 30):
print(self.name, "buffer full ", len(episode_buffer))
sys.exit()
if d == True:
break
print("episode: ", episode_count, "| worker: ", self.name, "| episode reward: ",
episode_reward, "| step count: ", episode_step_count)
After exiting the episode loop, we use the remaining samples in the buffer to
train the network. worker _0 contains the global or master network, which we
can save by using saver.save. We can also call the self.increment operation to
increment the global episode count by one:
# Update the network using the episode buffer at the end of the episode
if len(episode_buffer) != 0:
value_loss, policy_loss, entropy = self.train(episode_buffer,sess,gamma,0.0)
if self.name == 'worker_0':
sess.run(self.increment)
episode_count += 1
copy_ops = []
for from_param,to_param in zip(from_params,to_params):
copy_ops.append(to_param.assign(from_param))
return copy_ops
The other utility function that we need is the discount() function. It runs the
input list, x backwards, and sums them with a weight of gamma, which is the
discount factor. The discounted value is then returned from the function:
The code stores the episode rewards in the performance.txt file. A plot of the
episode rewards during training is shown in the following screenshot:
Figure 1: Ep isode rewards for CartPole, which was trained using A3C
Note that since we have shaped the reward, the episode reward that you can
see in the preceding screenshot is different from the values that are typically
reported by other researchers in papers and/or blogs.
The A3C algorithm applied to
LunarLander
We will extend the same code to train an agent on the LunarLander problem,
which is harder than CartPole. Most of the code is the same as before, so we
will only describe the changes that need to be made to the preceding code.
First, the reward shaping is different for the LunarLander problem. So, we
will include a function called reward_shaping() in the a3c.py file. It will check if
the lander has crashed on the lunar surface; if so, the episode will be
terminated and there will be a -1.0 penalty. If the lander is not moving, the
episode will be terminated and a -0.5 penalty will be paid:
def reward_shaping(r, s, s1):
# check if y-coord < 0; implies lander crashed
if (s1[1] < 0.0):
print('-----lander crashed!----- ')
d = True
r -= 1.0
This will train the agent and store the episode rewards in the performance.txt
file, which we can plot as follows:
As you can see, the agent has learned to land the spacecraft on the lunar
surface. Happy landings! Again, note that the episode reward is different
from the values that have been reported in papers and blogs by other RL
practitioners, since we have scaled the rewards.
The A2C algorithm
The difference between A2C and A3C is that A2C performs synchronous
updates. Here, all the workers will wait until they have completed the
collection of experiences and computed the gradients. Only after this are the
global (or master) network's parameters updated. This is different from A3C,
where the update is performed asynchronously, that is, where the worker
threads do not wait for the others to finish. A2C is easier to code than A3C,
but that is not undertaken here. If you are interested in this, you are
encouraged to take the preceding A3C code and convert it to A2C, after
which the performance of both algorithms can be compared.
Summary
In this chapter, we introduced the A3C algorithm, which is an on-policy
algorithm that's applicable to both discrete and continuous action problems.
You saw how three different loss terms are combined into one and optimized.
Python's threading library is useful for running multiple threads, with a copy
of the policy network in each thread. These different workers compute the
policy gradients and pass them on to the master to update the neural network
parameters. We applied A3C to train agents for the CartPole and the
LunarLander problems, and the agents learned them very well. A3C is a very
robust algorithm and does not require a replay buffer, although it does
require a local buffer for collecting a small number of experiences, after
which it is used to update the networks. Lastly, a synchronous version of the
algorithm, called A2C, was also introduced.
This chapter should have really improved your understanding of yet another
deep RL algorithm. In the next chapter, we will study the last two RL
algorithms in this book, TRPO and PPO.
Questions
1. Is A3C an on-policy or off-policy algorithm?
2. Why is the Shannon entropy term used?
3. What are the problems with using a large number of worker threads?
4. Why is softmax used in the policy neural network?
5. Why do we need an advantage function?
6. This is left as an exercise: For the LunarLander problem, repeat the
training without reward shaping and see if the agent learns faster/slower
than what we saw in this chapter.
Further reading
Asynchronous Methods for Deep Reinforcement Learning, by
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex
Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray
Kavukcuoglu, A3C paper from DeepMind arXiv:1602.01783: https://ar
xiv.org/abs/1602.01783
Deep Reinforcement Learning Hands-On, by Maxim Lapan, Packt
Publishing: https://www.packtpub.com/big-data-and-business-intelligence/deep-re
inforcement-learning-hands
Trust Region Policy Optimization
and Proximal Policy Optimization
In the last chapter, we saw the use of A3C and A2C, with the former being
asynchronous and the latter synchronous. In this chapter, we will see another
on-policy reinforcement learning (RL) algorithm; two algorithms, to be
precise, with a lot of similarities in the mathematics, differing, however, in
how they are solved. We will be introduced to the algorithm called Trust
Region Policy Optimization (TRPO), which was introduced in 2015 by
researchers at OpenAI and the University of California, Berkeley (the latter
is incidentally my former employer!). This algorithm, however, is difficult to
solve mathematically, as it involves the conjugate gradient algorithm, which
is relatively difficult to solve; note that first order optimization methods, such
as the well established Adam and Stochastic Gradient Descent (SGD),
cannot be used to solve the TRPO equations. We will then see how solving
the policy optimization equations can be combined into one, to result in the
Proximal Policy Optimization (PPO) algorithm, and first order
optimization algorithms such as Adam or SGD can be used.
Learning TRPO
Learning PPO
Using PPO to solve the MountainCar problem
Evaluating the performance
Technical requirements
To successfully complete this chapter, the following software are required:
The first equation here is the policy objective, and the second equation is an
additional constraint that ensures that the policy update is gradual and
does not make large policy updates that can take the policy to regions that are
very far away in parameter space.
The clip() function bounds the ratio between 1-ε and 1+ε, thus keeping the
ratio bounded within the range. The min() function is the minimum function to
ensure that the final objective is a lower bound on the unclipped objective.
The second loss function is the L2 norm of the state value function:
The third loss is the Shannon entropy of the policy distribution, which comes
from information theory:
We will now combine the three losses. Note that we need to maximize Lclip
and Lentropy, but minimize LV. So, we define our total PPO loss function as in
the following equation, where c1 and c2 are positive constants used to scale
the terms:
Note that, if we share the neural network parameters between the policy and
the value networks, then the preceding LPPO loss function alone can be
maximized. On the other hand, if we have separate neural networks for the
policy and the value, then we can have separate loss functions as in the
following equation, where Lpolicy is maximized and Lvalue is minimized:
Notice that the constant c1 is not required in this latter setting, where we have
separate neural networks for the policy and the value. The neural network
parameters are updated over multiple iteration steps over a batch of data
points, where the number of update steps are specified by the user as hyper
parameters.
Using PPO to solve the
MountainCar problem
We will solve the MountainCar problem using PPO. MountainCar involves a
car trapped in the valley of a mountain. It has to apply throttle to accelerate
against gravity and try to drive out of the valley up steep mountain walls to
reach a desired flag point on the top of the mountain. You can see a schematic
of the MountainCar problem from OpenAI Gym at https://gym.openai.com/envs/Mo
untainCar-v0/.
This problem is very challenging, as the agent cannot just apply full throttle
from the base of the mountain and try to reach the flag point, as the mountain
walls are steep and gravity will not allow the car to achieve sufficient
enough momentum. The optimal solution is for the car to initially go
backward and then step on the throttle to pick up enough momentum to
overcome gravity and successfully drive out of the valley. We will see that
the RL agent actually learns this trick.
We will code the following two files to solve MountainCar using PPO:
class_ppo.py
train_test.py
Coding the class_ppo.py file
We will now code the class_ppo.py file:
2. Set the neural network initializers: Then, we will set the neural
network parameters (we will use two hidden layers) and the initializers
for the weights and biases. As we have also done in past chapters, we
will use the Xavier initializer for the weights and a small positive value
for the initial values of the biases:
nhidden1 = 64
nhidden2 = 64
xavier = tf.contrib.layers.xavier_initializer()
bias_const = tf.constant_initializer(0.05)
rand_unif = tf.keras.initializers.RandomUniform(minval=-3e-3,maxval=3e-3)
regularizer = tf.contrib.layers.l2_regularizer(scale=0.0
3. Define the PPO class: The PPO() class is now defined. First,
the __init__() constructor is defined using the arguments passed to the
class. Here, sess is the TensorFlow session; S_DIM and A_DIM are the state
and action dimensions, respectively; A_LR and C_LR are the learning rates
for the actor and the critic, respectively; A_UPDATE_STEPS and C_UPDATE_STEPS
are the number of update steps used for the actor and the critic;
CLIP_METHOD stores the epsilon value:
class PPO(object):
5. Define the critic: The critic neural network is defined next. We use the
state (st) placeholder, self.tfs, as input to the neural network. Two
hidden layers are used with the nhidden1 and nhidden2 number of neurons
and the relu activation function (both nhidden1 and nhidden2 were set to
64 previously). The output layer has one neuron that will output the state
value function V(st), and so no activation function is used for the output.
We then compute the advantage function as the difference between the
discounted cumulative rewards, which is stored in
the self.tfdc_r placeholder and the self.v output that we just computed.
The critic loss is computed as an L2 norm and the critic is trained using
the Adam optimizer with the objective to minimize this L2 loss.
Note that this loss is the same as Lvalue mentioned earlier in this
chapter in the theory section:
# critic
with tf.variable_scope('critic'):
l1 = tf.layers.dense(self.tfs, nhidden1, activation=None,
kernel_initializer=xavier, bias_initializer=bias_const,
kernel_regularizer=regularizer)
l1 = tf.nn.relu(l1)
l2 = tf.layers.dense(l1, nhidden2, activation=None,
kernel_initializer=xavier, bias_initializer=bias_const,
kernel_regularizer=regularizer)
l2 = tf.nn.relu(l2)
self.pi_mean = self.pi.mean()
self.pi_sigma = self.pi.stddev()
# entropy
entropy = -tf.reduce_sum(self.pi.prob(self.tfa) *
tf.log(tf.clip_by_value(self.pi.prob(self.tfa),1e-10,1.0)),axis=1)
entropy = tf.reduce_mean(entropy,axis=0)
self.aloss -= 0.0 #0.01 * entropy
with tf.variable_scope('atrain'):
self.atrain_op = tf.train.AdamOptimizer(self.A_LR).minimize(self.aloss)
11. Define the update function: The update() function is defined next,
which takes the s state, the a action, and the r reward as arguments. It
involves running a TensorFlow session on updating the old policy
network parameters by calling the TensorFlow
self.update_oldpi_op operation. Then, the advantage is computed, which,
along with the state and action, is used to update the A_UPDATE_STEPS actor
number of iterations. Then, the critic is updated by the C_UPDATE_STEPS
number of iterations by running a TensorFlow session on the critic train
operation:
def update(self, s, a, r):
self.sess.run(self.update_oldpi_op)
adv = self.sess.run(self.advantage, {self.tfs: s, self.tfdc_r: r})
# update actor
for _ in range(self.A_UPDATE_STEPS):
self.sess.run(self.atrain_op, feed_dict={self.tfs: s, self.tfa: a,
self.tfadv: adv})
# update critic
for _ in range(self.C_UPDATE_STEPS):
self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r})
12. Define the _build_anet function: We will next define the _build_anet()
function that was used earlier. It will compute the policy distribution,
which is treated as a Gaussian (that is, normal). It takes the self.tfs state
placeholder as input, has two hidden layers with the nhidden1 and nhidden2
neurons, and uses the relu activation function. This is then sent to two
output layers with the A_DIM action dimension number of outputs, with
one representing the mean, mu, and the other the standard deviation, sigma.
Note that the mean of the actions are bounded, and so the tanh
activation function is used, including a small clipping to avoid edge
values; for sigma, the softplus activation function is used, shifted by
0.1 to avoid zero sigma values. Once we have the mean and standard
deviations for the actions, TensorFlow distributions' Normal is used to
treat the policy as a Gaussian distribution. We can also call
tf.get_collection() to obtain the model parameters, and the Normal
distribution and the model parameters are returned from the function:
def _build_anet(self, name, trainable):
with tf.variable_scope(name):
l1 = tf.layers.dense(self.tfs, nhidden1, activation=None,
trainable=trainable, kernel_initializer=xavier, bias_initializer=bias_const,
kernel_regularizer=regularizer)
l1 = tf.nn.relu(l1)
l2 = tf.layers.dense(l1, nhidden2, activation=None,
trainable=trainable, kernel_initializer=xavier, bias_initializer=bias_const,
kernel_regularizer=regularizer)
l2 = tf.nn.relu(l2)
mu = tf.layers.dense(l2, self.A_DIM, activation=tf.nn.tanh,
trainable=trainable, kernel_initializer=rand_unif, bias_initializer=bias_const)
small = tf.constant(1e-6)
mu = tf.clip_by_value(mu,-1.0+small,1.0-small)
14. Define the get_v function: Finally, we also define a get_v() function to
return the state value by running a TensorFlow session on self.v:
def get_v(self, s):
if s.ndim < 2: s = s[np.newaxis, :]
vv = self.sess.run(self.v, {self.tfs: s})
return vv[0,0]
2. Define function: We then define a function for reward shaping that will
give out some extra bonus rewards and penalties for good and bad
performance, respectively. We do this for encouraging the car to go
higher towards the side of the flag which is on the mountain top, without
which the learning will be slow:
def reward_shaping(s_):
r = 0.0
return r
EP_MAX = 1000
GAMMA = 0.9
A_LR = 2e-4
C_LR = 2e-4
BATCH = 32
A_UPDATE_STEPS = 10
C_UPDATE_STEPS = 10
S_DIM = env.observation_space.shape[0]
A_DIM = env.action_space.shape[0]
iter_num = 0
if (irestart == 0):
iter_num = 0
saver = tf.train.Saver()
print("-"*70)
s = env.reset()
max_pos = -1.0
max_speed = 0.0
done = False
t = 0
Inside the outer loop, we have the inner while loop over time steps.
This problem involves short time steps during which the car may not
significantly move, and so we use sticky actions where actions are
sampled from the policy only once every 8 time steps. The
choose_action() function in the PPO class will sample the actions for a
given state. A small Gaussian noise is added to the actions to
explore, and are clipped in the -1.0 to 1.0 range, as required for the
MountainCarContinuous environment. The action is then fed into the
env.render()
# sticky actions
#if (t == 0 or np.random.uniform() < 0.125):
if (t % 8 ==0):
a = ppo.choose_action(s)
# clip
a = np.clip(a, -1.0, 1.0)
# take step
s_, r, done, _ = env.step(a)
# reward shaping
if train_test == 0:
r += reward_shaping(s_)
6. If we are in training mode, the state, action, and reward are appended to
the buffer. The new state is set to the current state and we proceed to the
next time step if the episode has not already terminated. The
ep_r episode total rewards and the t time step count are also updated:
if (train_test == 0):
buffer_s.append(s)
buffer_a.append(a)
buffer_r.append(r)
s = s_
ep_r += r
t += 1
bs = np.array(np.vstack(buffer_s))
ba = np.array(np.vstack(buffer_a))
br = np.array(discounted_r)[:, np.newaxis]
if (done == True):
print("values at done: ", s_, a)
break
print("episode: ", ep, "| episode reward: ", round(ep_r,4), "| time steps:
", t)
print("max_pos: ", max_pos, "| max_speed:", max_speed)
if (train_test == 0):
with open("performance.txt", "a") as myfile:
myfile.write(str(ep) + " " + str(round(ep_r,4)) + " " +
str(round(max_pos,4)) + " " + str(round(max_speed,4)) + "\n")
This concludes the coding of PPO. We will next evaluate its performance on
MountainCarContinuous.
Evaluating the performance
The PPO agent is trained by the following command:
python train_test.py
Once the training is complete, we can test the agent by setting the following:
train_test = 1
We will now set the action to 1.0, that is, full throttle:
import sys
import numpy as np
import gym
env = gym.make('MountainCarContinuous-v0')
for _ in range(100):
s = env.reset()
done = False
max_pos = -1.0
max_speed = 0.0
ep_reward = 0.0
print("ep_reward: ", ep_reward, "| max_pos: ", max_pos, "| max_speed: ",
max_speed)
As is evident from the video generated during the training, the car is unable
to escape the inexorable pull of gravity, and remains stuck at the base of the
mountain valley.
Random throttle
What if we try random throttle values? We will code
mountaincar_random_throttle.py with random actions in the -1.0 to 1.0 range:
import sys
import numpy as np
import gym
env = gym.make('MountainCarContinuous-v0')
for _ in range(100):
s = env.reset()
done = False
max_pos = -1.0
max_speed = 0.0
ep_reward = 0.0
print("ep_reward: ", ep_reward, "| max_pos: ", max_pos, "| max_speed: ",
max_speed)
Here too, the car fails to escape gravity and remains stuck at the base. So, the
RL agent is required to figure out that the optimum policy here is to first go
backward, and then step on the throttle to escape gravity and reach the flag on
the mountain top.
In this chapter, we will use the The Open Racing Car Simulator (TORCS)
simulator to train an RL agent to learn to successfully drive on a racetrack.
While the CARLA simulator is more robust and has realistic rendering,
TORCS is easier to use and so is a good first option. The interested reader is
encouraged to try out training RL agents on the CARLA simulator after
completing this book.
Python (version 2 or 3)
NumPy
Matplotlib
TensorFlow (version 1.4 or higher)
TORCS racing car simulator
Car driving simulators
Applying RL in autonomous driving necessitates the use of robust car-driving
simulators, as the RL agent cannot be trained on the road directly. To this
end, several open source car-driving simulators have been developed by the
research community, with each having its own pros and cons. Some of the
open source car driving simulators are:
CARLA
http://vladlen.info/papers/carla.pdf
Developed at Intel labs
Suited to urban driving
TORCS
http://torcs.sourceforge.net/
Racing car
DeepTraffic
https://selfdrivingcars.mit.edu/deeptraffic/
Developed at MIT
Suited to highway driving
Learning to use TORCS
We will first learn how to use the TORCS racing car simulator, which is an
open source simulator. You can obtain the download instructions from http://
torcs.sourceforge.net/index.php?name=Sections&op=viewarticle&artid=3 but the salient
angle: Angle between the car direction and the track (1)
track: This will give us the end of the track measured every 10 degrees
from -90 to +90 degrees; it has 19 real values, counting the end values
(19)
trackPos: Distance between the car and the track axis (1)
speedX: Speed of the car in the longitudinal direction (1)
speedY: Speed of the car in the transverse direction (1)
speedZ: Speed of the car in the Z-direction; we don't need this actually,
but we retain it for now (1)
wheelSpinVel: The rotational speed of the four wheels of the car (4)
rpm: The car engine's rpm (1)
gym_torcs.py
snakeoil3_gym.py
autostart.sh
These files are included in the code files for the chapter (https://github.com/Pac
ktPublishing/TensorFlow-Reinforcement-Learning-Quick-Start-Guide), but can also be
The reward function is then set as follows. Note that we give rewards for
higher longitudinal speed along the track (the cosine of the angle term), and
penalize lateral speed (the sine of the angle term). Track position is also
penalized. Ideally, if this were zero, we would be at the center of the track,
and values of +1 or -1 imply that we are at the edges of the track, which is
not desired and hence penalized:
progress = sp*np.cos(obs['angle']) - np.abs(sp*np.sin(obs['angle'])) - sp *
np.abs(obs['trackPos'])
reward = progress
We terminate the episode if the car is out of the track and/or the progress of
the agent is stuck using the following code:
if (abs(track.any()) > 1 or abs(trackPos) > 1): # Episode is terminated if the car
is out of track
print("Out of track ")
reward = -100 #-200
episode_terminate = True
client.R.d['meta'] = True
with 400 and 300 neurons, respectively. Also, the output consists of three
actions: steering, acceleration, and brake. Since steering is in the [-1,1] range,
the tanh activation function is used; acceleration and brake are in the [0,1] range,
and so the sigmoid activation function is used. We then concat them along axis
dimension 1, and this is the output of our actor's policy:
def create_actor_network(self, scope):
with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
state = tf.placeholder(name='a_states', dtype=tf.float32, shape=[None,
self.s_dim])
4. Stack up state space: Use the following command to stack up the 29-
dimension space:
s = np.hstack((ob.angle, ob.track, ob.trackPos, ob.speedX, ob.speedY,
ob.speedZ, ob.wheelSpinVel/100.0, ob.rpm))
6. Full throttle: For the first 10 episodes, we apply full throttle to warm up
the neural network parameters. Only after that, do we start using the
actor's policy. Note that TORCS typically learns in about ~1,500–2,000
episodes, so the first 10 episodes will not really have much influence
later on in the learning. Apply full throttle to warm up the neural
network parameters as follows:
# first few episodes step on gas!
if (i < 10):
a[0][0] = 0.0
a[0][1] = 1.0
a[0][2] = 0.0
That's it for the changes that need to be made to the code for the DDPG to
play TORCS. The rest of the code is the same as that covered in Chapter 5,
Deep Deterministic Policy Gradients (DDPG). We can train the agent using
the following command:
python ddpg.py
Enter 1 for training; 0 is for testing a pretrained agent. Training can take about
2–5 days depending on the speed of the computer used. But this is a fun
problem and is worth the effort. The number of steps experienced per
episode, as well as the rewards, are stored in analysis_file.txt, which we can
plot. The number of time steps per episode is plotted as follows:
Figure 1: Number of time step s p er ep isode of TORCS (training mode)
We can see that the car has learned to drive reasonably well after ~600
episodes, with more efficient driving after ~1,500 episodes. Approximately
~300 time steps correspond to one lap of the racetrack. Thus, the agent is
able to drive more than seven to eight laps without terminating toward the
end of the training. For a cool video of the DDPG agent driving, see the
following YouTube link: https://www.youtube.com/watch?v=ajomz08hSIE.
Training a PPO agent
We saw previously how to train a DDPG agent to drive a car on TORCS.
How to use a PPO agent is left as an exercise for the interested reader. This
is a nice challenge to complete. The PPO code from Chapter 7, Trust Region
Policy Optimization and Proximal Policy Optimization, can be reused, with
the necessary changes made to the TORCS environment. The PPO code for
TORCS is also supplied in the code repository (https://github.com/PacktPublishi
ng/TensorFlow-Reinforcement-Learning-Quick-Start-Guide), and the interested reader
can peruse it. A cool video of a PPO agent driving a car in TORCS is in the
following YouTube video at: https://youtu.be/uE8QaJQ7zDI
Another challenge for the interested reader is to use Trust Region Policy
Optimization (TRPO) for the TORCS racing car problem. Try this too, if
interested! This is one way to master RL algorithms.
Summary
In this chapter, we saw how to apply RL algorithms to train an agent to learn
to drive a car autonomously. We installed the TORCS racing-car simulator
and also learned how to interface it with Python, so that we can train RL
agents. We also did a deep dive into the state space for TORCS and the
meaning of each of these terms. The DDPG algorithm was then used to train
an agent to learn to drive successfully in TORCS. The video rendering in
TORCS is really cool! The trained agent was able to drive more than seven
to eight laps around the racetrack successfully. Finally, the use of PPO for the
same problem of driving a car autonomously was also explored and left as an
exercise for the interested reader; code for this is supplied in the book's
repository.
This concludes this chapter as well as the book. Feel free to read upon more
material online on the application of RL for autonomous driving and
robotics. This is now a very hot area of both academic and industry research,
and is well funded, with several job openings in these areas. Wishing you the
best!
Questions
1. Why can you not use DQN for the TORCS problem?
2. We used the Xavier weights initializer for the neural network weights.
What other weight initializers are you aware of, and how well will the
trained agent perform with them?
3. Why is the abs() function used in the reward function, and why is it used
for the last two terms but not for the first term?
4. How can you ensure smoother driving than what was observed in the
video?
5. Why is a replay buffer used in DDPG but not in PPO?
Further reading
Continuous control with deep reinforcement learning, Timothy P.
Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom
Erez, Yuval Tassa, David Silver, Daan
Wierstra, arXiv:1509.02971 (DDPG paper): https://arxiv.org/abs/1509.029
71
Proximal Policy Optimization Algorithms, John Schulman, Filip
Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov,
arXiv:1707.06347 (PPO paper): https://arxiv.org/abs/1707.06347
TORCS: http://torcs.sourceforge.net/
Deep Reinforcement Learning Hands-On, by Maxim Lapan, Packt
Publishing: https://www.packtpub.com/big-data-and-business-intelligence/deep-re
inforcement-learning-hands
Assessment
Chapter 1
1. A replay buffer is required for off-policy RL algorithms. We sample
from the replay buffer a mini-batch of experiences and use it to train the
Q(s,a) state-value function in DQN and the actor's policy in a DDPG.
2. We discount rewards, as there is more uncertainty about the long-term
performance of the agent. So, immediate rewards have a higher weight,
a reward earned in the next time step has a relatively lower weight, a
reward earned in the subsequent time step has an even lower weight,
and so on.
3. The training of the agent will not be stable if γ > 1. The agent will fail
to learn an optimal policy.
4. A model-based RL agent has the potential to perform well, but there is
no guarantee that it will perform better than a model-free RL agent, as
the model of the environment we are constructing need not always be a
good one. It is also very hard to build an accurate enough model of the
environment.
5. In deep RL, deep neural networks are used for the Q(s,a) and the actor's
policy (the latter is true in an Actor-Critic setting). In the traditional RL
algorithms, a tabular Q(s, a) is used but is not possible when the number
of states is very large, as is usually the case in most problems.
Chapter 3
1. A replay buffer is used in DQN in order to store past experiences,
sample a mini-batch of data from it, and use it to train the agent.
2. Target networks help in the stability of the training. This is achieved by
keeping an additional neural network whose weights are updated using
an exponential moving average of the weights of the main neural
network. Alternatively, another approach that is also widely used is to
copy the weights of the main neural network to the target network once
every few thousand steps or so.
3. One frame as the state will not help in the Atari Breakout problem. This
is because no temporal information is deductible from one frame only.
For instance, in one frame alone, the direction of motion of the ball
cannot be obtained. If, however, we stack up multiple frames, the
velocity and acceleration of the ball can be ascertained.
4. L2 loss is known to overfit to outliers. Hence, the Huber loss is
preferred, as it combines both L2 and L1 losses. See Wikipedia: https://
en.wikipedia.org/wiki/Huber_loss.
5. RGB images can also be used. However, we will need extra weights for
the first hidden layer of the neural network, as we now have three
channels in each of the four frames in the state stack. This much finer
detail for the state space is not required for Atari. However, RGB
images can help in other applications, for example, in autonomous
driving and/or robotics.
Chapter 4
1. DQN is known to overestimate the state-action value function, Q(s,a).
To overcome this, DDQN was introduced. DDQN has fewer problems
than DQN regarding the overestimation of Q(s,a).
2. Dueling network architecture has separate streams for the advantage
function and the state-value function. These are then combined to obtain
Q(s,a). This branching out and then combining is observed to result in a
more stable training of the RL agent.
ISBN: 978-1-78883-424-7