0% found this document useful (0 votes)
28 views

Sac Ae

1) The document proposes a method called SAC+AE to improve sample efficiency in model-free reinforcement learning from images. 2) Prior work that combined autoencoders with model-free RL in an off-policy setting performed poorly due to training instability issues. 3) The proposed SAC+AE method is able to jointly train the latent state representation and policy in an off-policy setting, matching state-of-the-art model-based methods on challenging control tasks. It also demonstrates robustness to observational noise.

Uploaded by

beanwxf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Sac Ae

1) The document proposes a method called SAC+AE to improve sample efficiency in model-free reinforcement learning from images. 2) Prior work that combined autoencoders with model-free RL in an off-policy setting performed poorly due to training instability issues. 3) The proposed SAC+AE method is able to jointly train the latent state representation and policy in an off-policy setting, matching state-of-the-art model-based methods on challenging control tasks. It also demonstrates robustness to observational noise.

Uploaded by

beanwxf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Improving Sample Efficiency in Model-Free Reinforcement Learning

from Images

Denis Yarats 1 2 Amy Zhang 3 4 2 Ilya Kostrikov 1 Brandon Amos 2 Joelle Pineau 3 4 2 Rob Fergus 1 2

Abstract els to an appropriate representation for control using only


a sparse reward signal. Although deep convolutional en-
Training an agent to solve control tasks directly
coders can learn good representations (upon which a pol-
arXiv:1910.01741v3 [cs.LG] 9 Jul 2020

from high-dimensional images with model-free


icy can be trained), they require large amounts of train-
reinforcement learning (RL) has proven diffi-
ing data. As existing reinforcement learning approaches
cult. A promising approach is to learn a la-
already have poor sample complexity, this makes direct
tent representation together with the control pol-
use of pixel-based inputs prohibitively slow. For exam-
icy. However, fitting a high-capacity encoder us-
ple, model-free methods on Atari (Bellemare et al., 2013)
ing a scarce reward signal is sample inefficient
and DeepMind Control (DMC) (Tassa et al., 2018) take
and leads to poor performance. Prior work has
tens of millions of steps (Mnih et al., 2013; Barth-Maron
shown that auxiliary losses, such as image recon-
et al., 2018), which is impractical in many applications, es-
struction, can aid efficient representation learn-
pecially robotics.
ing. However, incorporating reconstruction loss
into an off-policy learning algorithm often leads Some natural solutions to improve sample efficiency are
to training instability. We explore the underlying i) to use off-policy methods and ii) add an auxiliary task
reasons and identify variational autoencoders, with an unsupervised objective. Off-policy methods enable
used by previous investigations, as the cause of more efficient sample re-use, while the simplest auxiliary
the divergence. Following these findings, we task is an autoencoder with a pixel reconstruction objec-
propose effective techniques to improve train- tive. Prior work has attempted to learn state representations
ing stability. This results in a simple approach from pixels with autoencoders, utilizing a two-step training
capable of matching state-of-the-art model-free procedure, where the representation is first trained via the
and model-based algorithms on MuJoCo con- autoencoder, and then either with a policy learned on top of
trol tasks. Furthermore, our approach demon- the fixed representation (Lange & Riedmiller, 2010; Munk
strates robustness to observational noise, sur- et al., 2016; Higgins et al., 2017b; Zhang et al., 2018a; Nair
passing existing approaches in this setting. Code, et al., 2018; Dwibedi et al., 2018), or with planning (Mat-
results, and videos are anonymously available tner et al., 2012; Finn et al., 2015). This allows for ad-
at https://sites.google.com/view/sac-ae/home. ditional stability in optimization by circumventing dueling
training objectives but leads to suboptimal policies. Other
work utilizes continual model-free learning with an auxil-
1. Introduction iary reconstruction signal in an on-policy manner (Jader-
berg et al., 2017; Shelhamer et al., 2016). However, these
Cameras are a convenient and inexpensive way to acquire methods do not report of learning representations and a pol-
state information, especially in complex, unstructured en- icy jointly in the off-policy setting, or note that it performs
vironments, where effective control requires access to the poorly (Shelhamer et al., 2016).
proprioceptive state of the underlying dynamics. Thus,
having effective RL approaches that can utilize pixels as We revisit the concept of adding an autoencoder to model-
input would potentially enable solutions for a wide range free RL approaches, with a focus on off-policy algorithms.
of real world applications, for example robotics. We perform a sequence of careful experiments to under-
stand why previous approaches did not work well. We
The challenge is to efficiently learn a mapping from pix- confirm that a pixel reconstruction loss is vital for learn-
1 ing a good representation, specifically when trained jointly,
New York University 2 Facebook AI Research 3 McGill
University 4 MILA. Correspondence to: Denis Yarats but requires careful design choices to succeed. Based
<[email protected]>. on these findings, we recommend a simple and effective
autoencoder-based off-policy method that can be trained
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

end-to-end. We believe this to be the first model-free off- assumption that the dynamics in the task are easy to learn
policy approach to train the latent state representation and and useful for learning a good policy. To prevent insta-
policy jointly and match performance with state-of-the-art bilities in learning, Shelhamer et al. (2016) pre-train the
model-based methods 1 (Hafner et al., 2018; Lee et al., agent on randomly collected transitions and then perform
2019) on many challenging control tasks. In addition, we joint optimization of the policy and auxiliary losses. Im-
demonstrate robustness to observational noise and outper- portantly, the learning is done completely on-policy: the
form prior methods in this more practical setup. policy loss is computed from rollouts while the auxiliary
losses use samples from a small replay buffer. Yet, even
This paper makes three main contributions: (i) a method-
with these precautions, the authors are unable to leverage
ical study of the issues involved with combining autoen-
reconstruction by VAE (Kingma & Welling, 2013) and re-
coders with model-free RL in the off-policy setting that ad-
port its damaging affect on learning.
vises a successful variant we call SAC+AE; (ii) a demon-
stration of the robustness of our model-free approach over Similarly, Jaderberg et al. (2017) propose to use unsuper-
model-based methods on tasks with noisy observations; vised auxiliary tasks, both observation and reward based,
and (iii) an open-source PyTorch implementation of our and show improvements in Atari, again in an on-policy
simple and effective algorithm for researchers and practi- regime2 , which is much more stable for learning. Of all
tioners to build upon. the auxiliary tasks considered by Jaderberg et al. (2017),
reconstruction-based Pixel Control is the most effective.
2. Related Work However, in maximizing changes in local patches, it im-
poses strong inductive biases that assume that dramatically
Efficient learning from high-dimensional pixel observa- changing pixel values and textures are correlated with good
tions has been a problem of paramount importance for exploration and reward. Unfortunately, such highly task
model-free RL. While some impressive progress has been specific auxiliary is unlikely to scale to real world applica-
made applying model-free RL to domains with simple dy- tions.
namics and discrete action spaces (Mnih et al., 2013), at-
Generic pixel reconstruction is explored in Higgins et al.
tempts to scale these approaches to complex continuous
(2017b); Nair et al. (2018), where the authors use a
control environments have largely been unsuccessful, both
beta variational autoencoder (β-VAE) (Kingma & Welling,
in simulation and the real world. A glaring issue is that
2013; Higgins et al., 2017a) and attempt to perform joint
the RL signal is much sparser than in supervised learning,
representation learning, but find it hard to train, thus reced-
which leads to sample inefficiency, and higher dimensional
ing to the alternating training procedure (Lange & Ried-
observation spaces such as pixels worsens this problem.
miller, 2010; Finn et al., 2015).
One approach to alleviate this problem is by training with
There has been more success in using model learning meth-
auxiliary losses. Early work (Lange & Riedmiller, 2010)
ods on images, such as Hafner et al. (2018); Lee et al.
explores using deep autoencoders to learn feature spaces
(2019). These methods use a world model approach (Ha
in visual reinforcement learning, crucially Lange & Ried-
& Schmidhuber, 2018), learning a representation space us-
miller (2010) propose to recompute features for all col-
ing a latent dynamics loss and pixel decoder loss to ground
lected experiences after each update of the autoencoder,
on the original observation space. These model-based rein-
rendering this approach impractical to scale to more com-
forcement learning methods often show improved sample
plicated domains. Moreover, this method has been only
efficiency, but with the additional complexity of balancing
demonstrated on toy problems. Alternatively, Finn et al.
various auxiliary losses, such as a dynamics loss, reward
(2015) apply deep autoencoder pretraining to real world
loss, and decoder loss in addition to the original policy and
robots that does not require iterative re-training, improving
value optimizations. These proposed methods are corre-
upon computational complexity of earlier methods. How-
spondingly brittle to hyperparameter settings, and difficult
ever, in this work the linear policy is trained separately
to reproduce, as they balance multiple training objectives.
from the autoencoder, which we find to not perform as well
as end-to-end methods. 2
Jaderberg et al. (2017) make use of a replay buffer that only
stores the most recent 2K transitions, a small fraction of the 25M
Shelhamer et al. (2016) employ auxiliary losses to enhance transitions experienced in training.
performance of A3C (Mnih et al., 2016) on Atari. They
recommend a multi-task setting and learning dynamics and
reward to find a good representation, which relies on the
1
We define model-based methods as those that train a dynam-
ics model. By this definition, SLAC (Lee et al., 2019) is a model-
based method.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

Number of
Task name SAC:pixel PlaNet SLAC SAC:state
Episodes
finger spin 1000 645 ± 37 659 ± 45 900 ± 39 945 ± 19
walker walk 1000 33 ± 2 949 ± 9 864 ± 35 974 ± 1
ball in cup catch 2000 593 ± 84 861 ± 80 932 ± 14 981 ± 1
cartpole swingup 2000 758 ± 58 802 ± 19 - 860 ± 8
reacher easy 2500 121 ± 28 949 ± 25 - 953 ± 11
cheetah run 3000 366 ± 68 701 ± 6 830 ± 32 836 ± 105

Table 1. A comparison of current methods: SAC from pixels, PlaNet, SLAC, SAC from proprioceptive states (representing an upper
bound). The large performance gap between SAC:pixel and SAC:state motivates us to address the representation learning bottleneck in
model-free off-policy RL.

3. Background The target value function V̄ is approximated via a Monte-


Carlo estimate of the following expectation
3.1. Markov Decision Process
 
V̄ (st ) = Eat ∼π Q̄(st , at ) − α log π(at |st ) , (3)
A fully observable Markov decision process (MDP) can be
described as M = hS, A, P, R, γi, where S is the state where Q̄ is the target Q-function parametrized by a weight
space, A is the action space, P (st+1 |st , at ) is the transi- vector obtained from an exponentially moving average of
tion probability distribution, R(st , at ) is the reward func- the Q-function weights to stabilize training. The policy im-
tion, and γ is the discount factor (Bellman, 1957). An agent provement step then attempts to project a parametric policy
starts in a initial state s1 sampled from a fixed distribution π(at |st ) by minimizing KL divergence between the policy
p(s1 ), then at each timestep t it takes an action at ∈ A and a Boltzmann distribution induced by the Q-function us-
from a state st ∈ S and moves to a next state st+1 ∼ ing the following objective
P (·|st , at ). After each action the agent receives a reward  
rt = R(st , at ). We consider episodic environments with J(π) = Est ∼D DKL (π(·|st )||Q(st , ·)) , (4)
the length fixed to T . The goal of standard RL is to learn
a policy π(at |st ) that can maximize the agent’s expected where Q(st , ·) ∝ exp{ α1 Q(st , ·)}.
PT
cumulative reward t=1 E(st ,at )∼ρπ [rt ], where ρπ is dis-
counted state-action visitations of π, also known as occu- 3.3. Image-based Observations and Autoencoders
pancies. An important modification (Ziebart et al., 2008) Directly learning from raw images posses an additional
auguments this objective with an entropy term H(π(·|st )) problem of partial observability, which is formalized by
to encourage exploration and robustness to noise. The re- a partially observable MDP (POMDP). In this setting, in-
sulting maximum entropy objective is then defined as stead of getting a low-dimensional state st ∈ S at time t,
the agent receives a high-dimensional observation ot ∈ O,
T
X which is a rendering of potentially incomplete view of the
π ∗ = arg max E(st ,at )∼ρπ [rt + αH(π(·|st ))], (1) corresponding state st of the environment (Kaelbling et al.,
π
t=1
1998). This complicates applying RL as the agent now
needs to also learn a compact latent representation to in-
where α is temperature that balances between optimizing
fer the state. Fitting a high-capacity encoder using only
for the reward and for the stochasticity of the policy.
a scarce reward signal is sample inefficient and prone to
suboptimal convergence. Following prior work (Lange &
3.2. Soft Actor-Critic Riedmiller, 2010; Finn et al., 2015) we explore unsuper-
Soft Actor-Critic (SAC) (Haarnoja et al., 2018) is an off- vised pretraining via an image-based autoencoder (AE). In
policy actor-critic method that uses the maximum entropy practice, the AE is represented as a convolutional encoder
framework to derive soft policy iteration. At each iter- gφ that maps an image observation ot to a low-dimensional
ation SAC performs soft policy evaluation and improve- latent vector zt , and a deconvolutional decoder fθ that re-
ment steps. The policy evaluation step fits a parametric constructs zt back to the original image ot . Both the en-
Q-function Q(st , at ) using transitions sampled from the re- coder and decoder are trained simultaneously by maximiz-
play buffer D by minimizing the soft Bellman residual ing the expected log-likelihood
 
  J(AE) = Eot ∼D log pθ (ot |zt ) , (5)
2
J(Q) = E(st ,at ,rt ,st+1 )∼D Q(st , at ) − rt − γ V̄ (st+1 ) .
where zt = gφ (ot ). Or in the case of β-VAE (Kingma
(2) & Welling, 2013; Higgins et al., 2017a) we maximize the
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

Figure 1. Image-based continuous control tasks from the DeepMind Control Suite (Tassa et al., 2018) used in our experiments. Each task
offers an unique set of challenges, including complex dynamics, sparse rewards, hard exploration, and other traits (see Appendix A).

objective below coder (Kingma & Welling, 2013; Higgins et al., 2017a),
 we also employ a β-VAE to learn representations, but in
J(VAE) = Eot ∼D Ezt ∼qφ (zt |ot ) [log pθ (ot |zt )] (6) contrast to Hafner et al. (2018); Lee et al. (2019) we only

− βDKL (qφ (zt |ot )||p(zt )) , consider reconstructing the current frame, instead of recon-
structing a temporal sequence of frames. Based on evi-
where the variational distribution is parametrized as dence from Lange & Riedmiller (2010); Finn et al. (2015);
qφ (zt |ot ) = N (zt |µφ (ot ), σφ2 (ot )). The latent vector zt Nair et al. (2018) we first try alternating between learn-
is then used by an RL algorithm, such as SAC, instead of ing the policy and β-VAE, and in Section 4.2 observe a
the unavailable true state st . positive correlation between the alternation frequency and
the agent’s performance. However, this approach does not
4. Representation Learning with Image fully close the performance gap, as the learned representa-
Reconstruction tion is not optimized for the task’s objective. To address
this shortcoming, we then attempt to additionally update
We start by noting a dramatic gap in an agent’s performance the β-VAE encoder with the actor-critic gradients. Unfor-
when it learns from image-based observations rather than tunately, our investigation in Section 4.3 shows this ap-
low-dimensional proprioceptive states. Table 1 illustrates proach to be ineffective due to severe instability in train-
that in all cases SAC:pixel (an agent that learns from pix- ing, especially with larger β values. Based on these results,
els) is significantly outperformed by SAC:state (an agent in Section 4.4 we identify two reasons behind the instabil-
that learns from states). This result suggests that attaining a ity, that originate from the stochastic nature of a β-VAE
compact state representation is key in enabling efficient RL and the non-stationary gradient from the actor. We then
from images. Prior work has demonstrated that auxiliary propose two simple remedies and in Section 4.5 introduce
supervision can improve representation learning, which is our method for an effective model-free off-policy RL from
further confirmed in Table 1 by superior performance of images.
model-based methods, such as PlaNet (Hafner et al., 2018)
and SLAC (Lee et al., 2019), both of which make use of 4.1. Experimental Setup
several auxiliary tasks to learn better representations.
Before carrying out our empirical study, we detail the ex-
While a wide range of auxiliary objectives could be added perimental setup. A more comprehensive overview can be
to aid effective representation learning, we focus our atten- found in Appendix B. We evaluate all agents on six chal-
tion on the most general and widely applicable – an image lenging control tasks (Figure 1). For brevity, on occasion,
reconstruction loss. Furthermore, our goal is to develop results for three tasks are shown with the remainder pre-
a simple and robust algorithm that has the potential to be sented in the appendix. An image observation is repre-
scaled up to real world applications (e.g. robotics). Corre- sented as a stack of three consecutive 84 × 84 RGB render-
spondingly, we avoid task dependent auxiliary losses, such ings (Mnih et al., 2013) to infer temporal statistics, such as
as Pixel Control from Jaderberg et al. (2017), or world- velocity and acceleration. For simplicity, we keep the hyper
models (Shelhamer et al., 2016; Hafner et al., 2018; Lee parameters fixed across all the tasks, except for action re-
et al., 2019). As noted by Gelada et al. (2019) the latter can peat (see Appendix B.3), which we set according to Hafner
be brittle to train for reasons including: i) tension between et al. (2018) for a fair comparison to the baselines. We
reward and transition losses which requires careful tuning evaluate an agent after every 10K training observations, by
and ii) difficulty in modeling complex dynamics (which we computing an average return over 10 episodes. For a re-
explore further in Section 5.2). liable comparison we run 10 random seeds and report the
Following Nair et al. (2018); Hafner et al. (2018); Lee et al. mean and standard deviation of the evaluation reward.
(2019), which use reconstruction loss to learn the represen-
tation space and dynamics model with a variational autoen-
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

4.2. Alternating Representation Learning with a Specifically, we want to update the encoder network gφ
β-VAE with the gradients coming through the latent state zt from
the actor J(π) (Equation (4)), critic J(Q) (Equation (2)),
We first set out to confirm the benefits of an alternating
and β-VAE J(VAE) (Equation (6)) losses. We thus take
approach to representation learning in off-policy RL. We
the best performing variant from the previous experiment
conduct an experiment where we initially pretrain the con-
(e.g. SAC+VAE:pixel (iter, 1)) and let the actor-critic’s
volutional encoder gφ and deconvolutional decoder fθ of
gradients update the encoder gφ . We tune for the best β
a β-VAE according to the loss J(VAE) (Equation (6)) on
and name this agent SAC+VAE:pixel. Results in Figure 3
observations collected by a random policy. The actor and
show that the joint representation learning with β-VAE in
critic networks of SAC are then trained for N steps using
unstable in the off-policy setting and performs worse than
latent states zt ∼ gφ (ot ) as inputs instead of image-based
the baseline that does not utilize task dependent informa-
observations ot . We keep the encoder gφ fixed during this
tion (e.g. SAC+VAE:pixel (iter, 1)).
period. The updated policy is then used to interact with
the environment to gather new transitions that are conse- reacher_easy ball_in_cup_catch walker_walk
1000 1000 1000
quently stored in the replay buffer. We continue iterating

average return
750 750 750
between the autoencoder and actor-critic updates until con- 500 500 500
vergence. Note that the gradients are never shared between 250 250 250
0 0 0
the β-VAE for learning the representation space, and the 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
environment steps (*106) environment steps (*106) environment steps (*106)
actor-critic. In Figure 2 we vary the frequency N at which SAC:state SAC+VAE:pixel (iter, 1)
SAC:pixel SAC+VAE:pixel
the representation space is updated, from N = ∞ where
the representation is never updated after the initial pretrain-
Figure 3. An unsuccessful attempt to propagate gradients from the
ing period, to N = 1 where the representation is updated
actor-critic down to the β-VAE encoder. SAC+VAE:pixel exhibits
after every policy update. We observe a positive correla- instability in training which leads to subpar performance compar-
tion between this frequency and the agent’s performance. ing to the baseline SAC+VAE:pixel (iter, 1), which does not use
Although the alternating scheme helps to improve the sam- the actor-critic gradients. Full results in Appendix D.
ple efficiency of the agent, it still falls short of reaching the
upper bound performance of SAC:state. This is not surpris-
ing, as the learned representation space is never optimized 4.4. Stabilizing Joint Representation Learning
for the task’s objective.
reacher_easy ball_in_cup_catch walker_walk Following an unsuccessful attempt at joint representation
1000 1000 1000
learning with a β-VAE in off-policy RL, we investigate the
average return

750 750 750


500 500 500 root cause of the instability.
250 250 250
0 0 0 reacher_easy ball_in_cup_catch walker_walk
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
environment steps (*106) environment steps (*106) environment steps (*106) 1000 1000 1000
average return

SAC:state SAC+VAE:pixel (iter, ) SAC+VAE:pixel (iter, 1) 750 750 750


SAC:pixel SAC+VAE:pixel (iter, 100) 500 500 500
250 250 250
0 0 0
Figure 2. Separate β-VAE and policy training with no shared gra- 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
environment steps (*106) environment steps (*106) environment steps (*106)
dients SAC+VAE:pixel (iter, N ), with SAC:state shown as an up- SAC:state SAC+VAE:pixel ( = 10 4) SAC+VAE:pixel ( = 10 6)
per bound. N refers to frequency in environment steps at which SAC:pixel SAC+VAE:pixel ( = 10 )
5 SAC+VAE:pixel ( = 10 7)
the β-VAE updates after initial pretraining. More frequent up-
(a) Smaller values of β reduce stochasticity of a β-VAE and lead
dates are beneficial for learning better representations, but cannot
to a better performance.
fully address the gap in performance. Full results in Appendix C.
reacher_easy ball_in_cup_catch walker_walk
1000 1000 1000
average return

750 750 750


500 500 500
4.3. Joint Representation Learning with a β-VAE 250 250 250
0 0 0
To further improve performance of the agent we seek to 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
environment steps (*106) environment steps (*106) environment steps (*106)
learn a latent representation that is well aligned with the un- SAC:state SAC+VAE:pixel
SAC:pixel SAC+VAE:pixel (w/o actor grad)
derlying RL objective. Shelhamer et al. (2016) has demon-
strated that joint policy and auxiliary objective optimiza- (b) Preventing the actor’s gradients to update the convolutional
tion improves on the pretraining approach, as described encoder helps to improve performance even further.
Figure 4. We identify two reasons for the subpar performance of
in Section 4.2, but this has been only shown in the on-policy
joint representation learning. (a) The stochastic nature of a β-
regime. VAE, and (b) the non-stationary actor’s gradients. Full results
Thus we now attempt to verify the feasibility of joint rep- in Appendix E.
resentation learning with a β-VAE in the off-policy setting.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

forward
reconstruction target

linear
deconv 4x32x[3x3]

linear
decoder

MLP
conv 4x32x[3x3]

encoder

linear
observation 3x[3x84x84] MLP

Figure 5. Our algorithm (SAC+AE) auguments SAC with a regularized autoencoder to achieve stable training from images in the off-
policy regime. The stability comes from switching to a deterministic encoder that is carefully updated with gradients from the recon-
struction J(RAE) (Equation (7)) and soft Q-learning J(Q) (Equation (2)) objectives.

We first observe that the stochastic nature of a β-VAE dam- We also prevent the actor’s gradients from updating the
ages performance of the agent. The results from Figure 4a convolutional encoder, as suggested in Section 4.4. Unfor-
illustrate that smaller values of β improve the training sta- tunately, this slows down signal propogation to the encoder,
bility as well as the task performance. This motivates us to and thus we find it important to update the convolutional
instead consider a completely deterministic autoencoder. weights of the target Q-function faster than the rest of the
network’s parameters. We thus employ different rates τQ
Furthermore, we observe that updating the convolutional
and τenc (with τenc > τQ ) to compute Polyak averaging
encoder with the actor’s gradients hurts the agent’s perfor-
over the corresponding parameters of the target Q-function.
mance. In Figure 4b we observe that blocking the actor’s
Our approach is summarized in Figure 5.
gradients from propagating to the encoder improves results
considerably. This is because updating the encoder with the
J(π) loss (Equation (4)) also changes the Q-function net- 5. Evaluation of SAC+AE
work inside the objective, due to the convolutional encoder
In this section we evaluate our approach, SAC+AE, on var-
being shared between the policy π and Q-function. A sim-
ious benchmark tasks and compare against state-of-the-art
ilar phenomenon has been observed by Mnih et al. (2013),
methods, both model-free and model-based. We then high-
where the authors employ a static target Q-function to sta-
light the benefits of our model-free approach over those
bilize TD learning. It might appear that updating the en-
model-based methods in modified environments with dis-
coder with only the critic’s gradients would be insufficient
tractors, as an approximation of real world noise. Finally,
to learn a task dependent representation space. However,
we test generalization to unseen tasks and dissect the rep-
the policy π in SAC is a parametric projection of a Boltz-
resentation power of the encoder.
mann distribution induced by the Q-function, see Equa-
tion (4). Thus, the Q-function contains all relevant infor-
mation about the task and allows the encoder to learn task 5.1. Learning Control from Pixels
dependent representations from the critic’s gradient alone. We evaluate our method on six challenging image-based
continuous control tasks (see Figure 1) from DMC (Tassa
4.5. Our Approach SAC+AE: Joint Off-Policy et al., 2018) and compare against several state-of-the-
Representation Learning art model-free and model-based RL algorithms for learn-
ing from pixels: D4PG (Barth-Maron et al., 2018), an
We now introduce our approach SAC+AE – a stable off-
off-policy actor-critic algorithm; PlaNet (Hafner et al.,
policy RL algorithm from images, derived from the above
2018), a model-based method that learns a dynamics model
findings. We first replace the β-VAE with a deterministic
with deterministic and stochastic latent variables and em-
autoencoder. To preserve the regularization affects of a β-
ploys cross-entropy planning for control; and SLAC (Lee
VAE we adopt the RAE approach of Ghosh et al. (2019),
et al., 2019), which combines a purely stochastic la-
which imposes a L2 penalty on the learned representation
tent model together with an model-free soft actor-critic.
zt and weight-decay on the decoder parameters
In addition, we compare against SAC:state that learns
from low-dimensional proprioceptive state, as an upper
J(RAE) = Eot ∼D log pθ (ot |zt ) + λz ||zt ||2 + λθ ||θ||2 ,
 
bound on performance. Results in Figure 6a illustrate
(7)
that SAC+AE:pixel matches the state-of-the-art model-
based methods such as PlaNet and SLAC, despite being
where zt = gφ (ot ), and λz , λθ are hyper parameters.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

reacher_easy ball_in_cup_catch walker_walk finger_spin cartpole_swingup cheetah_run


1000 1000 1000 1000 1000 1000
average return

750 750 750 750 750 750


500 500 500 500 500 500
250 250 250 250 250 250
0 0 0 0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106) environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state D4PG:pixel (108 steps) PlaNet SLAC SAC:pixel SAC+AE:pixel (ours)

(a) Our method demonstrates significantly improved performance over the baseline SAC:pixel. Moreover, it matches the state-of-the-art
performance of world-model based algorithms, such as PlaNet and SLAC, as well as a model-free algorithm D4PG, that learns directly
from raw images. Our algorithm is extremely stable, robust, and straightforward to implement.
reacher_easy ball_in_cup_catch walker_walk finger_spin cartpole_swingup cheetah_run
1000 1000 1000 1000 1000 1000
average return

750 750 750 750 750 750


500 500 500 500 500 500
250 250 250 250 250 250
0 0 0 0 0 0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
environment steps (*106) environment steps (*106) environment steps (*106) environment steps (*106) environment steps (*106) environment steps (*106)
SAC+AE:pixel (ours) SLAC PlaNet

(b) Methods that rely on forward modeling, such as PlaNet and SLAC, suffer severely from the background noise, while our approach is
resistant to the distractors. Examples of background distractors are show in Figure 7.
Figure 6. Two main results of our work. In (a) we demonstrate that our simple method matches the state-of-the-art performance on DMC
tasks. In (b) we outperform the baselines on more complicated tasks where the observations are altered with noise.

walker_stand walker_run
1000 1000
average return

750 750
500 500
250 250
0 0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
environment steps (*106) environment steps (*106)
Figure 7. Backgrounds altered with randomly moving distractors. SAC:state SAC:pixel SAC:pixel (pretrained on walker_walk)
extremely simple and straightforward to implement.
Figure 8. An encoder pretrained with our method
5.2. Performance on Noisy Observations (SAC+AE:pixel) on walker walk is able to generalize to
unseen walker stand and walker run tasks. All three tasks
Performing accurate forward-modeling predictions based share similar image observations, but have quite different reward
off of noisy observations is challenging and requires learn- structure. SAC with a pretrained on walker walk encoder
ing a high fidelity model that encapsulates strong induc- significantly outperforms the baseline.
tive biases (Watters et al., 2017). The current state-of-
the-art world-model based approaches (Hafner et al., 2018;
Lee et al., 2019) solely rely on a general purpose recurrent that methods that rely on forward modeling perform drasti-
state-space model parametrized with a β-VAE, and thus are cally worse than our approach, showing that our method is
highly vulnerable to the observational noise. In contrast, more robust to background noise.
the representations learned with just reconstruction loss are
better suited to handle the background noise. 5.3. Generalization to Unseen Tasks
To confirm this, we evaluate several agents on tasks where Next, we show that the latent representation space learned
we add simple distractors in the background, consisting of by our method is able to generalize to different tasks
colored balls bouncing off each other and the frame (Fig- without additional fine-tuning. We take three tasks
ure 7). We use image processing to filter away the static walker stand, walker walk, and walker run
background and replace it with this dynamic noise, as pro- from DMC, which share the same observational appear-
posed in Zhang et al. (2018b). We aim to emulate a com- ance, but all have different reward functions. We train
mon setup in a robotics lab, where various unrelated ob- SAC+AE:pixel on the walker walk task until conver-
jects can affect robot’s observations. In Figure 6b we see gence and fix the encoder. Consequently, we train two
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

position[1] position[2] velocity[1] velocity[2]


0.5
0.5 1 2.5

value
0.0 0.0
0.0 0
2.5
0.5 1
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
episode steps episode steps episode steps episode steps
Truth SAC+AE:pixel SAC:pixel

Figure 9. Linear projections of latent representation spaces learned by our method (SAC+AE:pixel) and the baseline (SAC:pixel) onto
proprioceptive states. We compare ground truth value of each proprioceptive coordinate against their reconstructions for cheetah run,
and conclude that our method successfully encodes proprioceptive state information. For visual clarity we only plot 2 position (out of 8)
and 2 velocity (out of 9) coordinates. Full results in Appendix G.

SAC:pixel agents on walker stand and walker run. reconstruction loss with off-policy methods for improved
The encoder of the first agent is initialized with weights sample efficiency in rich observation settings. Our anal-
from the pretrained on walker walk encoder, while the ysis yields two key findings. The first is that deterministic
encoder of the second agent is not. Neither of the agents AE models outperform β-VAEs (Higgins et al., 2017a), due
uses the reconstruction signal, and only backpropogate the to additional instabilities such as bootstrapping, off-policy
critic’s gradients. Results in Figure 8 illustrate that our data, and joint training with auxiliary losses. The second is
method learns latent representations that can readily gener- that propagating the actor’s gradients through the convolu-
alize to unseen tasks and achieve much better performance tional encoder hurts performance.
than SAC:pixel trained from scratch.
Based on these results, we also recommend an effective
off-policy, model-free RL algorithm for pixel observations
5.4. Representation Power of the Encoder with only reconstruction loss as an auxiliary task. It is com-
Finally, we want to determine if our method is able petitive with state-of-the-art model-based methods on tra-
to extract sufficient information from raw images to re- ditional benchmarks, but much simpler, robust, and does
cover the corresponding proprioceptive states. We thus not require learning a dynamics model (Figure 6a). We
train SAC+AE:pixel and SAC:pixel until convergence on show through ablations the superiority of joint learning
cheetah run and then fix their encoders. We then learn over previous methods that use an alternating training pro-
two linear projections to map the encoders’ latent embed- cedure with separated gradients, the necessity of a pixel re-
ding of image observations into the corresponding propri- construction loss over reconstruction to lower-dimensional
oceptive states. Finally, we compare ground truth propri- “correct” representations, and demonstrations of the repre-
oceptive states against their reconstructions. We empha- sentation power and generalization ability of our learned
size that the image encoder attributes for over 90% of the representation. We additionally construct settings with dis-
agent’s parameters, thus we believe that the encoder’s la- tractors approximating real world noise which show how
tent output zt captures a significant amount of information learning a world-model as an auxiliary loss can be harmful
about the corresponding internal state in both cases, even (Figure 6b), and in which our method, SAC+AE, exhibits
though SAC:pixel does not require this explicitly. Results state-of-the-art performance.
in Figure 9 confirm that the internals of the task are easily In the Appendix we provide results across all experi-
extracted from the encoder grounded on pixel observations, ments on the full suite of 6 tasks chosen from DMC (Ap-
whereas they are much more difficult to construct from the pendix A), and the full set of hyperparameters used in Ap-
representation learned by SAC:pixel. pendix B. There are also additional experiments autoen-
coder capacity (Appendix F), a look at optimality of the
6. Discussion learned latent representation (Appendix I) and importance
of action repeat (Appendix J). Finally, we opensource our
For RL agents to be effective in the real world, where codebase for the community to spur future research in
vision is one of the richest sensing modalities, we need image-based RL.
sample efficient, robust algorithms that work from pixel
observations. We pinpoint two strategies to obtain sam-
ple efficiency – i) use off-policy methods and ii) use self-
supervised auxiliary losses. For methods to be robust, we
want auxiliary losses that do not rely on task-specific in-
ductive biases, so we focus on a simple reconstruction loss.
In this work, we provide a thorough study into combining
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

References Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess,
C. P., Pritzel, A., Botvinick, M., Blundell, C., and Lerch-
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normaliza-
ner, A. Darla: Improving zero-shot transfer in reinforce-
tion. arXiv e-prints, 2016.
ment learning. CoRR, 2017b.
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney,
Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo,
W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lill-
J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement
icrap, T. Distributional policy gradients. In International
learning with unsupervised auxiliary tasks. International
Conference on Learning Representations, 2018.
Conference on Learning Representations, 2017.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The arcade learning environment: An evaluation plat- Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.
form for general agents. Journal of Artificial Intelligence Planning and acting in partially observable stochastic do-
Research, 2013. mains. Artificial intelligence, 101(1-2):99–134, 1998.

Bellman, R. A markovian decision process. Indiana Univ. Kingma, D. P. and Dhariwal, P. Glow: Generative flow
Math. J., 1957. with invertible 1x1 convolutions. arXiv e-prints, 2018.

Dwibedi, D., Tompson, J., Lynch, C., and Sermanet, P. Kingma, D. P. and Welling, M. Auto-encoding variational
Learning actionable representations from visual obser- bayes. arXiv preprint arXiv:1312.6114, 2013.
vations. CoRR, 2018.
Lange, S. and Riedmiller, M. A. Deep auto-encoder neural
Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and networks in reinforcement learning. In IJCNN, 2010.
Abbeel, P. Learning visual feature spaces for robotic ma-
nipulation with deep spatial autoencoders. CoRR, 2015. Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S.
Stochastic latent actor-critic: Deep reinforcement learn-
Gelada, C., Kumar, S., Buckman, J., Nachum, O., and ing with a latent variable model. arXiv e-prints, 2019.
Bellemare, M. G. Deepmdp: Learning continuous latent
space models for representation learning. In Proceed- Mattner, J., Lange, S., and Riedmiller, M. Learn to swing
ings of the 36th International Conference on Machine up and balance a real pole based on raw visual input data.
Learning, 2019. In Neural Information Processing, 2012.

Ghosh, P., Sajjadi, M. S. M., Vergari, A., Black, M. J., and Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Schölkopf, B. From variational to deterministic autoen- Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
coders. arXiv e-prints, 2019. atari with deep reinforcement learning. arXiv e-prints,
2013.
Ha, D. and Schmidhuber, J. Recurrent world models facili-
tate policy evolution. In Advances in Neural Information Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lilli-
Processing Systems 31. Curran Associates, Inc., 2018. crap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K.
Asynchronous methods for deep reinforcement learning.
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, CoRR, 2016.
S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P.,
et al. Soft actor-critic algorithms and applications. arXiv Munk, J., Kober, J., and Babuska, R. Learning state repre-
preprint arXiv:1812.05905, 2018. sentation for deep actor-critic control. In Proceedings
2016 IEEE 55th Conference on Decision and Control
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., (CDC), 2016.
Lee, H., and Davidson, J. Learning latent dynamics for
planning from pixels. arXiv preprint arXiv:1811.04551, Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and
2018. Levine, S. Visual reinforcement learning with imagined
goals. In Bengio, S., Wallach, H., Larochelle, H., Grau-
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., man, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Ad-
Botvinick, M., Mohamed, S., and Lerchner, A. beta- vances in Neural Information Processing Systems 31, pp.
vae: Learning basic visual concepts with a constrained 9191–9200. Curran Associates, Inc., 2018.
variational framework. In 5th International Confer-
ence on Learning Representations, ICLR 2017, Toulon, Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact
France, April 24-26, 2017, Conference Track Proceed- solutions to the nonlinear dynamics of learning in deep
ings, 2017a. linear neural networks. arXiv e-prints, 2013.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T.


Loss is its own reward: Self-supervision for reinforce-
ment learning. CoRR, 2016.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.
d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq,
A., et al. Deepmind control suite. arXiv preprint
arXiv:1801.00690, 2018.
van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-
ment learning with double q-learning. arXiv e-prints,
2015.
Watters, N., Tacchetti, A., Weber, T., Pascanu, R.,
Battaglia, P. W., and Zoran, D. Visual interaction net-
works. CoRR, 2017.
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S. S.,
and Pennington, J. Dynamical isometry and a mean field
theory of cnns: How to train 10,000-layer vanilla convo-
lutional neural networks. arXiv e-prints, 2018.
Yarats, D. and Kostrikov, I. Soft actor-critic (sac) im-
plementation in pytorch. https://github.com/
denisyarats/pytorch_sac, 2020.
Zhang, A., Satija, H., and Pineau, J. Decoupling dynamics
and reward for transfer learning. CoRR, 2018a.
Zhang, A., Wu, Y., and Pineau, J. Natural environment
benchmarks for reinforcement learning. CoRR, 2018b.
Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K.
Maximum entropy inverse reinforcement learning. In
Proceedings of the 23rd National Conference on Arti-
ficial Intelligence - Volume 3, 2008.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

Appendix
A. The DeepMind Control Suite
We evaluate the algorithms in the paper on the DeepMind control suite (DMC) (Tassa et al., 2018) – a collection of
continuous control tasks that offers an excellent testbed for reinforcement learning agents. The software emphasizes the
importance of having a standardised set of benchmarks with a unified reward structure in order to measure made progress
reliably.
Specifically, we consider six domains (see Figure 10) that result in twelve different control tasks. Each task (Table 2)
poses a particular set of challenges to a learning algorithm. The ball in cup catch task only provides the agent with
a sparse reward when the ball is caught; the cheetah run task offers high dimensional internal state and action spaces;
the reacher hard task requires the agent to explore the environment. We refer the reader to the original paper to find
more information about the benchmarks.

Task name dim(O) dim(A) Reward type


Proprioceptive Image-based
ball in cup catch 8 3 × 84 × 84 2 sparse
cartpole {balance,swingup} 5 3 × 84 × 84 1 dense
cheetah run 17 3 × 84 × 84 6 dense
finger {spin,turn easy,turn hard} 12 3 × 84 × 84 2 dense/sparse
reacher {easy,hard} 7 3 × 84 × 84 2 sparse
walker {stand,walk,run} 24 3 × 84 × 84 6 dense

Table 2. Specifications of observation space O (proprioceptive and image-based), action space A, and the reward type for each task.

(a) ball in cup (b) cartpole (c) cheetah

(d) finger (e) reacher (f) walker

Figure 10. Our testbed consists of six domains spanning the total of twelve challenging continuous con-
trol tasks: finger {spin,turn easy,turn hard}, cartpole {balance,swingup}, cheetah run,
walker {stand,walk,run}, reacher {easy,hard}, and ball in cup catch.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

B. Hyper Parameters and Setup


Our PyTorch SAC (Haarnoja et al., 2018) implementation is based off of (Yarats & Kostrikov, 2020).

B.1. Actor and Critic Networks


We employ double Q-learning (van Hasselt et al., 2015) for the critic, where each Q-function is parametrized as a 3-layer
MLP with ReLU activations after each layer except of the last. The actor is also a 3-layer MLP with ReLUs that outputs
mean and covariance for the diagonal Gaussian that represents the policy. The hidden dimension is set to 1024 for both the
critic and actor.

B.2. Encoder and Decoder Networks


We employ an almost identical encoder architecture as in Tassa et al. (2018), with two minor differences. Firstly, we add
two more convolutional layers to the convnet trunk. Secondly, we use ReLU activations after each conv layer, instead of
ELU. We employ kernels of size 3 × 3 with 32 channels for all the conv layers and set stride to 1 everywhere, except of
the first conv layer, which has stride 2. We then take the output of the convnet and feed it into a single fully-connected
layer normalized by LayerNorm (Ba et al., 2016). Finally, we add tanh nonlinearity to the 50 dimensional output of the
fully-connected layer.
The actor and critic networks both have separate encoders, although we share the weights of the conv layers between them.
Furthermore, only the critic optimizer is allowed to update these weights (e.g. we truncate the gradients from the actor
before they propagate to the shared conv layers).
The decoder consists of one fully-connected layer that is then followed by four deconv layers. We use ReLU activations
after each layer, except the final deconv layer that produces pixels representation. Each deconv layer has kernels of size
3 × 3 with 32 channels and stride 1, except of the last layer, where stride is 2.
We then combine the critic’s encoder together with the decoder specified above into an autoencoder. Note, because we
share conv weights between the critic’s and actor’s encoders, the conv layers of the actor’s encoder will be also affected by
reconstruction signal from the autoencoder.

B.3. Training and Evaluation Setup


We first collect 1000 seed observations using a random policy. We then collect training observations by sampling actions
from the current policy. We perform one training update every time we receive a new observation. In cases where we use
action repeat, the number of training observations is only a fraction of the environment steps (e.g. a 1000 steps episode at
action repeat 4 will only results into 250 training observations). The action repeat used for each environment is specified
in Table 3, following those used by PlaNet and SLAC.
We evaluate our agent after every 10000 environment steps by computing an average episode return over 10 evaluation
episodes. Instead of sampling from the Gaussian policy we take its mean during evaluation.
We preserve this setup throughout all the experiments in the paper.

Task name Action repeat


cartpole swingup 8
reacher easy 4
cheetah run 4
finger spin 2
ball in cup catch 4
walker walk 2

Table 3. Action repeat parameter used per task, following PlaNet and SLAC.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

B.4. Weights Initialization


We initialize the weight matrix of fully-connected layers with the orthogonal initialization (Saxe et al., 2013) and set the
bias to be zero. For convolutional and deconvolutional layers we use delta-orthogonal initialization (Xiao et al., 2018).

B.5. Regularization
We regularize the autoencoder network using the scheme proposed in Ghosh et al. (2019). In particular, we extend the
standard reconstruction loss for a deterministic autoencoder with a L2 penalty on the learned representation z and add
weight decay on the decoder parameters θ:

J(RAE) = Eot ∼D log pθ (ot |zt ) + λz ||zt ||2 + λθ ||θ||2 where zt = gφ (ot ).
 
(8)

We set λz = 10−6 and λθ = 10−7 .

B.6. Pixels Preprocessing


We construct an observational input as an 3-stack of consecutive frames (Mnih et al., 2013), where each frame is a RGB
rendering of size 84 × 84 from the 0th camera. We then divide each pixel by 255 to scale it down to [0, 1) range. For
reconstruction targets we instead preprocess images by reducing bit depth to 5 bits as in Kingma & Dhariwal (2018).

B.7. Other Hyper Parameters


We also provide a comprehensive overview of all the remaining hyper parameters in Table 4.

Parameter name Value


Replay buffer capacity 1000000
Batch size 128
Discount γ 0.99
Optimizer Adam
Critic learning rate 10−3
Critic target update frequency 2
Critic Q-function soft-update rate τQ 0.01
Critic encoder soft-update rate τenc 0.05
Actor learning rate 10−3
Actor update frequency 2
Actor log stddev bounds [−10, 2]
Autoencoder learning rate 10−3
Temperature learning rate 10−4
Temperature Adam’s β1 0.5
Init temperature 0.1

Table 4. A complete overview of used hyper parameters.


Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

C. Alternating Representation Learning with a β-VAE


Iterative pretraining suggested in Lange & Riedmiller (2010); Finn et al. (2015) allows for faster representation learning,
which consequently boosts the final performance, yet it is not sufficient enough to fully close the gap and additional modifi-
cations, such as joint training, are needed. Figure 11 provides additional results for the experiment described in Section 4.2.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC+VAE:pixel (iter, ) SAC+VAE:pixel (iter, 1)
SAC:pixel SAC+VAE:pixel (iter, 100)

Figure 11. Separate β-VAE and policy training with no shared gradients SAC+VAE:pixel (iter, N ), with SAC:state shown as an upper
bound. N refers to frequency in environment steps at which the β-VAE updates after initial pretraining. More frequent updates are
beneficial for learning better representations, but cannot fully address the gap in performance.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

D. Joint Representation Learning with a β-VAE


Additional results to the experiments from Section 4.3 are in Figure 4a and Figure 12.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC+VAE:pixel (iter, 1)
SAC:pixel SAC+VAE:pixel

Figure 12. An unsuccessful attempt to propagate gradients from the actor-critic down to the encoder of the β-VAE to enable joint off-
policy training. The learning process of SAC+VAE:pixel exhibits instability together with the subpar performance comparing to the
baseline SAC+VAE:pixel (iter, 1), which does not share gradients with the actor-critic.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

E. Stabilizing Joint Representation Learning


Additional results to the experiments from Section 4.4 are in Figure 13.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC+VAE:pixel ( = 10 4) SAC+VAE:pixel ( = 10 6)
SAC:pixel SAC+VAE:pixel ( = 10 5) SAC+VAE:pixel ( = 10 7)
(a) Smaller values of β reduce stochasticity of a β-VAE and lead to a better performance.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC+VAE:pixel
SAC:pixel SAC+VAE:pixel (w/o actor grad)
(b) Preventing the actor’s gradients to update the convolutional encoder helps to improve performance even further.
Figure 13. We identify two reasons for the subpar performance of joint representation learning. (a) The stochastic nature of a β-VAE,
and (b) the non-stationary actor’s gradients.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

F. Capacity of the Autoencoder


We also investigate various autoencoder capacities for the different tasks. Specifically, we measure the impact of changing
the capacity of the convolutional trunk of the encoder and corresponding deconvolutional trunk of the decoder. Here, we
maintain the shared weights across convolutional layers between the actor and critic, but modify the number of convolu-
tional layers and number of filters per layer in Figure 14 across several environments. We find that SAC+AE is robust to
various autoencoder capacities, and all architectures tried were capable of extracting the relevant features from pixel space
necessary to learn a good policy. We use the same training and evaluation setup as detailed in Appendix B.3.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC+AE:pixel (2x32) SAC+AE:pixel (4x64)
SAC:pixel SAC+AE:pixel (4x32) SAC+AE:pixel (6x64)

Figure 14. Different autoencoder architectures, where we vary the number of conv layers and the number of output channels in each
layer in both the encoder and decoder. For example, 4 × 32 specifies an architecture with 4 conv layers, each outputting 32 channels.
We observe that the difference in capacity has only limited effect on final performance.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

G. Representation Power of the Encoder


Addition results to the experiment in Section 5.4 that demonstrates encoder’s power to reconstruct proprioceptive state
from image-observations are shown in Figure 15.

position[0] position[1] position[2]


0.5
0.0 0.5
0.0 0.0
0.1
0.5
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
position[3] position[4] position[5]
0.0
0 0
0.5
1 1
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
position[6] position[7]
0.5 0.5
0.0 0.0
0.5
0.5
0 50 100 150 200 0 50 100 150 200
velocity[0] velocity[1] velocity[2]
7.5 2.5
1
5.0
value

0 0.0
2.5
2.5
0.0 1
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
velocity[3] velocity[4] velocity[5]
20 20
10
0 0
value

0
10 20
20 20
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
velocity[6] velocity[7] velocity[8]
20 20
10
10
value

0 0 0
10 10
20
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
episode steps episode steps episode steps
Truth SAC+AE:pixel SAC:pixel

Figure 15. Linear projections of latent representation spaces learned by our method (SAC+AE:pixel) and the baseline (SAC:pixel) onto
proprioceptive states. We compare ground truth value of each proprioceptive coordinate against their reconstructions for cheetah run,
and conclude that our method successfully encodes proprioceptive state information. The proprioceptive state of cheetah run has 8
position and 9 velocity coordinates.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

H. Decoding to Proprioceptive State


Learning from low-dimensional proprioceptive observations achieves better final performance with greater sample effi-
ciency (see Figure 6a for comparison to pixels baselines), therefore our intuition is to directly use these compact obser-
vations as the reconstruction targets to generate an auxiliary signal. Although, this is an unrealistic setup, given that we
do not have access to proprioceptive states in practice, we use it as a tool to understand if such supervision is beneficial
for representation learning and therefore can achieve good performance. We augment the observational encoder gφ , that
maps an image ot into a latent vector zt , with a state decoder fθ , that restores the corresponding state st from the latent
vector zt . This leads to an auxililary objective Eot ,st ∼D 12 ||fθ (zt ) − st ||22 , where zt = gφ (ot ). We parametrize the state
 

decoder fθ as a 3-layer MLP with 1024 hidden size and ReLU activations, and train it jointly with the actor-critic network.
Such auxiliary supervision helps less than expected, and surprisingly hurts performance in ball in cup catch, as seen
in Figure 16. Our intuition is that such low-dimensional supervision is not able to provide the rich reconstruction error
needed to fit the high-capacity convolutional encoder gφ . We thus seek for a denser auxiliary signal and try learning latent
representation spaces with pixel reconstructions.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC:pixel SAC:pixel (state supervision)

Figure 16. An auxiliary signal is provided by reconstructing a low-dimensional state from the corresponding image observation. Perhaps
surprisingly, such synthetic supervision doesn’t guarantee sufficient signal to fit the high-capacity encoder, which we infer from the
suboptimal performance of SAC:pixel (state supervision) compared to SAC:pixel in ball in cup catch.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

I. Optimality of Learned Latent Representation


We define the optimality of the learned latent representation as the ability of our model to extract and preserve all relevant
information from the pixel observations sufficient to learn a good policy. For example, the proprioceptive state repre-
sentation is clearly better than the pixel representation because we can learn a better policy. However, the differences in
performance of SAC:state and SAC+AE:pixel can be attributed not only to the different observation spaces, but also the
difference in data collected in the replay buffer. To decouple these attributes and determine how much information loss
there is in moving from proprioceptive state to pixel images, we measure final task reward of policies learned from the
same fixed replay buffer, where one is trained on proprioceptive states and the other trained on pixel observations.
We first train a SAC+AE policy until convergence and save the replay buffer that we collected during training. Importantly,
in the replay buffer we store both the pixel observations and the corresponding proprioceptive states. Note that for two
policies trained on the fixed replay buffer, we are operating in an off-policy regime, and thus it is possible we won’t be able
to train a policy that performs as well.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
environment steps (*106) environment steps (*106) environment steps (*106)
SAC+AE:pixel (collector) SAC+AE:pixel (fixed buffer) SAC:state (fixed buffer)

Figure 17. Training curves for the policy used to collect the buffer (SAC+AE:pixel (collector)), and the two policies learned on that
buffer using proprioceptive (SAC:state (fixed buffer)) and pixel observations (SAC+AE:pixel (fixed buffer)). We see that our method
actually outperforms proprioceptive observations in this setting.

In Figure 17 we find, surprisingly, that our learned latent representation outperforms proprioceptive state on a fixed buffer.
This could be because the data collected in the buffer is by a policy also learned from pixel observations, and is different
enough from the policy that would be learned from proprioceptive states that SAC:state underperforms in this setting.
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

J. Importance of Action Repeat


We found that repeating nominal actions several times has a significant effect on learning dynamics and final reward. Prior
works (Hafner et al., 2018; Lee et al., 2019) treat action repeat as a hyper parameter to the learning algorithm, rather than
a property of the target environment. Effectively, action repeat decreases the control horizon of the task and makes the
control dynamics more stable. Yet, action repeat can also introduce a harmful bias, that prevents the agent from learning
an optimal policy due to the injected lag. This tasks a practitioner with a problem of finding an optimal value for the action
repeat hyper parameter that stabilizes training without limiting control elasticity too much.
To get more insights, we perform an ablation study, where we sweep over several choices for action repeat on multiple
control tasks and compare acquired results against PlaNet (Hafner et al., 2018) with the original action repeat setting,
which was also tuned per environment. We use the same setup as detailed in Appendix B.3. Specifically, we average
performance over 10 random seeds, and reduce the number of training observations inverse proportionally to the action
repeat value. The results are shown in Figure 18. We observe that PlaNet’s choice of action repeat is not always optimal
for our algorithm. For example, we can significantly improve performance of our agent on the ball in cup catch task
if instead of taking the same nominal action four times, as PlaNet suggests, we take it once or twice. The same is true on a
few other environments.

reacher_easy ball_in_cup_catch walker_walk


1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
PlaNet (4) PlaNet (4) PlaNet (2)
0 0 0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
finger_spin cartpole_swingup cheetah_run
1000 1000 1000
average return

750 750 750


500 500 500
250 250 250
PlaNet (2) PlaNet (8) PlaNet (4)
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0 1 2 3
environment steps (*106) environment steps (*106) environment steps (*106)
SAC:state SAC+AE:pixel (1) SAC+AE:pixel (2) SAC+AE:pixel (4)

Figure 18. We study the importance of the action repeat hyper parameter on final performance. We evaluate three different settings,
where the agent applies a sampled action once (SAC+AE:pixel (1)), twice (SAC+AE:pixel (2)), or four times (SAC+AE:pixel (4)). As
a reference, we also plot the PlaNet (Hafner et al., 2018) results with the original action repeat setting. Action repeat has a significant
effect on learning. Moreover, we note that the PlaNet’s choice of hyper parameters is not always optimal for our method (e.g. it is better
to apply an action only once on walker walk, than taking it twice).

You might also like