PMLR (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization
PMLR (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization
Meta-Policy Optimization
Ignasi Clavera∗ Jonas Rothfuss∗ John Schulman
UC Berkeley KIT, UC Berkeley OpenAI
[email protected] [email protected]
1 Introduction
Most of the recent success in reinforcement learning was achieved using model-free reinforcement
learning algorithms [1, 2, 3]. Model-free (MF) algorithms tend to achieve optimal performance,
are generally applicable, and are easy to implement. However, this is achieved at the cost of being
data intensive, which is exacerbated when combined with high-capacity function approximators
like neural networks. Their high sample complexity presents a major barrier to their application to
robotic control tasks, on which data gathering is expensive.
In contrast, model-based (MB) reinforcement learning methods are able to learn with significantly
fewer samples by using a learned model of the environment dynamics against which policy opti-
mization is performed. Learning dynamics models can be done in a sample efficient way since they
are trained with standard supervised learning techniques, allowing the use of off-policy data. How-
ever, accurate dynamics models can often be far more complex than good policies. For instance,
pouring water into a cup can be achieved by a fairly simple policy while modeling the underlying
dynamics of this task is highly complex. Hence, model-based methods have only been able to learn
good policies on a much more limited set of problems, and even when good policies are learned,
they typically saturate in performance at a level well below their model-free counterparts [4, 5].
Model-based approaches tend to rely on accurate (learned) dynamics models to solve a task. If
the dynamics model is not sufficiently precise, the policy optimization is prone to overfit on the
deficiencies of the model, leading to suboptimal behavior or even to catastrophic failures. This
problem is known in the literature as model-bias [6]. Previous work has tried to alleviate model-bias
by characterizing the uncertainty of the models and learning a robust policy [6, 7, 8, 9, 10], often
using ensembles to represent the posterior. This paper also uses ensembles, but very differently.
We propose Model-Based Meta-Policy-Optimization (MB-MPO), an orthogonal approach to pre-
vious model-based RL methods: while traditional model-based RL methods rely on the learned
∗
Equal contribution
2 Related Work
In this section, we discuss related work, including model-based RL and approaches that combine
elements of model-based and model-free RL. Finally, we outline recent advances in the field of
meta-learning.
Model-Based Reinforcement Learning: Addressing Model Inaccuracies. Impressive results
with model-based RL have been obtained using simple linear models [16, 17, 18, 19]. However,
like Bayesian models [6, 20, 21], their application is limited to low-dimensional domains. Our
approach, which uses neural networks (NNs), is easily able to scale to complex high dimensional
control problems. NNs for model learning offer the potential to scale to higher dimensional problems
with impressive sample complexity [22, 23, 24, 25]. A major challenge when using high-capacity
dynamics models is preventing policies from exploiting model inaccuracies. Several works approach
this problem of model-bias by learning a distribution of models [26, 7, 10, 23], or by learning adap-
tive models [27, 28, 29]. We incorporate the idea of reducing model-bias by learning an ensemble
of models. However, we show that these techniques do not suffice in challenging domains, and
demonstrate the necessity of meta-learning for improving asymptotic performance.
Past work has also tried to overcome model inaccuracies through the policy optimization pro-
cess. Model Predictive Control (MPC) compensates for model imperfections by re-planning at each
step [30], but it suffers from limited credit assignment and high computational cost. Robust policy
optimization [7, 8, 9] looks for a policy that performs well across models; as a result policies tend to
be over-conservative. In contrast, we show that MB-MPO learns a robust policy in the regions where
the models agree, and an adaptive one where the models yield substantially different predictions.
Model-Based + Model-Free Reinforcement Learning. Naturally, it is desirable to combine el-
ements of model-based and model-free to attain high performance with low sample complexity.
Attempts to combine them can be broadly categorized into three main approaches. First, differ-
entiable trajectory optimization methods propagate the gradients of the policy or value function
through the learned dynamics model [31, 32] . However, the models are not explicitly trained to
approximate first order derivatives, and, when backpropagating, they suffer from exploding and
vanishing gradients [10]. Second, model-assisted MF approaches use the dynamics models to aug-
ment the real environment data by imagining policy roll-outs [33, 29, 34, 22]. These methods still
rely to a large degree on real-world data, which makes them impractical for real-world applications.
Thanks to meta-learning, our approach could, if needed, adapt fast to the real-world with fewer
samples. Third, recent work fully decouples the MF module from the real environment by entirely
using samples from the learned models [35, 10]. These methods, even though considering the model
uncertainty, still rely on precise estimates of the dynamics to learn the policy. In contrast, we meta-
2
learn a policy on an ensemble of models, which alleviates the strong reliance on precise models by
training for adaption when the prediction uncertainty is high. Kurutach et al. [10] can be viewed as
an edge case of our algorithm when no adaptation is performed.
Meta-Learning. Our approach makes use of meta-learning to address model inaccuracies. Meta-
learning algorithms aim to learn models that can adapt to new scenarios or tasks with few data
points. Current meta-learning algorithms can be classified in three categories. One approach in-
volves training a recurrent or memory-augmented network that ingests a training dataset and outputs
the parameters of a learner model [36, 37]. Another set of methods feeds the dataset followed by the
test data into a recurrent model that outputs the predictions for the test inputs [12, 38]. The last cat-
egory embeds the structure of optimization problems into the meta-learning algorithm [11, 39, 40].
These algorithms have been extended to the context of RL [12, 13, 15, 11]. Our work builds upon
MAML [11]. However, while in previous meta-learning methods each task is typically defined by a
different reward function, each of our tasks is defined by the dynamics of different learned models.
3 Background
3.1 Model-based Reinforcement Learning
Meta-RL aims to learn a learning algorithm which is able to quickly learn optimal policies in MDPs
Mk drawn from a distribution ρ(M) over a set of MDPs. The MDPs Mk may differ in their
reward function rk (s, a) and transition distribution pk (st+1 |st , at ), but share action space A and
state space S.
Our approach builds on the gradient-based meta-learning framework MAML [11], which in the RL
setting, trains a parametric policy πθ (a|s) to quickly improve its performance on a new task with
one or a few vanilla policy gradient steps. The meta-training objective for MAML can be written as:
H−1
X H−1
X
max EMk ∼ρ(M) rk (st , at ) s.t.: θ 0 = θ + α ∇θ Est+1 ∼pk rk (st , at ) (1)
θ st+1 ∼pk at ∼πθ (at |st )
t=0 t=0
at ∼πθ0 (at |st )
MAML attempts to learn an initialization θ ∗ such that for any task Mk ∼ ρ(M) the policy attains
maximum performance in the respective task after one policy gradient step.
4 Model-Based Meta-Policy-Optimization
Enabling complex and high-dimensional real robotics tasks requires extending current model-based
methods to the capabilities of mode-free while, at the same time, maintaining their data efficiency.
Our approach, model-based meta-policy-optimization (MB-MPO), attains such goal by framing
model-based RL as meta-learning a policy on a distribution of dynamic models, advocating to max-
imize the policy adaptation, instead of robustness, when models disagree. This not only removes the
arduous task of optimizing for a single policy that performs well across differing dynamic models,
but also results in better exploration properties and higher diversity of the collected samples, which
leads to improved dynamic estimates.
We instantiate this general framework by employing an ensemble of learned dynamic models and
meta-learning a policy that can be quickly adapted to any of the dynamic models with one policy
gradient step. In the following, we first describe how the models are learned, then explain how the
policy can be meta-trained on an ensemble of models, and, finally, we present our overall algorithm.
3
4.1 Model Learning
A key component of our method is learning a distribution of dynamics models, in the form of an
ensemble, of the real environment dynamics. In order to decorrelate the models, each model differs
in its random initialization and it is trained with a different randomly selected subset Dk of the
collected real environment samples. In order to address the distributional shift that occurs as the
policy changes throughout the meta-optimization, we frequently collect samples under the current
policy, aggregate them with the previous data D, and retrain the dynamic models with warm starts.
In our experiments, we consider the dynamics models to be a deterministic function of the current
state st and action at , employing a feed-forward neural network to approximate them. We follow
the standard practice in model-based RL of training the neural network to predict the change in
state ∆s = st+1 − st (rather than the next state st+1 ) [22, 6]. We denote by fˆφ the function
approximator for the next state, which is the sum of the input state and the output of the neural
network. The objective for learning each model fˆφk of the ensemble is to find the parameter vector
φk that minimizes the `2 one-step prediction loss:
1 X
min kst+1 − fˆφk (st , at )k22 (2)
φk |Dk |
(st ,at ,st+1 )∈Dk
where Dk is a sampled subset of the training data-set D that stores the transitions which the agent
has experienced. Standard techniques to avoid overfitting and facilitate fast learning are followed;
specifically, 1) early stopping the training based on the validation loss, 2) normalizing the inputs and
outputs of the neural network, and 3) weight normalization [41].
Given an ensemble of learned dynamic models for a particular environment, our core idea is to
learn a policy which can adapt quickly to any of these models. To learn this policy, we use gra-
dient based meta-learning with MAML (described in Section 3.2). To properly formulate this
problem in the context of meta-learning, we first need to define an appropriate task distribution.
Considering the models {fˆφ1 , fˆφ2 , ..., fˆφK }, which approximate the dynamics of the true envi-
ronment, we can construct a uniform task distribution by embedding them into different MDPs
Mk = (S, A, fˆφk , r, γ, p0 ) using these learned dynamics models. We note that, unlike the exper-
imental considerations of prior methods [12, 11, 14], in our work the reward function remains the
same across tasks while the dynamics vary. Therefore, each task constitutes a different belief about
what the dynamics in the true environment could be. Finally, we pose our objective as the following
meta-optimization problem:
K
1 X
max Jk (θk0 ) s.t.: θk0 = θ + α ∇θ Jk (θ) (3)
θ K
k=0
with Jk (θ) being the expected return under the policy πθ and the estimated dynamics model fˆφk .
H−1
X
Jk (θ) = Eat ∼πθ (at |st ) ˆ
r(st , at ) st+1 = fφk (st , at ) (4)
t=0
For estimating the expectation in Eq. 4 and computing the corresponding gradients, we sample tra-
jectories from the imagined MDPs. The rewards are computed by evaluating the reward function,
which we assume as given, in the predicted states and actions r(fˆφk (st−1 , at−1 , at )). In particu-
lar, when estimating the adaptation objectives Jk (θ), the meta-policy πθ is used to sample a set of
PK
imaginary trajectories Tk for each model fˆφk . For the meta-objective K 1 0
k=0 Jk (θk ), we generate
trajectory roll-outs Tk0 with the models fˆφk and the policies πθk0 obtained from adapting the param-
eters θ to the k-th model. Thus, no real-world data is used for the data intensive step of meta-policy
optimization.
In practice, any policy gradient algorithm can be chosen to perform the meta-update of the policy
parameters. In our implementation, we use Trust-Region Policy Optimization (TPRO) [1] for max-
imizing the meta-objective, and employ vanilla policy gradient (VPG) [42] for the adaptation step.
To reduce the variance of the policy gradient estimates a linear reward baseline is used.
4
Algorithm 1 MB-MPO
Require: Inner and outer step size α, β
1: Initialize the policy πθ , the models fˆφ1 , fˆφ2 , ..., fˆφK and D ← ∅
2: repeat
3: Sample trajectories from the real environment with the adapted policies πθ0 , ..., πθ0 . Add
1 K
them to D.
4: Train all models using D.
5: for all models fˆφk do
6: Sample imaginary trajectories Tk from fˆφk using πθ
7: Compute adapted parameters θk0 = θ + α ∇θ Jk (θ) using trajectories Tk
8: Sample imaginary trajectories Tk0 from fˆφk using the adapted policy πθk0
9: end for
1 0 0
P
10: Update θ → θ − β K k ∇θ Jk (θk ) using the trajectories Tk
11: until the policy performs well in the real environment
12: return Optimal pre-update parameters θ ∗
4.3 Algorithm
In the following, we describe the overall algorithm of our approach (see Algorithm 1). First, we
initialize the models and the policy with different random weights. Then, we proceed to the data
collection step. In the first iteration, a uniform random controller is used to collect data from the
real-world, which is stored in a buffer D. At subsequent iterations, trajectories from the real-world
are collected with the adapted policies {πθ10 , ..., πθK
0 }, and then aggregated with the trajectories from
previous iterations. The models are trained with the aggregated real-environment samples following
the procedure explained in section 4.1. The algorithm proceeds by imagining trajectories from
each the ensemble of models {fφ1 , ..., fφK } using the policy πθ . These trajectories are are used
to perform the inner adaptation policy gradient step, yielding the adapted policies {πθ10 , ..., πθK 0 }.
Finally, we generate imaginary trajectories using the adapted policies πθk and models fφk , and
0
optimize the policy towards the meta-objective (as explained in section 4.2). We iterate through
these steps until desired performance is reached. The algorithm returns the optimal pre-update
parameters θ ∗ .
training data is more diverse which promotes robustness of the dynamic models. Specifically, the
5
adapted policies tend to exploit the characteristic deficiencies of the respective dynamic models.
As a result, we collect real-world data in regions where the dynamic models insufficiently approxi-
mate the true dynamics. This effect accelerates correcting the imprecision of the models leading to
faster improvement. In Appendix A.1, we experimentally show the positive effect of tailored data
collection on the performance.
Fast fine-tuning. Meta-learning optimizes a policy for fast adaptation [11] to a set of tasks. In our
case, each task corresponds to a different believe of what the real environment dynamics might be.
When optimal performance is not achieved, the ensemble of models will present high discrepancy
in their predictions, increasing the likelihood of the real dynamics to lie in the believe distribution’s
support. As a result, the learned policy is likely to exhibit high adaptability towards the real envi-
ronment, and fine-tuning the policy with VPG on the real environment leads to faster convergence
than training the policy from scratch or from any other MB initialization.
Simplicity. Our approach, contrary to previous methods, is simple: it does not rely on parameter
noise exploration, careful reinitialization of the model weights or policy’s entropy, hard to train
probabilistic models, and it does not need to address the model distribution mismatch [23, 10, 35].
6 Experiments
The aim of our experimental evaluation is to examine the following questions: 1) How does MB-
MPO compare against state-of-the-art model-free and model-based methods in terms of sample
complexity and asymptotic performance? 2) How does the model uncertainty influence the policy’s
plasticity? 3) How robust is our method against imperfect models?
To answer the posed questions, we evaluate our approach on six continuous control benchmark
tasks in the Mujoco simulator [44]. A depiction of the environments as well a detailed description
of the experimental setup can be found in Appendix A.3. In all of the following experiments, the
pre-update policy is used to report the average returns obtained with our method. The performance
reported are averages over at least three random seeds. The source code and the experiments data is
available on our supplementary website † .
We compare our method in sample complexity and performance to four state-of-the-art model free
RL algorithms: Deep Deterministic Policy Gradient (DDPG) [2], Trust Region Policy Optimiza-
tion [1], Proximal Policy Optimization (PPO) [45], and Actor Critic using Kronecker-Factored Trust
Region (ACKTR) [46]. The results are shown in Figure 1.
Figure 1: Learning curves of MB-MPO (“ours”) and four state-of-the-art model-free methods in six
different Mujoco environments with a horizon of 200. MB-MPO is able to match the asymptotic
performance of model-free methods with two orders of magnitude less samples.
In all the locomotion tasks we are able to achieve maximum performance using between 10 and
100 times less data than model-free methods. In the most challenging domains: ant, hopper, and
†
https://sites.google.com/view/mb-mpo
6
walker2D; the data complexity of our method is two orders of magnitude less than the MF. In the
easier tasks: the simulated PR2 and swimmer, our method achieves the same performance of MF
using 20-50× less data. These results highlight the benefit of MB-MPO for real robotics tasks; the
amount of real-world data needed for attaining maximum return corresponds to 30 min in the case
of easier domains and to 90 min in the more complex ones.
We also compare our method against recent model-based work: Model-Ensemble Trust-Region
Policy Optimization (ME-TRPO) [10], and the model-based approach introduced in Nagabandi
et al. [22], which uses MPC for planning (MB-MPC).
Figure 2: Learning curves of MB-MPO (“ours”) and two MB methods in 6 different Mujoco en-
vironments with a horizon of 200. MB-MPO achieves better asymptotic performance and faster
convergence rate than previous MB methods.
The results, shown in Figure 2, highlight the strength of MB-MPO Model ensemble standard dev.
in complex tasks. MB-MPC struggles to perform well on tasks 0.012
that require robust planning, and completely fails in tasks where 1
0.010
medium/long-term planning is necessary (as in the case of hopper). 0.008
0
y
In contrast, ME-TRPO is able to learn better policies, but the con- 0.006
vergence to such policies is slower when compared to MB-MPO . 0.004
1 0.002
Furthermore, while ME-TRPO converges to suboptimal policies in
complex domains, MB-MPO is able to achieve max-performance. 1 0 1
x
6.3 Model Uncertainty and Policy Plasticity KL-div. pre-/post-update policy
0.225
In section 6.3 we hypothesize that the meta-optimization steers 1 0.200
0.175
the policy towards higher plasticity in regions with high dynamics 0.150
0
y
7
Figure 4: Comparison of MB-MPO (“ours”) and ME-TRPO using 5 biased and noisy dynamic
models in the half-cheetah environment with a horizon of 100 time steps. A bias term b is sampled
uniformly from a denoted interval in every iteration. During the iterations we add to the predicted
observation a Gaussian noise N (b, 0.1).
We pose the question of how robust our proposed algorithm is w.r.t. imperfect dynamics predictions.
We examine it in two ways. First, with an illustrative example of a model with clearly wrong dynam-
ics. Specifically, we add biased Gaussian noise N (b, 0.12 ) to the next state prediction, whereby the
bias b ∼ U(0, bmax ) is re-sampled in every iteration for each model. Second, we present a realistic
case on which long horizon predictions are needed. Bootstrapping the model predictions for long
horizons leads to high compounding errors, making policy learning on such predictions challenging.
Figure 4 depicts the performance comparison between
our method and ME-TRPO on the half-cheetah environ-
ment for various values of bmax . Results indicate that our
method consistently outperforms ME-TRPO when ex-
posed to biased and noisy dynamics models. ME-TPRO
catastrophically fails to learn a policy in the presence of
strong bias (i.e. bmax = 0.5 and bmax = 1.0), but our
method, despite the strongly compromised dynamic pre-
dictions, is still able to learn a locomotion behavior with
a positive forward velocity.
This property also manifests itself in long horizon tasks.
Figure 5 compares the performance of our approach with
inner learning rate α = 10−3 against the edge case α = 0, Figure 5: Comparison of our method
where no adaption is taking place. For each random seed, with and without adaptation. Depicted
MB-MPO steadily converges to maximum performance. is the development of average returns
However, when there is no adaptation, the learning be- during training with three different ran-
comes unstable and different seeds exhibit different be- dom seeds on the half-cheetah environ-
havior: proper learning, getting stuck in sub-optimal be- ment with a horizon of 1000 time steps.
havior, and even unlearning good behaviors.
7 Conclusion
In this paper, we present a simple and generally applicable algorithm, model-based meta-policy op-
timization (MB-MPO), that learns an ensemble of dynamics models and meta-optimizes a policy for
adaptation in each of the learned models. Our experimental results demonstrate that meta-learning
a policy over an ensemble of learned models provides the recipe for reaching the same level of per-
formance as state-of-the-art model-free methods with substantially lower sample complexity. We
also compare our method against previous model-based approaches, obtaining better performance
and faster convergence. Our analysis demonstrate the ineffectiveness of prior approaches to combat
model-bias, and showcases the robustness of our method against imperfect models. As a result, we
are able to extend model-based to more complex domains and longer horizons. One direction that
merits further investigation is the usage of Bayesian neural networks, instead of ensembles, to learn
a distribution of dynamics models. Finally, an exciting direction of future work is the application of
MB-MPO to real-world systems.
8
Acknowledgments
We thank A. Gupta, C. Finn, and T. Kurutach for the feedback on the earlier draft of the paper.
IC was supported by La Caixa Fellowship. The research leading to these results received funding
from the EU Horizon 2020 Research and Innovation programme under grant agreement No. 731761
(IMAGINE) and was supported by Berkeley Deep Drive, Amazon Web Services, and Huawei.
References
[1] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust Region Policy Optimization. ICML,
2015.
[2] T. P. Lillicrap et al. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
[3] D. Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, Jan. 2016.
[4] M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Found. Trends
Robot, 2, 2013.
[5] V. Pong, S. Gu, M. Dalal, and S. Levine. Temporal Difference Models: Model-Free Deep RL for Model-
Based Control. In ICLR, 2018.
[6] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search.
In ICML, pages 465–472, 2011.
[7] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine. EPOpt: Learning Robust Neural Network Policies
Using Model Ensembles. 10 2016.
[8] K. Zhou, J. C. Doyle, and K. Glover. Robust and Optimal Control. Prentice-Hall, Inc., 1996.
[9] S. H. Lim, H. Xu, and S. Mannor. Reinforcement learning in robust markov decision processes. In NIPS.
2013.
[10] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Model-Ensemble Trust-Region Policy Opti-
mization. In ICLR, 2018.
[11] C. Finn, P. Abbeel, and S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.
In ICML, 2017.
[12] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL$ˆ2$: Fast Reinforcement
Learning via Slow Reinforcement Learning. 11 2016.
[13] J. X. Wang et al. Learning to reinforcement learn. CoRR, abs/1611.05763, 2017.
[14] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A Simple Neural Attentive Meta-Learner. In ICLR,
7 2018.
[15] F. Sung, L. Zhang, T. Xiang, T. M. Hospedales, and Y. Yang. Learning to learn: Meta-critic networks for
sample efficient learning. CoRR, abs/1706.09529, 2017.
[16] J. A. Bagnell and J. G. Schneider. Autonomous helicopter control using reinforcement learning policy
search methods. In ICRA, volume 2, pages 1615–1620. IEEE, 2001.
[17] P. Abbeel, M. Quigley, and A. Y. Ng. Using inaccurate models in reinforcement learning. In ICML, 2006.
[18] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown
dynamics. In NIPS, pages 1071–1079, 2014.
[19] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of
Machine Learning Research, 17(39):1–40, 2016.
[20] D. Nguyen-Tuong, M. Seeger, and J. Peters. Local gaussian process regression for real time online model
learning and control. In NIPS, pages 1193–1200, 2009.
[21] S. Kamthe and M. P. Deisenroth. Data-efficient reinforcement learning with probabilistic model predictive
control. CoRR, abs/1706.06491, 2017.
[22] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural Network Dynamics for Model-Based Deep
Reinforcement Learning with Model-Free Fine-Tuning. 8 2017.
9
[23] K. Chua, R. Calandra, R. Mcallister, and S. Levine. Deep Reinforcement Learning in a Handful of Trials
using Probabilistic Dynamics Models. 2019.
[24] A. Punjani and P. Abbeel. Deep learning helicopter dynamics models. In ICRA, pages 3223–3230, 2015.
[25] N. Wahlström, T. B. Schön, and M. P. Deisenroth. From pixels to torques: Policy learning with deep
dynamical models. CoRR, abs/1502.02251, 2015.
[26] S. Depeweg, F. Doshi-velez, and S. Udluft. Learning and Policy Search in Stochastic Dynamical Systems
with Bayesian Neural Networks. In ICML, 2017.
[27] I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt: Meta-
learning for model-based control. CoRR, abs/1803.11347, 2018.
[28] J. Fu, S. Levine, and P. Abbeel. One-shot learning of manipulation skills with online dynamics adaptation
and neural network priors. CoRR, abs/1509.06841, 2015.
[29] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based accelera-
tion. In ICML. JMLR.org, 2016.
[30] I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive
control. In Robotics: Science and Systems, 2015.
[31] N. Mishra, P. Abbeel, and I. Mordatch. Prediction and Control with Temporal Segment Models. In ICML,
2017.
[32] N. Heess et al. Learning Continuous Control Policies by Stochastic Value Gradients. 10 2015.
[33] R. S. Sutton and R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART
Bulletin, 2(4), 7 1991.
[34] T. Weber et al. Imagination-Augmented Agents for Deep Reinforcement Learning.
[35] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model-Based Value Expansion
for Efficient Model-Free Reinforcement Learning. 2018.
[36] J. Schmidhuber. Evolutionary principles in self-referential learning. on learning now to learn: The meta-
meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany, 14 May 1987.
[37] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas.
Learning to learn by gradient descent by gradient descent. CoRR, abs/1606.04474, 2016.
[38] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. One-shot learning with memory-
augmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
[39] M. Hüsken and C. Goerick. Fast learning for problem classes using a knowledge based network initial-
ization. In IJCNN, 2000.
[40] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2018.
[41] T. Salimans and D. P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Train-
ing of Deep Neural Networks. In NIPS, 2 2016.
[42] J. Peters and S. Schaal. Policy gradient methods for robotics. In IROS, pages 2219–2225, Oct 2006.
[43] A. Nichol, J. Achiam, and J. S. Openai. On First-Order Meta-Learning Algorithms.
[44] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.
[45] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algo-
rithms. CoRR, 2017.
[46] Y. Wu, E. Mansimov, S. Liao, R. B. Grosse, and J. Ba. Scalable trust-region method for deep reinforce-
ment learning using kronecker-factored approximation. CoRR, abs/1708.05144, 2017.
[47] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-
regret online learning. In AISTATS, 2011.
[48] G. Brockman et al. Openai gym. CoRR, abs/1606.01540, 2016.
[49] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using
generalized advantage estimation. In ICLR, 2016.
[50] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning
for continuous control. In ICML, page 13291338, 2016.
10
A Appendix
A.1 Tailored Data Collection
We present the effects of collecting data using tailored exploration. We refer to tailored exploration
as the effect of collecting data using the post-update policies – the policies adapted to each spe-
cific model. When training policies on learned models they tend to exploit the deficiencies of the
model, and thus overfitting to it. Using the post-update policies to collect data results in exploring
the regions of the state space where these policies overfit and the model is inaccurate. Iteratively
collecting data in the regions where the models are innacurate has been shown to greatly improve
the performance [47].
Figure 6: Tailored exploration study in the half-cheetah and walker2D environment. “True” means
the data is collected by using tailored exploration, and “False” is the result of not using it, i.e., using
the pre-update policy to collect data.
The effect of using tailored exploration is shown in Figure 6. In the half-cheetah and the walker
we get an improvement of 12% and 11%, respectively. The tailored exploration effect cannot be
accomplished by robust optimization algorithms, such as ME-TRPO. Those algorithms learn a single
policy that is robust across models. The data collection using such policy will not exploit the regions
in which each model fails resulting in less accurate models.
We perform a hyperparameter study (see Figure 7) to assess the sensitivity of MB-MPO to its pa-
rameters. Specifically, we vary the inner learning rate α, the size of the ensemble, and the number of
meta gradient steps before collecting further real environment samples. Consistent with the results
in Figure 5, we find that adaptation significantly improves the performance when compared to the
non-adaptive case of α = 0. Increasing the number of models and meta gradient steps per iteration
results in higher performance at a computational cost. However, as the computational burden is
increased the performance gains diminish.
Up to a certain level, increasing the number of meta gradient steps per iteration improves perfor-
mance. Though, too many meta gradients steps (i.e. 60) can lead to early convergence to a subopti-
mal policy. This may be due to the fact that the variance of the Gaussian policy distribution is also
learned. Usually, the policies variance decreases during the training. If the number of meta-gradient
steps is too large, the policy loses its exploration capabilities too early and can hardly improve once
the models are more accurate. This problem can be alleviated using a fixed policy variance, or by
adding an entropy bonus the learning objective.
In the following we provide a detailed description of the setup used in the experiments presented in
section 6:
Environments:
11
Figure 7: Hyper-parameter study in the the half-cheetah environment of a) the inner learning rate α,
b) the number of dynamic models in the ensemble, and c) the number of meta gradient steps before
collecting real environment samples and refitting the dynamic models.
Figure 8: Mujoco environments used in our experiments. Form left to right: swimmer, half-cheetah,
walker2D, PR2, hopper, and ant.
We benchmark MB-MPO on six continuous control benchmark tasks in the Mujoco simulator [44],
shown in Fig. 8. Five of these tasks, namely swimmer, half-cheetah, walker2D, hopper and ant,
involve robotic locomotion and are provided trough the OpenAI gym [48].
The sixth, the 7-DoF arm of the PR2 robot, has to reach arbitrary end-effector positions. Thereby,
the PR2 robot is torque controlled. The reward function is comprised of the squared distance of the
end-effector (TCP) to the goal and energy / control costs:
Policy: We use a Gaussian policy πθ (a|s) = N (a|µ(a)θµ , σθσ ) with diagonal covariance matrix.
The mean µ(a)θµ is computed by a neural network (2 hidden layers of size 32, tanh nonlinearity)
which receives the current state s as an input. During the policy optimization, both the weights θµ
of the neural network and the standard deviation vector σθσ are learned.
Advantage-Estimation: We use generalized advantage estimation (GAE) [49] with γ = 0.99 and
λ = 1 in conjunction with a linear reward baseline as in [50] to estimate advantages.
Dynamics Model Ensemble: In all experiments (except in Figure 7b) we use an ensemble of 5
fully connected neural networks. For the different environments the following hidden layer sizes
were used:
In all models, we used weight normalization and ReLu nonlinearities. For the minimization of the
l2 prediction error, the Adam optimizer with a batch-size of 500 was employed. In the first iteration
all models are randomly initialized. In later iterations, the models are trained with warm starts using
the parameters of the previous iteration. In each iteration and for each model in the ensemble the
12
transition data buffer D is randomly split in a training (80%) and validation (20%) set. The latter
split is used to compute the validation loss after each training epoch on the shuffled training split.
A rolling average of the validation losses with a persistence of 0.95 is maintained throughout the
epochs. Each model’s training is stopped individually as soon as the rolling validation loss average
decreases.
Meta-Policy Optimization: As described in section 4.2, the policy parameters θ are optimized
using the gradient-based meta learning framework MAML. For the inner adaptation step we use a
gradient step-size of α = 0.001. For maximizing the meta-objective specified in equation 3 we use
the policy gradient method TPRO [1] with KL-constraint δ = 0.01. Since computing the gradients
of the meta-objective involves second order terms such as the Hessian of the policy’s log-likelihood,
computing the necessary Hessian vector products for TRPO analytically is very compute intensive.
Hence, we use a finite difference approximation of the vector product of the Fisher Information
Matrix and the gradients as suggested in [11]. If not denoted differently, 30 meta-optimization steps
are performed before new trajectories are collected from the real environment.
Trajectory collection: In each algorithm iteration 4000 environment transitions (20 trajectories of
200 time steps) are collected. For the meta-optimization, 100000 imaginary environment transitions
are sampled.
In this section we compare the computational complexity of MB-MPO against TRPO. Specifi-
cally, we report the wall clock time that it takes both algorithms to reach maximum performance
on the half-cheetah environment when running the experiments on an Amazon Web Services EC2
c4.4xlarge compute instance. Our method only requires 20% more compute time than TRPO (7
hours instead of 5.5), while attaining 70× reduction in sample complexity. The main time bottle-
neck of our method compared with the model-free algorithms is training the models.
Notice that when running real world experiment, our method will be significantly faster than model-
free approaches since the bottleneck then would shift towards the data collection step.
13