Data-Efficient Domain Randomization With Bayesian Optimization
Data-Efficient Domain Randomization With Bayesian Optimization
as a way to vastly automate the finding of a source domain While learning from experience, the agent adapts its policy
distribution in sim-to-real settings, which is typically done parameters. The resulting state-action-reward tuples are col-
−1
by trial and error. We validate our approach by conducting lected in trajectories, a.k.a. rollouts, τ = {st , at , rt }Tt=0 , with
a sim-to-sim as well as two sim-to-real experiments on an rt = r(st , at ). To keep the notation concise, we omit the
underactuated nonlinear swing-up task, and on a ball-in-a-cup dependency on s0 .
task (Figure 1). The sim-to-sim setup examines the domain
parameter adaptation mechanism of BayRn, and shows that
B. Bayesian Optimization with Gaussian Processes
the belief about the domain distribution parameters converges
to a specified ground truth parameter set. In the sim-to-real Bayesian Optimization (BO) is a sequential derivative-free
experiments, we compare the performance of policies trained global optimization strategy, which tries to optimize an un-
with BayRn against multiple baselines based on a total number known function f : X → R on a compact set X [10]. In order
of 700 real-world rollouts. Moreover, we demonstrate that to do so, BO constructs a probabilistic model, typically a Gaus-
BayRn is able to work with step-based as well as episodic sian Process (GP), for f . GPs are distributions over functions
Reinforcement Learning (RL) algorithms as policy optimiza- f ∼ GP(m, k) defined by a prior mean m : X → R and posi-
tion subroutines. tive definite covariance function k : X × X → R called kernel.
The remainder of this paper is organized as follows: first, This probabilistic model is used to make decisions about where
we introduce the necessary fundamentals (Section II) for to evaluate the unknown function next. A distinctive feature of
BayRn (Section III). Next, we evaluate the devised method BO is to use the complete history of noisy function evaluations
experimentally (Section IV). Subsequently, we put BayRn into D = {xi , yi }ni=0 with xi ∈ X and yi ∼ N y f (xi ), ε where
context with the related work (Section V). Finally, we conclude ε is the variance of the observation noise. The next evaluation
and mention possible future research directions (Section VI). candidate is then chosen by maximizing a so-called acquisition
function a : X → R, which typically balances exploration
II. BACKGROUND AND N OTATION and exploitation. Prominent acquisition functions are Expected
Optimizing control policies for Markov Decision Processes Improvement and Upper Confidence Bound. Through the use
(MDPs) with unknown dynamics is generally a hard problem of priors over functions, BO has become a popular choice for
(Section II-A). It is specifically hard due to the simulation sample-efficient optimization of black-box functions that are
optimization bias [2], which occurs when transferring the expensive to evaluate. Its sample efficiency plays well with
polices learned in one domain to another. Adapting the source the algorithm introduced in this paper where a GP models
domain based on real-world data requires a method suited for the relation between domain distribution’s parameters and the
expensive objective function evaluations. BO is a prominent resulting policy’s return estimated from real-world rollouts, i.e.
choice for these kind of problems (Section II-B). x ≡ φ and y ≡ Jˆreal (θ ? ). For further information on BO and
GPs, we refer the reader to [10] as well as [11].
A. Markov Decision Process
Consider a time-discrete dynamical system III. BAYESIAN D OMAIN R ANDOMIZATION (BAY R N )
st+1 ∼ Pξ ( st+1 | st , at , ξ) , s0 ∼ µ0,ξ ( s0 | ξ), The problem of source domain adaptation based on returns
from the target domain can be expressed in a bilevel formu-
at ∼ π( at | st ; θ) , ξ ∼ ν(ξ; φ) ,
lation
with the continuous state st ∈ Sξ ⊆ Rns , and continuous
action at ∈ Aξ ⊆ Rna at time step t. The environment, also φ? = arg max J real (θ ? (φ)) with (1)
φ∈Φ
called domain, is instantiated through its parameters ξ ∈ Rnξ
(e.g., masses, friction coefficients, or time delays), which θ ? (φ) = arg max Eξ∼ν(ξ;φ) [J (θ, ξ)] , (2)
θ ∈Θ
are assumed to be random variables distributed according to
the probability distribution ν : Rnξ → R+ parametrized by φ. where we refer to (1) and (2) as the upper and lower level
These parameters determine the transition probability density optimization problem respectively. Thus, the two equations
function Pξ : Sξ × Aξ × Sξ → R+ that describes the system’s state the goal of finding the set of domain distribution param-
stochastic dynamics. The initial state s0 is drawn from the start eters φ? that maximizes the return on the real-world target
state distribution µ0,ξ : Sξ → R+ . Together with the reward system J real (θ ? (φ)), when used to specify the distribution
function r : Sξ × Aξ → R, and the temporal discount factor ν(ξ; φ) during training in the source domain. The space of
γ ∈ [0, 1], the system forms a MDP described by the set domain parameter distributions is represented by Φ. In the
Mξ = {Sξ , Aξ , Pξ , µ0,ξ , r, γ}. The goal of a Reinforcement following, we abbreviate θ ? (φ) with θ ? . At the core of BayRn,
Learning (RL) agent is to maximize the expected (discounted) first a policy optimizer, e.g., an RL algorithm, is employed
return, a numeric scoring function which measures the policy’s to solve the lower level problem (2) by finding a (locally)
performance. The expected discounted return of a stochastic optimal policy π(θ ? ) for the current distribution of stochastic
domain-independent policy π( at | st ; θ), characterized by its environments. This policy is evaluated on the real system for
parameters θ ∈ Θ ⊆ Rnθ , is defined as nτ rollouts, providing an estimate of the return Jˆreal (θ ? ). Next,
−1
the upper level problem (1) is solved using BO, yielding a
h TX
new domain parameter distribution which is used to randomize
i
J (θ, ξ, s0 ) = Eτ ∼p(τ ) γ t r(st , at ) θ, ξ, s0 .
t=0
the simulator. In this process the relation between the domain
3
Algorithm 1: Bayesian Domain Randomization Connection to System Identification: Unlike related meth-
input : domain parameter distribution ν(ξ; φ), ods (Section V), BayRn does not include a term in the
parameter space Φ = [φmin , φmax ], algorithm objective function that drives the system parameters to match
PolOpt, Gaussian Process GP, acquisition the observed dynamics. Instead, the BO component in BayRn
function a, hyper-parameters ninit , nτ , J succ is free to adapt the domain distribution parameters φ (e.g.,
output: maximum a posteriori domain distribution mean or standard deviation of a body’s mass) while learning
parameter φ? and policy π(θ ? ) in simulation such that the resulting policies perform well in
. Initialization phase the target domain. This can be seen as an indirect system
1 Initialize empty data set and ninit policies randomly identification, since with increasing iteration count the BO
2 D ← {} ; π θ1:ninit ← θ1:ninit ∼ Θ process will converge to sampling from regions with high real-
3 Sample ninit source domain distribution parameter sets world return. There is a connection to control as inference ap-
and train in randomized simulators proaches which interpret the cost as a log-likelihood function
4 φ1:ninit ← φ1:ninit ∼ Φ under an optimality criterion using a Boltzmann distribution
?
5 θ1:ninit ← PolOpt π θ1:ninit , ν(ξ; φ1:ninit )
construct [13, 14]. Regarding BayRn, the sequence of sampled
6 Evaluate the ninit policies on the target domain for nτ domain distribution parameter sets strongly depends on the
rollouts and acquisition function and the complexity of the given problem.
estimateP the return
nτ We argue that excluding system identification from the upper
7 Jˆ
real ? real ?
θ1:n init
← 1/n τ j=1 Jj θ1:n init
8 Augment the data set and update the GP’s posterior
level objective (1) is sensible for the presented sim-to-real
distribution algorithm, since it learns from a randomized physics simulator,
hence attenuates the benefit of a well-fitted model.
9 D ∪ {φi , Jˆ (θi )}i=1 ; GP(m, k) ← GP m, k D
real ? ninit
10 do . Sim-to-real loop
11 Optimize the GP’s acquisition function IV. E XPERIMENTS
12 φ? ← arg maxφ∈Φ a(φ, D) We study Bayesian Domain Randomization (BayRn) on
13 Train a policy using the obtained domain two different platforms: 1) an underactuated rotary inverted
distribution parameter set pendulum, also known as Furuta pendulum, with the task of
14 θ ? ← PolOpt[π(θ) , ν(ξ; φ? )] swinging up the pendulum pole into an upright position, and
15 Evaluate the policy on the target domain for nτ 2) the tendon-driven 4-DoF robot arm WAM from Barrett,
rollouts and estimate Pnτ the real
return where the agent has to swing a ball into a cup mounted
16 Jˆreal (θ ? ) ← 1/nτ j=1 Jj (θ ? ) as the end-effector. First, we set up a simplified sim-to-sim
17 Augment the data set and update the GP’s experiment on the Furuta pendulum to check if the proposed
posterior distribution algorithm’s belief about the domain distribution parameters
D ∪ {φ? , Jˆreal (θ ? )} ; GP(m, k) ← GP m, k D converges to a specified set of ground truth values. Next, we
18
19 while Jˆ
real ?
(θ ) < J succ and niter ≤ niter,max evaluate BayRn as well as the baseline methods SimOpt [4],
20 Train the maximum a posteriori policy (repeat the Uniform Domain Randomization (UDR), and Proximal Policy
Lines 12 and 14 once) Optimization (PPO) [15] or Policy learning by Weighting
Exploration with the Returns (PoWER) [16] in two sim-to-real
experiments. Additional details on the system description can
be found in Appendix A. Furthermore, an extensive list of the
distribution’s parameters φ and the resulting policy’s return chosen hyper-parameters can be found in Appendix B. A video
on the real system Jˆreal (θ ? ) is modeled by a GP. The GP’s demonstrating the sim-to-real transfer of the policies learned
mean and covariance is updated using all recorded inputs φ with BayRn can be found at www.ias.informatik.
and the corresponding observations Jˆreal (θ ? ). Finally, BayRn tu-darmstadt.de/Team/FabioMuratore. Moreover,
terminates when the estimated performance on the target sys- the source code of BayRn and the baselines is available at [17].
tem exceeds J succ which is the task-specific success threshold.
Since the GP requires at least a few (about 5 to 10) samples
to provide a meaningful posterior, BayRn has an initialization A. Experimental Setup
phase before the loop. In this phase, ninit source domains All rollouts on the Fu-
are randomly sampled from Φ, and subsequently for each of ruta pendulum ran for 6 s at
these domains a policy is trained. After evaluating the ninit 100 Hz, collecting 600 time
initial policies, the GP is fed with the inputs φ1:ninit and steps with a reward rt ∈]0, 1]. dj , µs
the corresponding observations Jˆreal θ1:n ?
. The complete We decided to use a Feedfor- lp ,
lr , m
init
BayRn procedure is summarized in Algorithm 1. In principle, ward Neural Network (FNN) mr p ds , ls mb
there are no restrictions to the choice of algorithms for solving policy in combination with
Figure 2: Platforms with an-
the two stages (1) and (2). For training the GP, we used the BO PPO as policy optimization
notated domain parameters
implementation from BoTorch [12] which expects normalized (sub)routine (Table IIa). Be-
inputs and standardized outputs. Notably, we decided for the fore each rollout, the platform was reset automatically. On
expected improvement acquisition function and a zero-mean the physical system, this procedure includes estimating the
GP prior with a Matérn 5/2 Kernel. sensors’ offsets as well as running a controller which drives the
4
Parameter Range Unit Figure 3: Target domain returns (a) and the associated standard
string length mean E[ls ] ∈ [0.285, 0.315] m deviation (b) modeled by the GP learned with BayRn in a sim-
string length variance V[ls ] ∈ [9e−8, 2.25e−4] m2
string damping mean E[ds ] ∈ [0, 2e−4] N/s
to-sim setting (brighter is higher). The ground truth domain
string damping variance V[ds ] ∈ [3.33e−13, 8.33e−10] N2 /s2 parameters as well as the maximum a posteriori domain
ball mass mean E[mb ] ∈ [0.0179, 0.0242] kg distribution parameters found by BayRn are displayed as a red
ball mass variance V[mb ] ∈ [4.41e−10, 4.41e−6] kg2 and orange star, respectively. The circles mark the sequence
joint damping mean E[dj ] ∈ [0.0, 0.1] N/s
joint damping variance V[dj ] ∈ [3.33e−8, 2.08e−4] N2 /s2 of domain parameter configurations (darker is later).
joint stiction mean E[µs ] ∈ [0, 0.4] −
joint stiction variance V[µs ] ∈ [1.33e−6, 3.33e−3] −
for closing the reality gap, and validated it on two state-of-the-
art sim-to-real robotic manipulation tasks. SimOpt iteratively
device to its initial position with the rotary pole centered and adapts the domain parameter distribution’s parameters by
the pendulum hanging down. In simulation, the reset function minimizing discrepancy between observations from the real-
causes the simulator to sample a new set of domain parameters world system and the simulation. While BayRn formulates the
ξ (Figure 2). Due to the underactuated nature of the dynamics, upper level problem (1) solely based on the real-world returns,
the pendulum has to be swung back and forth to put energy SimOpt minimizes a linear combination of the L1 and L2 norm
into the system before being able to swing the pendulum up. between simulated and real trajectories. Moreover, SimOpt
The Barrett WAM was operated at 500 Hz with an episode employs Relative Entropy Policy Search (REPS) [18] to update
length of 3.5 s, i.e., 1750 time steps. For the ball-in-cup task, the simulator’s parameters, hence turning (1) into an RL prob-
we chose a RBF-policy commanding desired deltas to the lem. The necessity of real-world trajectories renders SimOpt
current joint angles and angular velocities, which are passed to unusable for the ball-in-a-cup task since the feed-forward
the robots feed-forward controller. Hence, the only input to the policy is executed without recording any observations. Thus,
policy is the normalized time. At the beginning of each rollout, there are no real-world trajectories with which to update the
the robot is driven to an initial position. When evaluating on simulator. BayRn (Section III), SimOpt and UDR randomize
the physical platform, the ball needs to be manually stabilized the same domain parameters with identical nominal values. At
in a resting position. Once the rollout has finished, the operator the beginning of each sim-to-real experiment (Section IV-C),
enters a return value (Appendix A). the domain distribution parameters φ are sampled randomly
In the sim-to-real experiments, we compare BayRn to
from their ranges (Table I). The main difference is that BayRn
SimOpt, UDR, and PPO or PoWER. For every algorithm, we
and SimOpt adapt the domain distribution parameters, while
train 20 polices and execute 5 evaluation rollouts per policy.
UDR does not. We chose normal distributions for masses and
PPO as well as PoWER are set up to learn from simulations
lengths as well as uniform distributions for parameters related
where the domain parameters are given by the platforms’ data
to friction and damping.
sheets or CAD models. These sets of domain parameters are
called nominal. Hence, PPO and PoWER serve as a baseline
representing step-based and episodic RL algorithms without B. Sim-to-sim Results
domain randomization or any real-world data. UDR augments Before applying BayRn to a physical system, we examine
an RL algorithm, here PPO or PoWER, and can be seen as the the domain distribution parameter sampling process of the BO
straightforward way of randomizing a simulator, as done in [7]. component in simulation. In order to provide a (qualitative)
Each domain parameter ξ is assigned to an independent prob- visualization, we chose to only randomize the means of the
ability distribution, specified by its parameters φ, i.e. mean poles’ masses, i.e., φ = [E[mr ] , E[mp ]]T . Thus, for this sim-
and variance, (Table I). Thus, we include UDR as a baseline to-sim experiment the domain distribution parameters φ are
method for static domain randomization. Note that UDR can, synonymous to the domain parameters ξ. Apart from that, the
in contrast to BayRn and SimOpt, be easily parallelized which hyper-parameters used for executing BayRn are identical to
reduces the time to train a policy significantly. With SimOpt, the ones used in the sim-to-real experiments (Appendix B).
Y. Chebotar et al. [4] presented a trajectory-based framework As stated in Section III, BayRn was designed without an
5
1.0
success probability
400
return
200 0.5
0 0.0
BayRn SimOpt UDR PPO BayRn UDR PoWER
(a) swing-up and balance (b) ball-in-a-cup
Figure 4: Performance of the different algorithms across both sim-to-real tasks. For each algorithm 20 policies have been
trained, varying the random seed, and evaluated 5 times to estimate the mean return per policy (700 rollouts in total). The
median performance per algorithm is displayed by white circles, and the inner quartiles are represented by thick vertical bars.
A dashed line in (a) marks an approximate threshold where the task is solved, i.e., the rotary pole is stabilized on top in the
center. SimOpt was not applicable to our open-loop ball-in-a-cup task (b) because of its requirement for recorded observations.
(explicit) system identification objective. However, we can Comparing the Furuta pendulum’s nominal domain param-
see from Figure 3a that the maximizer of the GP’s mean eters φnom = [mp , mr , lp , lr ]T = [0.024, 0.095, 0.129, 0.085]T
function φ? = [0.0266, 0.1084]T closely match the ground to the means among BayRn’s final estimate φ?mean =
truth parameters φGT = [0.0264, 0.1045]T . Moreover, Fig- [0.023, 0.098, 0.123, 0.087]T , we see that the domain parame-
ure 3b displays how the uncertainty about the target domain ters’ means changed by less than 10 % each. Complementary
return is reduced in the vicinity of the sampled parameter the variances among BayRn’s final estimate are φ?var =
configurations. There are two decisive factors for the domain [6.29e−8, 5.67e−6, 4.10e−5, 1.19e−5]T , indicating a higher
distribution parameter sampling process: the acquisition func- uncertainty on the link lengths (relative to the means). Thus,
tion (Algorithm 1 Line 12), and the quality of the found policy the final domain parameters are well within the boundaries of
(Algorithm 1 Line 14). Concerning the latter, a failed training the BO search space (Table I). In combination, these small
run is indistinguishable to a successful one which fails to differences result in significantly different system dynamics.
transfer to the target domain, since the GP only observes the We believe this to be the reason why the baselines without
estimated real-world return Jˆreal (θ ? ). domain randomization completely failed to transfer.
by controlling a robotic arm. The usage of risk-averse objective comparison is done by a discriminator which yields rewards
function has been explored on MuJoCo tasks in [19]. The proportional to the difficulty of distinguishing the simulated
authors also provide a Bayesian point of view. and real environments, hence providing an incentive to gen-
Cully et al. [20] can be seen as an edge case of static and erate distinct domains. Using this reward signal, the domain
adaptive domain randomization, where a large set of policies parameters of the simulation instances are updated via Stein
is learned before execution on the physical robot and evaluated Variational Policy Gradient. Mehta et al. [25] evaluated their
in simulation. Every policy is associated to one configuration method in a sim-to-real experiment where a robotic arm had to
of the so-called behavioral descriptors, which are related but reach a desired point. The strongest contrast between BayRn
not identical to domain parameters. In contrast to BayRn, and ADR is they way in which new simulation environments
there is no policy training after the initial phase. Instead are explored. While BayRn can rely on well-studied BO
of retraining or fine-tuning, the algorithm suggested in [20] with an adjustable exploration-exploitation behavior, ADR can
reacts to performance drops, e.g. due to damage, by using be fragile since it couples discriminator training and policy
BO to sequentially select a pretrained policy and measure optimization, which results in a non-stationary process where
its performance on the robot. The underlying GP models the distribution of the domains depends on the discriminator’s
mapping from behavior space to performance. This method performance.
demonstrated impressive damage recover abilities on a robotic Paul et al. [26] introduce Fingerprint Policy Optimization
locomotion and a reaching task. However, applying it to RL which, like BayRn, employs BO to adapt the distribution of
poses big challenges. Most notably, the number of policies to domain parameters such that using these for the subsequent
be learned in order to populate the map, scales exponentially training maximizes the policy’s return. At first glance the
with the dimension of the behavioral descriptors, potentially approaches look similar, but there is a major difference in
leading to a very large number of training runs. how the upper level problem (1) is solved. Fingerprint Policy
Aside from to the previous methods, Muratore et al. [2] Optimization models the relation between the current domain
propose an approach to estimate the transferability of a policy parameters, the current policy and the return of the updated
learned from randomized physics simulations. Moreover, the policy with a GP. This design decision requires to feed the
authors propose a meta-algorithm which provides a probabilis- policy parameters into the GP which is prohibitively expensive
tic guarantee on the performance loss when transferring the if done straightforwardly. Therefore, abstractions of the policy,
policy between two domains form the same distribution. so-called fingerprints, are created. These handcrafted features,
Static domain randomization has also been successfully e.g., the Gaussian approximation of the stationary state distri-
applied to computer vision problems. A few examples that bution, replace the policy to reduce the input dimension. The
are: (i) object detection [21], (ii) synthetic object generation authors tested Fingerprint Policy Optimization on three sim-to-
for grasp planning [8], and (iii) autonomous drone flight [22]. sim tasks. Contrarily, BayRn has been designed without the
need to approximate the policy. Moreover, we validated the
presented method in sim-to-real settings.
B. Domain Randomization with Adaptive Distributions Yu et al. [9] intertwine policy optimization, system iden-
Ruiz et al. [23] proposed the meta-algorithm which is based tification, and domain randomization. The proposed method
on a bilevel optimization problem highly similar to the one first identifies bounds on the domain parameters which are
of BayRn (1, 2). However, there are two major differences. later used for learning from the randomized simulator. The
First, BayRn uses Bayesian optimization on the acquired suggested policy is conditioned on a latent space projection of
real-wold data to adapt the domain parameter distribution, the domain parameters. After training in simulation, a second
whereas “learning to simulate” updates the domain parameter system identification step is executed to find the projected
distribution using REINFORCE. Second, the approach in [23] domain parameters which maximize the return on the physical
has been evaluated in simulation on synthetic data, except for robot. This step runs BO for a fixed number of iterations and is
a semantic segmentation task. Thus, there was no dynamics- similar to solving the upper level problem in (1). The algorithm
dependent interaction of the learned policy with the real world. was evaluated on the bipedal walking robot Darwin OP2.
With SPRL, Klink et al. [24] derived a relative entropy In Ramos et al. [27], likelihood-free inference in combi-
RL algorithm that endows the agent to adapt the domain nation with mixture density random Fourier networks is em-
parameter distribution, typically from easy to hard instances. ployed to perform a fully Bayesian treatment of the simulator’s
Hence, the overall training procedure can be interpreted as a parameters. Analyzing the obtained posterior over domain
curriculum learning problem. The authors were able to solve parameters, Ramos et al. showed that BayesSim is, in a sim-
sim-to-sim goal reaching problems as well as a robotic sim- to-sim setting, able to simultaneously infer different parameter
to-real ball-in-a-cup task, similar to the one in this paper. configurations which can explain the observed trajectories. The
One decisive difference to BayRn is that the target domain key difference between BayRn and BayesSim is the objective
parameter distribution has to be known beforehand. for updating the domain parameters. While BayesSim max-
The approach called Active Domain Randomization imizes the model’s posterior likelihood, BayRn updates the
(ADR) [25] also formulates the adaption of the domain domain parameters such that the policy’s return on the physical
parameter distribution as an RL problem where different system is maximized. The biggest advantage of BayRn over
simulation instances are sampled and compared against a BayesSim is its ability to work with very sparse real-world
reference environment based on the resulting trajectories. This data, i.e. only the scalar return values.
7
ArXiv e-prints, vol. 1805.10662, 2018. where the final reward given by the operator after the rollout
[27] F. Ramos, R. Possas, and D. Fox, “Bayessim: Adap- (r(st , at ) = 0 for t < T ) when running on the real system. We
tive domain randomization via probabilistic inference found the separation in three cases to be helpful during learn-
for robotics simulators,” in RSS, University of Freiburg, ing and easily distinguishable from the others. While training
Freiburg im Breisgau, Germany, June 22-26, 2019. in simulation, successful trials are identified by detecting a
[28] R. Moriconi, K. S. S. Kumar, and M. P. Deisenroth, collision between the ball and a virtual cylinder inside the cup.
“High-dimensional bayesian optimization with projec- Moreover, we have access to the full state, hence augment the
tions using quantile gaussian processes,” Optim. Lett., reward function with a cost term that punishes deviations from
vol. 14, no. 1, pp. 51–64, 2020. the initial end-effector position.
[29] “mujoco-py,” Online. [Online]. Available: https://github.
com/openai/mujoco-py A PPENDIX B
PARAMETER VALUES FOR THE E XPERIMENTS
A PPENDIX A Table II lists the hyper-parameters for all training runs
M ODELING D ETAILS ON THE P LATFORMS during the experiments in Section IV. The reported values
The Furuta pendulum (Figure 1) is modeled as an underac- have been tuned but not fully optimized.
tuated nonlinear second-order dynamical system given by the Table II: Hyper-parameter values for training the policies in
solution of Section IV. The domain distribution parameters φ are listed
Jr + mp lr2 + 41 mp lp2 (cos(α))2 12 mp lp lr cos(α) θ̈
in Table I.
1 =
2 mp lp lr cos(α) Jp + 14 mp lp2 α̈ (a) swing-up and balance
τ − 12 mp lp2 sin(α) cos(α) θ̇α̇ − 12 mp lp lr sin(α) α̇2 − dr θ̇
, Hyper-parameter Value
− 41 mp lp2 sin(α) cos(α) θ̇2 − 12 mp lp gsin(α) − dp α̇ common
PolOpt PPO
policy / critic architecture FNN 64-64 with tan-h
with the rotary angle θ and the pendulum angle α, which optimizer Adam
learning rate policy 5.97e−4
are defined to be zero when the rotary pole is centered and learning rate critic 3.44e−4
the pendulum pole is hanging down vertically. While the PPO clipping ratio 0.1
system’s state is defined as s = [θ, α, θ̇, α̇]T , the agent receives iterations niter 300
step size ∆t 0.01 s
observations o = [sin(θ) , cos(θ) , sin(α) , cos(α) , θ̇, α̇]T . The max. steps per episode T 600
horizontal pole is actuated by commanding a motor volt- min. steps per iteration 20T
age (action) a which regulates the servo motor’s torque temporal discount γ 0.9885
adv. est. trade-off factor λ 0.965
τ = km (a − km θ̇)/Rm . One part of the domain parameters success threshold J succ 375
is sampled from distributions specified by in Table Ia, while Q diag(2e−1, 1.0, 2e−2, 5e−3)
the remaining domain parameters are fixed at their nominal R 3e−3
real-world rollouts nτ 5
values given in [17]. We formulate the reward function based UDR specific
on an exponentiated quadratic cost min. steps per iteration 30T
SimOpt specific
r(st , at ) = exp − eTt Qet + at Rat with max. iterations niter 15
DistrOpt population size 500
et = 0 π 0 0 − st mod 2π. DistrOpt KL bound 1.0
DistrOpt learning rate 5e−4
Thus, the reward is in range ]0, 1] for every time step. BayRn specific
The 4-DoF Barrett WAM (Figure 1) is simulated using max. iterations niter,max 15
initial solutions ninit 5
MuJoCo, wrapped by mujoco-py [29]. The ball is attached to
a string, which is mounted to the center of the cup’s bottom (b) ball-in-a-cup
plate. We model the string as a concatenation of 30 rigid bodies Hyper-parameter Value
with two rotational joints per link (no torsion). This specific common
ball-in-a-cup instance can be considered difficult, since the PolOpt PoWER
policy architecture RBF with 16 basis functions
cups’s diameter is only about twice as large as the ball’s, and iterations niter 20
the string is rather short with a length of 30 cm. Similar to population size npop 100
the Furuta pendulum, one part of the domain parameters is num. importance samples nis 10
init. exploration std σinit π/12
sampled from distributions specified by in Table Ib, while min. rollouts per iteration 20
the remaining domain parameters are fixed at their nominal max. steps per episode T 1750
values given in [17]. Since the feed-forward policy is executed step size ∆t 0.002 s
temporal discount γ 1
without recording any observations, we define a discrete real-world rollouts nτ 5
ternary reward function UDR specific
min. steps per iteration 30T
1
if the ball is in the cup, BayRn specific
max. iterations niter 15
r(sT , aT ) = 0.5 if the ball hit the cup’s upper rim, initial solutions ninit 5
0 else