0% found this document useful (0 votes)

3 views

Data-Efficient Domain Randomization With Bayesian Optimization

Uploaded by

raxx666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Data-Efficient Domain Randomization With Bayesian Optimization

Uploaded by

raxx666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

1

Data-efficient Domain Randomization

with Bayesian Optimization
Fabio Muratore and Christian Eilers and Michael Gienger and Jan Peters

Abstract—When learning policies for robot control, the re-

quired real-world data is typically prohibitively expensive to
acquire, so learning in simulation is a popular strategy. Unfortu-
nately, such polices are often not transferable to the real world
due to a mismatch between the simulation and reality, called
arXiv:2003.02471v4 [cs.LG] 5 Jan 2021

‘reality gap’. Domain randomization methods tackle this problem

by randomizing the physics simulator (source domain) during
training according to a distribution over domain parameters in
order to obtain more robust policies that are able to overcome
the reality gap. Most domain randomization approaches sample Figure 1: Evaluation platforms: (left) underactuated swing-up
the domain parameters from a fixed distribution. This solution and balance task on the Quanser Furuta pendulum, (right) ball-
is suboptimal in the context of sim-to-real transferability, since it
yields policies that have been trained without explicitly optimizing
in-a-cup task on the Barrett WAM robotic arm.
for the reward on the real system (target domain). Additionally,
a fixed distribution assumes there is prior knowledge about model does not include all physical phenomena. Moreover, we
the uncertainty over the domain parameters. In this paper, we might face a situation where it is not affordable to improve the
propose Bayesian Domain Randomization (BayRn), a black-box
model. Alternatively, one can add variability to the generative
sim-to-real algorithm that solves tasks efficiently by adapting
the domain parameter distribution during learning given sparse model, e.g. by turning the physics simulator’s parameters
data from the real-world target domain. BayRn uses Bayesian into random variables. Learning from randomized simulations
optimization to search the space of source domain distribution poses a harder problem for the learner due to the additional
parameters such that this leads to a policy which maximizes the variability of the observed data. But the recent successes in the
real-word objective, allowing for adaptive distributions during
field of sim-to-real transfer argue for domain randomization
policy optimization. We experimentally validate the proposed
approach in sim-to-sim as well as in sim-to-real experiments, being a promising method [3, 4].
comparing against three baseline methods on two robotic tasks. State-of-the-art approaches commonly randomize the simu-
Our results show that BayRn is able to perform sim-to-real trans- lator according to a static handcrafted distribution [5, 6, 7, 8].
fer, while significantly reducing the required prior knowledge. Even though static randomization is in many cases sufficient
to cross the reality gap, it is desirable to automate the process
I. I NTRODUCTION as far as possible. Moreover, using a fixed distribution does
not allow to update the prior knowledge or incorporate the
Physics simulations provide a possibility of generating vast
uncertainty over domain parameters. Most importantly, closing
amounts of diverse data at a low cost. However, sample-based
the feedback loop over the real system will lead to policies
optimization has been known to be optimistically biased [1],
with higher performance on the target domain since the
which means that the found solution appears to be better than it
feedback enables the optimization of the domain parameter
actually is. The problem is worsened when the data used for
distribution. However, approaches which adapt an distribution
optimization does not originate from the same environment,
over simulators, yield to additional challenges. For example
also called domain. In this case, we observe a simulation
algorithms that intertwine system identification and policy
optimization bias, which leads to an overestimation of the
optimization, e.g., [4, 9], introduce a circular dependency since
policy’s performance [2]. Generally, there are two ways to
both subroutines depend on the sensible outputs of the other.
overcome the gap between simulation and reality. One can
One possible failure case is a policy which does not excite the
improve the generative model to closely match the reality, e.g.
system well enough, resulting in bad updates the simulator’s
by using system identification. Increasing the model’s accuracy
parameters. The sim-to-real algorithm presented in this paper
has the advantage of leading to controllers with potentially
does not require any system identification.
higher performance, since the learner can focus on a single
domain. On the downside, this goes in line with a reduced Contributions: We advance the state-of-the-art by intro-
transferability of the found policy, which is caused by the ducing Bayesian Domain Randomization (BayRn), a method
previously mentioned optimistic bias, and aggravated if the which is able to close the reality gap by learning from random-
ized simulations and adapting the distribution over simulator
Fabio Muratore, Christian Eilers and Jan Peters are with the Intelligent parameters based solely on real-world returns. The use usage
Autonomous Systems Group, Technical University Darmstadt, Germany. of Bayesian Optimization (BO) for sampling the next training
Fabio Muratore, Christian Eilers and Michael Gienger are with the
Honda Research Institute Eustring, Offenbach am Main, Germany. environment (source domain) makes BayRn sample efficient
Correspondence to [email protected] w.r.t. real-world data. The proposed algorithm can be seen
2

as a way to vastly automate the finding of a source domain While learning from experience, the agent adapts its policy
distribution in sim-to-real settings, which is typically done parameters. The resulting state-action-reward tuples are col-
−1
by trial and error. We validate our approach by conducting lected in trajectories, a.k.a. rollouts, τ = {st , at , rt }Tt=0 , with
a sim-to-sim as well as two sim-to-real experiments on an rt = r(st , at ). To keep the notation concise, we omit the
underactuated nonlinear swing-up task, and on a ball-in-a-cup dependency on s0 .
task (Figure 1). The sim-to-sim setup examines the domain
parameter adaptation mechanism of BayRn, and shows that
B. Bayesian Optimization with Gaussian Processes
the belief about the domain distribution parameters converges
to a specified ground truth parameter set. In the sim-to-real Bayesian Optimization (BO) is a sequential derivative-free
experiments, we compare the performance of policies trained global optimization strategy, which tries to optimize an un-
with BayRn against multiple baselines based on a total number known function f : X → R on a compact set X [10]. In order
of 700 real-world rollouts. Moreover, we demonstrate that to do so, BO constructs a probabilistic model, typically a Gaus-
BayRn is able to work with step-based as well as episodic sian Process (GP), for f . GPs are distributions over functions
Reinforcement Learning (RL) algorithms as policy optimiza- f ∼ GP(m, k) defined by a prior mean m : X → R and posi-
tion subroutines. tive definite covariance function k : X × X → R called kernel.
The remainder of this paper is organized as follows: first, This probabilistic model is used to make decisions about where
we introduce the necessary fundamentals (Section II) for to evaluate the unknown function next. A distinctive feature of
BayRn (Section III). Next, we evaluate the devised method BO is to use the complete history of noisy function evaluations

experimentally (Section IV). Subsequently, we put BayRn into D = {xi , yi }ni=0 with xi ∈ X and yi ∼ N y f (xi ), ε where
context with the related work (Section V). Finally, we conclude ε is the variance of the observation noise. The next evaluation
and mention possible future research directions (Section VI). candidate is then chosen by maximizing a so-called acquisition
function a : X → R, which typically balances exploration
II. BACKGROUND AND N OTATION and exploitation. Prominent acquisition functions are Expected
Optimizing control policies for Markov Decision Processes Improvement and Upper Confidence Bound. Through the use
(MDPs) with unknown dynamics is generally a hard problem of priors over functions, BO has become a popular choice for
(Section II-A). It is specifically hard due to the simulation sample-efficient optimization of black-box functions that are
optimization bias [2], which occurs when transferring the expensive to evaluate. Its sample efficiency plays well with
polices learned in one domain to another. Adapting the source the algorithm introduced in this paper where a GP models
domain based on real-world data requires a method suited for the relation between domain distribution’s parameters and the
expensive objective function evaluations. BO is a prominent resulting policy’s return estimated from real-world rollouts, i.e.
choice for these kind of problems (Section II-B). x ≡ φ and y ≡ Jˆreal (θ ? ). For further information on BO and
GPs, we refer the reader to [10] as well as [11].
A. Markov Decision Process
Consider a time-discrete dynamical system III. BAYESIAN D OMAIN R ANDOMIZATION (BAY R N )
st+1 ∼ Pξ ( st+1 | st , at , ξ) , s0 ∼ µ0,ξ ( s0 | ξ), The problem of source domain adaptation based on returns
from the target domain can be expressed in a bilevel formu-
at ∼ π( at | st ; θ) , ξ ∼ ν(ξ; φ) ,
lation
with the continuous state st ∈ Sξ ⊆ Rns , and continuous
action at ∈ Aξ ⊆ Rna at time step t. The environment, also φ? = arg max J real (θ ? (φ)) with (1)
φ∈Φ
called domain, is instantiated through its parameters ξ ∈ Rnξ
(e.g., masses, friction coefficients, or time delays), which θ ? (φ) = arg max Eξ∼ν(ξ;φ) [J (θ, ξ)] , (2)
θ ∈Θ
are assumed to be random variables distributed according to
the probability distribution ν : Rnξ → R+ parametrized by φ. where we refer to (1) and (2) as the upper and lower level
These parameters determine the transition probability density optimization problem respectively. Thus, the two equations
function Pξ : Sξ × Aξ × Sξ → R+ that describes the system’s state the goal of finding the set of domain distribution param-
stochastic dynamics. The initial state s0 is drawn from the start eters φ? that maximizes the return on the real-world target
state distribution µ0,ξ : Sξ → R+ . Together with the reward system J real (θ ? (φ)), when used to specify the distribution
function r : Sξ × Aξ → R, and the temporal discount factor ν(ξ; φ) during training in the source domain. The space of
γ ∈ [0, 1], the system forms a MDP described by the set domain parameter distributions is represented by Φ. In the
Mξ = {Sξ , Aξ , Pξ , µ0,ξ , r, γ}. The goal of a Reinforcement following, we abbreviate θ ? (φ) with θ ? . At the core of BayRn,
Learning (RL) agent is to maximize the expected (discounted) first a policy optimizer, e.g., an RL algorithm, is employed
return, a numeric scoring function which measures the policy’s to solve the lower level problem (2) by finding a (locally)
performance. The expected discounted return of a stochastic optimal policy π(θ ? ) for the current distribution of stochastic
domain-independent policy π( at | st ; θ), characterized by its environments. This policy is evaluated on the real system for
parameters θ ∈ Θ ⊆ Rnθ , is defined as nτ rollouts, providing an estimate of the return Jˆreal (θ ? ). Next,
−1
the upper level problem (1) is solved using BO, yielding a
h TX
new domain parameter distribution which is used to randomize
i
J (θ, ξ, s0 ) = Eτ ∼p(τ ) γ t r(st , at ) θ, ξ, s0 .
t=0
the simulator. In this process the relation between the domain
3

Algorithm 1: Bayesian Domain Randomization Connection to System Identification: Unlike related meth-
input : domain parameter distribution ν(ξ; φ), ods (Section V), BayRn does not include a term in the
parameter space Φ = [φmin , φmax ], algorithm objective function that drives the system parameters to match
PolOpt, Gaussian Process GP, acquisition the observed dynamics. Instead, the BO component in BayRn
function a, hyper-parameters ninit , nτ , J succ is free to adapt the domain distribution parameters φ (e.g.,
output: maximum a posteriori domain distribution mean or standard deviation of a body’s mass) while learning
parameter φ? and policy π(θ ? ) in simulation such that the resulting policies perform well in
. Initialization phase the target domain. This can be seen as an indirect system
1 Initialize empty data set and ninit policies randomly identification, since with increasing iteration count the BO

2 D ← {} ; π θ1:ninit ← θ1:ninit ∼ Θ process will converge to sampling from regions with high real-
3 Sample ninit source domain distribution parameter sets world return. There is a connection to control as inference ap-
and train in randomized simulators proaches which interpret the cost as a log-likelihood function
4 φ1:ninit ← φ1:ninit ∼ Φ under an optimality criterion using a Boltzmann distribution
?

5 θ1:ninit ← PolOpt π θ1:ninit , ν(ξ; φ1:ninit )
construct [13, 14]. Regarding BayRn, the sequence of sampled
6 Evaluate the ninit policies on the target domain for nτ domain distribution parameter sets strongly depends on the
rollouts and acquisition function and the complexity of the given problem.
estimateP the return
nτ We argue that excluding system identification from the upper
7 Jˆ
real ? real ?

θ1:n init
← 1/n τ j=1 Jj θ1:n init
8 Augment the data set and update the GP’s posterior
level objective (1) is sensible for the presented sim-to-real
distribution algorithm, since it learns from a randomized physics simulator,
hence attenuates the benefit of a well-fitted model.
9 D ∪ {φi , Jˆ (θi )}i=1 ; GP(m, k) ← GP m, k D
real ? ninit

10 do . Sim-to-real loop
11 Optimize the GP’s acquisition function IV. E XPERIMENTS
12 φ? ← arg maxφ∈Φ a(φ, D) We study Bayesian Domain Randomization (BayRn) on
13 Train a policy using the obtained domain two different platforms: 1) an underactuated rotary inverted
distribution parameter set pendulum, also known as Furuta pendulum, with the task of
14 θ ? ← PolOpt[π(θ) , ν(ξ; φ? )] swinging up the pendulum pole into an upright position, and
15 Evaluate the policy on the target domain for nτ 2) the tendon-driven 4-DoF robot arm WAM from Barrett,
rollouts and estimate Pnτ the real
return where the agent has to swing a ball into a cup mounted
16 Jˆreal (θ ? ) ← 1/nτ j=1 Jj (θ ? ) as the end-effector. First, we set up a simplified sim-to-sim
17 Augment the data set and update the GP’s experiment on the Furuta pendulum to check if the proposed
posterior distribution algorithm’s belief about the domain distribution parameters
D ∪ {φ? , Jˆreal (θ ? )} ; GP(m, k) ← GP m, k D converges to a specified set of ground truth values. Next, we

18
19 while Jˆ
real ?
(θ ) < J succ and niter ≤ niter,max evaluate BayRn as well as the baseline methods SimOpt [4],
20 Train the maximum a posteriori policy (repeat the Uniform Domain Randomization (UDR), and Proximal Policy
Lines 12 and 14 once) Optimization (PPO) [15] or Policy learning by Weighting
Exploration with the Returns (PoWER) [16] in two sim-to-real
experiments. Additional details on the system description can
be found in Appendix A. Furthermore, an extensive list of the
distribution’s parameters φ and the resulting policy’s return chosen hyper-parameters can be found in Appendix B. A video
on the real system Jˆreal (θ ? ) is modeled by a GP. The GP’s demonstrating the sim-to-real transfer of the policies learned
mean and covariance is updated using all recorded inputs φ with BayRn can be found at www.ias.informatik.
and the corresponding observations Jˆreal (θ ? ). Finally, BayRn tu-darmstadt.de/Team/FabioMuratore. Moreover,
terminates when the estimated performance on the target sys- the source code of BayRn and the baselines is available at [17].
tem exceeds J succ which is the task-specific success threshold.
Since the GP requires at least a few (about 5 to 10) samples
to provide a meaningful posterior, BayRn has an initialization A. Experimental Setup
phase before the loop. In this phase, ninit source domains All rollouts on the Fu-
are randomly sampled from Φ, and subsequently for each of ruta pendulum ran for 6 s at
these domains a policy is trained. After evaluating the ninit 100 Hz, collecting 600 time
initial policies, the GP is fed with the inputs φ1:ninit and steps with a reward rt ∈]0, 1]. dj , µs
the corresponding observations Jˆreal θ1:n ?

. The complete We decided to use a Feedfor- lp ,
lr , m
init
BayRn procedure is summarized in Algorithm 1. In principle, ward Neural Network (FNN) mr p ds , ls mb
there are no restrictions to the choice of algorithms for solving policy in combination with
Figure 2: Platforms with an-
the two stages (1) and (2). For training the GP, we used the BO PPO as policy optimization
notated domain parameters
implementation from BoTorch [12] which expects normalized (sub)routine (Table IIa). Be-
inputs and standardized outputs. Notably, we decided for the fore each rollout, the platform was reset automatically. On
expected improvement acquisition function and a zero-mean the physical system, this procedure includes estimating the
GP prior with a Matérn 5/2 Kernel. sensors’ offsets as well as running a controller which drives the
4

Table I: Range of domain distribution parameter values φ 0 300 100 200

used during the experiments. All domain parameters were
randomized such that they stayed physically plausible.
(a) swing-up and balance 0.11 0.11

rot. pole mass mr

Parameter Range Unit
pendulum pole mass mean E[mp ] ∈ [0.0192, 0.0288] kg 0.10 0.10
pendulum pole mass var. V[mp ] ∈ [5.76e−10, 5.76e−6] kg2
rotary pole mass mean E[mr ] ∈ [0.076, 0.114] kg 0.09 0.09
rotary pole mass var. V[mr ] ∈ [9.03e−9, 9.03e−5] kg2
pendulum pole length mean E[lp ] ∈ [0.1032, 0.1548] m 0.08 0.08
pendulum pole length var. V[lp ] ∈ [1.66e−8, 1.66e−4] m2
rotary pole length mean E[lr ] ∈ [0.068, 0.102] m 0.020 0.024 0.028 0.020 0.024 0.028
rotary pole length var. V[lr ] ∈ [7.23e−9, 7.23e−5] m2
pend. pole mass mp pend. pole mass mp
(b) ball-in-a-cup (a) posterior mean (b) posterior std.

Parameter Range Unit Figure 3: Target domain returns (a) and the associated standard
string length mean E[ls ] ∈ [0.285, 0.315] m deviation (b) modeled by the GP learned with BayRn in a sim-
string length variance V[ls ] ∈ [9e−8, 2.25e−4] m2
string damping mean E[ds ] ∈ [0, 2e−4] N/s
to-sim setting (brighter is higher). The ground truth domain
string damping variance V[ds ] ∈ [3.33e−13, 8.33e−10] N2 /s2 parameters as well as the maximum a posteriori domain
ball mass mean E[mb ] ∈ [0.0179, 0.0242] kg distribution parameters found by BayRn are displayed as a red
ball mass variance V[mb ] ∈ [4.41e−10, 4.41e−6] kg2 and orange star, respectively. The circles mark the sequence
joint damping mean E[dj ] ∈ [0.0, 0.1] N/s
joint damping variance V[dj ] ∈ [3.33e−8, 2.08e−4] N2 /s2 of domain parameter configurations (darker is later).
joint stiction mean E[µs ] ∈ [0, 0.4] −
joint stiction variance V[µs ] ∈ [1.33e−6, 3.33e−3] −
for closing the reality gap, and validated it on two state-of-the-
art sim-to-real robotic manipulation tasks. SimOpt iteratively
device to its initial position with the rotary pole centered and adapts the domain parameter distribution’s parameters by
the pendulum hanging down. In simulation, the reset function minimizing discrepancy between observations from the real-
causes the simulator to sample a new set of domain parameters world system and the simulation. While BayRn formulates the
ξ (Figure 2). Due to the underactuated nature of the dynamics, upper level problem (1) solely based on the real-world returns,
the pendulum has to be swung back and forth to put energy SimOpt minimizes a linear combination of the L1 and L2 norm
into the system before being able to swing the pendulum up. between simulated and real trajectories. Moreover, SimOpt
The Barrett WAM was operated at 500 Hz with an episode employs Relative Entropy Policy Search (REPS) [18] to update
length of 3.5 s, i.e., 1750 time steps. For the ball-in-cup task, the simulator’s parameters, hence turning (1) into an RL prob-
we chose a RBF-policy commanding desired deltas to the lem. The necessity of real-world trajectories renders SimOpt
current joint angles and angular velocities, which are passed to unusable for the ball-in-a-cup task since the feed-forward
the robots feed-forward controller. Hence, the only input to the policy is executed without recording any observations. Thus,
policy is the normalized time. At the beginning of each rollout, there are no real-world trajectories with which to update the
the robot is driven to an initial position. When evaluating on simulator. BayRn (Section III), SimOpt and UDR randomize
the physical platform, the ball needs to be manually stabilized the same domain parameters with identical nominal values. At
in a resting position. Once the rollout has finished, the operator the beginning of each sim-to-real experiment (Section IV-C),
enters a return value (Appendix A). the domain distribution parameters φ are sampled randomly
In the sim-to-real experiments, we compare BayRn to
from their ranges (Table I). The main difference is that BayRn
SimOpt, UDR, and PPO or PoWER. For every algorithm, we
and SimOpt adapt the domain distribution parameters, while
train 20 polices and execute 5 evaluation rollouts per policy.
UDR does not. We chose normal distributions for masses and
PPO as well as PoWER are set up to learn from simulations
lengths as well as uniform distributions for parameters related
where the domain parameters are given by the platforms’ data
to friction and damping.
sheets or CAD models. These sets of domain parameters are
called nominal. Hence, PPO and PoWER serve as a baseline
representing step-based and episodic RL algorithms without B. Sim-to-sim Results
domain randomization or any real-world data. UDR augments Before applying BayRn to a physical system, we examine
an RL algorithm, here PPO or PoWER, and can be seen as the the domain distribution parameter sampling process of the BO
straightforward way of randomizing a simulator, as done in [7]. component in simulation. In order to provide a (qualitative)
Each domain parameter ξ is assigned to an independent prob- visualization, we chose to only randomize the means of the
ability distribution, specified by its parameters φ, i.e. mean poles’ masses, i.e., φ = [E[mr ] , E[mp ]]T . Thus, for this sim-
and variance, (Table I). Thus, we include UDR as a baseline to-sim experiment the domain distribution parameters φ are
method for static domain randomization. Note that UDR can, synonymous to the domain parameters ξ. Apart from that, the
in contrast to BayRn and SimOpt, be easily parallelized which hyper-parameters used for executing BayRn are identical to
reduces the time to train a policy significantly. With SimOpt, the ones used in the sim-to-real experiments (Appendix B).
Y. Chebotar et al. [4] presented a trajectory-based framework As stated in Section III, BayRn was designed without an
5

1.0

success probability
400
return

200 0.5

0 0.0
BayRn SimOpt UDR PPO BayRn UDR PoWER
(a) swing-up and balance (b) ball-in-a-cup
Figure 4: Performance of the different algorithms across both sim-to-real tasks. For each algorithm 20 policies have been
trained, varying the random seed, and evaluated 5 times to estimate the mean return per policy (700 rollouts in total). The
median performance per algorithm is displayed by white circles, and the inner quartiles are represented by thick vertical bars.
A dashed line in (a) marks an approximate threshold where the task is solved, i.e., the rotary pole is stabilized on top in the
center. SimOpt was not applicable to our open-loop ball-in-a-cup task (b) because of its requirement for recorded observations.

(explicit) system identification objective. However, we can Comparing the Furuta pendulum’s nominal domain param-
see from Figure 3a that the maximizer of the GP’s mean eters φnom = [mp , mr , lp , lr ]T = [0.024, 0.095, 0.129, 0.085]T
function φ? = [0.0266, 0.1084]T closely match the ground to the means among BayRn’s final estimate φ?mean =
truth parameters φGT = [0.0264, 0.1045]T . Moreover, Fig- [0.023, 0.098, 0.123, 0.087]T , we see that the domain parame-
ure 3b displays how the uncertainty about the target domain ters’ means changed by less than 10 % each. Complementary
return is reduced in the vicinity of the sampled parameter the variances among BayRn’s final estimate are φ?var =
configurations. There are two decisive factors for the domain [6.29e−8, 5.67e−6, 4.10e−5, 1.19e−5]T , indicating a higher
distribution parameter sampling process: the acquisition func- uncertainty on the link lengths (relative to the means). Thus,
tion (Algorithm 1 Line 12), and the quality of the found policy the final domain parameters are well within the boundaries of
(Algorithm 1 Line 14). Concerning the latter, a failed training the BO search space (Table I). In combination, these small
run is indistinguishable to a successful one which fails to differences result in significantly different system dynamics.
transfer to the target domain, since the GP only observes the We believe this to be the reason why the baselines without
estimated real-world return Jˆreal (θ ? ). domain randomization completely failed to transfer.

C. Sim-to-real Results V. R ELATED W ORK

Figure 4 visualizes the results of the sim-to-real experiment We divide the related research on robot reinforcement
described in Section IV-A. The discrepancy between the learning from randomized simulations into approaches which
performance of PPO and PoWER and the other algorithms use static (Section V-A) or adaptive (Section V-B) distribu-
reveals that domain randomization was the decisive part for tions for sampling the physics parameters. Bayesian Domain
sim-to-real transferability. To verify that the PPO and PoWER Randomization (BayRn) as introduced in Section III belongs
learned meaningful policies, we checked them in the nominal to the second category.
simulation environments (not reported) and observed that they
solve the tasks excellently. In Figure 4a, we see that each
median performance of BayRn, SimOpt, and UDR are above A. Domain Randomization with Static Distributions
the success threshold. However, UDR has a significantly Learning from a randomized simulator with fixed domain
higher variance. SimOpt solves the swing-up and balance task parameter distributions has bridged the reality gap in several
in most cases. However, we noticed that the system identifica- cases [3, 6, 2]. Most prominently, the robotic in-hand manip-
tion subroutine can converge to extreme domain distribution ulation reported in [3] showed that domain randomization in
parameters, rendering the next policy useless, which then combination with careful model engineering and the usage of
yields a collection of poor trajectories for the next system recurrent neural networks enables direct sim-to-real transfer
identification, resulting in a downward spiral. BayRn on the on an unprecedented difficulty level. Similarly, Lowrey et al.
other side relies on the policy optimizer’s ability to robustly [6] employed Natural Policy Gradient to learn a continuous
solve the simulated environment (Section IV-B). This problem controller for a positioning task, after carefully identifying the
can be mediated by re-running the policy optimization in system’s parameters. Their results show that the policy learned
case a certain return threshold in simulation has not been from the identified model was able to perform the sim-to-
exceeded. For the ball-in-a-cup task, Figure 4b shows an real transfer, but the policies learned from an ensemble of
improvement of sim-to-real transfer for BayRn, especially models were more robust to modeling errors. Mordatch et al.
since the tasks open-loop design amplifies domain mismatch. [5] used finite model ensembles to run trajectory optimization
During the experiments, we noticed that UDR sometimes on a small-scale humanoid robot. In contrast, Peng et al.
failed unexpectedly. We suspect the a high dependency on the [7] combined model-free RL with recurrent neural network
initial state to be the reason for that. policies trained on experience replay in order to push an object
6

by controlling a robotic arm. The usage of risk-averse objective comparison is done by a discriminator which yields rewards
function has been explored on MuJoCo tasks in [19]. The proportional to the difficulty of distinguishing the simulated
authors also provide a Bayesian point of view. and real environments, hence providing an incentive to gen-
Cully et al. [20] can be seen as an edge case of static and erate distinct domains. Using this reward signal, the domain
adaptive domain randomization, where a large set of policies parameters of the simulation instances are updated via Stein
is learned before execution on the physical robot and evaluated Variational Policy Gradient. Mehta et al. [25] evaluated their
in simulation. Every policy is associated to one configuration method in a sim-to-real experiment where a robotic arm had to
of the so-called behavioral descriptors, which are related but reach a desired point. The strongest contrast between BayRn
not identical to domain parameters. In contrast to BayRn, and ADR is they way in which new simulation environments
there is no policy training after the initial phase. Instead are explored. While BayRn can rely on well-studied BO
of retraining or fine-tuning, the algorithm suggested in [20] with an adjustable exploration-exploitation behavior, ADR can
reacts to performance drops, e.g. due to damage, by using be fragile since it couples discriminator training and policy
BO to sequentially select a pretrained policy and measure optimization, which results in a non-stationary process where
its performance on the robot. The underlying GP models the distribution of the domains depends on the discriminator’s
mapping from behavior space to performance. This method performance.
demonstrated impressive damage recover abilities on a robotic Paul et al. [26] introduce Fingerprint Policy Optimization
locomotion and a reaching task. However, applying it to RL which, like BayRn, employs BO to adapt the distribution of
poses big challenges. Most notably, the number of policies to domain parameters such that using these for the subsequent
be learned in order to populate the map, scales exponentially training maximizes the policy’s return. At first glance the
with the dimension of the behavioral descriptors, potentially approaches look similar, but there is a major difference in
leading to a very large number of training runs. how the upper level problem (1) is solved. Fingerprint Policy
Aside from to the previous methods, Muratore et al. [2] Optimization models the relation between the current domain
propose an approach to estimate the transferability of a policy parameters, the current policy and the return of the updated
learned from randomized physics simulations. Moreover, the policy with a GP. This design decision requires to feed the
authors propose a meta-algorithm which provides a probabilis- policy parameters into the GP which is prohibitively expensive
tic guarantee on the performance loss when transferring the if done straightforwardly. Therefore, abstractions of the policy,
policy between two domains form the same distribution. so-called fingerprints, are created. These handcrafted features,
Static domain randomization has also been successfully e.g., the Gaussian approximation of the stationary state distri-
applied to computer vision problems. A few examples that bution, replace the policy to reduce the input dimension. The
are: (i) object detection [21], (ii) synthetic object generation authors tested Fingerprint Policy Optimization on three sim-to-
for grasp planning [8], and (iii) autonomous drone flight [22]. sim tasks. Contrarily, BayRn has been designed without the
need to approximate the policy. Moreover, we validated the
presented method in sim-to-real settings.
B. Domain Randomization with Adaptive Distributions Yu et al. [9] intertwine policy optimization, system iden-
Ruiz et al. [23] proposed the meta-algorithm which is based tification, and domain randomization. The proposed method
on a bilevel optimization problem highly similar to the one first identifies bounds on the domain parameters which are
of BayRn (1, 2). However, there are two major differences. later used for learning from the randomized simulator. The
First, BayRn uses Bayesian optimization on the acquired suggested policy is conditioned on a latent space projection of
real-wold data to adapt the domain parameter distribution, the domain parameters. After training in simulation, a second
whereas “learning to simulate” updates the domain parameter system identification step is executed to find the projected
distribution using REINFORCE. Second, the approach in [23] domain parameters which maximize the return on the physical
has been evaluated in simulation on synthetic data, except for robot. This step runs BO for a fixed number of iterations and is
a semantic segmentation task. Thus, there was no dynamics- similar to solving the upper level problem in (1). The algorithm
dependent interaction of the learned policy with the real world. was evaluated on the bipedal walking robot Darwin OP2.
With SPRL, Klink et al. [24] derived a relative entropy In Ramos et al. [27], likelihood-free inference in combi-
RL algorithm that endows the agent to adapt the domain nation with mixture density random Fourier networks is em-
parameter distribution, typically from easy to hard instances. ployed to perform a fully Bayesian treatment of the simulator’s
Hence, the overall training procedure can be interpreted as a parameters. Analyzing the obtained posterior over domain
curriculum learning problem. The authors were able to solve parameters, Ramos et al. showed that BayesSim is, in a sim-
sim-to-sim goal reaching problems as well as a robotic sim- to-sim setting, able to simultaneously infer different parameter
to-real ball-in-a-cup task, similar to the one in this paper. configurations which can explain the observed trajectories. The
One decisive difference to BayRn is that the target domain key difference between BayRn and BayesSim is the objective
parameter distribution has to be known beforehand. for updating the domain parameters. While BayesSim max-
The approach called Active Domain Randomization imizes the model’s posterior likelihood, BayRn updates the
(ADR) [25] also formulates the adaption of the domain domain parameters such that the policy’s return on the physical
parameter distribution as an RL problem where different system is maximized. The biggest advantage of BayRn over
simulation instances are sampled and compared against a BayesSim is its ability to work with very sparse real-world
reference environment based on the resulting trajectories. This data, i.e. only the scalar return values.
7

VI. C ONCLUSION [9] W. Yu, V. C. V. Kumar, G. Turk, and C. K. Liu, “Sim-

We have introduced Bayesian Domain Randomization to-real transfer for biped locomotion,” in IROS, Macau,
(BayRn), a policy search algorithm tailored to crossing the SAR, China, November 3-8. IEEE, 2019, pp. 3503–3510.
reality gap. At its core, BayRn learns from a randomized [10] J. Snoek, H. Larochelle, and R. P. Adams, “Practical
simulator while using Bayesian optimization for adapting the bayesian optimization of machine learning algorithms,”
source domain distribution during learning. In contrast to in NIPS, Lake Tahoe, Nevada, United States, December
previous work, the presented algorithm constructs a proba- 3-6, 2012, pp. 2960–2968.
bilistic model of the connection between domain distribution [11] C. E. Rasmussen and C. K. I. Williams, Gaussian pro-
parameters and the policy’s return after training with these cesses for machine learning, ser. Adaptive computation
parameters in simulation. Hence, BayRn only requires little and machine learning. MIT Press, 2006.
interaction with the real-world system. We experimentally [12] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton,
validated that the presented approach is able to solve two non- B. Letham, A. G. Wilson, and E. Bakshy, “BoTorch:
linear robotic sim-to-real tasks. Comparing the results against Programmable bayesian optimization in pytorch,” ArXiv
baseline methods showed that adapting the domain parameter e-prints, 2019.
distribution lead to policies with higher median performance [13] M. Toussaint, “Robot trajectory optimization using
and less variance. In order to improve the scalability of the approximate inference,” in ICML Montreal, Quebec,
Bayesian optimization subroutine to higher numbers of domain Canada, June 14-18, vol. 382, 2009, pp. 1049–1056.
distribution parameters, one could for example incorporate [14] K. Rawlik, M. Toussaint, and S. Vijayakumar, “On
quantile Gaussian processes [28], which have shown to scales stochastic optimal control and reinforcement learning by
up to problems with 60-dimensional input. approximate inference,” in IJCAI, Beijing, China, August
3-9, 2013, pp. 3052–3056.
ACKNOWLEDGMENTS [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
O. Klimov, “Proximal policy optimization algorithms,”
Fabio Muratore gratefully acknowledges the financial sup- ArXiv e-prints, 2017.
port from Honda Research Institute Europe. [16] J. Kober and J. Peters, “Policy search for motor prim-
Jan Peters received funding from the European Union’s itives in robotics,” Machine Learning, vol. 84, no. 1-2,
Horizon 2020 research and innovation programme under grant pp. 171–203, 2011.
agreement No 640554. [17] F. Muratore, “SimuRLacra - a framework for reinforce-
ment learning from randomized simulations,” https://
R EFERENCES github.com/famura/SimuRLacra, 2020.
[1] B. F. Hobbs and A. Hepenstal, “Is optimization optimisti- [18] J. Peters, K. Mülling, and Y. Altun, “Relative entropy
cally biased?” Water Resources Research, vol. 25, no. 2, policy search,” in AAAI, Atlanta, Georgia, USA, July 11-
pp. 152–160, 1989. 15, 2010.
[2] F. Muratore, M. Gienger, and J. Peters, “Assessing trans- [19] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine,
ferability from simulation to reality for reinforcement “Epopt: Learning robust neural network policies using
learning,” PAMI, vol. PP, pp. 1–1, 11 2019. model ensembles,” in ICLR, Toulon, France, April 24-
[3] OpenAI et al., “Learning dexterous in-hand manipula- 26, 2017.
tion,” ArXiv eprints, vol. 1808.00177, 2018. [20] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret,
[4] Y. Chebotar et al., “Closing the sim-to-real loop: Adapt- “Robots that can adapt like animals,” Nature, vol. 521,
ing simulation randomization with real world experi- no. 7553, pp. 503–507, 2015.
ence,” in ICRA, Montreal, QC, Canada, May 20-24, [21] J. Tobin et al., “Domain randomization for transferring
2019, pp. 8973–8979. deep neural networks from simulation to the real world,”
[5] I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble- in IROS, Vancouver, BC, Canada, September 24-28,
cio: Full-body dynamic motion planning that transfers 2017, pp. 23–30.
to physical humanoids,” in IROS, Hamburg, Germany, [22] F. Sadeghi and S. Levine, “CAD2RL: real single-image
September 28 - October 2, 2015, pp. 5307–5314. flight without a single real image,” in RSS, Cambridge,
[6] K. Lowrey, S. Kolev, J. Dao, A. Rajeswaran, and Massachusetts, USA, July 12-16, 2017.
E. Todorov, “Reinforcement learning for non-prehensile [23] N. Ruiz, S. Schulter, and M. Chandraker, “Learning to
manipulation: Transfer from simulation to physical sys- simulate,” ArXiv e-prints, vol. 1810.02513, 2018.
tem,” in SIMPAR 2018, Brisbane, Australia, May 16-19, [24] P. Klink, H. Abdulsamad, B. Belousov, and J. Peters,
2018, pp. 35–42. “Self-paced contextual reinforcement learning,” ArXiv e-
[7] X. B. Peng, M. Andrychowicz, W. Zaremba, and prints, vol. 1910.02826, 2019.
P. Abbeel, “Sim-to-real transfer of robotic control with [25] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull,
dynamics randomization,” in ICRA, Brisbane, Australia, “Active domain randomization,” in CoRL, Osaka, Japan,
May 21-25, 2018, pp. 1–8. October 30 - November 1, vol. 100. PMLR, 2019, pp.
[8] J. Tobin et al., “Domain randomization and generative 1162–1176.
models for robotic grasping,” in IROS, Madrid, Spain, [26] S. Paul, M. A. Osborne, and S. Whiteson, “Fingerprint
October 1-5, 2018. policy optimisation for robust reinforcement learning,”
8

ArXiv e-prints, vol. 1805.10662, 2018. where the final reward given by the operator after the rollout
[27] F. Ramos, R. Possas, and D. Fox, “Bayessim: Adap- (r(st , at ) = 0 for t < T ) when running on the real system. We
tive domain randomization via probabilistic inference found the separation in three cases to be helpful during learn-
for robotics simulators,” in RSS, University of Freiburg, ing and easily distinguishable from the others. While training
Freiburg im Breisgau, Germany, June 22-26, 2019. in simulation, successful trials are identified by detecting a
[28] R. Moriconi, K. S. S. Kumar, and M. P. Deisenroth, collision between the ball and a virtual cylinder inside the cup.
“High-dimensional bayesian optimization with projec- Moreover, we have access to the full state, hence augment the
tions using quantile gaussian processes,” Optim. Lett., reward function with a cost term that punishes deviations from
vol. 14, no. 1, pp. 51–64, 2020. the initial end-effector position.
[29] “mujoco-py,” Online. [Online]. Available: https://github.
com/openai/mujoco-py A PPENDIX B
PARAMETER VALUES FOR THE E XPERIMENTS
A PPENDIX A Table II lists the hyper-parameters for all training runs
M ODELING D ETAILS ON THE P LATFORMS during the experiments in Section IV. The reported values
The Furuta pendulum (Figure 1) is modeled as an underac- have been tuned but not fully optimized.
tuated nonlinear second-order dynamical system given by the Table II: Hyper-parameter values for training the policies in
solution of Section IV. The domain distribution parameters φ are listed
Jr + mp lr2 + 41 mp lp2 (cos(α))2 12 mp lp lr cos(α) θ̈

in Table I.
1 =
2 mp lp lr cos(α) Jp + 14 mp lp2 α̈ (a) swing-up and balance
τ − 12 mp lp2 sin(α) cos(α) θ̇α̇ − 12 mp lp lr sin(α) α̇2 − dr θ̇

, Hyper-parameter Value
− 41 mp lp2 sin(α) cos(α) θ̇2 − 12 mp lp gsin(α) − dp α̇ common
PolOpt PPO
policy / critic architecture FNN 64-64 with tan-h
with the rotary angle θ and the pendulum angle α, which optimizer Adam
learning rate policy 5.97e−4
are defined to be zero when the rotary pole is centered and learning rate critic 3.44e−4
the pendulum pole is hanging down vertically. While the PPO clipping ratio 0.1
system’s state is defined as s = [θ, α, θ̇, α̇]T , the agent receives iterations niter 300
step size ∆t 0.01 s
observations o = [sin(θ) , cos(θ) , sin(α) , cos(α) , θ̇, α̇]T . The max. steps per episode T 600
horizontal pole is actuated by commanding a motor volt- min. steps per iteration 20T
age (action) a which regulates the servo motor’s torque temporal discount γ 0.9885
adv. est. trade-off factor λ 0.965
τ = km (a − km θ̇)/Rm . One part of the domain parameters success threshold J succ 375
is sampled from distributions specified by in Table Ia, while Q diag(2e−1, 1.0, 2e−2, 5e−3)
the remaining domain parameters are fixed at their nominal R 3e−3
real-world rollouts nτ 5
values given in [17]. We formulate the reward function based UDR specific
on an exponentiated quadratic cost min. steps per iteration 30T
SimOpt specific
r(st , at ) = exp − eTt Qet + at Rat with max. iterations niter 15
DistrOpt population size 500
et = 0 π 0 0 − st mod 2π. DistrOpt KL bound 1.0
DistrOpt learning rate 5e−4
Thus, the reward is in range ]0, 1] for every time step. BayRn specific
The 4-DoF Barrett WAM (Figure 1) is simulated using max. iterations niter,max 15
initial solutions ninit 5
MuJoCo, wrapped by mujoco-py [29]. The ball is attached to
a string, which is mounted to the center of the cup’s bottom (b) ball-in-a-cup
plate. We model the string as a concatenation of 30 rigid bodies Hyper-parameter Value
with two rotational joints per link (no torsion). This specific common
ball-in-a-cup instance can be considered difficult, since the PolOpt PoWER
policy architecture RBF with 16 basis functions
cups’s diameter is only about twice as large as the ball’s, and iterations niter 20
the string is rather short with a length of 30 cm. Similar to population size npop 100
the Furuta pendulum, one part of the domain parameters is num. importance samples nis 10
init. exploration std σinit π/12
sampled from distributions specified by in Table Ib, while min. rollouts per iteration 20
the remaining domain parameters are fixed at their nominal max. steps per episode T 1750
values given in [17]. Since the feed-forward policy is executed step size ∆t 0.002 s
temporal discount γ 1
without recording any observations, we define a discrete real-world rollouts nτ 5
ternary reward function UDR specific
 min. steps per iteration 30T
1
 if the ball is in the cup, BayRn specific
max. iterations niter 15
r(sT , aT ) = 0.5 if the ball hit the cup’s upper rim, initial solutions ninit 5

0 else


Opt Model
100% (1)
Opt Model
524 pages
EC400 Slides Lecture 1
No ratings yet
EC400 Slides Lecture 1
44 pages
Bayesian domain randomization for sim-to-real transfer
No ratings yet
Bayesian domain randomization for sim-to-real transfer
9 pages
Cutler16_ICRA_final_submission
No ratings yet
Cutler16_ICRA_final_submission
7 pages
Sim-to-Real_Transfer_in_Deep_Reinforcement_Learning_for_Robotics_a_Survey
No ratings yet
Sim-to-Real_Transfer_in_Deep_Reinforcement_Learning_for_Robotics_a_Survey
8 pages
Bayesian Reinforcement Learning: A Survey
No ratings yet
Bayesian Reinforcement Learning: A Survey
147 pages
379 Rate Informed Discovery Vi
No ratings yet
379 Rate Informed Discovery Vi
20 pages
Sim-to-Real Transfer in Deep Reinforcement Learning For Robotics: A Survey
No ratings yet
Sim-to-Real Transfer in Deep Reinforcement Learning For Robotics: A Survey
8 pages
Answer Key
No ratings yet
Answer Key
12 pages
Gonzalez 2021
No ratings yet
Gonzalez 2021
67 pages
2503.03707v1
No ratings yet
2503.03707v1
13 pages
Drive in Trafic PDF
No ratings yet
Drive in Trafic PDF
20 pages
2015-Koutsourelakis - (ArXiv) - Variational Bayesian Strategies For High Dimensional Stochastic Design Problems
No ratings yet
2015-Koutsourelakis - (ArXiv) - Variational Bayesian Strategies For High Dimensional Stochastic Design Problems
61 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Identify, Recover From, and Adapt to Distribution Shifts
No ratings yet
Identify, Recover From, and Adapt to Distribution Shifts
17 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
483_learning_to_optimize
No ratings yet
483_learning_to_optimize
13 pages
Evaluation of Bayesian Optimization Applied To Discrete-Event Simulation
No ratings yet
Evaluation of Bayesian Optimization Applied To Discrete-Event Simulation
9 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Lecture 1
No ratings yet
Lecture 1
69 pages
9 The Latter Is
No ratings yet
9 The Latter Is
2 pages
(Ebook) Bayesian Optimization by Roman Garnett ISBN 9781108425780, 110842578X - Download the ebook now for an unlimited reading experience
100% (2)
(Ebook) Bayesian Optimization by Roman Garnett ISBN 9781108425780, 110842578X - Download the ebook now for an unlimited reading experience
77 pages
Bayesian Optimization in Action MEAP V07 1st / chapters 1 to 8 of 13 Edition Quan Nguyen pdf download
100% (1)
Bayesian Optimization in Action MEAP V07 1st / chapters 1 to 8 of 13 Edition Quan Nguyen pdf download
81 pages
4-2 Generalizing Bayesian Optimization With Likelihood-Free Inference and Decision-Theoretic Entropies
No ratings yet
4-2 Generalizing Bayesian Optimization With Likelihood-Free Inference and Decision-Theoretic Entropies
45 pages
2.1 - When to Trust Robots with Decisions, and When Not To
No ratings yet
2.1 - When to Trust Robots with Decisions, and When Not To
8 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
16 pages
Deepmind Control Suite
No ratings yet
Deepmind Control Suite
24 pages
Han et al. - 2022 - Non-Gaussian Risk Bounded Trajectory Optimization
No ratings yet
Han et al. - 2022 - Non-Gaussian Risk Bounded Trajectory Optimization
7 pages
Stabilizing Off Policy QLearning
No ratings yet
Stabilizing Off Policy QLearning
19 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
HBR, 2016, Dhar, When to Trust Robots With Decisions, And When Not To
No ratings yet
HBR, 2016, Dhar, When to Trust Robots With Decisions, And When Not To
8 pages
An Empirical Investigation of The Challenges of Real-World Reinforcement Learning
No ratings yet
An Empirical Investigation of The Challenges of Real-World Reinforcement Learning
48 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
AI_Unit - 5
No ratings yet
AI_Unit - 5
85 pages
37 RL
No ratings yet
37 RL
18 pages
Bayesian Optimization In Action Meap V07 1st Chapters 1 To 8 Of 13 Quan Nguyen pdf download
No ratings yet
Bayesian Optimization In Action Meap V07 1st Chapters 1 To 8 Of 13 Quan Nguyen pdf download
79 pages
April May 2024
No ratings yet
April May 2024
17 pages
Powell UnifiedFrameworkforOUU ECSO Tutorial Sept222017 PDF
No ratings yet
Powell UnifiedFrameworkforOUU ECSO Tutorial Sept222017 PDF
177 pages
Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo
No ratings yet
Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo
14 pages
Learningintro Notes
No ratings yet
Learningintro Notes
12 pages
Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges
No ratings yet
Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges
79 pages
Dulac Arnold 2021
No ratings yet
Dulac Arnold 2021
50 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
PMLR (2017). Mutual Alignment Transfer Learning
No ratings yet
PMLR (2017). Mutual Alignment Transfer Learning
10 pages
On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach
No ratings yet
On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach
9 pages
Offline Model-Based Optimization Via Policy-Guided Gradient Search
No ratings yet
Offline Model-Based Optimization Via Policy-Guided Gradient Search
10 pages
Gaussian Processes For Data-Efficient Learning in Robotics and Control
No ratings yet
Gaussian Processes For Data-Efficient Learning in Robotics and Control
20 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
(Ebook) Bayesian Optimization in Action (MEAP V07) by Quan Nguyen ISBN 9781633439078, 1633439070 download
100% (1)
(Ebook) Bayesian Optimization in Action (MEAP V07) by Quan Nguyen ISBN 9781633439078, 1633439070 download
48 pages
2202.08444v1 - 1
No ratings yet
2202.08444v1 - 1
8 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
qb serial test 1
No ratings yet
qb serial test 1
7 pages
Robust Adversarial Reinforcement Learning: Mnih Et Al. 2015
No ratings yet
Robust Adversarial Reinforcement Learning: Mnih Et Al. 2015
10 pages
Stockhammer TCP 2019
No ratings yet
Stockhammer TCP 2019
37 pages
Gelbart Dissertation 2015
No ratings yet
Gelbart Dissertation 2015
137 pages
Is Robotics Going Statistics? The Field of Probabilistic Robotics
No ratings yet
Is Robotics Going Statistics? The Field of Probabilistic Robotics
8 pages
Lec07 Baysian Opti
No ratings yet
Lec07 Baysian Opti
94 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
2501.03482v1
No ratings yet
2501.03482v1
9 pages
2501.03486v1
No ratings yet
2501.03486v1
27 pages
2501.03492v1
No ratings yet
2501.03492v1
15 pages
2501.03488v1
No ratings yet
2501.03488v1
24 pages
2501.03479v1
No ratings yet
2501.03479v1
13 pages
Object Detection Using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
No ratings yet
Object Detection Using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
8 pages
2501.03250v1
No ratings yet
2501.03250v1
91 pages
Ada-Leval: Evaluating Long-Context Llms With Length-Adaptable Benchmarks
No ratings yet
Ada-Leval: Evaluating Long-Context Llms With Length-Adaptable Benchmarks
13 pages
On The Evaluation of Machine-Generated Reports: James Mayfield Eugene Yang Dawn Lawrie Sean Macavaney
No ratings yet
On The Evaluation of Machine-Generated Reports: James Mayfield Eugene Yang Dawn Lawrie Sean Macavaney
12 pages
Towards Learning 3d Object Detection and 6d Pose Estimation From Synthetic Data
No ratings yet
Towards Learning 3d Object Detection and 6d Pose Estimation From Synthetic Data
4 pages
How Do Ideas Gain Legitimacy in Internal Crowdsourcing Idea Development Exploring The Effects of Feedback On Idea Selection
No ratings yet
How Do Ideas Gain Legitimacy in Internal Crowdsourcing Idea Development Exploring The Effects of Feedback On Idea Selection
33 pages
Ulrich Bestidea Mgtsci2010
No ratings yet
Ulrich Bestidea Mgtsci2010
15 pages
Novelty Detection - A Perspective From Natural Language Processing - Acl2022 - Jounral
No ratings yet
Novelty Detection - A Perspective From Natural Language Processing - Acl2022 - Jounral
42 pages
A Survey On Deep Transfer Learning
No ratings yet
A Survey On Deep Transfer Learning
10 pages
EE159 Computer Aided Power System Design
No ratings yet
EE159 Computer Aided Power System Design
1 page
Emeraude v5.20 - Doc v5.20.03 © KAPPA 1988-2019 PL Tutorial #1 - PLEX01 - 1/26
No ratings yet
Emeraude v5.20 - Doc v5.20.03 © KAPPA 1988-2019 PL Tutorial #1 - PLEX01 - 1/26
26 pages
Evolutionary Structural Optimization As Tool in Finding Strut and Tie Models For Designing Reinforced Concrete Deep Beam
No ratings yet
Evolutionary Structural Optimization As Tool in Finding Strut and Tie Models For Designing Reinforced Concrete Deep Beam
6 pages
Efficient Frontiers in Revenue Management PDF
No ratings yet
Efficient Frontiers in Revenue Management PDF
16 pages
Evolution Strategies With Additive Noise: A Convergence Rate Lower Bound
No ratings yet
Evolution Strategies With Additive Noise: A Convergence Rate Lower Bound
22 pages
Optimization - Lecture 1 - 2023
No ratings yet
Optimization - Lecture 1 - 2023
7 pages
Operation Research: Introduction To Operations Research
No ratings yet
Operation Research: Introduction To Operations Research
50 pages
A Framework For University's Final Exam Timetable Allocation Using Genetic Algorithm
No ratings yet
A Framework For University's Final Exam Timetable Allocation Using Genetic Algorithm
7 pages
Che G558 2152 20230811120111
No ratings yet
Che G558 2152 20230811120111
3 pages
Introduction To Sensitivity Analysis Graphical Sensitivity Analysis Sensitivity Analysis: Computer Solution Simultaneous Changes
No ratings yet
Introduction To Sensitivity Analysis Graphical Sensitivity Analysis Sensitivity Analysis: Computer Solution Simultaneous Changes
56 pages
Frequency Stabilization of Isolated and Grid Connected Hybrid Power System Models
No ratings yet
Frequency Stabilization of Isolated and Grid Connected Hybrid Power System Models
15 pages
Tutorial Model Predictive Control Technology
No ratings yet
Tutorial Model Predictive Control Technology
15 pages
A C I S V 6 S E: J L, J M: Tlantis Omputational Ntelligence Ystems Olume
No ratings yet
A C I S V 6 S E: J L, J M: Tlantis Omputational Ntelligence Ystems Olume
677 pages
05 - Annual Reviews in Control 2011 - MPC
No ratings yet
05 - Annual Reviews in Control 2011 - MPC
9 pages
Digsilent PawerFactory 15.0
0% (1)
Digsilent PawerFactory 15.0
27 pages
Mathematical Model of Transportation Problem
No ratings yet
Mathematical Model of Transportation Problem
14 pages
Badiozamani, Askari-Nasab - 2012 - 10 - 14 - Mahdi - ColoradoTailings
No ratings yet
Badiozamani, Askari-Nasab - 2012 - 10 - 14 - Mahdi - ColoradoTailings
14 pages
Composite Wing Elastic Axis For Aeroelasticity Optimization Design
No ratings yet
Composite Wing Elastic Axis For Aeroelasticity Optimization Design
6 pages
1 s2.0 S0169743902000692 Main PDF
No ratings yet
1 s2.0 S0169743902000692 Main PDF
20 pages
Fuzzy Controller Design Using LQR Fusion For Magnetic Levitation System
100% (1)
Fuzzy Controller Design Using LQR Fusion For Magnetic Levitation System
6 pages
9 Linear Programming 171-181 LPR171-181 (Primal Simplex) Solutions
No ratings yet
9 Linear Programming 171-181 LPR171-181 (Primal Simplex) Solutions
226 pages
IMPORTANT NUMERICAL QUESTIONS
No ratings yet
IMPORTANT NUMERICAL QUESTIONS
8 pages
Bellman Equation Problem
No ratings yet
Bellman Equation Problem
2 pages
A PSO-Based Optimum Design of PID Controller For A Linear Brushless DC Motor
No ratings yet
A PSO-Based Optimum Design of PID Controller For A Linear Brushless DC Motor
5 pages
1 Repair and Maintenance in Road Sector
No ratings yet
1 Repair and Maintenance in Road Sector
7 pages
Profit Maximization of TSP Through A Hybrid Algorithm
No ratings yet
Profit Maximization of TSP Through A Hybrid Algorithm
8 pages
Evans - Analytics2e - PPT - 01 Final
No ratings yet
Evans - Analytics2e - PPT - 01 Final
54 pages
Stopping Time Markov Processes
No ratings yet
Stopping Time Markov Processes
19 pages

Data-Efficient Domain Randomization With Bayesian Optimization

Uploaded by

Data-Efficient Domain Randomization With Bayesian Optimization

Uploaded by

1

Data-efficient Domain Randomization

Abstract—When learning policies for robot control, the re-

‘reality gap’. Domain randomization methods tackle this problem

Table I: Range of domain distribution parameter values φ 0 300 100 200

rot. pole mass mr

rot. pole mass mr

C. Sim-to-real Results V. R ELATED W ORK

VI. C ONCLUSION [9] W. Yu, V. C. V. Kumar, G. Turk, and C. K. Liu, “Sim-

You might also like