0% found this document useful (0 votes)
58 views7 pages

Robust High-Speed Running For Quadruped Robots Via

This document discusses using deep reinforcement learning to develop locomotion controllers for quadruped robots that can run at high speeds over rough terrain. Specifically, it explores learning foot positions in Cartesian space rather than joint space. This approach requires less reward shaping, improves sample efficiency, and results in natural gaits emerging like galloping and bounding. Policies can be learned in only a few million time steps and transferred between simulation environments, allowing the quadruped robot to run at over 4 m/s without a load and 3.5 m/s with a 10 kg load over 83% of its mass.

Uploaded by

Rafael Tavares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views7 pages

Robust High-Speed Running For Quadruped Robots Via

This document discusses using deep reinforcement learning to develop locomotion controllers for quadruped robots that can run at high speeds over rough terrain. Specifically, it explores learning foot positions in Cartesian space rather than joint space. This approach requires less reward shaping, improves sample efficiency, and results in natural gaits emerging like galloping and bounding. Policies can be learned in only a few million time steps and transferred between simulation environments, allowing the quadruped robot to run at over 4 m/s without a load and 3.5 m/s with a 10 kg load over 83% of its mass.

Uploaded by

Rafael Tavares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Robust High-speed Running for Quadruped Robots

via Deep Reinforcement Learning


Guillaume Bellegarda and Quan Nguyen

Abstract— Deep reinforcement learning has emerged as a


popular and powerful way to develop locomotion controllers for
quadruped robots. Common approaches have largely focused on
learning actions directly in joint space, or learning to modify
and offset foot positions produced by trajectory generators.
arXiv:2103.06484v1 [cs.RO] 11 Mar 2021

Both approaches typically require careful reward shaping


and training for millions of time steps, and with trajectory
generators introduce human bias into the resulting control
policies. In this paper, we instead explore learning foot positions
in Cartesian space, which we track with impedance control, for
a task of running as fast as possible subject to environmental
disturbances. Compared with other action spaces, we observe
less needed reward shaping, much improved sample efficiency,
the emergence of natural gaits such as galloping and bounding,
and ease of sim-to-sim transfer. Policies can be learned in only
a few million time steps, even for challenging tasks of running
over rough terrain with loads of over 100% of the nominal
quadruped mass. Training occurs in PyBullet, and we perform
a sim-to-sim transfer to Gazebo, where our quadruped is able
Fig. 1: Dynamic quadruped locomotion with LASER mini.
to run at over 4 m/s without a load, and 3.5 m/s with a 10 kg Top: Running in PyBullet [14], with a 10 kg load (83% of
load, which is over 83% of the nominal quadruped mass. Video nominal mass) on flat terrain (left), and a 6 kg load (50%
results can be found at https://youtu.be/roE1vxpEWfw. of nominal mass) over rough random terrain up to 0.04 m
in height (right). Middle: 1 second of snapshots in Gazebo,
I. I NTRODUCTION
running at 4 m/s. Bottom: 1 second of snapshots in Gazebo,
Traditional control methods have made significant ad- running at 3.5 m/s with a 10 kg mass.
vances toward real world robot locomotion [1]–[3]. Such
methods typically rely on solving online optimization prob-
lems (MPC) using simplified dynamics models, which re-
For legged robots, such methods have largely focused
quire significant domain knowledge, and may not general-
on learning policies directly in joint space through position
ize to new environments not explicitly considered during
control, or more recently, mapping outputs from higher level
development (i.e. uneven slippery terrain). There is also
trajectory generators [12] to joint positions. Previous works
considerable bias in the resulting motions due to the (for
have shown that for learning in joint space with quadrupeds
example) pre-determined foot swing trajectories, and use of
similar to the one considered here, complex reward functions
Raibert heuristics [4], which are based on a linear inverted
with many terms must be carefully crafted to learn successful
pendulum model, for foot placement planning. Such biases
policies. With trajectory generators, there are questions about
raise questions on optimality, with respect to speed and
optimality with respect to both speed and efficiency, as the
energy efficiency.
resulting policies are biased by design choices.
By contrast, to make control policies more robust to
Separately, several recent works have shown that the
external disturbances and new unseen environments, deep
choice of action space has a large effect on both training time
learning has emerged as an attractive and generalizable
and policy quality, for general robotic systems ranging from
formulation. Deep reinforcement learning in particular has
manipulators [15]–[18] to legged systems [19]–[21]. These
recently shown impressive results in learning control policies
works show that choosing the right action space is critical
for quadrotors [5], [6], bipeds [7], and quadrupeds [8]–[13].
to learning an optimal policy, and in particular suggest that
Typically such methods train from scratch (i.e. use little or
selecting actions in Cartesian space with impedance control
no prior information about the system) and rely on extensive
can outperform learning in joint space, depending on the
simulation with randomized environment parameters before
task.
transferring to hardware.
Contribution: In contrast to the previously discussed works
This work is supported by USC Viterbi School of Engineering. in learning quadruped locomotion, in this work we propose
Guillaume Bellegarda and Quan Nguyen are with the Dynamic learning fast running controllers directly in Cartesian space,
Robotics and Control Laboratory, Department of Aerospace and
Mechanical Engineering, University of Southern California (USC). where the policy chooses desired end effector positions for
[email protected] each quadruped leg. Joint torques are then computed with
TABLE I: Physical Robot Parameters
Parameter Symbol Value Units
Mass m 12 kg
Body Inertia Ixx 0.0168 kg · m2
Iyy 0.0565 kg · m2
Izz 0.0647 kg · m2
Body Length lbody 0.361 m
Body Width wbody 0.194 m
Body Height hbody 0.114 m
Leg Link Lengths l1 , l2 0.2 m

TABLE II: Actuator Parameters


Fig. 2: Robot Configuration. Overview of LASER robot
Parameter Value Units
and leg configuration Gear Ratio 9
Max Torque 33.5 Nm
Max Joint Speed 21 Rad/s
Cartesian space PD control. This action space provides more
structure than learning in joint space, including a more
obvious action mapping for the agent, yet avoids biases
introduced from trajectory generators. Results show benefits flipped upside-down. The robot parameters are summarized
from this approach include: in Table I.
• little needed reward shaping, and thus much improved Each of LASER’s actuators consists of a custom high
sample efficiency torque density electric motor coupled to a single-stage 9:1
• the emergence of high-speed, robust, and natural gaits planetary gear reduction. The lower link is driven by a bar
such as bounding and galloping linkage which passes through the upper link. The legs are
• ease of sim-to-sim transfer, where the quadruped runs serially actuated, but to keep leg inertia low, the hip and knee
at speeds significantly outperforming model-based ap- actuators are co-axially located at the hip of each leg. The
proaches, even when carrying loads over 80% of the actuation capabilities of the LASER robot are summarized
nominal quadruped mass in Table II.
The rest of this paper is organized as follows. Section II
provides background details on the robot model and rein- B. Reinforcement Learning
forcement learning. Section III describes our design choices
and training set up for learning fast and robust dynamic In the reinforcement learning framework [22], an agent
locomotion in Cartesian space. Section IV shows results from interacts with an environment modeled as a Markov Decision
learning our controller, sim-to-sim transfer, and comparison Process (MDP). An MDP is given by a 4-tuple (S, A, P, R),
with other approaches, and a brief conclusion is given in where S is the set of states, A is the set of actions available
Section V. to the agent, P : S × A × S → R is the transition function,
where P(st+1 |st , at ) gives the probability of being in state
II. BACKGROUND st , taking action at , and ending up in state st+1 , and R :
A. Robot Model S ×A×S → R is the reward function, where R(st , at , st+1 )
In this paper, we will validate our running controller on the gives the expected reward for being in state st , taking action
quadruped model LASER (Legged-Agile-Smart-Efficient- at , and ending up in state st+1 . The goal of an agent is to
Robot). The LASER robot is a highly dynamic quadruped interact with the environment by selecting actions that will
platform built from Unitree’s A1 robot (see Figure 2). maximize future rewards.
The LASER robot has low-inertial legs and high torque There are several popular algorithms for determining the
density electric motors with planetary gear reduction, and optimal policy π to maximize the expected return, such as
it is capable of ground force control without using any PPO [23], TRPO [24], SAC [25], TD3 [26]. Although we
force or torque sensors. The LASER robot uses these high- expect any RL algorithm to work in our framework, in this
performance actuators for all the hip, thigh, and knee joints paper we use Proximal Policy Optimization (PPO) [23], a
to enable full 3D control of ground reaction forces. It is also state-of-the-art on-policy algorithm for continuous control
equipped with contact sensors on each foot. problems, which optimizes the following clipped surrogate
The LASER legs feature a large range of motion: the hip objective:
joints have a range of motion of ±46◦ , the thigh joints have a
range of motion from −60◦ to 240◦ and the knee joints have LCLIP (θ) = Êt [min(rt (θ)Ât , clip(rt (θ), 1 − , 1 + )Ât ]
a range from −154.5◦ to −52.5◦ . The hip and knee designs
allow the robot to operate identically forward, backward and where Ât is an estimator of the advantage function at time
step t as in [27], and rt (θ) denotes the probability ratio The observation space is thus s ∈ R64 . These values are
πθ (at |st ) first normalized before being used for training purposes by
rt (θ) = reinforcement learning.
πθold (at |st )
where πθ is a stochastic policy, and θold is the vector of C. Reward Function
policy parameters before the update. This objective seeks to In order to learn fast locomotion policies while encour-
penalize too large of a policy update, which means penalizing aging energy efficiency and not falling down, we use the
deviations of rt (θ) from 1. following reward function:
III. L EARNING L OCOMOTION C ONTROLLERS WITH R(st , at , st+1 ) = w1 min(xb,t+1 − xb,t , dmax )
I MPEDANCE C ONTROL Z t+1
In this section we describe our reinforcement learning − w2 |τ · q̇|dt + 0.01 (2)
t
framework and design decisions for learning fast locomotion
controllers for quadruped robots. where w1 = 2, w2 = 0.008, xb is the body x-coordinate,
dmax = 0.06 caps the maximum reward for forward velocity,
A. Action Space τ are the motor torques, and q̇ are the motor velocities.
We propose learning desired foot positions in Cartesian The first term measures the progress in the world x-
space, giving an action space of a ∈ R12 , which will then direction. We find that without limiting this term, and even
be tracked with impedance control. Compared with learning with dynamics randomization, the agent is likely exploiting
in joint space, this gives a naturally limited action space simulator dynamics and thus achieving unrealistic speeds.
with intuitive ranges that produces an exact mapping to the Limiting this reward to what corresponds to 6 m/s mitigates
environment. this effect and ensures a successful sim-to-sim transfer.
More precisely, the agent chooses Cartesian foot po- The second term measures energy expenditure over the
sitions in the leg frame from each hip location in the time interval, and the agent also receives a small survival
range [−0.2, 0.2] m for x, [−0.05, 0.05] m for y, and bonus. Although the learning algorithm is robust to differ-
[−0.33, −0.15] m for z, which the policy selects at 100 Hz. ent weight choices, our experiments show these proposed
These ranges are chosen to allow for speed in the body x weights provide a good balance between speed and energy
direction, small corrections in the body y direction, as well efficiency.
as maintaining a minimum height and clearance of obstacles An episode terminates after 1000 time steps, correspond-
in the z direction. The foot positions for each leg are then ing to 10 seconds, or if the quadruped loses balance and
tracked with Cartesian PD control at 1 kHz with: falls, in which case the agent receives a penalty of −10.
τ = J (q)> [Kp (pd − p) − Kd (v)] (1) D. Training Environment Details
where J (q) is the foot Jacobian at joint configuration q, We use PyBullet [14] as the physics engine for training
pd are the desired foot positions learned with RL, p and and simulation purposes, and the LASER quadruped model
v are the current foot positions and velocities in the leg introduced in Sec. II-A. Similar to the previously discussed
frame, and Kp and Kd are diagonal matrices of proportional studies in Sec. I on learning robust controllers with deep rein-
and derivative gains in Cartesian coordinates. We note it is forcement learning, we make use of dynamics randomization
possible to successfully learn locomotion controllers for a toward avoiding the exploitation of simulator dynamics, and
wide range of Cartesian PD gains, but we present results for to ease the sim-to-sim transfer. Our dynamics randomization
(Kp = 700I3 , Kd = 12I3 ). choices are summarized in Table III, and we discuss them
As is common in the reinforcement learning literature, our below.
policy network outputs actions in the range [−1, 1], which At the start of each environment reset, the mass of each
are then scaled to the ranges discussed above. body part (base, hips, thighs, calves, feet) is individually
perturbed randomly by up to 20% of its nominal value, and
B. Observation Space the inertia is set accordingly. The coefficient of friction of
The observation space consists of the full robot state, other the ground (and any random objects added to the simulation)
than the body xy-coordinates. More precisely, the robot state is also randomly chosen in [0.5, 1].
consists of: body state (height, orientation quaternion, linear For added robustness, and to test the capabilities of the
and angular velocities), joint state (positions, velocities), foot system, a random mass is attached to the base with proba-
state (positions, velocities), and foot-in-contact booleans. bility 0.8. This mass is uniformly chosen to be in [0, 15] kg,
Each of these parameters can either be directly measured and randomly placed within [0.15, 0.05, 0.05] m Cartesian
or estimated from LASER’s onboard sensors. offsets from the base center of mass.
We also include the time remaining in the current rollout to
avoid violation of the Markov property, as suggested in [28]. IV. R ESULTS
Since each rollout has a maximum length of 10 seconds, at In this section we discuss results from using our method
test time once 8 seconds have elapsed, we fix this value to to learn fast locomotion controllers. All policies are learned
2 for all future observations supplied to the policy network. with PPO [23], and our neural networks are multi-layer
Fig. 3: Motion snapshots of galloping in ideal conditions at 6 m/s in PyBullet.

Fig. 4: Motion snapshots of running at 4 m/s while carrying a 10 kg load.

Fig. 5: Motion snapshots of running over random rough terrain up to 0.04 m in height, while carrying a 6 kg load. The
quadruped is able to recover from feet catching on the box edges, shown here for the rear right foot.

TABLE III: Randomized physical parameters and their 1. How does environmental noise affect the agent’s ability
ranges. to learn, as well as resulting policies and gaits?
Parameter Lower Bound Upper Bound
2. Which environmental noise parameters are most critical
Mass (each body link) 80% 120% for a successful sim-to-sim transfer?
Added mass 0 kg 15 kg 3. How does action space choice affect the sim-to-sim
Added mass base offset [-0.15,-0.05,-0.05] m [0.15,0.05,0.05] m transfer?
Coefficient of friction 0.5 1
1) Training curves: Figure 6 shows training curves for the
environment described in Sec. III-D over flat terrain, as well
TABLE IV: PPO Hyperparameters.
as for randomly varying rough terrain. In this latter case, at
Parameter Value the start of each environment reset, we generate 100 random
Horizon (T) 4096
Optimizer Adam
boxes up to 0.04 m high, up to 1 m wide and with random
Learning rate 1 · 10−4 orientation. These are randomly distributed in a 20 × 6 m
Number of epochs 10 grid in front of the agent, and walls are added at ±3 m on the
Minibatch size 128 y axis so the agent cannot learn to avoid the rough terrain.
Discount (γ) 0.99
GAE prarmeter (λ) 0.95 Results show that training converges within a few million
Clipping parameter () 0.2 timesteps on both flat and rough terrain.
VF coeff. (c1 ) 1
Number of hidden layers (all networks) 2 There is a classic trade-off between speed and robustness
Number of hidden units per layer [200, 100] on the rough terrain, as the agent needs to lift its feet
Nonlinearity tanh higher to clear the terrain, which necessitates expending more
energy as well as decreasing forward velocity. Also, we note
that the agent is still likely to catch its feet on the sides of the
perceptrons with two hidden layers of 200 and 100 neurons boxes, but is able to recover in most cases if the added mass
respectively, with tanh activation. Other training hyperparam- is not too large, or if the speed is not too high. As a reactive
eters are listed in Table IV. controller can only do so well in such instances, in future
Snapshots from executing the learned policy are shown work we plan to incorporate exteroceptive measurements
over flat terrain in Figure 3, with a a 10 kg load in Figure 4, such as vision to run over even rougher terrain.
over rough terrain with blocks of up to 0.04 m high and a 2) Effects of Dynamics Randomization on Sim-to-Sim
load of 6 kg in Figure 5, and from transferring the policy Transfer: To evaluate the requirement and usefulness of
to Gazebo in Figure 1, shown with and without running our dynamics randomization choices, we compare policies
with a 10 kg mass. The reader is encouraged to watch the trained with systematically increasing environment noise
supplementary video for clearer visualizations. parameters in PyBullet, and their performance at test time in
For our experiments, we are specifically interested in the a sim-to-sim transfer. Each trained policy is run in Gazebo
following questions: for 10 trials, where a trial lasts for a maximum of 20 seconds
Episode Reward Mean
80 TABLE VI: Sim-to-sim performance of 10 trials from poli-
Episode Reward Mean

60
cies learned in joint space with various joint PD gains. All
joint gains can produce fast and stable (though unnatural-
40
looking) locomotion in PyBullet, but do not transfer well to
20
another simulator.
0 Flat Terrain
Rough Terrain kp kd Mean Dist. Success Rate
-20 30 0.1 59 0.9
0 2 4 6 8 10
6 50 0.1 20 0.2
Number of Training Timesteps 10
20 0.5 11 0
Fig. 6: Episode reward mean while training in our proposed 30 0.5 24 0.1
40 0.5 56 0.7
action space. We compare training over flat terrain only with 50 0.5 37 0.9
training over randomly varying rough terrain, which consists 60 0.5 26 0.7
of random boxes up to 0.04 m in height. 70 0.5 15 0.4
50 1 36 0.4
60 1 24 0.3
TABLE V: Sim-to-sim success rate for 10 trials from training
with varying dynamics randomization. A successful trial
represents running without falling for 20 seconds.
Dynamics Randomization Mean Dist. (m) Success Rate
None <2 0
Varying µ only 5 0
Varying µ, added load 7 0
Varying µ, 10% mass 5 0
Varying µ, 20% mass 4 0
Varying µ, 10% mass, added load 14 0
Varying µ, 20% mass, added load 78.9 1

Fig. 7: Body velocity in Gazebo with varying added masses


from sample policy executions.
and terminates early due to a fall, and we evaluate the mean
distance the robot is able to run in this time. For these tests
we use the nominal quadruped model (without any noise in
the mass/inertia, nor added load) in Gazebo, and we load those seen when training in Cartesian space, toward ensuring
and query our trained policy network with the TensorFlow a fair comparison. For all combinations tested, the agent is
C++ API. able to learn to run without falling in PyBullet, for the same
Results from various training noise are summarized in speeds as with training in Cartesian space. However, almost
Table V. Notably, with full randomization as described in all policies converge to unnatural-looking gaits, where one
Sec. III-D, we achieve a 100% success rate without any or both rear legs extend outward and back, which appears to
falls, where the agent consistently runs just under 80 m be used more for balancing. This behavior remains the same
in 20 seconds. With using any subset of the dynamics even if we limit the action space to the joint ranges used by
randomization, we observe that the quadruped always falls policies trained in Cartesian space.
before the 20 seconds are up. In these cases, the distance Although still able to successfully run in PyBullet, the
traveled is estimated to be just prior to the base hitting the policies learned in joint space do not transfer well to another
ground. As the significantly varying added load adds the most simulator. The unnatural looking gaits may be the result of
randomization, it is intuitive that this training noise is most the agent exploiting the simulator dynamics. We summarize
helpful for the sim-to-sim transfer. the best results obtained for policies trained with various joint
3) Comparison with Joint PD: We compare training in gains in Table VI. We run 10 trials for each policy in the
Cartesian space with impedance control with training di- transfer to Gazebo, and record the average distance traveled
rectly in joint space. Recent works have found that with in 20 seconds, as well as the success rate of not falling.
proper reward shaping and dynamics randomization, policies 4) Performance with Varying Loads: We next test the full
trained in joint space can transfer well from simulation to performance in the sim-to-sim transfer of the policy learned
hardware [8]–[10]. However, such works have focused on in Cartesian space as described in Sec. III. Figure 7 shows the
robustness, and speeds lower than those attainable with our body velocity from sample policy executions in Gazebo with
quadruped. varying added masses. For relatively “small” added masses
We tested training fast running policies in joint space (i.e. 5 kg, though this is over 40% of the nominal robot
with a range of joint gains, namely kp ∈ [20, 100] and mass), there is little noticeable difference in performance
kd ∈ [0.1, 1], using the same observation space, dynamics from the ideal case. As the added load mass increases, it
randomization, and simple reward function from Equation 2. becomes more difficult for the agent to maintain higher
We trained at least 3 policies for each set of gains tested, and speeds, though still possible to make forward progress.
also tested varying the action space to limit the joint ranges to In Figures 8, 9, 10, we compare the body height and
Fig. 8: Body height and orientation for 1 second of running Fig. 9: Front right leg joint velocities for 1 second of running
in Gazebo, in the ideal case and with a 10 kg load. in Gazebo, in the ideal case and with a 10 kg load.

orientation, as well as joint velocities and torques for the


front right leg (though the other legs are similar) while
running with no load vs. with a 10 kg load. The plots show
1 second of time when the agent has reached a steady-state
velocity in both cases. Due to the added load, the height
of the quadruped is lower when carrying the 10 kg load.
While the joint states are similar in both cases, the agent
must use more torque in the 10 kg load instance, as can
be reasonably expected. In both cases, the maximum joint
velocities and torques are reached for the thigh joint, as well
as the maximum velocity for the calf joint, implying we reach
near the dynamic capabilities of the system. In the 10 kg load
case, the maximum torque is applied for longer periods of
time.
5) Comparison with Model-Based Methods: To put our
results in perspective, we compare our locomotion speed with Fig. 10: Front right leg joint torques for 1 second of running
Unitree’s controller, as well as the current best model-based in Gazebo, in the ideal case and with a 10 kg load.
approaches for similar quadruped robots. To the best of our
knowledge, however, maximum speed while load-carrying
has not been heavily optimized for in these cases. perceived, uneven terrain as well as unknown added masses
The maximum continuous running speed reported by Uni- representing over 100% of the nominal mass of the robot,
tree for A1 is 3.3 m/s [29], without any load carrying. For while still running at speeds comparable to the current state-
similar sized quadrupeds, a recent method combining whole- of-the-art model-based approaches without loads. In addition
body impulse control and MPC has reached 3.7 m/s on MIT to previously proposed dynamics randomization, we find that
Mini-Cheetah on a treadmill [30]. training with unknown added loads can play a significant role
Our results show higher running speeds in the nominal in successful sim-to-sim transfers.
case, and comparable running speeds with loads representing Future work will further validate the method with a sim-
83% of the nominal robot mass. to-real transfer. We also plan to incorporate exteroceptive
measurements such as vision to improve dynamic capabilities
V. C ONCLUSION over rougher terrain.
In this work, we proposed learning high-speed quadruped R EFERENCES
running in Cartesian space tracked with impedance control.
[1] J. Di Carlo, P. M. Wensing, B. Katz, G. Bledt, and S. Kim, “Dynamic
Compared to other recently proposed action spaces, results locomotion in the mit cheetah 3 through convex model-predictive
showed little needed reward shaping, improved sample ef- control,” in 2018 IEEE/RSJ International Conference on Intelligent
ficiency, the emergence of natural gaits such as galloping Robots and Systems (IROS), 2018, pp. 1–9.
[2] C. Gehring, S. Coros, M. Hutter, C. Dario Bellicoso, H. Heijnen,
and bounding, and ease of sim-to-sim transfer. We also R. Diethelm, M. Bloesch, P. Fankhauser, J. Hwangbo, M. Hoepflinger,
showed robustness to environmental disturbances such as un- and R. Siegwart, “Practice makes perfect: An optimization-based
approach to controlling agile motions for a quadruped robot,” IEEE stochastic environments,” IEEE Transactions on Autonomous Mental
Robotics Automation Magazine, vol. 23, no. 1, pp. 34–43, 2016. Development, vol. 4, no. 4, pp. 330–341, 2012.
[3] C. D. Bellicoso, F. Jenelten, C. Gehring, and M. Hutter, “Dy- [17] J. Buchli, E. Theodorou, F. Stulp, and S. Schaal, “Variable impedance
namic locomotion through online nonlinear motion optimization for control - a reinforcement learning approach,” in Proceedings of
quadrupedal robots,” IEEE Robotics and Automation Letters, vol. 3, Robotics: Science and Systems, Zaragoza, Spain, June 2010.
no. 3, pp. 2261–2268, 2018. [18] J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar,
[4] M. H. Raibert, Legged robots that balance. MIT press, 1986. and P. Abbeel, “Reinforcement learning on variable impedance con-
[5] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control of a quadrotor troller for high-precision robotic assembly,” in 2019 International
with reinforcement learning,” IEEE Robotics and Automation Letters, Conference on Robotics and Automation (ICRA), 2019, pp. 3080–
vol. 2, no. 4, pp. 2096–2103, 2017. 3087.
[6] A. Molchanov, T. Chen, W. Hönig, J. A. Preiss, N. Ayanian, and G. S. [19] X. B. Peng and M. van de Panne, “Learning locomotion skills using
Sukhatme, “Sim-to-(multi)-real: Transfer of low-level robust control deeprl: Does the choice of action space matter?” in Proceedings of the
policies to multiple quadrotors,” in 2019 IEEE/RSJ International ACM SIGGRAPH / Eurographics Symposium on Computer Animation,
Conference on Intelligent Robots and Systems (IROS), 2019, pp. 59– ser. SCA ’17. Association for Computing Machinery, 2017.
66. [20] G. Bellegarda and K. Byl, “Training in task space to speed up
[7] Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, and M. van de Panne, and guide reinforcement learning,” in 2019 IEEE/RSJ International
“Learning locomotion skills for cassie: Iterative design and sim-to- Conference on Intelligent Robots and Systems (IROS), 2019, pp. 2693–
real,” in Proceedings of the Conference on Robot Learning, ser. Pro- 2699.
ceedings of Machine Learning Research, L. P. Kaelbling, D. Kragic, [21] G. Bellegarda and Q. Nguyen, “Robust quadruped jumping via deep
and K. Sugiura, Eds., vol. 100. PMLR, 30 Oct–01 Nov 2020, pp. reinforcement learning,” arXiv preprint arXiv:2011.07089, 2020.
317–329. [22] R. S. Sutton and A. G. Barto, Reinforcement learning - an introduction,
[8] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, ser. Adaptive computation and machine learning. MIT Press, 1998.
V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
for legged robots,” Science Robotics, vol. 4, no. 26, 2019. O. Klimov, “Proximal policy optimization algorithms,” CoRR,
[9] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/
“Learning quadrupedal locomotion over challenging terrain,” Science 1707.06347
Robotics, vol. 5, no. 47, 2020. [Online]. Available: https://robotics. [24] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel,
sciencemag.org/content/5/47/eabc5986 “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015.
[10] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo- [Online]. Available: http://arxiv.org/abs/1502.05477
hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for [25] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
quadruped robots,” in Proceedings of Robotics: Science and Systems, Off-policy maximum entropy deep reinforcement learning with a
Pittsburgh, Pennsylvania, June 2018. stochastic actor,” ser. Proceedings of Machine Learning Research,
[11] X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, and S. Levine, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm
“Learning agile robotic locomotion skills by imitating animals,” 2020. Sweden: PMLR, 10–15 Jul 2018, pp. 1861–1870. [Online]. Available:
[12] A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, http://proceedings.mlr.press/v80/haarnoja18b.html
and V. Vanhoucke, “Policies modulating trajectory generators,” in [26] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi-
Conference on Robot Learning. PMLR, 2018, pp. 916–926. mation error in actor-critic methods,” in International Conference on
[13] M. Rahme, I. Abraham, M. L. Elwin, and T. D. Murphey, “Dynamics Machine Learning, 2018, pp. 1582–1591.
and domain randomized gait modulation with bezier curves for sim- [27] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel,
to-real legged locomotion,” arXiv preprint arXiv:2010.12070, 2020. “High-dimensional continuous control using generalized advantage
[14] E. Coumans and Y. Bai, “Pybullet, a python module for physics sim- estimation,” CoRR, vol. abs/1506.02438, 2015.
ulation for games, robotics and machine learning,” http://pybullet.org, [28] F. Pardo, A. Tavakoli, V. Levdik, and P. Kormushev, “Time limits
2016–2019. in reinforcement learning,” in International Conference on Machine
[15] R. Martı́n-Martı́n, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, Learning. PMLR, 2018, pp. 4045–4054.
and A. Garg, “Variable impedance control in end-effector space: An [29] Unitree Robotics. (2021, February) A1. [Online]. Available: https:
action space for reinforcement learning in contact-rich tasks,” in 2019 //www.unitree.com/products/a1/
IEEE/RSJ International Conference on Intelligent Robots and Systems [30] D. Kim, J. Di Carlo, B. Katz, G. Bledt, and S. Kim, “Highly dynamic
(IROS), 2019, pp. 1010–1017. quadruped locomotion via whole-body impulse control and model
[16] F. Stulp, J. Buchli, A. Ellmer, M. Mistry, E. A. Theodorou, and predictive control,” arXiv preprint arXiv:1909.06586, 2019.
S. Schaal, “Model-free reinforcement learning of impedance control in

You might also like