Deep Tracking Control
Deep Tracking Control
This is the accepted version of Science Robotics Vol. 9, Issue 86, eadh5401 (2024)
DOI: 0.1126/scirobotics.adh5401
Legged locomotion is a complex control problem that requires both accuracy and robustness to cope with real-
world challenges. Legged systems have traditionally been controlled using trajectory optimization with inverse
arXiv:2309.15462v2 [cs.RO] 22 Jan 2024
dynamics. Such hierarchical model-based methods are appealing due to intuitive cost function tuning, accurate
planning, generalization, and most importantly, the insightful understanding gained from more than one decade
of extensive research. However, model mismatch and violation of assumptions are common sources of faulty
operation. Simulation-based reinforcement learning, on the other hand, results in locomotion policies with
unprecedented robustness and recovery skills. Yet, all learning algorithms struggle with sparse rewards emerging
from environments where valid footholds are rare, such as gaps or stepping stones. In this work, we propose
a hybrid control architecture that combines the advantages of both worlds to simultaneously achieve greater
robustness, foot-placement accuracy, and terrain generalization. Our approach utilizes a model-based planner
to roll out a reference motion during training. A deep neural network policy is trained in simulation, aiming to
track the optimized footholds. We evaluate the accuracy of our locomotion pipeline on sparse terrains, where
pure data-driven methods are prone to fail. Furthermore, we demonstrate superior robustness in the presence of
slippery or deformable ground when compared to model-based counterparts. Finally, we show that our proposed
tracking controller generalizes across different trajectory optimization methods not seen during training. In
conclusion, our work unites the predictive capabilities and optimality guarantees of online planning with the
inherent robustness attributed to offline learning.
INTRODUCTION recent years for synthesizing robust legged locomotion. Unlike model-
based control, RL does not rely on explicit models. Instead, behaviors
Trajectory optimization (TO) is a commonly deployed instance of op- are learned, most often in simulation, through random interactions
timal control for designing motions of legged systems and has a long of agents with the environment. The result is a closed-loop control
history of successful applications in rough environments since the early policy, typically represented by a deep neural network, that maps raw
2010s [1, 2]. These methods require a model of the robot’s kinematics observations to actions. Handcrafted state-machines become obsolete
and dynamics during runtime, along with a parametrization of the ter- because all relevant corner cases are eventually visited during training.
rain. Until recently, most approaches have used simple models such End-to-end policies, trained from user commands to joint target posi-
as single rigid body [3] or inverted pendulum dynamics [4, 5], or have tions, have been deployed successfully on quadrupedal robots such as
ignored the dynamic effects altogether [6]. Research has shifted to- ANYmal [10, 11]. More advanced teacher-student structures have sub-
wards more complex formulations, including centroidal [7] or full-body stantially improved the robustness, enabling legged robots to overcome
dynamics [8]. The resulting trajectories are tracked by a whole-body obstacles through touch [12] and perception [13]. Although locomotion
control (WBC) module, which operates at the control frequency and across gaps and stepping stones is theoretically possible, good explo-
utilizes full-body dynamics [9]. Despite the diversity and agility of the ration strategies are required to learn from the emerging sparse reward
resulting motions, there remains a considerable gap between simulation signals. So far, these terrains could only be handled by specialized
and reality due to unrealistic assumptions. Most problematic assump- policies, which intentionally overfit to one particular scenario [14] or a
tions include perfect state estimation, occlusion-free vision, known selection of similar terrain types [15–18]. Despite promising results,
contact states, zero foot-slip, and perfect realization of the planned distilling a unifying locomotion policy may be difficult and has only
motions. Sophisticated hand-engineered state machines are required been shown with limited success [19].
to detect and respond to various special cases not accounted for in the
modeling process. Nevertheless, highly dynamic jumping maneuvers Some of the shortcomings that appear in RL can be mitigated
performed by Boston Dynamics’ bipedal robot Atlas demonstrate the using optimization-based methods. While the problem of sparse gra-
potential power of TO. dients still exists, two important advantages can be exploited: First,
Reinforcement learning (RL) has emerged as a powerful tool in cost-function and constraint gradients can be computed with a small
Research Article ETH Zurich 2
Precision Robustness
Fig. 1. Robust and precise locomotion in various indoor and outdoor environments. The marriage of model-free and model-based control
allows legged robots to be deployed in environments where steppable contact surfaces are sparse (bottom left) and environmental uncertainties
are high (top right).
Research Article ETH Zurich 3
number of samples. Second, poor local optima can be avoided by pre- time-based rewards and position-based goal tracking. However, we
computing footholds [5, 8], pre-segmenting the terrain into steppable reward desired foothold positions at planned touch-down instead of
areas [7, 20], or by smoothing out the entire gradient landscape [21]. rewarding a desired base pose at an arbitrarily chosen time. Finally,
Another advantage of TO is its ability to plan actions ahead and pre- we use an asymmetric actor-critic structure similar to [22], where we
dict future interactions with the environment. If model assumptions provide privileged ground truth information to the value function and
are generic enough, this allows for great generalization across diverse noisified measurements to the network policy.
terrain geometries [7, 21]. We trained more than 4000 robots in parallel for two weeks on
The sparse gradient problem has been addressed extensively in challenging ground covering a surface area of more than 76000 m2 .
the learning community. A notable line of research has focused on Throughout the entire training process, we generated and learned from
learning a specific task while imitating expert behavior. The expert about 23 years of optimized trajectories. The combination of offline
provides a direct demonstration for solving the task [22, 23], or is used training and online re-planing results in accurate, agile, and robust
to impose a style while discovering the task [24–26]. These approaches locomotion. As showcased in Fig. 1 and movie 1, with our hybrid
require collecting expert data, commonly done offline, either through control pipeline, ANYmal [34] can skillfully traverse parkours with
re-targeted motion capture data [24–26] or a TO technique [22, 23]. high precision, and confidently overcome uncertain environments with
The reward function can now be formulated to be dense, meaning that high robustness. Without the need for any post-training, the tracking
agents can collect non-trivial rewards even if they do not initially solve policy can be deployed zero-shot with different TO methods at different
the task. Nonetheless, the goal is not to preserve the expert’s accuracy update rates. Moreover, movie 2 demonstrates successful deployment
but rather to lower the sample and reward complexity by leveraging in search-and-rescue scenarios, which demand both accurate foot place-
existing knowledge. ment and robust recovery skills. The contributions of our work are
To further decrease the gap between the expert and the policy per- therefore twofold: Firstly, we enable the deployment of model-based
formance, we speculate that the latter should have insight into the planners in rough and uncertain real-world environments. Secondly,
expert’s intentions. This requires online generation of expert data, we create a single unifying locomotion policy that generalizes beyond
which can be conveniently achieved using any model-based controller. the limitations imposed by state-of-the-art RL methods.
Unfortunately, rolling out trajectories is often orders of magnitude
more expensive than a complete learning iteration. To circumvent this RESULTS
problem, one possible alternative is to approximate the expert with a
generative model, for instance, by sampling footholds from a uniform In order to evaluate the effectiveness of our proposed pipeline, hereby
distribution [15, 16], or from a neural network [17, 27, 28]. However, referred to as Deep Tracking Control (DTC), we compared it with four
for the former group, it might be challenging to capture the distribution different approaches: two model-based controllers, TAMOLS [21] and
of an actual model-based controller, while the latter group still does a nonlinear model predictive control (MPC) method presented in [7],
not solve the exploration problem itself. and two data-driven methods, as introduced in [13] and [11]. We refer
In this work, we propose to guide exploration through the solution to those as baseline-to-1 (TAMOLS), baseline-to-2 (MPC), baseline-rl-
of TO. As such data will be available both on- and offline, we refer 1 (teacher/student policy), and baseline-rl-2 (RL policy), respectively.
to it as “reference” and not expert motion. We utilize a hierarchical These baselines mark the state-of-the-art in MPC and RL prior to this
structure introduced in deep loco [28], where a high-level planner work and they have been tested and deployed under various conditions.
proposes footholds at a lower rate, and a low-level controller follows If not noted differently, all experiments were conducted in the real
the footholds at a higher rate. Instead of using a neural network to world.
generate the foothold plan, we leverage TO. Moreover, we do not only
use the target footholds as an indicator for a rough high-level direction Evaluation of Robustness
but as a demonstration of optimal foot placement. We conducted three experiments to evaluate the robustness of our
The idea of combining model-based and model-free approaches is hybrid control pipeline. The intent is to demonstrate survival skills
not new in the literature. For instance, supervised [29] and unsuper- on slippery ground, and recovery reflexes when visual data is not
vised [30, 31] learning has been used to warm-start nonlinear solvers. consistent with proprioception or is absent altogether. We rebuilt harsh
RL has been used to imitate [22, 23] or correct [32] motions obtained environments that are likely to be encountered on sites of natural
by solving TO problems. Conversely, model-based methods have been disasters, where debris might further break down when stepped onto,
used to check the feasibility of learned high-level commands [27] or and construction sites, where oil patches create slippery surfaces.
to track learned acceleration profiles [33]. Compared to [32], we do In the first experiment, we placed a rectangular cover plate with an
not learn corrective joint torques around an existing WBC, but in- area of 0.78 × 1.19 m2 on top of a box with the same length and width,
stead, learn the mapping from reference signals to joint positions in an and height 0.37 m (Fig. 2 A). The cover plate was shifted to the front,
end-to-end fashion. half of the box’s length. ANYmal was then steered over the cover
To generate the reference data, we rely on an efficient TO plate, which pitched down as soon as its center of mass passed beyond
method called terrain-aware motion generation for legged systems the edge of the box. Facing only forward and backward, the plate’s
(TAMOLS) [21]. It optimizes over footholds and base pose simultane- movement was not detected through the depth cameras, and could
ously, thereby enabling the robot to operate at its kinematic limits. We only be perceived through proprioceptive sensors. Despite the error
let the policy observe only a small subset of the solution, namely planar between map and odometry reaching up to 0.4 m, the robot managed to
footholds, desired joint positions, and the contact schedule. We found successfully balance itself. This experiment was repeated three times
that these observations are more robust under the common pitfalls of with consistent outcomes.
model-based control, while still providing enough information to solve In our second experiment (Fig. 2 B) we created an obstacle park-
the locomotion task. In addition, we limit computational costs arising our with challenging physical properties. A large wooden box with a
from solving the optimization problems by utilizing a variable update slopped front face was placed next to a wet and slippery whiteboard.
rate. During deployment, the optimizer runs at the fastest possible rate We increased the difficulty by placing a soft foam box in front, and a
to account for model uncertainties and external disturbances. rolling transport cart on top of the wooden box. The robot was com-
Our approach incorporates elements introduced in [14], such as manded to walk over the objects with random reference velocities for
Research Article ETH Zurich 4
A moving plane
Fig. 2. Evaluation of robustness. (A) ANYmal walks along a loose cover plate that eventually pitches forward (left to right, top to bottom).
The third row shows ANYmal’s perception of the surroundings during the transition and recovery phase. (B) The snapshots are taken at critical
time instances when walking on slippery ground, just before complete recovery. (C) ANYmal climbs upstairs with disabled perception (top to
bottom). The collision of the right-front end-effector with the stair tread triggers a swing reflex, visualized in orange.
Research Article ETH Zurich 5
approximately 45 seconds, after which the objects were redistributed choice of observations, which hides the optimized base pose from the
to their original locations to account for any potential displacement. policy. Some terrains within the training environment can be seen as a
This experiment was repeated five times. Despite not being trained on combination of gaps and boxes, where each box is surrounded by a gap.
movable or deforming obstacles, the robot demonstrated its recovery During training, TAMOLS placed the footholds sufficiently far away
skills in all five trials without any falls. from the box to avoid stepping into the gap. This allowed the policy to
The tracking policy was trained with perceptive feedback, mean- learn climbing maneuvers without knee joint collisions. Baseline-to-2,
ing that the policy and the motion planner had partial or complete being aware of the spatial coordinates of the knees, naturally produces
insight into the local geometrical landscape. Nevertheless, the loco- a similar foothold pattern, even in the absence of the gap.
motion policy was still capable of overcoming many obstacles com-
pletely blind. To simulate a scenario with damaged depth sensors, we Benchmark Against Model-Based Control
let ANYmal blindly walk over a stair with two treads, each 0.18 m
TO was proven to be effective in solving complex locomotion tasks
high and 0.29 m wide (Fig. 2 C). The experiment was repeated three
in simulation, such as the parkour shown in Fig. 4 A. This parkour
times up and down, with an increasing heading velocity selected from
has been successfully traversed by ANYmal using baseline-to-1 and
{±0.5, ±0.75, ±1.0} m/s. In some cases, a stair tread was higher
baseline-to-2, while it was found to be non-traversable for baseline-rl-1
than the swing motion of a foot. Thanks to a learned swing reflex,
and baseline-rl-2 [7]. With our proposed approach, ANYmal could
the stair set could be successfully cleared in all trials. We note that
cross the same obstacle parkour in simulation back and forth at a speed
the same stair set was passed by a blindfolded version of baseline-rl-
of 1 m/s, which was 20 % faster than baseline-to-1. The corresponding
1 [13], which was trained in a complex teacher/student environment.
video clip can be found in movie S3.
In contrast, our method relies on an asymmetric actor/critics structure,
Model-based controllers react sensitively to violation of model as-
achieving a similar level of robustness. Accompanying video clips can
sumptions, which hinders applications in real-world scenarios, where,
be found in the supplementary movie S1.
for instance, uncertainties in friction coefficients, contact states, and
Evaluation of Accuracy visual perception may be large. This issue is exemplified in Fig. 4 B,
where baseline-to-1 was used to guide ANYmal over a flat floor with
We demonstrate the precision of foothold tracking by devising a com-
an invisible gap. When the right front foot stepped onto the trap, the
plex motion that requires the robot to perform a turn-in-place maneuver
planned and executed motions deviated from each other. This triggered
on a small surface area of 0.94 × 0.44 m2 . The robot was commanded
a sequence of heuristic recovery strategies. For large mismatches, how-
to walk up a slope onto a narrow table, then to execute a complete
ever, such scripted reflexes were not effective, and resulted in failure.
360 deg turn, and finally to descend onto a pallet. Some snapshots
DTC uses the same high-level planner but incorporates learned recov-
of the experiment are provided in Fig. 3 A, whereas the full video is
ery and reflex skills. This allowed the robot to successfully navigate
contained in movie S2.
through the trap. The robustness is rooted in the ability to ignore both
To evaluate the quality of the foothold tracking, we collected data
perception and reference motion while relying only on proprioception.
while ANYmal walked on flat ground. Each experiment lasted for
Such behavior was learned in simulation by experiencing simulated
approximately 20 s and was repeated with eight different heading ve-
map drift. The experiment was repeated five times with baseline-to-1,
locities selected from {±1.0, ±0.8, ±0.6, ±0.4} m/s. We measured
five times with baseline-to-2, and five times with our method, con-
the tracking error as the smallest horizontal distance between a foot
sistently leading to similar results. The video clips corresponding to
and its associated foothold during a stance phase. As shown in Fig. 3 B,
the above experiments can be found in movie S3. The movie is fur-
the footholds could be tracked with very high precision of 2.3 cm and
ther enriched with a comparison of baseline-to-2 against DTC on soft
standard deviation 0.48 cm when averaged over the broad spectrum of
materials, which impose very similar challenges.
heading velocity commands.
A artistic motion
0.1 0.1
LF LF
RF RF
0.08 LH 0.08 LH
RH RH
0.06 0.06
0.04 0.04
0.02 0.02
0 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 3. Evaluation of tracking performance. (A) ANYmal climbs up a narrow table, turns, and descends back down to a box. The second
image in the second row shows the robot’s perception of the environment. (B) Euclidean norm of the planar foothold error, averaged over 20 s of
operation using a constant heading velocity. The solid/dashed curves represent the average/maximum tracking errors. (C) Same representation as
in (B), but the data was collected with baseline-to-2. (D) DTC deployed with baseline-to-2, enabling ANYMal to climb up a box of 0.48 m.
Research Article ETH Zurich 7
Fig. 4. Benchmarking against model-based control. (A) DTC successfully traverses an obstacle parkour (left to right) in simulation with a
heading velocity of 1 m/s. (B) Baseline-to-1 falls after stepping into a gap hidden from the perception (left to right). (C) ANYmal successfully
overcomes a trapped floor using our hybrid control architecture (left to right).
crossed a sequence of four gaps with the same width, whilst, relying update rate. Using the experimental setup outlined in the supplemen-
on online generated maps only. tary methods (section “Experimental Setup for Evaluation of Optimizer
The limitations of baseline-rl-1 were previously demonstrated [7] Rate”), we collected a total of six days of data in simulation. A robot
on the obstacle parkour of Fig. 4 A, showing its inability to cross the was deemed “successful” if it could walk from the center to the border
stepping stones. We showcase the generality of our proposed control of its assigned terrain patch, “failed” if its torso made contact with the
framework by conducting three experiments on stepping stones in environment within its patch, and “stuck” otherwise. We report success
the real world, each with an increased level of difficulty. The first and failure rates in Fig. 7 A. Accordingly, when increasing the update
experiment (Fig. 6 A) required ANYmal traversing a field of equally rate from 1 Hz to 50 Hz, the failure rate dropped by 7.11 % whereas
sized stepping stones, providing a contact surface of 0.2 × 0.2 m2 the success rate increased by 4.25 %.
each. The robot passed the 2.0 m long field 10 times. Despite the In the second set of experiments, we compared our approach against
varying heading velocity commands, the robot accurately hit the correct baseline-rl-2 as well as against the same policy trained within our train-
stepping stones as indicated by the solution of the TO. For the second ing environment. We refer to the emerging policy as baseline-rl-3.
experiment (Fig. 6 B), we increased the height of two randomly selected More details regarding the experimental setup can be found in the
stones. The parkour was successfully crossed four out of four times. In supplementary methods (section “Experimental Setup for Performance
the final experiment (Fig. 6 C), we distributed three elevated platforms Evaluation”). As depicted in Fig. 7 B i, our approach exhibited a
a, b, and c, connected by loose wooden blocks of sizes 0.31 × 0.2 × substantially higher success rate than baseline-rl-2. By learning on
0.2 m3 and 0.51 × 0.2 × 0.2 m3 . This environment posed considerable the same terrains, baseline-rl-3 could catch up but still did not match
challenges as the blocks may move and flip over when stepped on. our performance in terms of success rate. The difference mainly orig-
Following the path a → b → a → b → c → a, the robot missed only inates from the fact that the retrained baseline still failed to solve
one stepping stone, which, however, did not lead to failure. Video clips sparse-structured terrains. To highlight this observation, we evaluated
of the stepping stones experiments are provided in movie S5. the performance on four terrain types with sparse (“stepping stones”,
“beams”, “gaps”, and “pallets”), and on four types with dense stepping
Simulation-Based Ablation Study locations (“stairs”, “pit”, “rough slope”, and “rings”). On all consid-
During training, we computed a solution to the TO problem after vari- ered terrain types, our approach outperformed baseline-rl-2 by a huge
able time intervals, but mainly after each foot touch-down. Although margin (Fig. 7 B ii), thereby demonstrating that learned locomotion
such a throttled rate greatly reduced computational costs, it also leads generally does not extrapolate well to unseen scenarios. We performed
to poor reactive behavior in the presence of quickly changing external equally well as baseline-rl-3 on dense terrains, but scored notably
disturbances, dynamic obstacles, or map occlusion. Moreover, the higher on sparse-structured terrains. This result suggests that the pro-
optimizer was updated using privileged observations, whereas, in real- posed approach itself was effective and that favorable locomotion skills
ity, the optimizer is subject to elevation map drift, wrongly estimated were not encouraged by the specific training environment.
friction coefficients, and unpredicted external forces. To compensate In an additional experiment, we investigated the effects of erroneous
for such modeling errors, we deploy the optimizer in MPC-fashion. We predictions of the high-level planner on locomotion performance. We
investigated the locomotion performance as a function of the optimizer did so by adding a drift value to the elevation map, sampled uniformly
Research Article ETH Zurich 8
Fig. 5. Benchmarking against reinforcement learning. (A) Baseline-rl-1 attempts to cross a small gap. ANYmal initially manages to recover
from miss-stepping with its front legs but subsequently gets stuck as its hind legs fall inside the gap. (B) Using baseline-rl-1, the robot stumbles
along a narrow beam. (C) With DTC, the robot can pass four consecutive large gaps (left to right) without getting stuck or falling. (D) ANYmal
is crossing a long beam using the proposed control framework.
Research Article ETH Zurich 9
A stepping stones
c
b
Fig. 6. Evaluation of the locomotion performance on stepping stones. (A) ANYmal reliably crosses a field of flat stepping stones (left to
right). (B) The robot crosses stepping stones of varying heights (left to right). The two tall blocks are highlighted in blue. (C) ANYmal navigates
through a field of loosely connected stepping stones, following the path a → b → a → b → c → a.
Research Article ETH Zurich 10
40 60 60 ours
success baseline-rl-2
success 40 40
failed baseline-rl-3
20 failed stuck
20 20
stuck
0 0
0
urs 2 3 irs pit ough rings ones llets ams gaps
100 101 o -rl- -rl- sta r st pa be
el ine eline p ing
bas bas ste
p
C tracking performance and success rate as a function of map drift (all terrains)
i ii iii
100 100
0.08 flat
0.06 rough success
50 success 50 failed
0.04
failed stuck
0.02 stuck
0 0
0 0.1 0.2 0.3 0.4 0 0.2 0.4 0 0.2 0.4
4.5 2.5
with with
4 without 2 without
3.5 1.5
0 20 40 60 80 0 20 40 60 80
Fig. 7. Simulation results and ablation studies. (A) Success and failure rates of DTC, recorded for different update rates of the optimizer.
The upper limit of 50 Hz is imposed by the policy frequency. (B) Comparison against baseline policies. (i) Evaluation on all 120 terrains. (ii)
Evaluation on terrains where valid footholds are dense (white background) and sparse (gray background). (C) Influence of elevation map drift on
the locomotion performance, quantified by tracking error (i), success rate on rough (ii), and on flat ground (ii). (D) Average terrain level (i) and
average foothold reward (ii) scored during training.
Research Article ETH Zurich 11
from the interval ∈ (0, 0.5) m. Contrary to training, the motion was challenges in legged locomotion. The proposed control architecture
optimized over the perturbed height map. Other changes to the ex- leverages this observation by combining the planning capabilities of
perimental setup are described in the supplementary methods (section the former and the robustness properties of the latter. It does, by no
“Experimental Setup for Performance Evaluation under Drift”). As means, constitute a universal recipe to integrate the two approaches
visualized in Fig. 7 C, we collected tracking error, success, and failure in an optimal way for a generic problem. Moreover, one could even
rates with simulated drift on flat and rough ground. The tracking error extend the discussion with self- and unsupervised learning, indirect
grew mostly linearly with the drift value (Fig. 7 C i). On rough terrains, optimal control, dynamic programming, and stochastic optimal control.
the success rate remained constant for drift values smaller than 0.1 m, Nevertheless, our results may motivate future research to incorporate
and decreased linearly for larger values (Fig. 7 C ii). On the other the aspect of planning into the concept RL.
hand, success and failure rates were not affected by drift on flat ground We see several promising avenues for future research. Many suc-
(Fig. 7 C iii). cessful data-driven controllers have the ability to alter the stride dura-
We found that providing joint positions computed for the upcoming tion of the trotting gait. We expect a further increase in survival rate
touch-down event greatly improved convergence time and foothold and technical skills if the network policy could suggest an arbitrary
tracking performance. This signal encodes the foothold location in contact schedule to the motion optimizer. Moreover, a truly hybrid
joint space, thus, providing a useful hint for foothold tracking. It also method, in which the policy can directly modify the cost function of
simplifies the learning process, as the network is no longer required to the planner, may be able to generate more diversified motions. Our
implicitly learn the inverse kinematics (IK). Evidence for our claims results indicate that IK is difficult to learn. To increase the sample
is given in Fig. 7 D, showing two relevant learning curves. Tracking efficiency and improve generalization across different platforms, a
accuracy is represented by the foothold rewards, whereas technical more sophisticated network structure could exploit prior knowledge
skills are quantified using the average terrain level [11]. Both scores of analytical IK. Another potential research direction may focus on
are substantially higher if the footholds could be observed in both task leveraging the benefits of sampling trajectories from an offline buffer.
and joint space. This could substantially reduce the training time and allow for the
substitution of TAMOLS with a more accurate TO method, or even
DISCUSSION expert data gathered from real animals.
This work demonstrates the potential of a hybrid locomotion pipeline MATERIALS AND METHODS
that combines accurate foot placement and dynamic agility of state-of-
the-art TO with the inherent robustness and reflex behaviors of novel Motivation
RL control strategies. Our approach enables legged robots to overcome To motivate the specific architectural design, we first identify the
complex environments that either method alone would struggle with. strengths and weaknesses of the two most commonly used control
As such terrains are commonly found in construction sites, mines, and paradigms in legged locomotion.
collapsed buildings, our work could help advance the deployment of TO amounts to open-loop control, which produces suboptimal so-
autonomous legged machines in the fields of construction, maintenance, lutions in the presence of stochasticity, modeling errors, and small
and search-and-rescue. prediction windows. Unfortunately, these methods also introduce many
We have rigorously evaluated the performance in extensive real- assumptions, mostly to reduce computation time or achieve favorable
world experiments over the course of about half a year. We included numerical properties. For instance, the feet are almost always pre-
gaps, stepping stones, narrow beams, and tall boxes in our tests, and selected interaction points to prevent complex collision constraints,
demonstrated that our method outperformed the RL baseline controller contact and actuator dynamics are usually omitted or smoothed out
on every single terrain. Next, we evaluated the robustness on slippery to circumvent stiff optimization problems, and the contact schedule is
and soft ground, each time outperforming two model-based controllers. often pre-specified to avoid the combinatorial problem imposed by the
Furthermore, we have shown that the emerging policy can track switched system dynamics. Despite a large set of strong assumptions,
the motion of two different planners utilizing the same trotting gait. real-time capable planners are not always truly real-time. The reference
This was possible because the observed footholds seem to be mostly trajectories are updated around 5 Hz [31] to 100 Hz [7] and realized
invariant under the choice of the optimizer. However, certain obstacles between 400 Hz to 1000 Hz. In other words, these methods do not plan
may encourage the deployed planner to produce footprint patterns that fast enough to catch up with the errors they are making. While struc-
otherwise do not emerge during training. In this case, we would expect tural [2] or environmental [7, 20] decomposition may further contribute
a degraded tracking performance. to the overall suboptimality, they were found useful for extracting good
In addition to our main contribution, we have demonstrated several local solutions on sparse terrains. Because the concept of planning is
other notable results. First, our policy, which was trained exclusively not restricted to the tuning domain, model-based approaches tend to
with visual perception, is still able to generalize to blind locomotion. generalize well across different terrain geometries [7, 21]. Moreover,
Second, A simple multilayer perceptron (MLP) trained with an asym- since numerical solvers perform very cheap and sparse operations on
metric actor/critics setup achieves similar robust behaviors as much the elevation map, the map resolution can be arbitrarily small, facilitat-
more complex teacher/student trainings [12, 13]. Third, Our locomo- ing accurate foothold planning.
tion policy can handle a lot of noise and drift in the visual data without RL, on the other hand, leads to policies that represent global closed-
relying on complicated gaited networks, which might be difficult to loop control strategies. Deep neural networks are large capacity models,
tune and train [13]. and as such, can represent locomotion policies without introducing any
Contrary to our expectations, the proposed training environment assumption about the terrain or the system (except from being Marko-
was found to not be more sample efficient than similar unifying RL vian). They exhibit good interpolation in-between visited states but do
approaches [11, 13]. The large number of epochs required for con- not extrapolate well to unseen environments. Despite their large size,
vergence suggests that foothold accuracy is something intrinsically the inference time is usually relatively small. The integration of an actu-
complicated to learn. ator model has been demonstrated to improve sim-to-real-transfer [10],
In this work, we emphasized that TO and RL share complementary while the stochasticity in the system dynamics and training environ-
properties and that no single best method exists to address the open ment can effectively be utilized to synthesize robust behaviors [12, 13].
Research Article ETH Zurich 12
Contrary toTO, the elevation map is typically down-sampled [11, 13] policy network can also be understood as a learned model reference
to avoid immense memory consumption during training. adaptive controller with the optimizer being the reference model.
In summary, TO might be better suited if good generalization and In this work, we use an asymmetrical actor/critic method for train-
high accuracy are required, whereas RL is the preferred method if ing. The value function approximation V (o, õ) uses privileged õ ∈ Õ
robustness is of concern or onboard computational power is limited. as well as policy observations o.
As locomotion combines challenges from both of these fields, we
formulate the goal of this work as follows: RL shall be used to train a Observation Space
low-level tracking controller that provides markedly more robustness
than classical inverse dynamics. The accuracy and planning capabilities The value function is trained on policy observations and privileged ob-
of model-based TO shall be leveraged on a low level to synthesize a servations, while the policy network is trained on the former only [22].
unifying locomotion strategy that supports diverse and generalizing All observations are given in the robot-centric base frame. The defini-
motions. tion of the observation vector is given below, whereas noise distribu-
tions and dimensionalities of the observation vectors can be found in
the supplementary methods and Table 2.
Reference Motions
Designing a TO problem for control always involves a compromise, Policy Observations
that trades off physical accuracy and generalization against good nu-
The policy observations comprise proprioceptive measurements such
merical conditioning, low computation time, convexity, smoothness,
as base twist, gravity vector, joint positions, and joint velocities. The
availability of derivatives, and the necessity of a high-quality initial
history only includes previous actions [11]. Additional observations are
guess. In our work, we generate the trajectories using TAMOLS [21].
extracted from the model-based planner, including planar coordinates
Unlike other similar methods, it does not require terrain segmenta-
of foothold positions (xy coordinates), desired joint positions at touch-
tion nor pre-computation of footholds, and its solutions are robust
down time, desired contact state, and time left in the current phase.
under varying initial guesses. The system dynamics and kinematics
The latter two are per-leg quantities that fully describe the gait pattern.
are simplified, allowing for fast updates. During deployment, we also
Footholds only contain planner coordinates since the height can be
compare against baseline-to-2, which builds up on more complex kin-
extracted from the height scan.
odynamics. Due to the increased computation time and in particular
The height scan, which is an additional part of the observation space,
the computationally demanding map-processing pipeline, this method
enables the network to anticipate a collision-free swing leg trajectory.
is not well-suited to be used directly within the learning process (the
In contrast to similar works, we do not construct a sparse elevation map
training time would be expected to be about eight times larger).
around the base [11, 27] or the feet [13]. Instead, we sample along a line
We added three crucial features to TAMOLS: First, we enable par-
connecting the current foot position with the desired foothold (Fig. 8 A).
allelization on CPU, which allows multiple optimization problems to
This approach has several advantages: First, the samples can be denser
be solved simultaneously. Second, we created a python interface using
by only scanning terrain patches that are most relevant for the swing leg.
pybind11 [35], enabling it to run in a python-based environment.
Second, it prevents the network from extracting other information from
Finally, we assume that the measured contact state always matches
the map, which is typically exposed to most uncertainty (for instance,
the desired contact state. This renders the TO independent of contact
occlusion, reflection, odometry drift, discretization error, etc.). Third,
estimation, which typically is the most fragile module in a model-based
it allows us to conveniently model elevation map drift as a per-foot
controller.
quantity, which means that each leg can have its own drift value.
The optimizer requires a discretized 2.5d representation of its en-
We use analytical IK to compute the desired joint positions. As the
vironment, a so-called elevation map, as input. We extract the map
motion optimizer may not provide a swing trajectory, as is the case
directly from the simulator by sampling the height across a fixed grid.
for TAMOLS, we completely skip the swing phase. This means that
For both training and deployment, we use a fixed trotting gait with
the IK is computed with the desired base pose and the measured foot
a stride duration of 0.93 s and swing phase of 0.465 s, and set the
position for a stance leg, and the target foothold for a swing leg.
resolution of the grid map to 0.04 × 0.04 m2 .
It is worth noting that we do not provide the base pose reference as
observation. This was found to reduce sensitivity to mapping errors
Overview of the Training Environment
and to render the policy independent of the utilized planner. Finally, to
The locomotion policy π ( a | o) is a stochastic distribution of actions allow the network to infer the desired walking direction, we add the
a ∈ A that are conditioned on observations o ∈ O , parametrized reference twist (before optimization) to the observation space.
by an MLP. The action space comprises target joint positions that
are tracked using a PD controller, following the approach in [10] and Privileged Observations
related works [12–14].
The privileged observations contain the optimized base pose, base twist,
Given the state s ∈ S , we extract the solution at the next time step
and base linear and angular acceleration, extracted one time step ahead.
x′ (s) ∈ X ⊆ S from the optimizer, which includes four footholds
In addition, the critics can observe signals confined to the simulator,
pi∗=0,...,3 , joint positions q∗ at touch-down time, and the base trajectory
such as the external base wrench, external foot forces, the measured
evaluated at the next time step. The base trajectory consists of of base
∗ ∗ contact forces, friction coefficients, and elevation map drift.
pose b∗ (∆t), twist ḃ (∆t), and linear and angular acceleration b̈ (∆t).
More details can be found in Fig. 8 A. We then sample an action from
the policy. It is used to forward simulate the system dynamics, yielding Reward Functions
a new state s′ ∈ S , as illustrated in Fig. 8 B. The total reward is computed as a weighted combination of several
To define a scalar reward r (s, s′ , x′ , a), we use a monotonically individual components, which can be categorized as follows: “tracking”
decreasing function of the error between the optimized and measured of reference motions, encouraging “consistent” behavior, and other
states, that is r ∝ x′ (s) ⊖ x(s′ ). The minus operator ⊖ is defined on “regularization” terms necessary for successful sim-to-real transfer. The
the set X , the vector x′ (s) is the optimized state, and x(s′ ) is the state reward functions are explained below whereas weights and parameters
of the simulator after extracting it on the corresponding subset. The are reported in Table 3.
Research Article ETH Zurich 13
A B C
D a b c d
e f g h
i j k l
Fig. 8. Overview of the training method and deployment strategy. (A) The optimized solution provides footholds pi∗ , desired base pose b∗ ,
∗ ∗
twist ḃ , and acceleration b̈ (extracted one policy step ∆t ahead), as well as desired joint positions q∗ . Additionally, a height scan h is sampled
between the foot position pi and the corresponding foothold. (B) Training environment: The optimizer runs in parallel to the simulation. At
each leg touch-down, a new solution x′ is generated. The policy π drives the system response s′ toward the optimized solution x′ (s), which is
encouraged using the reward function r. Actor observations are perturbed with the noise vector n, while critics and the TO receive ground truth
data. (C) Deployment: Given the optimized footholds, the network computes target joint positions that are tracked using a PD control law. The
state estimator (state) returns the estimated robot state, which is fed back into the policy and the optimizer. (D) The list of terrain types includes
a) stairs, b) combinations of slopes and gaps, c) pyramids, d) slopped rough terrain, e) stepping stones, f) objects with randomized poses, g)
boxes with tilted surfaces, h) rings, i) pits, j) beams, k) hovering objects with randomized poses, and l) pallets.
Research Article ETH Zurich 14
Base Pose Tracking Here, p∗t with t = { a, b} is a vector of stacked footholds, w p > 0
To achieve tracking of the reference base pose trajectory, we use is a relative weight, δt = 0.01 s is the discretization time of the base
trajectory, and t0 is the time elapsed since the solution was retrieved.
∗
r Bn = e−σBn ·||b (t+∆t) ⊖b(t) || ,
(n) (n) 2
(1) The index a refers to the most recent solution, while b refers to the
previous solution. It is important to note that the two solution vectors
where n = {0, 1, 2} is the derivative order, b(t) is the measured base x a and xb , from which we extract the base and footholds, are only
pose, b∗ (t + ∆t) is the desired base pose sampled from the reference defined on their respective time intervals given by the optimization
trajectory one policy step ∆t ahead, and ⊖ denotes the quaternion horizon τh , i.e, t a ∈ Ta = [0, τh,a ] and tb ∈ Tb = [0, τh,b ].
difference for base orientation and the vector difference otherwise. We
Regularization
refer to the above reward function as a “soft” tracking task because
large values can be scored even if the tracking error does not perfectly To ensure that the robot walks smoothly, we employ two different
vanish. penalty terms enforcing complementary constraints. The first term,
To further analyze the reward function, we decompose the base rr1 = − ∑i |viT f i |, discourages foot-scuffing and end-effector col-
trajectory into three segments. The “head” starts at time zero, the lisions by penalizing power measured at the feet. The second term,
“tail” stops at the prediction horizon, and the “middle” connects these rr2 = − ∑i (q̇iT τ i )2 , penalizes joint power to prevent arbitrary mo-
two segments with each other. A logarithmic reward function would tions, especially during the swing phase. Other regularization terms
prioritize the tracking of the trajectory head, while a linear penalty are stated in the supplementary methods (section “Implementation
would focus on making progress along the whole trajectory at once. Details”).
Contrary, the exponential shape of the reward function splits the track-
Training Environment
ing task into several steps. During the initial epochs, the tracking error
of the trajectory middle and tail will likely be relatively large, and To train the locomotion policy, we employ a custom version of Proximal
thus, do not contribute notably to the reward gradient. As a result, the Policy Optimization (PPO)[36] and a training environment that is
network will minimize the tracking error of the trajectory head. Once mostly identical to that introduced in[11]. It is explained in more
its effect on the gradient diminishes, the errors corresponding to the detail in the supplementary methods (section “Training Details”) and
trajectory middle will dominate the gradient landscape. In the final Table 1. Simulation and back-propagation are performed on GPU,
training stages, tracking is mostly improved around the trajectory tail. while the optimization problems are solved on CPU.
Termination
Foothold Tracking
We use a simple termination condition where an episode is terminated
We choose a logarithmic function
if the base of the robot makes contact with the terrain.
r pi = − ln(|| pi∗ − pi ||2 + ϵ), (2) Domain Randomization
We inject noise into all observations except for those designated as
to learn foothold tracking, where pi is the current foot position of
∗ privileged. At each policy step, a noise vector n is sampled from a
leg i ∈ {0, . . . , 3}, pi is the corresponding desired foothold, and
uniform distribution and added to the observation vector, with the only
0 < ϵ ≪ 1 is small number ensuring the function is well defined. The
exceptions of the desired joint positions and the height scan.
above reward function may be termed a “hard” tracking task, as the
For the elevation map, we add noise before extracting the height
maximum value can only be scored if the error reaches zero. As the
scan. The noise is sampled from an approximate Laplace distribution
tracking improves, the gradients will become larger, resulting in even
where large values are less common than small ones. We perturb the
tighter tracking toward the later training stages.
height scan with a constant offset, which is sampled from another
A dense reward structure typically encourages a stance foot to
approximate Laplace distribution for each foot separately. Both per-
be dragged along the ground to further minimize the tracking error.
turbations discourage the network to rely extensively on perceptive
To prevent such drag motions from emerging, the above reward is
feedback and help to generalize to various perceptive uncertainties
given for each foot at most once during one complete gait cycle: more
caused by odometry drift, occlusion, and soft ground.
specifically, if and only if the leg is intended to be in contact and the
All robots are artificially pushed by adding a twist offset to the
norm of the contact force indicates a contact, that is if || f i || > 1, then
measured twist at regular time instances. Friction coefficients are ran-
the agent receives the reward.
domized per leg once at initialization time. To render the motion robust
Consistency against disturbances, we perturb the base with an external wrench and
In RL for legged locomotion, hesitating to move over challenging ter- the feet with external forces. The latter slightly stiffens up the swing
rains is a commonly observed phenomenon that prevents informative motion but improves tracking performance in the presence of unmod-
samples from being gathered and thus impedes the agent’s performance. eled joint frictions and link inertia. The reference twist is resampled in
This behavior can be explained by insufficient exploration: The major- constant time intervals and then held constant.
ity of agents fail to solve a task while a small number of agents achieve The solutions for the TO problems are obtained using ground truth
higher average rewards by refusing to act. To overcome this local opti- data, which include the true friction coefficients, the true external base
mum, we propose to encourage consistency by rewarding actions that wrench, and noise-free height map. In the presence of simulated noise,
are intended by previous actions. In our case, we measure consistency drift, and external disturbances, the policy network is therefore trained
as the similarity between two consecutive motion optimizations. If to reconstruct a base trajectory that the optimizer would produce given
the solutions are similar, the agent is considered to be “consistent”. the ground truth data. However, there is a risk that the network learns
We measure similarity as the Euclidean distance between two adjacent to remove the drift from the height scan by analyzing the desired joint
solutions and write positions. During hardware deployment, such a reconstruction will fail
because the optimizer is subject to the same height drift. To mitigate
rc = ∑ − δt || b ∗
a ( δtj + t 0,a ) ⊖ b ∗
b ( δtj + t 0,b )|| − w p || p ∗
a − p ∗ this issue, we introduce noise to the desired joint position observations,
b || .
δtj+t0 ∈( Ta ∩ Tb ) sampled from a uniform distribution with boundaries proportional to
(3) the drift value.
Research Article ETH Zurich 15
Terrain Curriculum cameras of the same type. For the second Version C, the depth cameras
We use a terrain curriculum as introduced in [11]. Before the training were replaced with two identical Robosense Bpearl dome LiDAR
process, terrain patches of varying types and difficulties are generated, sensors. For the outdoor experiments, we mostly used this robot, as
and each agent is assigned a terrain patch. As an agent acquires more the Bpearls tend to be more robust against lighting conditions. Motion
skills and can navigate the current terrain, its level is upgraded, which optimization and the forward propagation of the network policy are
means it will be re-spawned on the same terrain type, but with a harder done on a single Intel core-i7 8850H machine. Elevation mapping [37]
difficulty. We have observed that the variety of terrains encountered runs on a dedicated onboard Nvidia Jetson.
during training influences the sim-to-real transfer. We thus have in-
cluded a total of 12 different terrain types with configurable parameters REFERENCES
(Fig. 8 D), leading to a total of 120 distinguishable terrain patches. The
terrain types classify different locomotion behaviors, s.a. climbing 1. J. Z. Kolter, M. P. Rodgers, A. Y. Ng, A control architecture for
(“stairs”, “pits”, “boxes”, “pyramids”), reflexing (“rough”, “rings”, quadruped locomotion over rough terrain, 2008 IEEE International
“flying objects”), and walking with large steps (“gaps”, “pallets”, “step- Conference on Robotics and Automation, 811–818 (2008).
2. M. Kalakrishnan, J. Buchli, P. Pastor, M. Mistry, S. Schaal, Fast, robust
ping stones”, “beams”, “objects with randomized poses”). Our terrain
quadruped locomotion over challenging terrain, 2010 IEEE International
curriculum consists of 10 levels, where one of the configurable param- Conference on Robotics and Automation, 2665–2670 (2010).
eters is modulated to increase or decrease its difficulty. This results in 3. A. W. Winkler, C. D. Bellicoso, M. Hutter, J. Buchli, Gait and trajectory
a total of 1200 terrain patches, each with a size of 8 × 8 m2 , summing optimization for legged systems through phase-based end-effector
up to a total area of 76800 m2 , which is approximately the size of 14 parameterization, IEEE Robotics and Automation Letters 1560–1567
football fields or 10 soccer fields. (2018).
4. C. Mastalli, I. Havoutis, M. Focchi, D. G. Caldwell, C. Semini, Mo-
Training tion planning for quadrupedal locomotion: Coupled planning, terrain
mapping, and whole-body control, IEEE Transactions on Robotics 1635–
Solving the TO problems at the policy frequency during training was 1648 (2020).
found to provoke poor local optima. In such a case, the optimizer 5. F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, M. Hutter, Perceptive lo-
adapts the solution after each policy step: If the agent is not able to comotion in rough terrain – online foothold optimization, IEEE Robotics
follow the reference trajectory, the optimizer will adapt to the new and Automation Letters 5370–5376 (2020).
state s.t. the tracking problem becomes feasible again. This means that 6. P. Fankhauser, M. Bjelonic, C. Dario Bellicoso, T. Miki, M. Hutter,
the agent can exhibit “lazy” behavior and still collect some rewards. Robust rough-terrain locomotion with a quadrupedal robot, 2018 IEEE
We prevent such a local optimum by updating the optimizer only at International Conference on Robotics and Automation (ICRA), 5761–
a leg touch-down (after 0.465 seconds). This also greatly reduces 5768 (2018).
7. R. Grandia, F. Jenelten, S. Yang, F. Farshidian, M. Hutter, Perceptive
learning time because computational costs are reduced by a factor of
locomotion through nonlinear model-predictive control, IEEE Transac-
23. After a robot fell (on average, once every 18 seconds), was pushed
tions on Robotics 1–20 (2023).
(after 10 seconds) or its twist commands changed (three times per 8. C. Mastalli, W. Merkt, G. Xin, J. Shim, M. Mistry, I. Havoutis, S. Vijayaku-
episode), the optimized trajectories are no longer valid. To guarantee mar, Agile maneuvers in legged robots: a predictive control approach
that the locomotion policy generalizes across different update rates, we (2022).
additionally recompute the solution in all those scenarios. 9. C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic locomotion
We trained the policy with a massive parallelization of 642 = 4096 through online nonlinear motion optimization for quadrupedal robots,
robots, for a total of 90000 epochs. Each epoch consisted of 45 learning IEEE Robotics and Automation Letters 2261–2268 (2018).
iterations where each iteration covered a duration of 0.02 seconds. 10. J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun,
Considering the variable update rate explained previously, this resulted M. Hutter, Learning agile and dynamic motor skills for legged robots,
Science Robotics p. eaau5872 (2019).
in a total of 8295 days (or 23 years) of optimized trajectories. The
11. N. Rudin, D. Hoeller, P. Reist, M. Hutter, Learning to walk in minutes
policy can be deployed after about one day of training (6000 epochs),
using massively parallel deep reinforcement learning, 5th Annual Con-
reaches 90 % of its peak performance after three days (20000 epochs), ference on Robot Learning (2021).
and is fully converged after two weeks (90000 epochs). 12. J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning
In comparison, the baseline-rl-1 policy was trained for 4000 epochs quadrupedal locomotion over challenging terrain, Science Robotics p.
with 1000 parallelized robots over 5 consecutive days. Each epoch eabc5986 (2020).
lasted for 5 seconds, resulting in a throughput of 46 simulated seconds 13. T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learn-
per second. Our policy was trained for 14 days, with each epoch lasting ing robust perceptive locomotion for quadrupedal robots in the wild,
for 0.9 seconds, leading to a throughput of 27 simulated seconds per Science Robotics p. eabk2822 (2022).
second. Thus, despite generating 1.6 years of desired motions per day, 14. N. Rudin, D. Hoeller, M. Bjelonic, M. Hutter, Advanced skills by learning
locomotion and local navigation end-to-end, 2022 IEEE/RSJ Interna-
our approach has only a 1.7 times lower throughput than the baseline.
tional Conference on Intelligent Robots and Systems (IROS), 2497–2503
(2022).
Deployment 15. Z. Xie, H. Y. Ling, N. H. Kim, M. van de Panne, Allsteps: Curriculum-
We deploy the policy at a frequency of 50 Hz without any fine-tuning. driven learning of stepping stone skills, Computer Graphics Forum 39
The motion optimizer runs at the largest possible rate in a separate (2020).
thread. For TAMOLS with a trotting gait, this is around 400 Hz and for 16. H. Duan, A. Malik, J. Dao, A. Saxena, K. Green, J. Siekmann, A. Fern,
baseline-to-2 around 100 Hz (both are faster than the policy frequency). J. Hurst, Sim-to-real learning of footstep-constrained bipedal dynamic
walking, 2022 International Conference on Robotics and Automation
At each step, the policy queries the most recent solution from the thread
(ICRA), 10428–10434 (2022).
pool and extracts it ∆t = 0.02 s ahead of the most recent time index. 17. W. Yu, D. Jain, A. Escontrela, A. Iscen, P. Xu, E. Coumans, S. Ha, J. Tan,
For our experiments, we used three different types of ANYmal T. Zhang, Visual-locomotion: Learning to walk on complex terrains with
robots [34], two version C and one version D, for which we trained vision, Proceedings of the 5th Conference on Robot Learning, A. Faust,
different policies. ANYmal C is by default equipped with four Intel D. Hsu, G. Neumann, eds., 1291–1302 (PMLR, 2022).
RealSense D435 depth cameras whereas ANYmal D has eight depth 18. A. Agarwal, A. Kumar, J. Malik, D. Pathak, Legged locomotion in
Research Article ETH Zurich 16
challenging terrains using egocentric vision, 6th Annual Conference on seamless operability between c++11 and python (2017).
Robot Learning (2022). Https://github.com/pybind/pybind11.
19. K. Caluwaerts, A. Iscen, J. C. Kew, W. Yu, T. Zhang, D. Freeman, K.-H. 36. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal
Lee, L. Lee, S. Saliceti, V. Zhuang, N. Batchelor, S. Bohez, F. Casarini, policy optimization algorithms, CoRR abs/1707.06347 (2017).
J. E. Chen, O. Cortes, E. Coumans, A. Dostmohamed, G. Dulac- 37. T. Miki, L. Wellhausen, R. Grandia, F. Jenelten, T. Homberger, M. Hutter,
Arnold, A. Escontrela, E. Frey, R. Hafner, D. Jain, B. Jyenis, Y. Kuang, Elevation mapping for locomotion and navigation using gpu (2022).
E. Lee, L. Luu, O. Nachum, K. Oslund, J. Powell, D. Reyes, F. Romano, 38. F. Jenelten, J. He, F. Farshidian, M. Hutter, Evaluation of tracking
F. Sadeghi, R. Sloat, B. Tabanpour, D. Zheng, M. Neunert, R. Hadsell, performance and robustness for a hybrid locomotion controller, https:
N. Heess, F. Nori, J. Seto, C. Parada, V. Sindhwani, V. Vanhoucke, //doi.org/10.5061/dryad.b5mkkwhkq.
J. Tan, Barkour: Benchmarking animal-level agility with quadruped
robots (2023).
20. R. J. Griffin, G. Wiedebach, S. McCrory, S. Bertrand, I. Lee, J. Pratt, ACKNOWLEDGMENTS
Footstep planning for autonomous walking over rough terrain, 2019
IEEE-RAS 19th International Conference on Humanoid Robots (Hu- Funding: This project has received funding from the European Re-
manoids), 9–16 (2019). search Council (ERC) under the European Union’s Horizon 2020 re-
21. F. Jenelten, R. Grandia, F. Farshidian, M. Hutter, Tamols: Terrain-aware search and innovation programme grant agreement No 852044. This
motion optimization for legged systems, IEEE Transactions on Robotics research was supported by the Swiss National Science Foundation
3395–3413 (2022). (SNSF) as part of project No 188596, and by the Swiss National
22. P. Brakel, S. Bohez, L. Hasenclever, N. Heess, K. Bousmalis, Learning Science Foundation through the National Centre of Competence in Re-
coordinated terrain-adaptive locomotion by imitating a centroidal dy- search Robotics (NCCR Robotics). Author Contribution: F.J. formu-
namics planner, 2022 IEEE/RSJ International Conference on Intelligent
lated the main ideas, trained and tested the policy using baseline-to-1,
Robots and Systems (IROS), 10335–10342 (2022).
23. M. Bogdanovic, M. Khadiv, L. Righetti, Model-free reinforcement learn-
and conducted most of the experiments. J.H. interfaced baseline-to-2
ing for robust locomotion using demonstrations from trajectory opti- with the tracking policy and conducted the box-climbing experiments.
mization, Frontiers in Robotics and AI 9 (2022). F.F. contributed to the Theory and improved some of the original ideas.
24. X. B. Peng, P. Abbeel, S. Levine, M. van de Panne, Deepmimic: All authors helped to write, improve, and refine the paper. Competing
Example-guided deep reinforcement learning of physics-based charac- interests: The authors declare that they have no competing interests.
ter skills, ACM Transactions on Graphics 37 (2018). Data and materials availability: All (other) data needed to evaluate
25. X. B. Peng, Z. Ma, P. Abbeel, S. Levine, A. Kanazawa, Amp: Adver- the conclusions in the paper are present in the paper or the Supplemen-
sarial motion priors for stylized physics-based character control, ACM tary Materials. Data-sets and code to generate all our figures are made
Transactions on Graphics 40 (2021). publicly available [38].
26. S. Bohez, S. Tunyasuvunakool, P. Brakel, F. Sadeghi, L. Hasenclever,
Y. Tassa, E. Parisotto, J. Humplik, T. Haarnoja, R. Hafner, M. Wulfmeier,
M. Neunert, B. Moran, N. Siegel, A. Huber, F. Romano, N. Batchelor,
F. Casarini, J. Merel, R. Hadsell, N. Heess, Imitate and repurpose:
Learning reusable robot movement skills from human and animal be-
haviors (2022).
27. V. Tsounis, M. Alge, J. Lee, F. Farshidian, M. Hutter, Deepgait: Planning
and control of quadrupedal gaits using deep reinforcement learning,
IEEE Robotics and Automation Letters 3699–3706 (2020).
28. X. B. Peng, G. Berseth, K. Yin, M. van de Panne, Deeploco: Dynamic
locomotion skills using hierarchical deep reinforcement learning, ACM
Transactions on Graphics (Proc. SIGGRAPH 2017) 36 (2017).
29. O. Melon, M. Geisert, D. Surovik, I. Havoutis, M. Fallon, Reliable
trajectories for dynamic quadrupeds using analytical costs and learned
initializations (2020).
30. D. Surovik, O. Melon, M. Geisert, M. Fallon, I. Havoutis, Learning
an expert skill-space for replanning dynamic quadruped locomotion
over obstacles, Proceedings of the 2020 Conference on Robot Learning,
J. Kober, F. Ramos, C. Tomlin, eds., 1509–1518 (PMLR, 2021).
31. O. Melon, R. Orsolino, D. Surovik, M. Geisert, I. Havoutis, M. Fal-
lon, Receding-horizon perceptive trajectory optimization for dynamic
legged locomotion with learned initialization, 2021 IEEE International
Conference on Robotics and Automation (ICRA), 9805–9811 (2021).
32. S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, I. Havoutis, Rloc:
Terrain-aware legged locomotion using reinforcement learning and
optimal control, IEEE Transactions on Robotics 2908–2927 (2022).
33. Z. Xie, X. Da, B. Babich, A. Garg, M. v. de Panne, Glide: Generaliz-
able quadrupedal locomotion in diverse environments with a centroidal
model, Algorithmic Foundations of Robotics XV , S. M. LaValle, J. M.
O’Kane, M. Otte, D. Sadigh, P. Tokekar, eds., 523–539 (Springer Inter-
national Publishing, Cham, 2023).
34. M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso, V. Tsounis,
J. Hwangbo, K. Bodie, P. Fankhauser, M. Bloesch, R. Diethelm, S. Bach-
mann, A. Melzer, M. Hoepflinger, Hutter2016 - a highly mobile and
dynamic quadrupedal robot, 2016 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), 38–44 (2016).
35. W. Jakob, J. Rhinelander, D. Moldovan, pybind11 –
Research Article ETH Zurich 17