0% found this document useful (0 votes)
2 views

scirobotics.abk2822 --

This research article presents a robust locomotion controller for quadrupedal robots that integrates exteroceptive and proprioceptive perception, enabling autonomous operation in challenging environments. The controller was tested in various terrains, demonstrating high speed and stability without failures, even in adverse conditions like snow and steep inclines. A successful hiking experiment in the Alps showcased the robot's ability to complete a complex route faster than the expected human duration, highlighting the effectiveness of the proposed approach.

Uploaded by

Stella
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

scirobotics.abk2822 --

This research article presents a robust locomotion controller for quadrupedal robots that integrates exteroceptive and proprioceptive perception, enabling autonomous operation in challenging environments. The controller was tested in various terrains, demonstrating high speed and stability without failures, even in adverse conditions like snow and steep inclines. A successful hiking experiment in the Alps showcased the robot's ability to complete a complex route faster than the expected human duration, highlighting the effectiveness of the proposed approach.

Uploaded by

Stella
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SCIENCE ROBOTICS | RESEARCH ARTICLE

ANIMAL ROBOTS Copyright © 2022


The Authors, some
Learning robust perceptive locomotion rights reserved;
exclusive licensee
for quadrupedal robots in the wild American Association
for the Advancement
of Science. No claim
Takahiro Miki1*, Joonho Lee1, Jemin Hwangbo2, Lorenz Wellhausen1, to original U.S.
Vladlen Koltun3, Marco Hutter1 Government Works

Legged robots that can operate autonomously in remote and hazardous environments will greatly increase op-
portunities for exploration into underexplored areas. Exteroceptive perception is crucial for fast and energy-efficient
locomotion: Perceiving the terrain before making contact with it enables planning and adaptation of the gait
ahead of time to maintain speed and stability. However, using exteroceptive perception robustly for locomotion
has remained a grand challenge in robotics. Snow, vegetation, and water visually appear as obstacles on which
the robot cannot step or are missing altogether due to high reflectance. In addition, depth perception can degrade
due to difficult lighting, dust, fog, reflective or transparent surfaces, sensor occlusion, and more. For this reason,
the most robust and general solutions to legged locomotion to date rely solely on proprioception. This severely
limits locomotion speed because the robot has to physically feel out the terrain before adapting its gait accordingly.
Here, we present a robust and general solution to integrating exteroceptive and proprioceptive perception for
legged locomotion. We leverage an attention-based recurrent encoder that integrates proprioceptive and extero-
ceptive input. The encoder is trained end to end and learns to seamlessly combine the different perception modalities
without resorting to heuristics. The result is a legged locomotion controller with high robustness and speed. The

Downloaded from https://www.science.org on July 20, 2023


controller was tested in a variety of challenging natural and urban environments over multiple seasons and completed
an hour-long hike in the Alps in the time recommended for human hikers.

INTRODUCTION travels in a straight path. In addition, depth sensors by nature cannot


Legged robots can carry out missions in challenging environments distinguish soft unstable surfaces, such as vegetation, from rigid ones.
that are too far or too dangerous for humans, such as hazardous An elevation map is commonly used to represent geometric terrain
areas and the surfaces of other planets. Legs can walk over challenging information extracted from depth sensor measurements (14–17). It
terrain with steep slopes, steps, and gaps that may impede wheeled relies on the robot’s estimated pose and is therefore affected by errors
or tracked vehicles of similar size. There has been notable progress in this estimate. Other common sources of uncertainty in the map
in legged robotics (1–5), and several commercial platforms are being are occlusion or temporal inconsistency of the measurements due
deployed in the real world (6–10). to dynamic objects. Most existing methods that rely on onboard terrain
However, until now, legged robots could not match the perform­ perception are still vulnerable to these failures.
ance of animals in traversing challenging real-world terrain. Many Conventional approaches assume that the terrain information and
legged animals, such as humans and dogs, can briskly walk or run in any uncertainties encoded in the map are reasonably accurate, and
such environments by foreseeing the upcoming terrain and planning the focus shifts solely to generating the motion. Offline methods use
their footsteps based on visual information (11). Animals naturally a prescanned terrain map, compute a handcrafted cost function
combine proprioception and exteroception to adapt to highly irregular over the map, and optimize a trajectory that is replayed on the robot
terrain shape and surface properties such as slipperiness or softness, (18, 19). They assume perfect knowledge of the full terrain and
even when visual perception is limited. Endowing legged robots with robot states and plan complex motions with long planning times.
this ability is a grand challenge in robotics. Online methods generally use a similar approach but use only onboard
One of the biggest difficulties lies in reliable interpretation of resources to construct a map and continuously replan trajectories
incomplete and noisy perception for control. Exteroceptive infor­ during execution (20–24). Recently, faster locomotion has been
mation provided by onboard sensors is incomplete and often un­ achieved by reducing the planning time with heuristics (25–27) or
reliable in real-world environments. Stereo camera–based depth using convolutional neural networks to calculate foothold cost more
sensors, which most existing legged robots rely on (6, 9, 12), require efficiently (27). Recently, a bipedal robot, Atlas, demonstrated parkour
texture to perform stereo matching and consequently struggle with over complex obstacles (28). It leverages preplanned motion reference
low-texture surfaces or when parts of the image are under- or over­ and optimizes its motion online by using onboard LiDAR sensor
exposed. Time-of-flight (ToF) cameras often fail to perceive dark data. Overall, the focus of all the approaches mentioned above is on
surfaces and become noisy under sunlight (13). In general, sensors picking footholds and generating trajectories given accurate terrain
that rely on light to infer distance are prone to producing artifacts information. Some works (14, 17) represent the statistical uncertainty
on highly reflective surfaces, because the sensors assume that light of the measurements in the map, but its use is limited to heuristically
defined foot placement rules to avoid risky areas (24). Such methods
1
can only handle explicitly modeled uncertainties and are not robust
Robotic Systems Lab, ETH-Zürich, Zürich, Switzerland. 2Robotics & Artificial to the variety of perception failures encountered in the wild.
Intelligence Lab, KAIST, Daejeon, Korea. 3Intelligent Systems Lab, Intel, Jackson,
WY, USA. Data-driven methods have recently been introduced to incorporate
*Corresponding author. Email: [email protected] more complex dynamics without compromising real-time performance.

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 1 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Learning-based quadrupedal or bipedal locomotion for simulated by leveraging a robot-centric elevation map. The elevation map serves
characters has been achieved by using reinforcement learning (RL) as an abstraction layer between sensors and the locomotion controller,
(29–32), and realistic robot models were used in recent works (33). making our method independent of depth sensor choices. It works
However, these works were only conducted in simulation. Recently, with no fine-tuning with different sensors, such as stereo cameras or
RL-based locomotion controllers have been successfully transferred LiDAR. Because the policy was trained to handle large noises, bias,
to physical robots (3, 4, 34–40). Hwangbo et al. (3, 41) realized quadru­ and gaps in the elevation map, the robot can continue walking even
pedal locomotion and recovery on flat ground with a physical robot when mapping fails or the sensors are physically broken.
by using learned actuator dynamics to facilitate simulation-to-reality The presented approach achieves substantial improvements over
(sim-to-real) transfer. Lee et al. (4) extended this approach and enabled the state of the art (4) in locomotion speed and obstacle traversability
rough-terrain locomotion by simulating challenging terrain in a while maintaining exceptional robustness. Our key contribution is a
privileged training setup with an adaptive curriculum. Peng et al. method for combining multimodal perception and demonstrating
(35) used imitation learning to transfer animal motion to a legged with extensive hardware experiments that the resulting control policy
robot. However, these methods do not use any visual information. is robust against various exteroceptive failures. Handling exterocep­
To add exteroceptive information to locomotion learning, tion failures has been a challenging problem in robotics. Our approach
Gangapurwala et al. (42) combined a learning-based foothold planner constitutes a general framework for robust deployment of complex
and a model-based whole-body motion controller to transfer policies autonomous machines in the wild.
to the real world in a laboratory setting. Their applications are limited
to rigid terrain with mostly flat surfaces and are still constrained in
their deployment range. Their performance is tightly bound to the RESULTS
quality of the map, which often becomes unreliable in the field. Fast and robust locomotion in the wild
In both model-based and learning-based approaches, the assump­ We deployed our controller in a wide variety of terrain, as shown in

Downloaded from https://www.science.org on July 20, 2023


tion of flawless map quality precludes the application of these methods Fig. 1 and Movie 1. This includes alpine, forest, underground, and
in uncontrolled outdoor environments. Handling uncertainties in urban environments. The controller was consistently robust and had
terrain perception remains an open problem. Existing controllers zero falls during all deployments. Because of the exteroceptive per­
avoid catastrophic failures by simply refraining from using visual ception, the robot could anticipate the terrain and adapt its motion
information in outdoor environments (2, 4, 38) or by adding heu­ to achieve fast and smooth walking. This was particularly notable for
ristically defined reflex rules (43, 44). structures that require high foot clearance, such as stairs and large
Here, we present a terrain-aware locomotion controller for quadru­ obstacles. The robot was able to leverage exteroceptive input to con­
pedal robots that overcomes the limitations of previous approaches quer terrain that was beyond the capabilities of prior work that did
and enables robust traversal of harsh natural terrain at unprecedented not use exteroception (4).
speeds (Movie 1). At its core, the controller is based on a principled ANYmal successfully traversed challenging natural environments
solution to incorporating exteroceptive perception into locomotion with steep inclination, slippery surfaces, grass, and snow (Fig. 1, A to J).
control. The robot was robust under these conditions, even when occlusion
The key component is a recurrent encoder that combines proprio­ and surface properties such as high reflectance impeded extero­
ception and exteroception into an integrated belief state. The encoder ception. Our controller was also robustly deployed in underground
is trained in simulation to capture ground-truth information about environments with loose gravel, sand, dust, water, and limited illu­
the terrain given exteroceptive observations that may be incomplete, mination (Fig. 1, K to N).
biased, and noisy. The belief state encoder is trained end to end to Urban environments also present important challenges (Fig. 1,
integrate proprioceptive and exteroceptive data without resorting to O to R). For traversing stairs, the state-of-the-art quadrupedal robot
heuristics. It learns to take advantage of the foresight afforded by ex­ Spot from Boston Dynamics requires that a dedicated mode is en­
teroception to plan footholds and accelerate locomotion when extero­ gaged, and the robot must be properly oriented with respect to the
ception is reliable and can seamlessly fall back to robust proprioceptive stairs [(44), p. ~33]. In contrast, our controller does not require any
locomotion when needed. The learned controller thus combines the special mode for stairs and can traverse stairs natively in any direc­
best of both worlds: the speed and efficiency afforded by exterocep­ tion and any orientation, such as sideways, diagonally, and turning
tion and the robustness of proprioception. around on the stairway. See movie S1 for demonstrations of smooth
The controller is trained via privileged learning (45). We first and robust stair traversal in arbitrary directions with our controller.
train a teacher policy via RL with full access to privileged informa­ The controller was also robust to combinations of different chal­
tion in the form of the ground-truth state of the environment. This lenges, as can be seen with snow on stairs in Fig. 1R. Snow makes
privileged training enables the teacher policy to discover the opti­ stairs slippery and yields incomplete and erroneous exteroceptive
mal behavior given perfect knowledge of the terrain. We then train data. Depth sensors either fail due to the high reflectivity of snow or
a student policy that only has access to information that is available estimate the surface profile to be on top of the snow, whereas the
in the field on the physical robot. The student policy is built around robot’s legs sink below this level. Foot slippage in snow can also
our belief state encoder and trained via imitation learning. The stu­ cause large drift in the kinematic pose estimation (46), making the
dent policy learns to predict the teacher’s optimal action given only map even more inconsistent. Nevertheless, the controller remained
partial and noisy observations of the environment. consistently robust, with zero failures in this regime as well.
Once the student policy is trained, we deploy it on the robot
without any fine-tuning. The controller gets onboard sensor obser­ A hike in the Alps
vations and a desired velocity command and outputs each joint’s To further evaluate the robustness of our controller, we conducted
target position as the action. The robot perceives the environment a hiking experiment in which we tested whether ANYmal could

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 2 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Downloaded from https://www.science.org on July 20, 2023


Fig. 1. Robust locomotion in the wild. The presented locomotion controller was extensively tested in a variety of complex environments such as natural [(A) to (J)],
underground [(K) to (N)] or various stairs [(O) to (R)] over multiple seasons. The controller overcame a whole spectrum of real-world challenges, often encountering them
in combination. These include slippery surfaces [(M) and (R)], steep inclinations [(C) to (E)], complex terrain, and vegetation in natural environments [(B), (C), (F), and (I)]. In
search-and-rescue scenarios, the controller dealt with steep stairs [(F), (G), and (O) to (R)], unknown payloads (I), and perception-degrading fog (P). Reflective surfaces (N),
loose ground [(K) and (M)], low light, and water puddles were encountered in underground cave systems [(K) to (N)]. Soft and slippery snow piled up in the winter [(J) and
(R)]. The controller traversed these environments with zero failures.

complete an hour-long hiking loop on the Etzel mountain in The robot was able to reach the summit in 31 min, which is faster
Switzerland. The hiking route was 2.2 km long, with an elevation gain than the expected human hiking duration indicated in the official
of 120 m. Completing the trail required traversing steep inclinations, signage (35 min as shown in Fig. 2), and finished the entire path in
high steps, rocky surfaces, slippery ground, and tree roots (Fig. 2). 78 min, virtually the same duration suggested by a hiking planner
As seen in Movie 2, ANYmal completed the entire hike without any (76 min), which rates the hike “difficult” (47). The difficulty levels
failure, stopping only to fix a detached shoe and swap batteries. are chosen from “easy,” “moderate,” and “difficult,” calculated by

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 3 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

because the sensors were only located on the robot itself, areas be­
hind structures were occluded and not presented in the map, which
was especially problematic during uphill walking (Fig. 3G).
Overall, our controller could handle all of these challenging con­
ditions gracefully, without a single failure. The belief state estimator
was trained to assess the reliability of exteroceptive information and
made use of it to the extent possible. When exteroceptive informa­
tion was incomplete, noisy, or misleading, the controller could always
gracefully degrade to proprioceptive locomotion, which was shown
to be robust (4). The controller thus aims to achieve the best of both
worlds: achieving fast predictive locomotion when exteroceptive
information is informative but seamlessly retaining the robustness
of proprioceptive control when it is not.
Movie 1. Wild ANYmal: Robust zero-shot perceptive locomotion.
Evaluating the contribution of exteroception
We conducted controlled experiments to quantitatively evaluate the
combining the required fitness level, sport type, and the technical contribution of exteroception. We compared our controller with a
complexity (48). proprioceptive baseline (4) that does not use exteroception.
During the hike, the controller faced various challenges. The as­ First, we compared the success rate of overcoming fixed-height
cending path reached inclinations of up to 38% with rocky and wet steps as shown in Fig. 4A. Wooden steps of various heights (from 12
surfaces (Fig. 2, B and C). On the descent through a forest, tree roots to 36.5 cm) were placed ahead of the robot, which performed 10

Downloaded from https://www.science.org on July 20, 2023


formed intricate obstacles, and the ground proved very slippery trials to overcome each step with a fixed velocity command. A trial
(Fig. 2, G and H). was considered successful if the robot overcame the step within 5 s.
Vegetation above the robot sometimes introduced severe artifacts The success rate of the proprioceptive baseline dropped at a 20-cm
into the estimated elevation map. Despite all the challenges, the robot step height, when the front legs started frequently getting stuck at
finished the hike without any human help and without a single fall. the step (Fig. 4B). Even when the front legs successfully overcame
the step, the hind legs often failed to fully step up. In contrast, our
Exteroceptive challenges controller reliably traversed steps of up to 30.5 cm in height. Be­
In this section, we examine how the terrain was perceived by the robot cause our controller could anticipate the step, it lifted its legs higher
under conditions that are challenging for exteroception. The robot without making physical contact first and leaned its body forward
perceives the environment in the form of height samples from an ele­ to let the hind leg swing over the step (Fig. 4A). Until this height, the
vation map constructed from point-cloud input, as seen in Fig. 3A. We dominant failure reason was the robot evading the step sideways
used LiDAR in some experiments (Fig. 3, D to G) and active stereo instead of falling. When approaching steps higher than 32 cm, our
cameras in others (Fig. 3, B and C) to test the robustness of the con­ controller hesitated to walk forward because it learned that steps of
troller to the sensing modality. such height are at or above the robot’s physical limits and are likely
We encountered many circumstances in which exteroception pro­ to incur a high cost.
vides incomplete or misleading input. As shown in Fig. 3 (B to G), We also tested the two controllers in an obstacle course, as
the estimated elevation map can be unreliable due to sensing fail­ shown in Fig. 4 (C and D). In this experiment, the robot was given
ures, limitations of the 2.5D height map representation, or view­ a fixed path over the obstacles and tracked it using a pure pursuit
point restrictions due to onboard sensing. controller (49). The path traverses several types of obstacles: an in­
Because most depth sensors rely on light to infer distance, either clined platform, a raised platform, stairs, and a pile of blocks. The
through ToF measurements or stereo disparity, they commonly platforms are 20 cm high, the stairs are 17 cm high and 29 cm deep
struggle with reflective or translucent surfaces. Figure 3B shows such each, and the blocks are each 20 cm in both height and depth. Our
a sensing failure, where the reflective metal floor induced large depth controller followed the given path smoothly without any assistance,
outliers that appear as a trench in the elevation map. Figure 3C shows as shown in Fig. 4C. The exteroceptive perception provided advance
a sensing failure in the presence of snow. Because snow is highly information on the upcoming obstacles, allowing the controller to
reflective and has very little texture, stereo cameras could not infer adjust the robot’s motion before it made contact with the obstacles,
depth, which led to an empty map. facilitating fast and smooth motion through the obstacle course.
The 2.5D elevation map representation cannot accurately repre­ The baseline, on the other hand, failed to track the path without
sent overhanging objects such as tree branches or low ceilings (17). human assistance. During execution, it got stuck on all three obsta­
These were integrated into the height field and were misrepresented cles, and we had to lift and push the robot to continue the experi­
as tall obstacles (Fig. 3D). In addition, because the map cannot dis­ ment (Fig. 4D).
tinguish between rigid and soft materials, the map gave misleading In addition, we measured the maximum locomotion speed of
information in soft vegetation or deep snow (Fig. 3E). both controllers over flat ground and in the presence of obstacles.
Slippery or deformable surfaces caused odometry drift because Figure 4E shows the experimental setup. We gave the controller a
they violate the assumption of stable footholds commonly adopted constant forward, lateral, or turning command and recorded the
by kinematic pose estimators (46). Because map construction relies on velocity on flat ground and over a 20-cm step. Note that the baseline
such pose estimation to register consecutive input point clouds, the controller only receives a directional command and learns to walk
map became inaccurate in such circumstances (Fig. 3F). Furthermore, as fast as possible in the commanded direction (4). Our controller

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 4 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Fig. 2. A hike on the Etzel mountain


in Switzerland, completed by ANYmal
with our locomotion controller. The
2.2-km route—with 120 m of eleva-
tion gain and inclinations up to 38%—
encompasses a variety of challenging
terrains [(A) to (I)]. ANYmal reached the
summit faster than the human time indi-
cated in the official signage and finished
the entire route in virtually the same time
as given by a hiking guide (47).

walked at 1.2 m/s, whereas the


baseline could only achieve 0.6 m/s
on flat ground in both the for­
ward and lateral directions. The
difference became even more pro­
nounced over the obstacle. Our
controller could traverse the ob­

Downloaded from https://www.science.org on July 20, 2023


stacle without any notable slow­
down, whereas the baseline was
stymied. The turning velocity
showed the biggest difference
between the baseline policy and
ours. Our controller could turn
at 3 rad/s, but the baseline poli­
cy could only turn at 0.6 rad/s: a
fivefold difference.
These results show gains by
our controller over the proprio­
ceptive baseline. Exteroception
enabled our controller to tra­
verse challenging environments
more successfully and at higher
speeds in comparison with pure
proprioception. Further quan­
titative performance evaluation
is provided in section S2.

Evaluating robustness
with belief state
visualization
To examine how our controller
integrates proprioception and
exteroception, we conducted a
number of controlled experiments.
We tested with two types of ob­
stacles that provide ambiguous
or misleading exteroceptive in­
put: an opaque foam obstacle that
appears solid but cannot support a foothold and a solid but trans­ misleading obstacle: The controller initially trusted the exteroceptive
parent obstacle. We placed each obstacle ahead of the robot and input (red) but quickly revised its estimate of terrain height upon
commanded the robot to walk forward at a constant velocity. contact. Once the correct belief had been formed, it was retained
The sensors perceived the foam block as solid, and the robot con­ even after the foot left the ground, showing that the controller retains
sequently prepared to step on it but could not achieve a stable foot­ past information due to its recurrent structure.
hold due to the deformation of the foam. Figure 5A shows how the The transparent obstacle is a block made of clear, acrylic plates
internal belief state (blue) was revised as the robot encounters the that were not accurately perceived by the onboard sensors (Fig. 5B).

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 5 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

The robot therefore walked as if it was on flat ground until it made In the next experiment, we simulated complete exteroception
contact with the step, at which point it revised its estimate of terrain failure by physically covering the sensors, thus making them fully
profile upward and changed its gait accordingly. uninformative (Fig. 5, C and D). The robot was commanded to walk
up and down two steps of stairs. With an unobstructed sensor, the
controller traversed the stairs gracefully, without any unintended
contact with the stair risers, adjusting its footholds and body pos­
ture to step down the stairs softly. When the sensors were covered,
the map had no information, and the controller received random
noise as input. Under this condition, the robot made contact with
the riser of the first stair, which could not be perceived in advance,
revised its estimate of the terrain profile, adjusted its gait accordingly,
and successfully climbed the stairs. On the way down, the blinded
robot made a hard landing with its front feet but kept its balance
and stepped down softly with its hind legs.
Last, we tested locomotion over an elevated slippery surface
(Fig. 5E). After the robot stepped onto the slippery platform, it de­
tected the low friction and adapted its behavior to step faster and keep
Movie 2. Hiking at Etzel. its balance. The momentarily sliding feet violated the assumption of

Downloaded from https://www.science.org on July 20, 2023

Fig. 3. Exteroceptive representation and challenges. Our locomotion controller perceives the environment through height samples (red dots) from an elevation map
(A). The controller is robust to many perception challenges commonly encountered in the field: missing map information due to sensing failure (B, C, and G) and misleading
map information due to nonrigid terrain (D and E) and pose estimation drift (F).

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 6 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Downloaded from https://www.science.org on July 20, 2023


Fig. 4. We compared the presented controller with a proprioceptive baseline. An experiment with steps of varying height shows that our controller can overcome
notably higher obstacles than the baseline (A and B). Our method completes an obstacle course in less than half the time of the baseline and without requiring any
human help (C and D). As seen in the graphs, our controller could follow the command more precisely. Note that the directional command plotted in (F) is scaled to
0.6 m/s. (E and F) Our controller can maintain double the linear velocity of the baseline and achieves a fivefold increase in turning speed. The arrows indicate when the
robot reached the step (G and H).

the kinematic pose estimator, which, in turn, destabilized the esti­ DISCUSSION
mated elevation map and rendered exteroception uninformative We have presented a fast and robust quadrupedal locomotion con­
during this time. The controller seamlessly fell back on propriocep­ troller for challenging terrain. The controller seamlessly integrates
tion until the estimated elevation map stabilized and exteroception exteroceptive and proprioceptive input. Exteroceptive perception
became informative again. enables the robot to traverse the environment quickly and gracefully

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 7 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Downloaded from https://www.science.org on July 20, 2023

Fig. 5. Internal belief state inspection during perceptive failure using a learned belief decoder. Red dots indicate height samples given as input to the policy. Blue
dots show the controller’s internal estimate of the terrain profile. (A) After stepping on a soft obstacle that cannot support a foothold, the policy correctly revises its esti-
mate of the terrain profile downward. (B) A transparent obstacle is correctly incorporated into the terrain profile after contact is made. (C) With operational sensors, the
robot swiftly and gracefully climbs the stairs, with no spurious contacts. (D) When the robot is blinded by covering the sensors, the policy can no longer anticipate the
terrain but remains robust and successfully traverses the stairs. (E) When stepping onto a slippery platform, the policy identifies low friction and compensates for the in-
duced pose estimation drift. The graph shows a decoded friction coefficient.

by anticipating the terrain and adapting its gait accordingly before integration of exteroceptive and proprioceptive inputs is learned
contact is made. When exteroceptive perception is misleading, in­ end to end and does not require any hand-coded rules or heuristics.
complete, or missing altogether, the controller smoothly transitions The result is a rough-terrain legged locomotion controller that com­
to proprioceptive locomotion. The controller remains robust under bines the speed and grace of vision-based locomotion with the high
all conditions, including when the robot is effectively blind. The robustness of proprioception.

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 8 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Fig. 6. Overview of the training meth-


ods and deployment. We first train a
teacher policy with access to privileged
simulation data using RL. This teacher
policy is then distilled into a student
policy, which is trained to imitate the
teacher’s actions and to reconstruct the
ground-truth environment state from
noisy observations. We deploy the stu-
dent policy zero-shot on real hardware
using height samples from a robot-centric
elevation map.

This combination of speed and


high robustness has been vali­
dated through controlled experi­
ments and extensive deployments
in the wild, including an hour-
long hiking route in the Alps that
is rated “difficult ” (47). The entire

Downloaded from https://www.science.org on July 20, 2023


route was completed by the robot
without human assistance (other
than reattaching a detached shoe
and swapping the batteries) in the
recommended time for comple­
tion of this route by human hikers.
Our work expands the oper­
ational domain of legged robots
and opens up previously un­
explored frontiers in autonomous
navigation. Navigation planners
no longer need to identify ground
type or to switch modes during
autonomous operation. Our con­
troller was used as the default
controller in the Defense Ad­
vanced Research Pro­jects Agency
Subterranean Challenge missions
of team Cerberus (50, 51), which
has won the first prize in the finals
(52). In this challenge, our con­
troller drove ANYmals to oper­
ate autonomously over extended
periods of time in underground
environments with rough terrain,
obstructions, and degraded sens­
ing in the presence of dust, fog,
water, and smoke (53). Our con­
troller played a crucial role be­
cause it enabled four ANYmals
to explore over 1700 m in all three types of courses—tunnel, urban, due to occlusion. Therefore, the policy assumes a continuous sur­
and cave—without a single fall. face, and, as a result, the robot might step off and fall. Explicitly es­
timating uncertainty may allow the policy to become more careful
Possible extensions when exteroceptive input is unreliable, for example, using the robot’s
Future work could explicitly use the uncertainty information in the foot to probe the ground if the policy is unsure about it. In addition,
belief state. Currently, the policy uses uncertainty only implicitly to our current implementation obtains perceptual information through
estimate the terrain. For example, in front of a narrow cliff or a stepping an intermediate state in the form of an elevation map, rather than
stone, the elevation map does not provide sufficient information directly ingesting raw sensor data. This has the advantage that the

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 9 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Downloaded from https://www.science.org on July 20, 2023


Fig. 7. Details of robust terrain perception components. (A) During student training, random noise is added to the height samples. The noise is sampled from a Gaussian
distribution ​N(0, ​z​​ l​  ∈ ​ℝ​​ 8​)​, where each ​​zli​​​​ controls a different noise component i per leg l. (B) We use multiple noise configurations z to simulate different operating
conditions. “Zero noise” is applied during teacher training, whereas “nominal noise” represents normal mapping conditions during student training. “Large offset” noise
simulates large map offsets due to pose estimation drift or deformable terrain surfaces. “Large noise” simulates a complete lack of terrain information due to occlusion or
sensor failure. (C) The student policy belief encoder incorporates a recurrent core and an attentional gate that integrates the proprioceptive and exteroceptive modalities.
The gate explicitly controls which aspects of exteroceptive data should pass through. (D) The belief decoder has a gate for reconstructing the exteroceptive data. It is only
used during training and for introspection into the belief state.

model is independent of the specific exteroceptive sensors. (We use First, a teacher policy is trained with RL to follow a random
LiDAR and stereo cameras in different deployments, with no re­ target velocity over randomly generated terrain with random dis­
training or fine-tuning.) However, the elevation map representa­ turbances. The policy has access to privileged information such as
tion omits detail that may be present in the raw sensory input and noiseless terrain measurements, ground friction, and the distur­
may provide additional information concerning material and texture. bances that were introduced.
Furthermore, our elevation map construction relies on a classical In the second stage, a student policy is trained to reproduce
pose estimation module that is not trained jointly with the rest of the teacher policy’s actions without using this privileged infor­
the system. Appropriately folding the processing of raw sensory in­ mation. The student policy constructs a belief state to capture
put into the network may further enhance the speed and robustness unobserved information using a recurrent encoder and outputs
of the controller. In addition, an occlusion model could be learned, an action based on this belief state. During training, we leverage
such that the policy understands that there is an occlusion behind two losses: a behavior cloning loss and a reconstruction loss. The
the cliff and avoids stepping off it. Another limitation is the inability behavior cloning loss aims to imitate the teacher policy. The re­
to complete locomotion tasks, which would require maneuvers very construction loss encourages the encoder to produce an informa­
different from normal walking, for example, recovering from a leg tive internal representation.
stuck in narrow holes or climbing onto high ledges. Last, we transfer the learned student policy to the physical robot
and deploy it in the real world with onboard sensors. The robot
constructs an elevation map by integrating depth data from on­
MATERIALS AND METHODS board sensors and samples height readings from the constructed
Overview elevation map to form the exteroceptive input to the policy. This
We train a neural network policy in simulation and then perform exteroceptive input is combined with proprioceptive sensory data
zero-shot sim-to-real transfer. Our method consists of three stages, and is given to the neural network, which produces actuator
as illustrated in Fig. 6. commands.

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 10 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

Problem formulation Teacher policy training


We formulate our control problem in discrete time dynamics, where In the first stage of training, we aim to find an optimal reference
the environment is fully defined by the state st at time step t. The control policy that has access to perfect, privileged information and
policy performs an action at and observes the environment via ot, enables ANYmal to follow a desired command velocity over ran­
which comes from an observation model ​O(​o​ t​​∣​s​ t​​, ​a​ t​​)​. Then, the domly generated terrain. The desired command is generated ran­
environment moves to the next state st+1 with transition probability domly as a vector vdes ∈ ℝ3 = (vx, vy, w), where vx, vy represents the
P(st+1∣st, at) and returns a reward rt+1. longitudinal and lateral velocity, and w represents the yaw velocity,
When all states are observable such that ot = st, this can be con­ all in the robot’s body frame.
sidered as Markov decision process (MDP). When there is unobserv­ We used proximal policy optimization (PPO) (59) to train
able information, however, such as external forces or full terrain the teacher policy. The teacher is modeled as a Gaussian policy,​​
information in our case, the dynamics are modeled as a partially a​ t​​  ∼ N(​​  ​​(​o​ t​​  = ​s​ t​​  ) , I)​, where  is implemented by a multilayer
observable Markov decision process (POMDP). perceptron (MLP) parameterized by , and  represents the vari­
The RL objective is to find a policy * that maximizes the expected ance for each action.
discounted reward over the future trajectory, such that Observation and action
∞ The teacher observation is defined as o​ ​​ teacher
t​  ​ = (​opt​​  ​, ​oet​​  ​, ​spt​​  ​)​, where ​​opt​​  ​​
​ ​  E [ ​ ∑ ​​ ​​  ​​  t​ ​r​ t​​]​
​​​​  *​  = ​argmax​ refers to the proprioceptive observation, ​​o​et​  ​​refers to the exterocep­
a t=0
tive observation, and s​ ​​ pt​  ​​refers to the privileged state. o​ ​​ pt​  ​​ contains the
A number of RL algorithms have been developed to solve fully body velocity, orientation, joint position and velocity history, action
observable MDPs and are readily available to be used for training. history, and each leg’s phase. ​​oet​​  ​​is a vector of height samples around
However, the case of POMDPs is more challenging because the state each foot with five different radii. The privileged state ​​s​tp​  ​​ includes
is not fully observable. This is often overcome by constructing a belief contact states, contact forces, contact normals, friction coefficient,

Downloaded from https://www.science.org on July 20, 2023


state bt from a history of observations {o0, ⋯, ot} in an attempt to thigh and shank contact states, external forces and torques applied
capture the full state. In deep RL, this is frequently done by stacking to the body, and swing phase duration.
a sequence of previous observations (54) or by using architectures Our action space is inspired by central pattern generators (4).
that can compress past information, such as recurrent neural network Each leg l = {1, 2, 3, 4} keeps a phase variable l and defines a nominal
(RNN) (55, 56) or temporal convolutional network (4, 57). trajectory based on the phase. The nominal trajectory is a stepping
Training a complex neural network policy that handles sequential motion of the foot tip, and we calculate the nominal joint target qi(l)
data naively from scratch can be time-consuming (4). Therefore, we for each joint actuator i = {1, ⋯,12} using inverse kinematics. The
use privileged learning (45), in which we first train a teacher policy action from the policy is the phase difference l and the residual
with privileged information and then distill the teacher policy into joint position target qi. More details of the observation and action
a student policy via supervised learning. space are in section S5.
Policy architecture
Training environment We model the teacher policy  as an MLP. It consists of three MLP
We use RaiSim (58) as our simulator to build the training environ­ components: exteroceptive encoder, privileged encoder, and the main
ment. There, we simulate multiple ANYmal-C robots on randomly network, as shown in Fig. 6. The exteroceptive encoder ge receives ​​oet​​  ​​
generated rough terrain in parallel with an integrated actuator and outputs a smaller latent representation l​​​et​  ​​
model (3) to close the reality gap.
Terrain ​​l​et​  ​ = ​g​ e​​(​oet​​  ​)​
We define parameterized terrain as shown in Fig. 6 (1). The ter­
rain is modeled as a height map; further details are provided in The privileged encoder gp receives the privileged state ​​spt​​  ​​ and out­
section S4. puts a latent representation ​​l​priv
t​  ​​
In addition to terrains composed of a variety of slopes and steps,
we modeled four different types of stairs in the training environment: ​​l​priv g​ p​​(​spt​​  ​)​
t​  ​ = ​
standard, open, ledged, and random. We use boxes to form the stairs,
because stair risers modeled by a height map are not perfectly vertical; These encoders compress each input to more compact represen­
we observed that the policy exploited these nonvertical edges in tations and facilitate reuse of some of the teacher policy components
simulation, resulting in poor sim-to-real transfer. by the student policy. More details on each layer are in section S6.
Domain randomization Rewards
We randomize the masses of the robot’s body and legs, the initial We define a positive reward for following the command velocity
joint position and velocity, and the initial body orientation and and a negative reward for violating some imposed constraints. The
velocity in each episode. In addition, external force and torque are command-following reward is defined as follows
applied to the body of the robot, and the friction coefficients of the
feet are occasionally set to a low value to introduce slippage. ​r​  command​​  =

(​exp​(​​  − (​v​  des​​  ⋅ v − ∣​v​  des​​∣)​​​​  ​),​ otherwise


Termination ​​ ​ 1.0, if ​v​  des​​  ⋅ v > ∣​v​  des​​∣​​
                 ​​ ​​​    
​ 2 ​   ​​​
We terminate a training episode and start a new one when the robot
reaches an undesirable state. Termination criteria are body collision (1)
with the ground, large body tilt, and exceeding the joint torque lim­
it of the actuators. These criteria help shape the motion and obtain where vdes ∈ ℝ2 is the desired horizontal velocity, and v ∈ ℝ2 is the
constraint-satisfying behaviors. current horizontal body velocity with respect to the body frame. The

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 11 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

same reward is applied to the yaw command as well. We penalize behavior cloning loss is defined as the squared distance between the
the velocity component orthogonal to the desired velocity as well as student action and the teacher action given the same state and com­
the body velocity around roll, pitch, and yaw. In addition, we use mand. The reconstruction loss is the squared distance between the
shaping rewards for body orientation, joint torque, joint velocity, joint noiseless height sample and privileged information ​(​oet​​  ​, ​spt​​  ​)​ and their
acceleration, and foot slippage as well as shank and knee collision. reconstruction from the belief state. We generate samples by rolling
Body orientation reward was used to avoid strange postures of out the student policy to increase robustness (60, 61).
the body. Joint-related reward terms were used to avoid overly Height sample randomization
aggressive motion. Foot slippage and collision reward terms were During student training, we inject random noise into the height
used to avoid them. We tuned the reward terms by looking at the samples using a parameterized noise model n ​ (​​ o˜​​et​  ​∣​oet​​  ​, z)​, z ∈ ℝ8 × 4.
policy’s behavior in simulation. In addition to the traversal per­ We apply two different types of measurement noise when sampling
formance, we checked the smoothness of the locomotion. All reward the heights, as shown in Fig. 7A:
terms are specified in section S7. 1) Shifting scan points laterally.
Curriculum 2) Perturbing the height values.
We use two curricula to ramp up the difficulty as the policy’s per­ Each noise value is sampled from a Gaussian distribution, and
formance improves. One curriculum adjusts the terrain difficulty the noise parameter z defines the variance. Both types of noise are
using an adaptive method (4), and the other changes elements such applied in three different scopes, all with their own noise variance:
as reward or applied disturbances using a logistic function (3). per scan point, per foot, and per episode. The noise values per scan
For the terrain curriculum, a particle filter updates the terrain point and per foot are resampled at every time step, while the epi­
parameters such that they remain challenging but achievable at any sodic noise remains constant for all scan points.
point during policy training (4). The second curriculum multiplies In addition, we define three mapping conditions with associated
the magnitude of domain randomization and some reward terms noise parameters z to simulate changing map quality and error sources,

Downloaded from https://www.science.org on July 20, 2023


(joint velocity, joint acceleration, orientation, slip, and thigh and as shown in Fig. 7B.
shank contact) by a factor that is monotonically increasing and 1) Nominal noise assuming good map quality during regular
asymptotically trending to 1 operation.
2) Large offsets through high per-foot noise to simulate map offsets
​​c​  k+1​​ = ​(​c​  k​​)​​  d​​
due to pose estimation drift or deformable terrain.
where ck is the curriculum factor at the kth iteration, and 0 < d < 1 is 3) Large noise magnitude for each scan point to simulate a complete
the convergence rate. lack of terrain information due to occlusion or mapping failure.
These three mapping conditions are selected at the beginning of
Student policy training each training episode in a ratio of 60, 30, and 10%.
After we train a teacher policy that can traverse various terrain with Last, we divide each training terrain into cells and add an addi­
the help of privileged information, we distill it into a student policy tional offset to the height sample, depending on which cell it was
that only has access to information that is available on the real ro­ sampled from. This simulates transitions between areas with different
bot. We use the same training environment as for the teacher policy terrain characteristics, such as vegetation and deep snow. The pa­
but add additional noise to the student height sample observation:​​ rameter vector z is also part of a learning curriculum, and its mag­
o​tstudent
​  ​ = (​ot​p​  ​, n(​ot​e​  ​  ) )​, where ​n(​ot​e​  ​)​is a noise model applied to the nitude increases linearly with training duration. The height sample
height sample input. The noise model simulates different failure representation is specified in more detail in section S8.
cases of exteroception frequently encountered during field deploy­ Belief state encoder
ment and is detailed below. The recurrent belief state encoder encodes states that are not directly
When there is a large noise in the exteroception, it becomes un­ observable. To integrate proprioceptive and exteroceptive data, we
observable; thus, the dynamics is considered to be POMDP. In introduce a gated encoder as shown in Fig. 7C, inspired by gated
addition, the privileged states are not observable due to the lack of RNN models (62, 63) and multimodal information fusion (64–66).
sensors to directly measure. Therefore, the policy needs to consider The encoder learns an adaptive gating factor that controls
the sequential correlation to estimate the unobservable states. We how much exteroceptive information should pass through. First,
propose to use a recurrent belief state encoder to combine sequences proprioception ​​opt​​  ​​, exteroceptive features from noisy observations​​
of both exteroception and proprioception to estimate the unobserv­ l​et​  ​ = ​g​ e​​(o​​  ˜​​et​  ​)​, and hidden state st are encoded by the RNN module into
able states as a belief state. the intermediate belief state bt'. Then, the attention vector  is com­
The student policy consists of a recurrent belief state encoder and puted from bt'. It controls how much exteroceptive information en­
an MLP, as shown in Fig. 6 (2). We denote the hidden state of the ters the final belief state bt
recurrent network by ht. The belief state encoder takes o​​ ​student t​  ​​ and ht
as input and outputs a latent vector bt, which we refer to as the belief ​​b​ t'​​, ​ht​  +1​​ = RNN(​opt​​  ​, ​let​​  ​, ​ht​  ​​)​
state. The goal is to match the belief state bt with the feature vector​
(​l​et​  ​, ​lpriv
t​​  ​)​of the teacher policy that encodes all locomotion-relevant ​  = (​g​ a​​(​b​ t'​​))​
information. We then pass ​​o​pt​  ​​ and bt to the MLP, which computes
the output action. The MLP structure remains the same as for the teacher e
policy, such that we can reuse the learned weights of the teacher ​​b​ t​​  = ​g​  b​​(​b​ t'​​  ) + ​lt​​  ​  ⊙ ​
policy to initialize the student network and speed up training.
Training is performed in a supervised fashion by minimizing Here, ga and gb are fully connected neural networks, and (⋅) is
two losses: a behavior cloning loss and a reconstruction loss. The the sigmoid function.

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 12 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

The same gate is used in the decoder, where it is used to recon­ 15. D. Belter, P. Skrzypczyński, Rough terrain mapping and classification for foothold
struct the privileged information and the height samples (Fig. 7D). selection in a walking robot, in 2010 IEEE Safety Security and Rescue Robotics, Bremen,
Germany, 26 to 30 July 2010 (IEEE, 2010), pp. 1–6.
This is used to calculate a reconstruction loss that encourages the
16. P. Fankhauser, M. Bloesch, C. Gehring, M. Hutter, R. Siegwart, Robot-centric elevation
belief state to capture veridical information about the environment. mapping with uncertainty estimates, in Mobile Service Robotics (World Scientific, 2014),
We use the GRU (62) as our RNN architecture. The evaluation pp. 433–440.
of the effectiveness of the gate structure is presented in section S9. 17. P. Fankhauser, M. Bloesch, M. Hutter, Probabilistic terrain mapping for mobile robots
with uncertain localization. IEEE Robot. Autom. Lett. 3, 3019–3026 (2018).
18. M. Zucker, J. A. Bagnell, C. G. Atkeson, J. Kuffner, An optimization approach to rough
Deployment terrain locomotion, in 2010 IEEE International Conference on Robotics and Automation
We deployed our controller on the ANYmal-C robot with two dif­ (IEEE, 2010), pp. 3589–3595.
ferent sensor configurations, either using two Robosense Bpearl (67) 19. P. D. Neuhaus, J. E. Pratt, M. J. Johnson, Comprehensive summary of the institute
dome LiDAR sensors or four Intel RealSense D435 depth cameras for human and machine cognition's experience with LittleDog. Int. J. Robot. Res. 30,
216–235 (2011).
(68). We trained our policy in PyTorch (69) and deployed on the
20. J. Z. Kolter, Y. Kim, A. Y. Ng, Stereo vision and terrain modeling for quadruped robots,
robot zero-shot without any fine-tuning. We build a robot-centric in 2009 IEEE International Conference on Robotics and Automation (IEEE, 2009),
2.5D elevation map at 20 Hz by estimating the robot’s pose and reg­ pp. 1557–1564.
istering the point-cloud readings from the sensors accordingly. The 21. I. Havoutis, J. Ortiz, S. Bazeille, V. Barasuol, C. Semini, D. G. Caldwell, Onboard
policy runs at 50 Hz and samples the heights from the latest eleva­ perception-based trotting and crawling with the hydraulic quadruped robot (HyQ), in
2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IEEE, 2013),
tion map, filling a randomly sampled value if no map information is pp. 6052–6057.
available at a query location. 22. C. Mastalli, M. Focchi, I. Havoutis, A. Radulescu, S. Calinon, J. Buchli, D. G. Caldwell,
We developed an elevation mapping pipeline for fast terrain mapping C. Semini, Trajectory and foothold optimization using low-dimensional models for rough
on a graphics processing unit to parallelize point-cloud processing. terrain locomotion, in 2017 IEEE International Conference on Robotics and Automation
(ICRA) (IEEE, 2017), pp. 1096–1103.
We follow a similar approach to that used by Fankhauser et al. (17)

Downloaded from https://www.science.org on July 20, 2023


23. D. Belter, P. Łabęcki, P. Skrzypczyński, Adaptive motion planning for autonomous rough
to update the map in a Kalman filter fashion and additionally per­ terrain traversal with a walking robot. J. Field Robot. 33, 337–370 (2016).
form drift compensation and ray casting to obtain a more consistent 24. P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, M. Hutter, Robust rough-terrain
map. This fast mapping implementation was crucial to maintain fast locomotion with a quadrupedal robot, in 2018 IEEE International Conference on Robotics
processing rates and keep up with the fast locomotion speeds achieved and Automation (ICRA) (IEEE, 2018), pp. 5761–5768.
25. F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, M. Hutter, Perceptive locomotion in rough
by our controller. terrain–online foothold optimization. IEEE Robot. Autom. Lett. 5, 5370–5376 (2020).
26. D. Kim, D. Carballo, J. Di Carlo, B. Katz, G. Bledt, B. Lim, S. Kim, Vision aided dynamic
SUPPLEMENTARY MATERIALS exploration of unstructured terrain with a small-scale quadruped robot, in 2020 IEEE
International Conference on Robotics and Automation (ICRA) (IEEE, 2020), pp. 2464–2470.
www.science.org/doi/10.1126/scirobotics.abk2822
Sections S1 to S9 27. O. A. Villarreal-Magaña, V. Barasuol, M. Camurri, M. Focchi, L. Franceschi, M. Pontil,
Figs. S1 and S2 D. G. Caldwell, C. Semini, Fast and continuous foothold adaptation for dynamic
Tables S1 to S5 locomotion through CNNs. IEEE Robot. Autom. Lett. 4, 2140–2147 (2019).
Movies S1 to S4 28. Boston Dynamics, Atlas | partners in parkour (2021); https://youtu.be/tF4DML7FIWk
[online; accessed September 2021].
29. X. B. Peng, G. Berseth, M. Van de Panne, Terrain-adaptive locomotion skills using deep
REFERENCES AND NOTES reinforcement learning. ACM Trans. Graph. 35, 1–12 (2016).
1. M. Raibert, K. Blankespoor, G. Nelson, R. Playter, Bigdog, the rough-terrain quadruped
30. X. B. Peng, G. Berseth, K. Yin, M. Van De Panne, Deeploco: Dynamic locomotion skills
robot. IFAC Proceedings Volumes 41, 10822–10825 (2008).
using hierarchical deep reinforcement learning. ACM Trans. Graph. 36, 1–13 (2017).
2. B. Katz, J. Di Carlo, S. Kim, Mini cheetah: A platform for pushing the limits of dynamic
31. X. B. Peng, P. Abbeel, S. Levine, M. van de Panne, Deepmimic: Example-guided deep
quadruped control, in 2019 International Conference on Robotics and Automation (ICRA)
reinforcement learning of physics-based character skills. ACM Trans. Graph. 37,
(IEEE, 2019), pp. 6295–6301.
143:1–143:14 (2018).
3. J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, M. Hutter, Learning
32. Z. Xie, H. Y. Ling, N. H. Kim, M. van de Panne, Allsteps: Curriculum-driven learning of
agile and dynamic motor skills for legged robots. Sci. Robot. 4, 26 (2019).
stepping stone skills, in Computer Graphics Forum (Wiley Online Library, 2020),
4. J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning quadrupedal locomotion
pp. 213–224.
over challenging terrain. Sci. Robot. 5, abc5986 (2020).
33. V. Tsounis, M. Alge, J. Lee, F. Farshidian, M. Hutter, Deepgait: Planning and control
5. H.-W. Park, P. M. Wensing, S. Kim, Jumping over obstacles with MIT Cheetah 2.
of quadrupedal gaits using deep reinforcement learning. IEEE Robot. Autom. Lett. 5,
Robot. Auton. Syst. 136, 103703 (2021).
3699–3706 (2020).
6. Boston Dynamics, Spot (2021); www.bostondynamics.com/spot [online; accessed
34. J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, V. Vanhoucke,
March 2021].
Sim-to-real: Learning agile locomotion for quadruped robots, in Robotics: Science and
7. C. Gehring, P. Fankhauser, L. Isler, R. Diethelm, S. Bachmann, M. Potz, L. Gerstenberg,
Systems, Pittsburgh, PA, USA, 26 to 30 June 2018 (2018).
M. Hutter, ANYmal in the field: Solving industrial inspection of an offshore HVDC platform
with a quadrupedal robot, in Field and Service Robotics (Springer, 2021), pp. 247–260. 35. X. B. Peng, E. Coumans, T. Zhang, T.-W. E. Lee, J. Tan, S. Levine, Learning agile robotic
8. Agility Robotics, Robots (2021); www.agilityrobotics.com/robots [online; accessed June locomotion skills by imitating animals, in Robotics: Science and Systems (2020).
2021]. 36. Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, V. Sindhwani, Data efficient reinforcement
9. Unitree Robotics, A1 (2021); www.unitree.com/products/a1/ [online; accessed March 2021]. learning for legged robots, in Conference on Robot Learning (PMLR, 2020), pp. 1–10.
10. Ghost Robotics, Vision 60 (2021); www.ghostrobotics.io/ [online; accessed June 2021]. 37. Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, M. van de Panne, Learning locomotion skills for
11. J. S. Matthis, J. L. Yates, M. M. Hayhoe, Gaze and the control of foot placement when cassie: Iterative design and sim-to-real, in Proceedings of the Conference on Robot
walking in natural terrain. Curr. Biol. 28, 1224–1233 (2018). Learning, L. P. Kaelbling, D. Kragic, K. Sugiura, Eds. (PMLR, 2020), pp. 317–329.
12. ANYbotics, ANYmal (2021); www.anybotics.com/anymal-autonomous-legged-robot/ 38. J. Siekmann, K. Green, J. Warila, A. Fern, J. Hurst, Blind bipedal stair traversal via
[online; accessed June 2021]. sim-to-real reinforcement learning, in Robotics: Science and Systems (2021).
13. P. Fankhauser, M. Bloesch, D. Rodriguez, R. Kaestner, M. Hutter, R. Siegwart, Kinect v2 for 39. A. Kumar, Z. Fu, D. Pathak, J. Malik, Rma: Rapid motor adaptation for legged robots, in
mobile robot navigation: Evaluation and modeling, in 2015 International Conference on Robotics: Science and Systems (2021).
Advanced Robotics (ICAR) (IEEE, 2015), pp. 388–394. 40. C. Yang, K. Yuan, Q. Zhu, W. Yu, Z. Li, Multi-expert learning of adaptive legged
14. C. Ye, J. Borenstein, A new terrain mapping method for mobile robots obstacle locomotion. Sci. Robot. 5, eabb2174 (2020).
negotiation, in Unmanned ground vehicle technology V (International Society for Optics 41. J. Lee, J. Hwangbo, M. Hutter, Robust recovery controller for a quadrupedal robot using
and Photonics, 2003), pp. 52–62. deep reinforcement learning. arXiv:1901.07517 (2019).

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 13 of 14


SCIENCE ROBOTICS | RESEARCH ARTICLE

42. S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, I. Havoutis, RLOC: Terrain-aware legged 60. S. Ross, G. Gordon, J. D. Bagnell, A reduction of imitation learning and structured
locomotion using reinforcement learning and optimal control. arXiv:2012.03094 (2020). prediction to no-regret online learning, in Proceedings of the Fourteenth International
43. M. Focchi, R. Orsolino, M. Camurri, V. Barasuol, C. Mastalli, D. G. Caldwell, C. Semini, Conference on Artificial Intelligence and Statistics (JMLR Workshop and Conference
Heuristic planning for rough terrain locomotion in presence of external disturbances and Proceedings, 2011), pp. 627–635.
variable perception quality, in Advances in Robotics Research: From Lab to Market 61. W. M. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar, G. Swirszcz, M. Jaderberg,
(Springer, 2020), pp. 165–209. Distilling policy distillation, in Proceedings of Machine Learning Research, K. Chaudhuri,
44. Boston Dynamics, Spot user guide release 2.0 version A (2021); www.generationrobots. M. Sugiyama, Eds. (PMLR, 2019), pp. 1331–1340.
com/media/spot-boston-dynamics/spot-user-guide-r2.0-va.pdf [online; accessed 62. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio,
June 2021]. Learning phrase representations using rnn encoder-decoder for statistical machine
45. D. Chen, B. Zhou, V. Koltun, P. Krähenbühl, Learning by cheating, in Conference on Robot translation, in Conference on Empirical Methods in Natural Language Processing (EMNLP)
Learning (PMLR, 2020), pp. 66–75. (2014), pp. 1724–1734.
46. M. Bloesch, M. Hutter, M. A. Hoepflinger, S. Leutenegger, C. Gehring, C. D. Remy, 63. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9, 1735–1780
R. Siegwart, State estimation for legged robots-consistent fusion of leg kinematics (1997).
and IMU. Robotics 17, 17–24 (2013). 64. T. Anzai, K. Takahashi, Deep gated multi-modal learning: In-hand object pose changes
47. Komoot, Etzel kulm loop hike (2021); https://bit.ly/35bjfyE [online; accessed June 2021]. estimation using tactile and image data, in 2020 IEEE/RSJ International Conference on
48. Komoot, Komoot help guides (2021); https://d21buns5ku92am.cloudfront.net/67683/ Intelligent Robots and Systems (IROS) (IEEE, 2020), pp. 9361–9368.
documents/40488-Komoot [online; accessed December 2021]. 65. J. Kim, J. Koh, Y. Kim, J. Choi, Y. Hwang, J. W. Choi, Robust deep multi-modal learning
49. R. C. Coulter, Implementation of the pure pursuit path tracking algorithm, Tech. rep., based on gated information fusion network, in Asian Conference on Computer Vision
Carnegie-Mellon UNIV Pittsburgh PA Robotics INST (1992). (Springer, 2019), pp. 90–106.
50. M. Tranzatto, F. Mascarich, L. Bernreiter, C. Godinho, M. Camurri, S. M. K. Khattak, T. Dang, 66. J. Arevalo, T. Solorio, M. Montes-y Gómez, F. A. González, Gated multimodal units for
V. Reijgwart, J. Loeje, D. Wisth, S. Zimmermann, H. Nguyen, M. Fehr, L. Solanka, information fusion, ICLR workshop (2017).
R. Buchanan, M. Bjelonic, N. Khedekar, M. Valceschini, F. Jenelten, M. Dharmadhikari, 67. Rs-bpearl (April 2021); www.robosense.ai/en/rslidar/RS-Bpearl.
T. Homberger, P. De Petris, L. Wellhausen, M. Kulkarni, T. Miki, S. Hirsch, M. Montenegro, 68. Intel RealSense (April 2021); www.intelrealsense.com/.
C. Papachristos, F. Tresoldi, J. Carius, G. Valsecchi, J. Lee, K. Meyer, X. Wu, J. Nieto, 69. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
A. Smith, M. Hutter, R. Y. Siegwart, M. Mueller, M. Fallon, K. Alexis, CERBERUS: N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,

Downloaded from https://www.science.org on July 20, 2023


Autonomous legged and aerial robotic exploration in the tunnel and urban circuits of S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style,
the darpa subterranean challenge, Journal of Field Robotics (2021). high-performance deep learning library, in Advances in Neural Information Processing
51. Cerberus, Team cerberus (2021); www.subt-cerberus.org/ [online; accessed June 2021]. Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox, R. Garnett,
52. DARPA, Darpa subterranean challenge competition results finals (2021); www. Eds. (Curran Associates Inc., 2019), pp. 8024–8035.
subtchallenge.com/results.html [online; accessed November 2021].
53. DARPA, Darpa subterranean challenge competition rules final event (2021); www.
subtchallenge.com [online; accessed June 2021]. Funding: The project was funded, in part, by the Intel Network on Intelligent Systems, the
54. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Swiss National Science Foundation (SNF) through the National Centre of Competence in
Playing atari with deep reinforcement learning, Advances in Neural Information Processing Research Robotics and project no. 188596, and the European Research Council (ERC) under
Systems, Deep Learning Workshop (2013). the European Union’s Horizon 2020 research and innovation programme grant agreement
55. P. Zhu, X. Li, P. Poupart, G. Miao, On improving deep reinforcement learning for pomdps. nos. 852044, 780883, and 101016970. This work has been conducted as part of ANYmal
arXiv:1704.07978 (2017). Research, a community to advance legged robotics. Author contributions: T.M. formulated
56. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, the main idea of combining inputs from multiple modalities. J.L. and J.H. designed and tested
R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, the initial setup. T.M. developed software and trained the controller. T.M. and L.W. set up the
L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, perception pipeline on the robot. T.M. conducted most of the indoor experiments. T.M., J.L.,
V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, and L.W. conducted outdoor experiments. All authors refined ideas, contributed in the
Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. M. Kinney, O. Smith, T. Schaul, T. Lillicrap, experiment design, analyzed the data, and wrote the paper. Competing interests: The
K. Kavukcuoglu, D. Hassabis, C. Apps, D. Silver, Grandmaster level in starcraft ii using authors declare that they have no competing interests. Data and materials availability: All
multi-agent reinforcement learning. Nature 575, 350–354 (2019). data needed to evaluate the conclusions in the paper are present in the paper or the
57. S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and Supplementary Materials.
recurrent networks for sequence modeling. arXiv:1803.01271 (2018).
58. J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for solving contact dynamics. Submitted 5 July 2021
IEEE Robot. Autom. Lett. 3, 895–902 (2018). Accepted 20 December 2021
59. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization Published 19 January 2022
algorithms. arXiv:1707.06347 (2017). 10.1126/scirobotics.abk2822

Miki et al., Sci. Robot. 7, eabk2822 (2022) 19 January 2022 14 of 14


Learning robust perceptive locomotion for quadrupedal robots in the wild
Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter

Sci. Robot., 7 (62), eabk2822.


DOI: 10.1126/scirobotics.abk2822

View the article online


https://www.science.org/doi/10.1126/scirobotics.abk2822

Downloaded from https://www.science.org on July 20, 2023


Permissions
https://www.science.org/help/reprints-and-permissions

Use of this article is subject to the Terms of service

Science Robotics (ISSN ) is published by the American Association for the Advancement of Science. 1200 New York Avenue NW,
Washington, DC 20005. The title Science Robotics is a registered trademark of AAAS.
Copyright © 2022 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim
to original U.S. Government Works

You might also like