0% found this document useful (0 votes)

25 views

DiffuseLoco - Real-Time Legged Locomotion

Uploaded by

lingfanbao1998

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

DiffuseLoco - Real-Time Legged Locomotion

Uploaded by

lingfanbao1998

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

DiffuseLoco: Real-Time Legged Locomotion

Control with Diffusion from Offline Datasets

Xiaoyu Huang∗,1 , Yufeng Chi∗,1 , Ruofeng Wang∗,1 , Zhongyu Li1 , Xue Bin Peng2 ,
Sophia Shao1 , Borivoje Nikolic1 , Koushil Sreenath1
1
UC Berkeley 2 Simon Fraser University
*Equal contribution. Email: [email protected]

Diffusion model
runs on here
Trotting
arXiv:2404.19264v1 [cs.RO] 30 Apr 2024

Hopping Bipedal Walking Pacing

Walking Running

Fig. 1. Snapshots of (top) a quadrupedal robot trotting with our diffusion-based real-time locomotion control policy via onboard computing;
(middle) our multi-skill offline-learned policy performing a sequence of challenging skills, including hopping, bipedal locomotion, and pacing
with smooth skill transitioning; (bottom) a bipedal robot controlled by our multi-skill policy transitioning from walking to running stably.
We present DiffuseLoco, a scalable framework that leverages diffusion models to learn legged locomotion control exclusively from offline
datasets. DiffuseLoco learns a state-of-the-art policy that is able to perform a diverse set of agile locomotion skills with a single policy,
exhibiting robustness in the real world, and is versatile to various sources of offline data. We encourage the viewers to watch supplementary
videos on these runs.

Abstract—This work introduces DiffuseLoco, a framework for cloning baselines. The design choices are validated via compre-
training multi-skill diffusion-based policies for dynamic legged hensive ablation studies. This work opens new possibilities for
locomotion from offline datasets, enabling real-time control of scaling up learning-based legged locomotion controllers through
diverse skills on robots in the real world. Offline learning at scale the scaling of large, expressive models and diverse offline datasets.
has led to breakthroughs in computer vision, natural language
processing, and robotic manipulation domains. However, scaling
up learning for legged robot locomotion, especially with multiple I. I NTRODUCTION
skills in a single policy, presents significant challenges for prior Learning from large-scale offline datasets has led to break-
online reinforcement learning methods. To address this challenge, throughs in a large variety of domains, such as computer vision
we propose a novel, scalable framework that leverages diffusion
models to directly learn from offline multimodal datasets with and natural language processing, where scaling up both the
a diverse set of locomotion skills. With design choices tailored size of models and datasets leads to improved performance
for real-time control in dynamical systems, including receding and generalization [32, 74]. This has led to the development of
horizon control and delayed inputs, DiffuseLoco is capable of powerful generative models, like diffusion models, which are
reproducing multimodality in performing various locomotion able to model complex multi-modal data distributions [64, 67],
skills, zero-shot transfer to real quadrupedal robots, and it
can be deployed on edge computing devices. Furthermore, and generate high-quality images and videos.
DiffuseLoco demonstrates free transitions between skills and In robotics, learning from diverse offline datasets has also
robustness against environmental variations. Through extensive been shown to be an effective and scalable approach for
benchmarking in real-world experiments, DiffuseLoco exhibits developing more versatile policies for domains such as robotic
better stability and velocity tracking performance compared to manipulation [3, 4] and autonomous driving [8, 13, 54].
prior reinforcement learning and non-diffusion-based behavior
However, these domains typically involve agents that have
low-dimensional action spaces (e.g., end-effector trajectory), A. Multi-skill Reinforcement Learning in Locomotion
with low re-planning frequency on inherently stable systems Recent advances in model-free RL have demonstrated
(e.g., robot arms or cars). For dynamical systems featuring promising results in developing agile locomotion skills for
higher degrees of freedom and more complex dynamics, legged robots in the real world [6, 14, 40, 47, 56, 66]. Among
such as legged robots, data-driven approaches have largely them, prior works have demonstrated impressive performances
been focused on online reinforcement learning (RL) tech- on highly agile skills such as jumping, running and sharp
niques [27, 43, 56]. Unlike offline learning, it can be difficult turning on bipedal robots [44, 91], and walking on two feet
to scale online RL to both large models and datasets due to the with quadrupedal robots [20, 42, 71], which requires a high
requirements of online rollouts. Most prior works have focused degree of agility and robustness with a floating-base robot.
on dynamic motions with specialized single-skill models, and However, the majority of these agile locomotion skills are
scaling towards a single model that can reproduce a diverse trained with single-skill RL and remain unscalable to large-
set of challenging locomotion skills remains an open problem. scale learning of multiple locomotion skills.
To this end, we present a novel approach that emphasizes A simple and natural idea of learning multi-skill locomotion
learning agile legged locomotion skills at scale by solving is to train separate policies for each skill, and then coor-
two aforementioned challenges: offline learning from various dinate through extra high-level planning [23, 26, 33, 88] or
data sources and the ability to learn a set of diverse skills. distill into a small-scale model via imitation learning such as
We propose DiffuseLoco, a framework that leverages expres- DAgger [93]. Due to the inherent coordination difficulty and
sive diffusion models to effectively learn the multi-modalities the requirements of online distillation, these methods remain
that exist in the diverse offline dataset without manual skill unscalable to an increasing number of skills.
labeling. Once trained, our controllers can execute robust In comparison, learning multi-skill policies directly from
locomotion skills on real-world legged robots for real-time scratch typically involves parameterized motions [62, 70] with
control. limited sets of applicable motions, and more popularly, motion
The primary contributions of this work include: imitation methods through either reward shaping [11, 34, 92]
1) A state-of-the-art multi-skill controller, leveraging expres- or adversarial imitation learning [16, 39, 83, 89]. However, this
sive diffusion models, that learns agile bipedal walking approach still faces challenges such as needing extra model-
and various quadrupedal locomotion skills within a single based trajectory optimization or well-trained expert policies
policy and can be deployed zero-shot on real-world for acquiring reference for agile motions and the limited
quadrupedal robots. expressiveness of simple models in online RL frameworks in
2) A novel framework that directly learns from a diverse learning diverse skills.
offline dataset for real-time control of legged robots, In general, learning a diverse, agile multi-skill policy with
showing the benefits and potentials of offline learning online RL remains challenging. For example, while existing
at scale for locomotion skills practically in a real-world RL methods have successfully combined skills like jumping
scenario. and trotting [89, 92], quadrupedal walking and standing on
3) Extensive real-world validation showing higher stability hind legs with wheels without bipedal walking [78], a combi-
and lower velocity tracking errors compared to baselines, nation of more diverse and agile skills such as stable bipedal
while demonstrating multi-modal behaviors with skill walking and quadrupedal hopping in a single policy has not
transitioning and robustness on terrains with varying yet been demonstrated.
ground frictions.
B. Offline Learning in Locomotion Control
This work opens up the possibility of leveraging large-scale
Compared to online learning, offline learning offers better
learning to create diverse and agile multi-skill controllers for
scalability, a simpler training scheme, and an effective way
legged locomotion from offline datasets. For the first time, we
to re-use data, yet prior works in learning low-level loco-
show that it is feasible to zero-shot transfer such a diverse
motion control from offline datasets remain limited. Most
locomotion policy learned from a static dataset to real-world
prior works focus on simple simulated tasks, such as Gym
applications. This approach offers a scalable and versatile
locomotion tasks, with behavior cloning (BC) [76] and offline
framework for learning-based control, allowing for continuous
RL [38]. Among them, some leverage Q-learning on offline
expansion of the dataset and integration of diverse skills from
datasets [35, 36, 51], or supervised learning techniques [9, 28,
various data sources. Codebase and checkpoints will be open-
81, 86]. However, these tasks are oversimplified and do not
sourced upon the acceptance of this work.
adequately consider the complexities in real-world scenarios.
An alternative is the use of offline data as a foundation for
II. R ELATED W ORK
online learning [50, 77]. Among them, Smith et al. develops
Our proposed framework leverages diffusion models for baseline policies from offline datasets to bootstrap online
multi-skill legged locomotion control. In this section, we re- learning on real robots. Yet, this approach still requires online
view the most closely related works on learning-based legged learning.
locomotion algorithms and applications of diffusion models in Another recent work develops offline learning on humanoid
robotics. locomotion with real robots [61], with the scope limited
to only one walking skill. In comparison, the efficacy of Data Source Training Deployment
learning completely from offline datasets, especially at a larger DiffuseLoco DiffuseLoco
scale than a few simple skills remains unproven in legged at st 30 Hz
(s, g, a+εk, k)
locomotion control.
Multiple Skills εθ
C. Diffusion Models in Robotics
(st, at, gt) Offline εk MSE
Recent advances have seen increasing applications of dif- Dataset Multiple Skills
loss
fusion models in control and planning systems. Some prior
works integrate them into the learning pipeline, including the
discriminators in adversarial IL for legged control [79], and re- Fig. 2: Overview of the three stages of DiffuseLoco. First,
ward models for RL [52, 80]. However, the use of small multi- we generate or utilize an offline dataset with demonstrations
layer perceptron (MLP) networks as policies presents simi- of a set of skills gathered with different methods (left). Then,
lar challenges in exploring diverse, multi-skill learning [81]. we train DiffuseLoco policy with DDPM loss on trajectories
Diffusion models are also employed in high-level trajectory within the dataset (middle). Finally, DiffuseLoco policy is
planning [24, 29], safe planning [85], and goal generation for deployed on robots in the real world and executes a diverse
low-level controllers [1, 31], as well as enhancing visuomotor set of agile skills (right).
planning for manipulation tasks [48, 55, 63]. However, most
prior works are limited to simulation environments only. Algorithm 1 DiffuseLoco Algorithm
Emerging efforts to apply diffusion models to real-world 1 N
1: Initilize: N source policies πsrc . . . , πsrc , Empty or Existing
robot manipulation include using diffusion to manage a variety
Offline Dataset Dsrc , Diffusion Model πθ
of manipulation tasks with visual inputs and incorporating self-
2: // OBTAIN DATA FROM πsrc
supervised learning and language conditioning [10, 12, 21, 41].
3: repeat
Yoneda et al. leverages the reverse diffusion process for
4: Sample n uniformly from {1, . . . , N }
shared autonomy with a human user in end-effector planning.
5: Sample environment dynamics p(s′ |s, a)
Additionally, hierarchical frameworks are being developed
6: Reset source-specific state ssrc t
to handle tasks requiring multiple skills, pushing towards
7: for t = 1 to T do
generalist policies [2, 53, 84]. However, these prior works pri-
8: if t mod random goal step = 0 then
marily focus on high-level planning on manipulation systems,
9: Sample goal g
featuring a low-dimensional action space (e.g., end-effector
10: end if
position), low-replanning frequency (e.g., around 10 Hz), and n
11: Act source policy asrc src
t = πsrc (st , gt )
inherently more stable dynamics.
12: Record source-agnostic state st
In contrast, using diffusion models for high-frequency, low- ′
13: Step environment ssrc t = environment.step(asrc t )
level control remains limited [7]. The most relevant work
14: Record source-agnostic action at
uses online RL to train diffusion-based actor policies in
15: Add data to offline dataset Dsrc ← (st , at , gt )
simpler simulation settings [87], but transitioning to real-world
16: end for
applications with high-frequency feedback control presents
17: until desired
significant challenges due to the instability and rapid dynamics
18: // TRAIN ON OFFLINE DATASET
of legged robots [82]. This work aims to leverage diffusion
19: for each epoch do
models in low-level control for legged locomotion and demon-
20: for each (straj , atraj , gtraj ) in Dsrc do
strate the advantages of multimodality and scalability in the
21: Compute the loss L(θ) with Eqn. 5
real world.
22: Update model parameters θ to minimize loss L(θ)
III. V ERSATILE F RAMEWORK FOR D IFFUSION 23: end for
L OCOMOTION C ONTROL F ROM O FFLINE DATASET 24: end for
In this section, we provide an overview of DiffuseLoco, a
framework designed to generate and utilize offline datasets for
training multi-skill locomotion policies scalably. DiffuseLoco handling data variability, and ensuring real-time control in
is based on the diffusion model and is designed to train a real-world environments.
low-level multi-skill locomotion policy from offline datasets A schematic illustration of the DiffuseLoco framework is
containing multiple agile locomotion skills with diverse behav- shown in Figure 2. Our framework consists of three stages:
iors. DiffuseLoco represents the state-of-the-art performance a) Data Source: We start with collecting a new or
in multi-skill locomotion, as it is able to learn both bipedal utilizing an existing offline dataset consisting of multiple
and quadrupedal locomotion skills in a unified policy, and can skills. To generate or expand a dataset, we first obtain single-
be deployed robustly zero-shot on hardware. DiffuseLoco ad- skill control policies as the source policies. These policies are
dresses several challenges inherent to learning from offline conditioned on given goals g (commands), such as velocity
data, including generating diverse skills with the same goals, commands and base heights, that are skill-agnostic. Note that
the source policies can be obtained with different methods. accelerate the inference to be faster than the control frequency
With the assumption of the frequency of low-level control to achieve this RHC manner. The acceleration techniques are
being the same, the observation and action spaces of the source detailed in Appendix E, which ultimately enables running the
policies can be vastly different. Thus, we need to collect a DiffuseLoco policy on an edge-compute device that can be
set of source-agnostic state and action pairs across all source mounted on the robot.
policies. For legged robots, the widely-used source-agnostic
IV. D IFFUSION M ODEL FOR R EAL -T IME C ONTROL
states and actions are the proprioceptive feedback st directly
from the robot and the joint-level PD targets at directly to Having introduced the framework of DiffuseLoco, we now
the robot, respectively. As illustrated in Algo. 1, we start begin to develop its backbone: a diffusion model for locomo-
an episode where the robot is controlled by the ith source tion control, shown in Fig. 3, with a special focus on design
policy π i . The policy acts at a source-specific state ssrc
t , and
choices for real-time control and inference acceleration.
only the source-agnostic state-action-goal pairs (st , at , gt ) A. DDPM for Control
are collected during the rollout of the robot’s closed-loop
dynamics until it reaches the maximum episode length T . In To model multi-modal behaviors from diverse datasets,
addition, the goal gt will be re-sampled within the command we leverage Denoising Diffusion Probabilistic Models
range after a time interval within the episode. In this work, (DDPM) [22] with a transformer backbone to model different
we leverage cheap simulation data as an example, but since skills that can be applied to achieve a common goal. DDPM is
DiffuseLoco effectively re-uses data, we can scalably extend a class of generative models in which the generative process
to more expensive data collection process, potentially from is modeled as a denoising procedure, often referred to as
real-world hardware. The details of dataset generation in our Stochastic Langevin Dynamics, expressed in the following
experiments can be found in the Appendix D. equation,
b) Training: In the second step, we train our DiffuseLoco xk−1 = α xk − γϵθ (xk , k) + N (0, σ 2 I)

(1)
policy from the offline dataset in an end-to-end manner. Let
input state and goal history length be h and output action where N (0, σ 2 I) denotes the sampled noise from a DDPM
prediction length be n. During training, we sample a segment scheduler, α, γ, and σ are its hyperparameters: α regulates
of state trajectory straj and corresponding action and goal se- the rate at which noise is added at each step, γ represents
quences, atraj and gtraj . We sample a diffusion step k randomly the denoising strength, and σ defines the noise level. For
from {1, . . . , K}, and sample a Gaussian noise ϵk to add to the clarity, we now use subscripts t−a:t−b to indicate trajectories
action sequence. Then, a transformer-based denoising model from timestep t − a to t − b, replacing subscripts traj . To
takes the noisy action sequence along with states trajectory generate the action trajectory for control, an initial noisy action
straj , goal trajectory gtraj , and diffusion step k as input, and sequence, aK t:t+n , is sampled from Gaussian noise, and the
predicts the added noise as ϵθ . The predicted noise ϵθ is then DDPM conditioned on states st−h:t , goals gt−h:t , and previous
regressed to match the true noise ϵk with mean square error actions at−h−1:t−1 undergoes K iterations of denoising steps.
loss. In this way, the denoising model is learned to generate Unlike previous works applying DDPM in manipula-
sequences of low-level actions conditioned on robot states and tion [12], the inclusion of previous actions, i.e., I/O history,
goals from the dataset. Details of the model architecture and helps the policy to better perform system identification and
training objective are introduced in Sec. IV. state estimations for legged locomotion control, as evaluated
c) Deployment: In the last stage, we zero-shot transfer in [44]. Furthermore, instead of concatenating state and goal
the trained DiffuseLoco policy on the robot hardware. During into a single embedding [12, 41], we conveniently leverage
deployment, the DiffuseLoco policy takes a sequence of pure the transformer’s attention mechanism to assign different at-
noise sampled from a Gaussian distribution and denoises it tention weights to separately embedded robot’s I/O and static
conditioned on the state trajectory straj from the robot hardware goals, enabling the policy to adjust focus between adapting to
and the given goal gtraj . The denoising process is repeated dynamic environments and achieving goals. We find that these
for K iterations to generate a sequence of actions, but only modifications result in better command tracking performance
the immediate action at is used as the robot’s joint-level PD and robustness, as shown in Sec. VII.
targets. After executing this action, the DiffuseLoco policy The denoising process yields a sequence of intermediate
takes a new sequence of states from the robot and updates the actions characterized by progressively decreasing noise levels:
immediate action from the newly generated action sequence. aK , aK−1 , . . . , a0 , until the desired noise-free output, a0 ,
This is designed to align with the Receding Horizon Control is attained. This process can be expressed as the following
(RHC) framework, instead of interpolating the action sequence equation:
at high frequency as previously used by other diffusion-based ak−1 k k
t:t+n = α at:t+n − γϵθ (at−h−1:t+n , st−h:t , gt−h:t , k)
work [31, 41]. RHC enables DiffuseLoco to replan rapidly
+N (0, σ 2 I)

with fast-changing states of the robot to ensure up-to-date
(2)
actions while keeping future steps in account. However, since
the diffusion model has a large number of parameters and the where akt:t+n represents the output at the k th iteration, and
diffusion process involves multiple denoising steps, we must ϵθ (akt−h−1:t+n , st−h:t , k) represents the predicted noise from
delayed input
st-3 st-2 st-1 st the following,
at-4 at-3 at-2 at-1 … Diffusion
Observation Goal Cond.
Timestep
I/O sequence at time t ak−1 k
t:t+n = α at:t+n
gt-3 gt-2 gt-1 gt … MLP MLP

Goal sequence at time t

M −γϵθ (akt−h−2:t+n , st−h−1:t−1 , gt−h−1:t−1 , k)
+ N (0, σ 2 I)

Action
Diffusion(s, a, g, k) ∇εθ
(4)
Causal Cross-attention Mask
at-3 at-2 at-1 at at+1 at+2 at+3 …
Repeat K times and the loss function in Eqn. 3 to the following,
Action sequence at time t

Fig. 3: The DiffuseLoco architecture. At time step t, it l = M SE (ϵk , ϵθ ( at−h−2:t+n + ϵk ,

takes in a delayed h-step history of proprioceptive states (5)
st−h−1:t−1 , gt−h−1:t−1 , k))
st−h−1:t−1 , goals gt−h−1:t−1 , and actions at−h−2:t−2 , and
predicts a sequence of n future actions at:t+n for the robot’s Note that for more complex tasks, larger models may require
actuators. First, separate MLP encoders map state and goal into more time for inference, then our method can be extended
embeddings which, with a one-hot diffusion step, are queried by delaying more than one timestep of inputs. However,
by noisy action tokens via M transformer decoder layers for more delayed steps will result in more challenging learning
denoising. After K denoising iterations, the predicted action complexity because of the lack of recent state feedback.
sequence is generated and we feed the executed action back Remark 1: Another possible solution to run inference be-
to the model’s input. The model is trained through end-to-end fore st arrives is to predict st given h-step history of previous
imitation learning. states, st−h:t−1 with either a model-based Kalman Filter [59]
or a learning-based prediction model [30]. However, state
prediction inevitably introduces extra prediction error or bias
the denoising model ϵθ , which is parameterized by θ, with to the policy.
respect to akt−h−1:t+n , st−h−1:t , and iteration k.
During training, we opt to use the simplified training C. Sampling Techniques
objective proposed by Ho et al. [22],
To accelerate diffusion models, especially during robotic
deployment, prior work often uses samplers like the Denois-
l = M SE (ϵk , ϵθ (at−h−1:t+n + ϵk , st−h:t , gt−h:t , k)) . (3)
ing Diffusion Implicit Models (DDIM)[72], which employ a
deterministic process to reduce sampling steps and speed up
where ϵk is the sampled noise at iteration k. The detailed
inference, albeit with some loss in sample quality. However,
model architecture can be found in Appendix B.
we find that DDIM is less suited for real-time control on
legged robots due to its less accurate action outputs, which
B. Delayed Input and Predicted Actions increases the compounding error of each step that leads
to more frequent out-of-distribution scenarios from training
To achieve real-time deployment, we introduce the tech- to real-world deployment. Consequently, in DiffuseLoco, we
nique of predicting current actions using delayed inputs, continue using the Denoising Diffusion Probabilistic Model
overcoming the challenge posed by the extensive inference (DDPM), maintaining the same number of denoising iterations
times of large models like transformers, which exceed the as during training. Our detailed comparisons in Sec.VII-C
common control frequency requirement of 30-50Hz. show that DDPM enhances robustness and performance over
In the DiffuseLoco policy, we use delayed inputs DDIM in real-time control scenarios.
from the previous timestep—st−h−1:t−1 , at−h−2:t−2 , and
gt−h−1:t−1 —to predict actions for the current timestep at:t+n . V. R ESULTS : M ODEL C APACITIES
By initiating inference for the next action before the current
state st is received, we can run the inference in parallel and To scale up learning locomotion skills as discussed in Sec. I,
ensure actions are the most up-to-date. This design choice the critical questions we need to address are whether Diffuse-
is favored for two reasons: First, we train a sequence of Loco can a) be trained with various sources of demonstrations
predictions, instead of one-step prediction autoregressively, and b) incorporate a diverse set of skills present in the dataset.
which is suitable for generating actions further in time than the In this work, we answer these questions by presenting a state-
nearest timestep. Second, these larger-scale models are adept of-the-art five-skill policy that combines four quadrupedal
at handling higher input delays. Since delays (zero-order-hold) skills, and more importantly, a bipedal locomotion skill, for
exponentially deteriorate the stability of the system, this marks quadrupedal robots. As discussed in Sec. II-A, the ability to
an advantage over small-scale MLP policies, which typically learn these diverse skills with a single model, including an
manage delays less than one control step, for instance, 25% agile bipedal locomotion policy along with quadrupedal skills,
less than DiffuseLoco as noted in [44, Table IV]. remains challenging and has not yet been demonstrated by
In this way, we modify the denoising process in Eqn. 2 to prior RL frameworks.
B. Skill Transitioning
More importantly, we demonstrate DiffuseLoco’s capacity
to transition freely between skills, which are not originally
(a) pacing (b) trotting present in the dataset, such as transitioning from hopping to
bipedal walking and then to pacing, as shown in Fig. 1. This
sequence highlights the robustness of the DiffuseLoco policy
against variations in starting state and the stability required to
execute these skills successfully. Out of five consecutive runs,
(c) bouncing (d) hopping we do not experience any failure. Additional examples of skill
transitions are available in Appendix A.
In addition to transitions under different goals (commands),
the DiffuseLoco policy also demonstrates the ability to per-
form both trotting and pacing under the same goal. In our
experiment, as shown in Fig. 5, the policy begins with trotting
(e) bipedal walking and only switches to pacing when a sudden braking event
Fig. 4: Snapshots of five diverse agile locomotion skills significantly alters the contact sequence. This highlights Dif-
with the DiffuseLoco policy. This represents a leading effort fuseLoco’s effectiveness in learning and adhering to different
in developing a single policy that can combine an agile modes from the offline dataset, committing to a single mode
bipedal walking skill with other quadrupedal skills and can within each rollout unless prompted by external disturbances.
be deployed on real-world robots. C. Extension to Bipedal Robots
In addition to quadrupedal robots, we also demonstrate
the effectiveness of our method on high-dimensional, highly
A. Learning from Diverse Data Sources non-linear bipedal robots with a human-sized Cassie in the
MuJoCo simulation. First, we collect demonstrations evenly
from two separately trained single-skill RL policies on walking
Fig. 4 illustrates the five skills acquired by DiffuseLoco
and running, adapted from [44]. After training directly on
for a quadrupedal robot, encompassing quadrupedal walking
this aggregated dataset, our method successfully learns both
skills like trotting and pacing, jumping skills such as hopping
skills within a single policy. Furthermore, as shown in Fig. 1,
and bouncing, and an agile bipedal walking skill. Among
our policy can transition from walking to running smoothly
these, trotting and pacing are trained using AMP [16], hopping
without specific transition data in the dataset, in addition to
and bouncing through nominal CPG curves [70], and bipedal
maintaining each skill’s stability before and after transitions.
locomotion with symmetry augmentation [73]. After collecting
This demonstrates one of the initial working combinations of
demonstrations of these skills separately in simulation, we
these skills on bipedal robots.
directly learn from this combined dataset and achieve robust
zero-shot transfer to actual hardware. This capability surpasses D. Robustness
previous offline learning methods that were largely confined to To demonstrate DiffuseLoco’s robustness, we show both
simulated environments with simplified dynamics, showcasing quadrupedal and bipedal locomotion over different ground
DiffuseLoco’s robust real-world performance. conditions, including padded floor, bare floor, turf, and over
The ability to learn from diverse skill sources is crucial small terrain variations, as shown in Fig. 6. We especially
for scaling locomotion learning, especially for this five-skill emphasize bipedal walking over the half-padded floor, demon-
controller. For instance, bipedal locomotion requires a dis- strating a high degree of robustness to the differences in
tinct framework with specific early termination conditions ground heights, as well as friction and restitution forces on
and reward landscape compared to quadrupedal skills. With each side of the two standing legs.
these specialized conditions, basic symmetry augmentation can Admittedly, DiffuseLoco’s robustness is subpar compared
already yield effective stepping patterns, whereas, learning to the state-of-the-art single-skill RL policy. However, while
the same skill with prior multi-skill RL approaches like the RL policy is mostly limited by simulation randomization,
AMP [56] or motion imitation involves complex reference one of the most desirable advantages of DiffuseLoco is its
motions and trajectory optimization [78], representing a signif- ability to learn scalably on offline real-world data without
icant engineering challenge. Despite the different requirements accessing expert policies in training or repeated real-world
for observation, action spaces, and auxiliary signals such collection processes. More importantly, as we will show in
as phase signals across various RL methods used in this Sec. VII-E, DiffuseLoco shows the potential to improve its
work, DiffuseLoco simplifies the process by solely relying on robustness by absorbing data from various dynamics. By ex-
basic proprioceptive inputs to accomplish all skills, thanks to panding our dataset with more diverse, potentially real-world
the powerful multimodal capability of diffusion models and data, we anticipate DiffuseLoco to achieve better robustness
receding horizon control. progressively.
Rear Left
Rear Right
Front Left
Front Right

Command
Velocity
(m/s)

Gait pattern change Time (s)

Fig. 5: Foot contact map indicating stable walking and skill switching with DiffuseLoco policy and velocity commands. The
red circle denotes the legs that are in contact with the ground. The robot initially walks using trotting skill, indicated by a
purple background, then switches to pacing, shown in green, following a command change that involves a sudden stop and
resume. We emphasize DiffuseLoco’s ability to maintain different modalities for stable walking under the same command,
switching modalities only when necessary.

In conclusion, in addressing question a), we have demon- Skill information are often unscalable or unavailable during
strated that DiffuseLoco is invariant to the source of offline training and deployment. For a boarder range of applicability,
demonstrations, and can be trained with data from multiple we limit the scope of comparisons within not-skill-conditioned
specialized RL frameworks. More importantly, DiffuseLoco multi-skill RL and non-diffusion BC baselines. Specifically,
shows better scalability in learning diverse skills that existing the RL baselines include,
RL frameworks have not yet illustrated, affirming its state-of- • Adversarial Motion Priors (AMP) [16]: An MLP policy
the-art capability as an answer to question b). We believe that trained using AMP with RL (PPO) and style reward of
the potential of DiffuseLoco extends significantly beyond the both pacing and trotting reference motions. We directly
current five skills showcased, suggesting promising avenues use the open-sourced checkpoint from [16]. We note
for future works. that although several skill-conditioned RL policies [39,
VI. Q UANTITATIVE A NALYSIS 83, 89] have been introduced since [16], yielding better
sim-to-real results, progress in unconditioned multi-skill
In this section, we seek to quantitatively compare Diffuse- policies has been limited.
Loco against various existing multi-skill RL and non-diffusion • AMP with history steps (AMP w/ H): To align with
behavior cloning (BC) baselines. Since there is no existing RL DiffuseLoco, we train an AMP policy with 8 steps of
baseline that learns the five skills in the previous section, we state and action history with the same setup as [16] and
refer to only quadrupedal walking skills, including pacing and a similar evaluation return in simulation.
trotting, as a case study for the performance of DiffuseLoco.
Furthermore, we compare DiffuseLoco with non-diffusion
A. Task and Baselines BC policies, which can be categorized into autoregressive
This analysis includes walking for four meters under five token prediction [9, 25] and action sequence prediction as used
goals (commands) with different velocities. The goals are the in [19]. We adopt baselines for each category.
following: move forward at three different speeds: 0.3 m/s, • Transformer with Autoregressive Token Prediction (TF):
0.5 m/s, and 0.7 m/s, and make a left turn and a right turn at A Generative Pretrained Transformer (GPT) [5] policy
0.3 rad/s. We record the actual linear velocities via a Kalman similar to a decision transformer [9] without reward
filter state estimation [18], and the number of trials where the conditioning. This only generates one timestep action.
robot does not fall over through out the trial as the stability • Transformer with Receding Horizon Control (TF w/
metrics. We repeat each experiment five times and report the RHC): A transformer policy with the same future step
mean and standard deviation across five runs. action predictions. The model’s architecture is identical
Goal (Task) Metric AMP AMP w/ H TF TF w/ RHC DiffuseLoco (Ours)
0.3m/s Forward Stability (%) 100 100 80 100 100
Ev (%) 90.44 ± 1.87 90.63 ± 4.79 75.75 ± 6.07 39.28 ± 2.34 33.22 ± 12.48

0.5m/s Forward Stability (%) 100 100 100 100 100

Ev (%) 50.44 ± 1.97 46.29 ± 2.55 54.35 ± 2.66 37.46 ± 5.31 12.91 ± 6.84

0.7m/s Forward Stability (%) 0 20 0 40 100

Ev (%) fail
5/5
54.96 ± 0.00 fail
5/5
39.36 ± 5.02 24.80 ± 8.91

Turn Left Stability (%) 20 100 0. 100 100

Ev (%) 20.96 ± 0.00 33.39 ± 6.96 fail
5/5
13.41 ± 5.02 12.79 ± 5.64

Turn Right Stability (%) 100 100 100 80 100

Ev (%) 18.61 ± 2.40 33.39 ± 6.96 25.86 ± 1.47 8.69 ± 5.04 2.22 ± 1.03

TABLE I: Performance Benchmark across different baselines and our DiffuseLoco policy in the real world. Stability (the
higher the better) measures the number of trials in which the robot stays stable and does not fall over. Ev (the lower the
better) measures the deviation from the desired velocity in percentage. The experiments are conducted with different command
settings (Left). Each command is repeated non-stop for five trials, and we report the average and standard deviation of the
metrics across five trials.

epochs as DiffuseLoco.
Remark 2: Typically, previous work uses DAgger style al-
gorithms [65] to better cope with distribution shift, but these
methods require access to the expert policy in training and
online learning environments. As a more scalable and versatile
framework, we limit our focus to learning entirely from offline
(a) Turf (b) Turf datasets.

B. DiffuseLoco versus AMP (RL)

We first compare DiffuseLoco with RL-based multi-skill
control policy AMP. Table I shows that DiffuseLoco is the
only method among all of the baselines in our benchmark
(c) Bare Floor (d) Bare Floor that is able to reliably complete all trials without falling over.
Specifically, the RL-trained AMP and AMP w/ H baselines
struggle with low and high speed commands. For 0.3 m/s
forward command, these two baselines give a velocity tracking
performance of more than 90% slower than the commanded
velocity. For 0.7 m/s forward command, they achieve a
stability metric of 0% and 20% respectively.
(e) Half Padded Floor (f) Over a Wooden Step This shows the prevailing problem of mode-collapse for
Fig. 6: Depiction of DiffuseLoco’s robustness on different Generative Adversarial Network (GAN) style networks, such
ground conditions and terrains: bipedal walking on (a) turf as a multi-skill AMP policy. Mode collapse is a significant
terrain, (c) vinyl composite floor, and (e) half padded floor, challenge in GANs, where the generator becomes overfitted to
where the ground heights, friction and restitution forces on a limited range of outputs that are often similar or identical,
the two standing legs are different; quadrupedal walking on rather than offering a broad range of solutions [15, 46, 49].
(b) turf, (d) bare floor, (f) over a thick wooden board as a In the context of AMP, this means the actor network exces-
variation in the terrain height. sively overfits the simulation environment, losing its ability to
generalize and adapt to new environments (such as the real
world), but instead reverting to early training behaviors where
the discriminator is not yet converged. In real-world testing,
to our DiffuseLoco model, but it directly predicts future
an example of this is seen in low-speed tasks where AMP
action sequences and the loss is replaced by the recon-
oscillates in place with correct frequency but reduced ampli-
struction loss l = M SE(πθ (st , gt ), at ).
tude, similar to its behaviors in early training stage. In contrast,
These baselines have the same parameter count of 6.8M and DiffuseLoco generalizes within the expert demonstrations and
are trained with the same learning rate scheme and number of does not revert to infeasible actions. Note that in the simulation
0.08 0.08 as the training dataset, which means the overfitting problem
DiffuseLoco (Ours) DiffuseLoco (Ours) becomes more pronounced when we switch to real-world
0.06 TF w/ RHC 0.06 TF w/ RHC
experiments, as shown earlier.
Train Loss

Eval Loss
0.04 0.04 In comparison, our DiffuseLoco shows more stable and
0.02
smooth motions measured by both stability metrics and magni-
0.02
tudes of the body’s angular velocity. On average, DiffuseLoco
0.00 0.00 achieves 10.40% less in magnitude for the body’s oscillation
0 1 2 0 1 2
Step 1e5 Step 1e5 over all trials. As a result, the smoother locomotion skill helps
(a) (b) DiffuseLoco to achieve on average 38.97% less tracking error
Fig. 7: Loss curves for training DiffuseLoco and TF w/ RHC compared to TF w/ RHC. Based on this observation, we
baseline. (a): Training Loss. (b): Evaluation Loss. We report suggest that DDPM-style training is more suitable for imi-
mean and standard deviation across three seeds. We find that tating locomotion tasks compared to prior Behavior Cloning
even though TF w/ RHC achieves a low reconstruction loss methods.
in training, the evaluation loss stays higher than DiffuseLoco VII. A BLATION S TUDY ON D ESIGN C HOICES
and increases at the end. This indicates TF w/ RHC tends to
overfit the offline datasets while DiffuseLoco is not, with the In this section, we further evaluate the design choices used
same number of parameters. to build DiffuseLoco policy in simulation and the real world by
extensive ablation studies. We use the same experiment setups
as the previous section. For brevity of the main content, further
environment, both AMP and AMP w/ H are able to control ablation studies can be found in Appendix C.
the robot to track different velocities without falling over. A. Ablation Components
Besides better sim-to-real transfer, DiffuseLoco with a
To validate our design choices, we ablate DiffuseLoco with
diffusion model is able to efficiently learn the multi-modality
the following critical components and compare them to our
presented by the different skills for the same locomotion task,
real-world benchmark.
and thus able to perform valid and coordinated locomotion
• Without Receding Horizon Control (DL w/o RHC): Re-
skills without mode-collapsing, as an example given by Fig. 5.
This helps DiffuseLoco achieve both better stability and track place RHC with one-step prediction in an autoregressive
completion (velocity tracking) performance compared with manner and keep the diffusion model.
• Without Domain Randomization (DL w/o Rand): Trained
AMP-based policy baselines.
on a dataset generated without domain randomization,
C. DiffuseLoco versus Non-Diffusion Behavior Cloning except for the ground friction coefficient.
We further compare DiffuseLoco with non-diffusion BC • DDIM Inference: We develop two DDIM baselines to
methods. For locomotion tasks, smooth and temporally con- investigate how training and inference steps affect per-
sistent actions are a necessity for stability and robustness. formance in locomotion control.
Looking at Table I, we find that DiffuseLoco outperforms – 100 Training + 10 Inference (DDIM-100/10)
TF and TF w/ RHC in both stability and robustness of – 10 Training + 5 Inference (DDIM-10/5)
the locomotion policy. Using one-step action output, we find Compared with our DiffuseLoco, DDIM-100/10 has the
that TF lacks robustness and fails the 0.7 m/s forward and same inference steps, and DDIM-10/5 has the same
left turn tasks completely. This is likely because single-step training steps.
action prediction makes the policy less aware of future actions, • U-Net as the backbone (U-Net): Replace the Transformer
leading to more jittering behavior. with a U-Net as the backbone, adjusted to the same
With receding horizon control, TF w/ RHC overcomes most parameter count.
of the jittering problem and can complete most of the tasks.
However, we note that for more agile motion such as the B. Single-step output versus RHC
0.7 m/s forward task, the stability metric drops drastically to To isolate the effects of RHC, we test a variant of Dif-
merely 40%. This is likely because the reconstruction loss used fuseLoco without RHC (DL w/o RHC), finding that it strug-
in TF w/ RHC training tends to overfit the action trajectories gles with faster speed goal and exhibits significant jittering
in the dataset, resulting in less robust policy in the out-of- behaviors, as detailed in Table II. This suggests that single-
distribution scenarios (such as in the real world). step token-prediction models like GPT are less suitable for
This is especially evident when looking at training curves legged locomotion control than diffusion models, which pre-
for TF w/ RHC versus DiffuseLoco shown in Figure 7, where dict sequences of future actions.
TF w/ RHC overfits significantly to the training dataset and
the evaluation loss stays high. Note that the model architecture C. Sampling Techniques
is kept identical across TF w/ RHC and DiffuseLoco, so only As discussed earlier, popular diffusion-based frameworks
the loss functions are different. In addition, the evaluation loss like DDIM often reduce sample iterations for inference ac-
here is calculated with samples from the same distribution celeration, trading off output quality for speed, often with
Goal (Task) Metric DL w/o RHC DL w/o Rand DDIM-100/10 DDIM-10/5 U-Net DiffuseLoco (Ours)
0.3m/s Forward Stability (%) 100 100 100 100 100 100
Ev (%) 75.09 ± 18.98 50.45 ± 2.70 56.89 ± 2.43 47.09 ± 2.40 81.31 ± 1.90 33.22 ± 12.48

0.5m/s Forward Stability (%) 100 80 80 100 100 100

Ev (%) 64.49 ± 1.87 41.07 ± 6.12 41.00 ± 3.18 37.92 ± 1.59 74.52 ± 2.83 12.91 ± 6.84

0.7m/s Forward Stability (%) 0 40 80 80 100 100

Ev (%) fail
5/5
44.30 ± 4.21 47.71 ± 6.63 42.58 ± 2.08 71.71 ± 2.93 24.80 ± 8.91

Turn Left Stability (%) 100 100 100 100 20 100

Ev (%) 20.96 ± 18.22 10.17 ± 5.86 22.22 ± 4.29 13.27 ± 2.63 18.93 ± 23.28 12.79 ± 5.64

Turn Right Stability (%) 100 100 100 100 100 100
Ev (%) 18.61 ± 2.40 8.18 ± 3.94 6.47 ± 2.49 7.42 ± 2.90 89.63± 3.36 2.22 ± 1.03

TABLE II: Performance Ablation Study across different ablations and DiffuseLoco policy in real-world experiments. Stability
(the higher the better) measures the number of trials in which the robot stays stable and does not fall over. Ev (the lower the
better) measures the deviation from the desired velocity in percentage. The experiments are conducted with different command
settings (Left). Each command is repeated non-stop for five trials, and we report the average and standard deviation of the
metrics across five trials.

ten times fewer iterations [72]. While this approach suits DiffuseLoco on a dataset with dynamics randomization leads
tasks like image generation, which tolerate some variance, it to a 44.26% increase in both robustness and stability compared
underperforms in quadrupedal locomotion control. As shown to DL w/o Rand baseline. Specifically, in the challenging
in Table II, both DDIM-100/10 and DDIM-10/5 exhibit worse 0.7 m/s forward task, DL w/o Rand falls in 3 out of 5
stability and higher velocity tracking errors. Noticeably, the trials. This ablation study points to the potential of altering
100 training steps and 10 inference steps variant demonstrates the dataset, by adding either more diversity and potentially
limping behavior and fails two trials. Tracking errors for real-world data or more fault-recovery behaviors, to further
both variants increase by 50.69% and 42.04%, respectively, enhance the robustness of DiffuseLoco.
compared to DiffuseLoco.
Thus, we believe that noisier control signals from the VIII. D ISCUSSION AND F UTURE W ORK
DDIM pipeline likely disrupt the control of inherently un-
stable floating-based dynamic systems, like legged robots. An We have presented DiffuseLoco, a scalable learning frame-
interesting future work direction could be on control-specific work to learn diverse agile legged locomotion skills from
sampling techniques to accelerate diffusion models without multi-modal offline datasets and can robustly transfer to real-
compromising stability and performance. world robots in real-time. Leveraging diffusion models to
capture the multi-modality in the offline dataset, DiffuseLoco
D. Model Architecture Effects learns a state-of-the-art controller that combines bipedal walk-
In addition, we compare against another commonly used ing and quadrupedal locomotion skills within one policy and
architecture in diffusion models, a CNN-based U-Net as the transitions freely among the skills.
backbone of DiffuseLoco. Qualitatively, the U-Net policy is
shaky and inconsistent, and quantitatively, it has one of the A. A Scalable Approach for Locomotion Skills
highest errors in velocity tracking, with worsened stability due
to its shaky actions. We reckon that this is because CNNs In this work, we focus on the scalability of learning legged
are not the best fit for temporal data and also lack separate locomotion control. Inspired by successes of large-scale learn-
attention weights for goal conditioning. This is consistent with ing in other robotics tasks, we leverage the most scalable
prior work [12] that finds U-Net underperforms Transformer, approach: learning from offline datasets, with special attention
especially in high action-rate dynamical systems. to the versatility of the data sources and the multi-modality
of different skills in a large-scale dataset. With an expressive
E. Dataset Effects diffusion model as the backbone, we are able to absorb
Lastly, we explore how dataset characteristics influence demonstrations learned with various existing RL algorithms
the robustness and performance of DiffuseLoco in real-world with potentially different observation and action spaces, and
scenarios. Consistent with previous findings that diversity in effectively execute skills, such as trotting and pacing, that
training data, such as noise insertion, mitigates compounding represent different modalities given identical commands. With
error [37], we demonstrate that increasing the variety of the five diverse skills presented in this work as a testimony,
dynamics parameters in simulation environments where we we show the scalability of DiffuseLoco towards a generalist
collect data also enhances robustness. As in Table II, training policy for locomotion control tasks.
B. Benefits over Other Multi-skill Policies Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran,
As we show in our thorough real-world benchmark (Sec. Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan
VI), DiffuseLoco demonstrates smoother actions and improved Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao,
stability and velocity tracking in real-world conditions com- Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich.
pared to non-diffusion Behavior Cloning baselines that are RT-2: Vision-Language-Action Models Transfer Web
commonly used in prior works. When not conditioned on Knowledge to Robotic Control, 2023.
explicit skill labels, DiffuseLoco shows better sim-to-real [4] Anthony Brohan, Noah Brown, Justice Carbajal, Yev-
transfer performance for multi-skill locomotion compared to gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana
AMP policies, which often face a significant sim-to-real gap Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine
due to mode-collapsing in generator-discriminator methods. Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas
DiffuseLoco avoids these issues, providing stable, coherent, Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian,
and effective control with smooth skill-switching and stable Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-
execution under consistent commands. Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deek-
sha Manjunath, Igor Mordatch, Ofir Nachum, Carolina
C. Large-scale Offline Dataset for Locomotion Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jor-
In the experiments, DiffuseLoco demonstrates scalability nell Quiambao, Kanishka Rao, Michael Ryoo, Grecia
and robustness by utilizing diverse offline datasets, as dis- Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh,
cussed in Sec. VII-E. Unlike online learning which mostly Sumedh Sontakke, Austin Stone, Clayton Tan, Huong
depends on simulations, it offers a practical inspiration for Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei
scalable real-world data collection and learning. Similar to Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and
[17], we also hypothesize that DiffuseLoco could adapt to Brianna Zitkovich. RT-1: Robotics Transformer for Real-
datasets containing different robot morphologies, thus allowing World Control at Scale, 2023.
for broader deployment and better generalization. Furthermore, [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
as a popular direction in manipulation fields, integrating vision biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee-
and language instructions into the goal-conditioning dataset lakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
could further enhance DiffuseLoco’s versatility and applica- et al. Language models are few-shot learners. Advances
bility in future works. in neural information processing systems, 33:1877–1901,
2020.
IX. ACKNOWLEDGEMENT [6] Guillermo A Castillo, Bowen Weng, Wei Zhang, and
This work was supported in part by NSF 2303735 for POSE, Ayonga Hereid. Robust feedback motion policy design
in part byNSF 2238346 for CAREER, in part by The AI using reinforcement learning on a 3d digit bipedal robot.
Institute and in part byInnoHK of the Government of the Hong In 2021 IEEE/RSJ International Conference on Intel-
Kong Special AdministrativeRegion via the Hong Kong Centre ligent Robots and Systems (IROS), pages 5136–5143.
for Logistics Robotics. IEEE, 2021.
[7] Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and
R EFERENCES Jun Zhu. Score Regularized Policy Optimization through
[1] Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Diffusion Behavior, 2023.
Tommi Jaakkola, and Pulkit Agrawal. Is conditional [8] Jianyu Chen, Bodi Yuan, and Masayoshi Tomizuka. Deep
generative modeling all you need for decision-making? imitation learning for autonomous driving in generic
arXiv preprint arXiv:2211.15657, 2022. urban scenarios with enhanced safety. In 2019 IEEE/RSJ
[2] Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, International Conference on Intelligent Robots and Sys-
Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey tems (IROS), pages 2884–2890. IEEE, 2019.
Levine. Zero-shot robotic manipulation with pre- [9] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee,
trained image-editing diffusion models. arXiv preprint Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind
arXiv:2310.10639, 2023. Srinivas, and Igor Mordatch. Decision Transformer:
[3] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Reinforcement Learning via Sequence Modeling, 2021.
Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, [10] Lili Chen, Shikhar Bahl, and Deepak Pathak. Playfusion:
Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo- Skill acquisition via diffusion from language-annotated
rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana play. In Conference on Robot Learning, pages 2012–
Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- 2029. PMLR, 2023.
der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil [11] Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak
Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Pathak. Extreme parkour with legged robots. arXiv
Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey preprint arXiv:2309.14341, 2023.
Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, [12] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric
Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Cousineau, Benjamin Burchfiel, and Shuran Song. Dif-
Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, fusion policy: Visuomotor policy learning via action
diffusion. arXiv preprint arXiv:2303.04137, 2023. ence on Computer Vision and Pattern Recognition, pages
[13] Felipe Codevilla, Matthias Müller, Antonio López, 16750–16761, 2023.
Vladlen Koltun, and Alexey Dosovitskiy. End-to-end [25] Xiaoyu Huang, Dhruv Batra, Akshara Rai, and Andrew
driving via conditional imitation learning. In 2018 IEEE Szot. Skill transformer: A monolithic policy for mobile
international conference on robotics and automation manipulation. In Proceedings of the IEEE/CVF Inter-
(ICRA), pages 4693–4700. IEEE, 2018. national Conference on Computer Vision, pages 10852–
[14] Jeremy Dao, Kevin Green, Helei Duan, Alan Fern, 10862, 2023.
and Jonathan Hurst. Sim-to-real learning for bipedal [26] Xiaoyu Huang, Zhongyu Li, Yanzhen Xiang, Yiming
locomotion under unsensed dynamic loads. In 2022 Ni, Yufeng Chi, Yunhao Li, Lizhi Yang, Xue Bin Peng,
International Conference on Robotics and Automation and Koushil Sreenath. Creating a dynamic quadrupedal
(ICRA), pages 10449–10455. IEEE, 2022. robotic goalkeeper with reinforcement learning. In 2023
[15] Ricard Durall, Avraam Chatzimichailidis, Peter Labus, IEEE/RSJ International Conference on Intelligent Robots
and Janis Keuper. Combating mode collapse in gan and Systems (IROS), pages 2715–2722. IEEE, 2023.
training: An empirical analysis using hessian eigenvalues. [27] Jemin Hwangbo, Joonho Lee, A. Dosovitskiy, Dario
arXiv preprint arXiv:2012.09673, 2020. Bellicoso, Vassilios Tsounis, V. Koltun, and M. Hutter.
[16] Alejandro Escontrela, Xue Bin Peng, Wenhao Yu, Learning agile and dynamic motor skills for legged
Tingnan Zhang, Atil Iscen, Ken Goldberg, and Pieter robots. Science Robotics, 4, 2019. doi: 10.1126/
Abbeel. Adversarial motion priors make good substitutes scirobotics.aau5872.
for complex reward functions. In 2022 IEEE/RSJ Inter- [28] Michael Janner, Qiyang Li, and Sergey Levine. Offline
national Conference on Intelligent Robots and Systems reinforcement learning as one big sequence modeling
(IROS), pages 25–32. IEEE, 2022. problem. Advances in neural information processing
[17] Gilbert Feng, Hongbo Zhang, Zhongyu Li, Xue Bin systems, 34:1273–1286, 2021.
Peng, Bhuvan Basireddy, Linzhu Yue, Zhitao Song, Lizhi [29] Michael Janner, Yilun Du, Joshua B. Tenenbaum, and
Yang, Yunhui Liu, Koushil Sreenath, et al. Genloco: Gen- Sergey Levine. Planning with Diffusion for Flexible
eralized locomotion controllers for quadrupedal robots. Behavior Synthesis, 2022.
In Conference on Robot Learning, pages 1893–1903. [30] Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and
PMLR, 2023. Jemin Hwangbo. Concurrent training of a control policy
[18] Thomas Flayols, Andrea Del Prete, Patrick Wensing, and a state estimator for dynamic and robust legged
Alexis Mifsud, Mehdi Benallegue, and Olivier Stasse. locomotion. IEEE Robotics and Automation Letters, 7
Experimental evaluation of simple estimators for hu- (2):4630–4637, 2022.
manoid robots. In 2017 IEEE-RAS 17th International [31] Ivan Kapelyukh, Vitalis Vosylius, and Edward Johns.
Conference on Humanoid Robotics (Humanoids), pages DALL-E-Bot: Introducing Web-Scale Diffusion Models
889–895. IEEE, 2017. to Robotics. IEEE Robotics and Automation Letters, 8
[19] Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mo- (7):3956–3963, July 2023. ISSN 2377-3774. doi: 10.
bile aloha: Learning bimanual mobile manipulation with 1109/lra.2023.3272516. URL http://dx.doi.org/10.1109/
low-cost whole-body teleoperation. arXiv preprint LRA.2023.3272516.
arXiv:2401.02117, 2024. [32] J. Kaplan, Sam McCandlish, T. Henighan, Tom B.
[20] Yuni Fuchioka, Zhaoming Xie, and Michiel Van de Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Panne. Opt-mimic: Imitation of optimized trajectories Radford, Jeff Wu, and Dario Amodei. Scaling Laws for
for dynamic quadruped behaviors. In 2023 IEEE Interna- Neural Language Models. ArXiv, abs/2001.08361, 2020.
tional Conference on Robotics and Automation (ICRA), [33] Sunwoo Kim, Maks Sorokin, Jehee Lee, and Sehoon Ha.
pages 5092–5098. IEEE, 2023. Humanconquad: human motion control of quadrupedal
[21] Huy Ha, Pete Florence, and Shuran Song. Scaling up and robots using deep reinforcement learning. In SIGGRAPH
distilling down: Language-guided robot skill acquisition. Asia 2022 Emerging Technologies, pages 1–2, 2022.
In Conference on Robot Learning, pages 3766–3777. [34] Arnaud Klipfel, Nitish Sontakke, Ren Liu, and Sehoon
PMLR, 2023. Ha. Learning a single policy for diverse behaviors on
[22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising a quadrupedal robot using scalable motion imitation. In
Diffusion Probabilistic Models, 2020. 2023 IEEE/RSJ International Conference on Intelligent
[23] David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Robots and Systems (IROS), pages 2768–2775. IEEE,
Hutter. Anymal parkour: Learning agile navigation for 2023.
quadrupedal robots. Science Robotics, 9(88):eadi7566, [35] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
2024. reinforcement learning with implicit q-learning. arXiv
[24] Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, preprint arXiv:2110.06169, 2021.
Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. [36] Aviral Kumar, Aurick Zhou, G. Tucker, and S. Levine.
Diffusion-based generation, optimization, and planning Conservative Q-Learning for Offline Reinforcement
in 3d scenes. In Proceedings of the IEEE/CVF Confer- Learning. ArXiv, abs/2006.04779, 2020.
[37] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, jciech Zaremba, and Pieter Abbeel. Overcoming ex-
and Ken Goldberg. Dart: Noise injection for robust ploration in reinforcement learning with demonstrations.
imitation learning. In Conference on robot learning, In 2018 IEEE international conference on robotics and
pages 143–156. PMLR, 2017. automation (ICRA), pages 6292–6299. IEEE, 2018.
[38] Sergey Levine, Aviral Kumar, George Tucker, and Justin [51] Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh,
Fu. Offline reinforcement learning: Tutorial, review, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar,
and perspectives on open problems. arXiv preprint and Sergey Levine. Cal-QL: Calibrated Offline RL Pre-
arXiv:2005.01643, 2020. Training for Efficient Online Fine-Tuning. arXiv preprint
[39] Chenhao Li, Sebastian Blaes, Pavel Kolev, Marin Vlastel- arXiv:2303.05479, 2023.
ica, Jonas Frey, and Georg Martius. Versatile skill control [52] Felipe Nuti, Tim Franzmeyer, and João F Henriques.
via self-supervised adversarial imitation of unlabeled Extracting Reward Functions from Diffusion Models.
mixed motions. In 2023 IEEE international conference arXiv preprint arXiv:2306.01804, 2023.
on robotics and automation (ICRA), pages 2944–2950. [53] Octo Model Team, Dibya Ghosh, Homer Walke, Karl
IEEE, 2023. Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey
[40] Chenhao Li, Marin Vlastelica, Sebastian Blaes, Jonas Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You
Frey, Felix Grimminger, and Georg Martius. Learning Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey
agile skills via adversarial imitation of rough partial Levine. Octo: An Open-Source Generalist Robot Policy.
demonstrations. In Conference on Robot Learning, pages https://octo-models.github.io, 2023.
342–352. PMLR, 2023. [54] Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek
[41] Xiang Li, Varun Belagali, Jinghuan Shang, and Lee, Xinyan Yan, Evangelos Theodorou, and Byron
Michael S. Ryoo. Crossway Diffusion: Improving Boots. Agile autonomous driving using end-to-end deep
Diffusion-based Visuomotor Policy via Self-supervised imitation learning. arXiv preprint arXiv:1709.07174,
Learning, 2024. 2017.
[42] Yunfei Li, Jinhan Li, Wei Fu, and Yi Wu. Learning Agile [55] Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave
Bipedal Motions on a Quadrupedal Robot. arXiv preprint Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcar-
arXiv:2311.05818, 2023. cel Macua, Shan Zheng Tan, Ida Momennejad, Katja
[43] Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Hofmann, and Sam Devlin. Imitating Human Behaviour
Sergey Levine, Glen Berseth, and Koushil Sreenath. with Diffusion Models, 2023.
Reinforcement learning for robust parameterized loco- [56] X. B. Peng, Erwin Coumans, Tingnan Zhang, T. Lee, Jie
motion control of bipedal robots. In 2021 IEEE Interna- Tan, and S. Levine. Learning Agile Robotic Locomotion
tional Conference on Robotics and Automation (ICRA), Skills by Imitating Animals. ArXiv, abs/2004.00784,
pages 2811–2817. IEEE, 2021. 2020. doi: 10.15607/rss.2020.xvi.064.
[44] Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey [57] Xue Bin Peng, Marcin Andrychowicz, Wojciech
Levine, Glen Berseth, and Koushil Sreenath. Rein- Zaremba, and Pieter Abbeel. Sim-to-real transfer of
forcement Learning for Versatile, Dynamic, and Robust robotic control with dynamics randomization. In 2018
Bipedal Locomotion Control, 2024. IEEE international conference on robotics and automa-
[45] Qiayuan Liao, Zhongyu Li, Akshay Thirugnanam, Jun tion (ICRA), pages 3803–3810. IEEE, 2018.
Zeng, and Koushil Sreenath. Walking in narrow spaces: [58] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and
Safety-critical locomotion control for quadrupedal robots Angjoo Kanazawa. Amp: Adversarial motion priors for
with duality-based optimization. In 2023 IEEE/RSJ In- stylized physics-based character control. ACM Transac-
ternational Conference on Intelligent Robots and Systems tions on Graphics (ToG), 40(4):1–20, 2021.
(IROS), pages 2723–2730. IEEE, 2023. [59] Carole G. Prevost, Andre Desbiens, and Eric Gagnon.
[46] Kanglin Liu, Wenming Tang, Fei Zhou, and Guoping Extended Kalman Filter for State Estimation and Tra-
Qiu. Spectral regularization for combating mode collapse jectory Prediction of a Moving Object Detected by an
in gans. In Proceedings of the IEEE/CVF international Unmanned Aerial Vehicle. In 2007 American Control
conference on computer vision, pages 6382–6390, 2019. Conference, pages 1805–1810, 2007. doi: 10.1109/ACC.
[47] Gabriel B Margolis, Ge Yang, Kartik Paigwar, Tao Chen, 2007.4282823.
and Pulkit Agrawal. Rapid locomotion via reinforcement [60] Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell,
learning. arXiv preprint arXiv:2205.02824, 2022. Jitendra Malik, and Koushil Sreenath. Learning Hu-
[48] Utkarsh A. Mishra, Shangjie Xue, Yongxin Chen, and manoid Locomotion with Transformers. arXiv preprint
Danfei Xu. Generative Skill Chaining: Long-Horizon arXiv:2303.03381, 2023.
Skill Planning with Diffusion Models, 2023. [61] Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan
[49] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil
GAN optimization is locally stable. Advances in neural Sreenath, and Jitendra Malik. Humanoid Locomotion as
information processing systems, 30, 2017. Next Token Prediction. arXiv preprint arXiv:2402.19469,
[50] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- 2024.
[62] Alexander Reske, Jan Carius, Yuntao Ma, Farbod of Highly Dynamic Motions with Rigid and Articulated
Farshidian, and Marco Hutter. Imitation learning from Soft Quadrupeds. arXiv preprint arXiv:2309.09682,
mpc for quadrupedal multi-gait control. In 2021 IEEE 2023.
International Conference on Robotics and Automation [78] Eric Vollenweider, Marko Bjelonic, Victor Klemm,
(ICRA), pages 5014–5020. IEEE, 2021. Nikita Rudin, Joonho Lee, and Marco Hutter. Ad-
[63] Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf vanced skills through multiple adversarial motion priors
Lioutikov. Goal-Conditioned Imitation Learning using in reinforcement learning. In 2023 IEEE International
Score-based Diffusion Policies, 2023. Conference on Robotics and Automation (ICRA), pages
[64] Robin Rombach, Andreas Blattmann, Dominik Lorenz, 5120–5126. IEEE, 2023.
Patrick Esser, and Björn Ommer. High-Resolution Image [79] Bingzheng Wang, Guoqiang Wu, Teng Pang, Yan Zhang,
Synthesis with Latent Diffusion Models, 2022. and Yilong Yin. DiffAIL: Diffusion Adversarial Imitation
[65] Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bag- Learning, 2023.
nell. A Reduction of Imitation Learning and Structured [80] Hsiang-Chun Wang, Shang-Fu Chen, and Shao-Hua Sun.
Prediction to No-Regret Online Learning, 2011. Diffusion Model-Augmented Behavioral Cloning. arXiv
[66] Nikita Rudin, David Hoeller, Philipp Reist, and Marco preprint arXiv:2302.13335, 2023.
Hutter. Learning to walk in minutes using massively [81] Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou.
parallel deep reinforcement learning. In Conference on Diffusion Policies as an Expressive Policy Class for
Robot Learning, pages 91–100. PMLR, 2022. Offline Reinforcement Learning, 2023.
[67] Dohoon Ryu and Jong Chul Ye. Pyramidal Denoising [82] Eric R Westervelt, Jessy W Grizzle, Christine
Diffusion Probabilistic Models, 2022. Chevallereau, Jun Ho Choi, and Benjamin Morris.
[68] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Feedback control of dynamic bipedal robot locomotion.
Radford, and Oleg Klimov. Proximal Policy Optimization CRC press, 2018.
Algorithms, 2017. [83] Jinze Wu, Yufei Xue, and Chenkun Qi. Learning multiple
[69] Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, gaits within latent space for quadruped robots. arXiv
Hongtao Wang, and Wei Yang. Learning free gait tran- preprint arXiv:2308.03014, 2023.
sition for quadruped robots via phase-guided controller. [84] Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet,
IEEE Robotics and Automation Letters, 2021. Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif-
[70] Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, fuser: Unifying trajectory diffusion and keypose predic-
Hongtao Wang, and Wei Yang. Learning Free Gait tion for robotic manipulation. In Conference on Robot
Transition for Quadruped Robots Via Phase-Guided Con- Learning, pages 2323–2339. PMLR, 2023.
troller. IEEE Robotics and Automation Letters, 7:1230– [85] Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, and Daniela
1237, 2022. doi: 10.1109/LRA.2021.3136645. Rus. SafeDiffuser: Safe Planning with Diffusion Proba-
[71] Laura Smith, J Chase Kew, Tianyu Li, Linda Luu, bilistic Models. arXiv preprint arXiv:2306.00148, 2023.
Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. [86] Haoran Xu, Li Jiang, Li Jianxiong, and Xianyuan Zhan.
Learning and adapting agile locomotion skills by trans- A policy-guided imitation approach for offline reinforce-
ferring experience. arXiv preprint arXiv:2304.09834, ment learning. Advances in Neural Information Process-
2023. ing Systems, 35:4085–4098, 2022.
[72] Jiaming Song, Chenlin Meng, and Stefano Ermon. De- [87] Long Yang, Zhixiong Huang, Fenghao Lei, Yucun
noising Diffusion Implicit Models, 2022. Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin
[73] Zhi Su, Xiaoyu Huang, Daniel Ordoñez-Apraez, Yunfei Zhou, and Zhouchen Lin. Policy Representation via
Li, Zhongyu Li, Qiayuan Liao, Giulio Turrisi, Massim- Diffusion Probability Model for Reinforcement Learning,
iliano Pontil, Claudio Semini, Yi Wu, et al. Leveraging 2023.
Symmetry in RL-based Legged Locomotion Control. [88] Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang.
arXiv preprint arXiv:2403.17320, 2024. Multi-task reinforcement learning with soft modular-
[74] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and ization. Advances in Neural Information Processing
A. Gupta. Revisiting Unreasonable Effectiveness of Systems, 33:4767–4777, 2020.
Data in Deep Learning Era. 2017 IEEE International [89] Ruihan Yang, Zhuoqun Chen, Jianhan Ma, Chongyi
Conference on Computer Vision (ICCV), pages 843–852, Zheng, Yiyu Chen, Quan Nguyen, and Xiaolong Wang.
2017. doi: 10.1109/ICCV.2017.97. Generalized animal imitator: Agile locomotion with ver-
[75] The Linux Foundation. Open neural network exchange. satile motion prior. arXiv preprint arXiv:2310.01408,
https://onnx.ai/, 2024. Accessed: 2024-01-29. 2023.
[76] Faraz Torabi, Garrett Warnell, and Peter Stone. Be- [90] Takuma Yoneda, Luzhe Sun, Bradly Stadie, Ge Yang,
havioral cloning from observation. arXiv preprint and Matthew Walter. To the Noise and Back: Diffusion
arXiv:1805.01954, 2018. for Shared Autonomy. arXiv preprint arXiv:2302.12244,
[77] Francecso Vezzi, Jiatao Ding, Antonin Raffin, Jens 2023.
Kober, and Cosimo Della Santina. Two-Stage Learning [91] Fangzhou Yu, Ryan Batke, Jeremy Dao, Jonathan Hurst,
Kevin Green, and Alan Fern. Dynamic bipedal turning
through sim-to-real reinforcement learning. In 2022
IEEE-RAS 21st International Conference on Humanoid
Robots (Humanoids), pages 903–910. IEEE, 2022.
[92] Chong Zhang, Jiapeng Sheng, Tingguang Li, He Zhang,
Cheng Zhou, Qingxu Zhu, Rui Zhao, Yizheng Zhang,
and Lei Han. Learning Highly Dynamic Behaviors for
Quadrupedal Robots. arXiv preprint arXiv:2402.13473,
2024.
[93] Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christo-
pher Atkeson, Soeren Schwertfeger, Chelsea Finn, and
Hang Zhao. Robot parkour learning. arXiv preprint
arXiv:2309.05665, 2023.
A PPENDIX A
M ORE R ESULTS ON S KILL T RANSITIONING

Fig. 8: Skill Transitioning: Pace to Stand

Fig. 9: Skill Transitioning: Hop to Stand on Bare Floor

Fig. 10: Skill Transitioning: Bounce to Stand with Emergent Intermediate Pacing Skill

Fig. 11: Skill Transitioning: Trot to Stand

Fig. 12: Skill Transitioning: Bounce to Pace

Fig. 13: Skill Transitioning: Hop to Bounce

Fig. 14: Skill Transitioning: Hop to Pace

A PPENDIX B TABLE III: Hyperparameters for DiffuseLoco in the Experi-
M ODEL A RCHITECTURE AND H YPERPARAMETERS ments
Here, we explain the details of the diffusion model’s archi- Five-Skill Walk Cassie
(Sec. V) (Sec. VI) (Sec. V-C)
tecture, dataflow, and transformer backboone. History Length 8 8 16
Predict Length 4 4 4
A. Receding Horizon Control Token Dim 256 128 256
Learning sequences of actions instead of single-step action Attn Drop-out 0.3 0.3 0.3
Learning Rate 1e-4 1e-4 1e-4
in training helps improving temporal consistency of the policy. Weight Decay 1e-3 1e-3 1e-3
However, in dynamic systems such as legged robots, error Epochs 100 100 100
accumulates significantly after a short horizon of planned
steps, and the predicted actions further ahead may no longer be
DL w/o Goal
useful for control. Therefore, we adopt the Receding Horizon
DiffuseLoco (Ours)
Control (RHC) manner, where DiffuseLoco policy generates n

Failure Rate (%)

Tracking Error
steps of actions, but only executes the very first step of actions.
This is in contrast to previous work which infers a sequence
of actions at a lower frequency and using interpolation to get
a high-frequency action [31, 41]. Such a setup allows us to
replan rapidly with fast-changing states of the robot while
keeping future steps in account. As we evaluated in Sec. VII-B,
using RHC is critical in improving smoothness and consistency
Failed Episodes Linear Vel (m/s) Angular Vel (m/s)
of a legged locomotion control policy.
B. Architecture Fig. 15: Comparison of failure rates and tracking errors
between DL w/o Goal and DiffuseLoco (ours) in simulation.
The DiffuseLoco policy leverages an encoder-decoder trans- The left y-axis is the metric for Failed Episodes. The right y-
former DDPM. First, the past robot’s past I/O trajectory axis indicates the tracking error for linear velocity and angular
(st−h−1:t−1 , at−h−2:t−2 ) and given goal sequence gt−h−1:t−1 velocity.
are transformed into separate I/O embedding and goal embed-
ding by two 2-layer MLP encoders, respectively. Then, we
sample noise ϵ(k) for diffusion time step k with the DDPM
scheduler and add to the ground truth action a from the Considering the noisy base velocity estimations on real robots,
offline dataset to produce a noisy action akt:t+n = at:t+n + ϵk . we utilize simulation environments with extensive dynamics
The noisy action akt:t+n is then passed through an MLP randomization to perform a large number of trials and get more
layer into action embedding. The noisy action tokens are systematic results. As shown in Fig.15, DiffuseLoco achieves
then passed through 6 Transformer decoder layers, each of a 15.4% reduction in linear velocity tracking error and a
which is composed of an 8-head cross-attention layer. Each 14.5% reduction in angular velocity error compared to the
layer computes the attention weights for the noisy action DL w/o Goal baseline. Moreover, over 64 trials with identical
tokens querying all the state embedding, goal embedding, commands, DL w/o Goal falls over four times, or 6.25% of
and the timestep embedding reflecting the current diffusion all trials, whereas DiffuseLoco experiences no failures. This
timestep k. We apply causal attention masks to each of pattern persists in real-world testing, where DL w/o Goal fails
the state embeddings and goal embeddings separately. The one trial in a 0.7m/s forward test.
predicted noise ϵθ (at−h−2:t+n , st−h−1:t−1 , gt−h−1:t−1 , k) is These results underscore the importance of goal-
then computed by each corresponding output token of the conditioning with distinct attention weights for dynamic
decoder stack. We then supervise the output to predict the system control, revealing that the robot’s I/O history
added noise with Eqn. 5 to find optimal parameters θ of the and goals, governed by physics and arbitrary objectives
denoising model ϵθ . respectively, should not be merged into one embedding space.

C. Hyperparameters
A PPENDIX D
The hyperparameters are summarized in Table III, D ETAILS IN O FFLINE L OCOMOTION DATASET
A PPENDIX C
Here, we will introduce the details in creating the offline
M ORE A BLATION S TUDIES ON D ESIGN C HOICES
locomotion dataset used in this work. We will explain the state,
A. Use of Goal-conditioning goal, and action spaces, followed by a brief introduction to the
Here, we evaluate the impact of goal-conditioning, hypothe- source policies and dynamics randomization used to diversify
sizing enhancements tracking performance and stability. In the the data. We collect a total of 4 million data of state-action-
DL w/o Goal baseline, we do not add the goal-conditioning goal pairs in the offline dateset for the quadrupedal robot tasks,
encoder. Instead, the goal is concatenated with the robot’s I/O. and 10 million transitions for the bipedal robot tasks.
A. State Space 4) Other Methods: Although the source policies used to
collect the dataset in this work are all RL-based policies,
The state space is the robot’s proprioceptive feedback. In
our framework is general and can include the data generated
the quadrupedal locomotion control case, this consists of the
from model-based optimal controllers (such as from [45])
measured motor positions qm , measured motor velocities q̇m ,
and others. The requirement is to align the state and action
base orientation qψ,θ,ϕ , and base angular velocities q̇ψ,θ,ϕ .
spaces among different source policies, and the frequency of
Note that we exclude quantities from the estimation of base
the policy should be kept the same.
velocity (q̇x,y ) to prevent additional estimation errors.
E. Dynamics Randomization
B. Goal Space In order to diversify the training dataset for DiffuseLoco,
we also include the same amount of dynamics randomiza-
The goal of the locomotion task is the commands given
tion [57] during the training of the source policies and the data
to the policy. For quadrupedal robots, the command includes
generation using these policies. Specifically, in each episode
desired sagittal velocity qxd in the range of 0 m/s to 1 m/s,
in simulation, the dynamics parameters are randomized. These
desired base height from 0.2 m to 0.6 m, and desired turning
include the motor’s PD gains, the mass of the robot’s base (up
velocity qψd from -1 rad/s to 1 rad/s.
to the weight of the onboard compute), ground frictions, and
random changes in base velocity. The randomization ranges
C. Action Space are adapted from the source policy’s original methods.
The action space is the robot’s joint-level commands. In this A PPENDIX E
work, we use the desired motor position qdm as the action. This R EAL -T IME I NFERENCE ACCELERATION
is then used by joint-level PD controllers to compute motor Although the diffusion model targets real-time use, it cannot
torques τ at a higher frequency. meet the real-time targets without further tuning. Compared to
previous works using Transformer for locomotion control with
D. Source Policy 2M parameters [60], our model is 3 times larger in parameter
We obtain source policies using three different RL methods. count(6.8M parameters) and needs to be forwarded 10 times
We leverage proximal policy optimization (PPO) [68] to in each inference. Thus, an additional effort is needed to
optimize each of the source policies, and we train the policies accelerate the diffusion on the edge computing device on the
in simulation (Isaac Gym [66]). We evenly distribute the data robot, such as the setup shown in Fig. 17. In this section, we
generated from each of the source skill-specific policies. explore several methods to accelerate the inference process of
the diffusion model to enable it to run real-time onboard.
1) Adversarial Motion Prior (AMP): For skills trained with
AMP, we provide a reference motion retargeted from motion A. Acceleration Framework
capture data of a dog [56], and incorporate an GAN-style Our DiffuseLoco policy has a parameter count of 6.8 million
discriminator to encourage the robot to imitate the reference parameters, which exceeds most modern mobile processors’
motion without extended reward engineering. Then, the reward cache capacity. Furthermore, hardware on a typical consumer-
of this method is formulated as motion imitating term (pro- grade central processing unit (CPU) is not optimized for
vided by the discriminator [58]) and task term (e.g., tracking the operators used in transformer networks. The graphics
error, etc). processing unit (GPU) is more suitable for computing the
2) Central Pattern Generator Guidance (CPG): For skills high-dimension matrix and vector operations. To ensure the
trained with CPG-guidance, we follow the formulation in [69] portability of the setup, we use an accessible NVIDIA Mobile
and provide nominal reference motions of strictly periodic GPU as the deployment platform. For real-time deployment,
motions generated by a Central Pattern Generator with phase an acceleration pipeline is built in the DiffuseLoco framework
signals. Specifically, for hopping skill, the phase selections for to convert and optimize our model towards the target compute
all legs are 0, and for bouncing skill, the phase selections are platforms. The operators of the model are first extracted with
0 for the front legs, and π for the hinder legs. The reward ONNX [75]. Then, TensorRT is used to refine the execution
is composed of task term (e.g., velocity tracking error, etc), graph and compile the resulting execution pipeline onto the tar-
motion tracking term (e.g., reference motion tracking error, get GPU. Through domain-specific architecture optimizations,
etc), and smoothing terms (e.g., action rate, etc). the operations and memory access patterns are optimized to
3) Symmetry Augmented RL: We train the bipedal loco- utilize the full capability of the GPU. With this approach, the
motion skill for quadrupedal robots following a symmetry- speed for each denoising iteration is increased by about 7X
augmented RL policy [73] to achieve a symmetric gait pattern compared to the native implementation in PyTorch, and the
that is crucial for sim-to-real transfer. Specifically, the data maximum inference (with 10 denoising iterations) frequency
collection process is augmented by the addition of symmetric is increased from 17.0 Hz to 116.5 Hz. To showcase the effect
states and actions. The reward includes task term (e.g., velocity of this acceleration approach, we conducted a benchmark on
tracking error, etc), gait pattern term (e.g., feet clearance the inference frequency of the policy running on multiple
height, etc), and smoothing terms (e.g., action rate, etc). hardware platforms we have access to, shown in Fig. 16.
RTX 4090

RTX 4060 M

RTX 2080

GTX 1070

GTX TITAN X

Core i9-9900K
PyTorch – FP32
ONNX – FP32
Ryzen 7 1700
TensorRT – FP32

Fig. 18: Snapshots of (top) deploying DiffuseLoco policy on

Inference Frequency (Hz) a high-fidelity Simulink simulation and (bottom) the Cassie
Fig. 16: Benchmark of running our DiffuseLoco policy (6.8M hardware. Although working in sim-to-sim transfer, the Dif-
parameters) on different hardware platforms. The dashed line fuseLoco policy encounters difficulty in zero-shot transferring
remarks the 30 Hz minimum frequency required to control on the real Cassie hardware due to a much larger sim-to-real
the robot in real time. We utilize TensorRT to optimize the gap. Extending DiffuseLoco to a higher dimensional more
computation graph and achieve approximately 7 times speedup complex dynamic system in the real world as an important
of inference computation time compared to the naive PyTorch future work.
implementation. The N/A entries are due to the lack of
software compatibility on the corresponding platform.
A PPENDIX F
L IMITATIONS
A
In the experiments, our DiffuseLoco policy demonstrates
B good robustness against various ground conditions and small
C variations in terrain. However, against skill-specific RL poli-
cies, the robustness of our multi-skill DiffuseLoco policy is
insufficient. For instance, a quadrupedal robot controlled by
the DiffuseLoco policy struggles with recovery from large
external perturbations, while skill-specific RL policies handle
effectively. This limitation becomes more evident where our
policy fails to overcome the significant sim-to-real gap on
Cassie. Although the diffusion policy for Cassie worked in the
high-fidelity simulation (Simulink), the policy only sustained a
Fig. 17: Onboard compute experiment setup. A: Mini com- few steps in hardware experiments before Cassie lost balance
puter with Intel Core i7-13700H and NVIDIA GeForce RTX and fell down as shown in Fig. 18.
4060 Mobile. B: Battery bank. C: Go1 quadrupedal robot.
This setup is below the robot’s adaptive load capacity and the
robot can walk with our policy steadily. With our acceleration
framework, we are able to achieve onboard and real-time
deployment of our diffusion model.

B. Edge Compute for DiffuseLoco Policy on Robots

With the help of the acceleration framework, the compute
platform can be deployed onboard a Go1 quadruped robot.
A mini-computer equipped with Intel Core i7-13700H and
NVIDIA GeForce RTX 4060 Mobile is attached to the top of
the robot, as showcased in Fig. 17. This computer runs Dif-
fuseLoco policy and is powered by a dedicated battery bank,
separated from the robot’s internal battery. This arrangement is
capable of running the policy for up to 90 minutes. The mini-
computer connects to the robot via an Ethernet cable to send
action for the joint-level PD controls on the robot’s computer.
We note that all the experiments we present are completed on
this edge computing device.

Generalized Animal Imitator: Agile Locomotion With Versatile Motion Prior
No ratings yet
Generalized Animal Imitator: Agile Locomotion With Versatile Motion Prior
8 pages
Deep Reinforcement Learning For Bipedal Locomotion: A Brief Survey
No ratings yet
Deep Reinforcement Learning For Bipedal Locomotion: A Brief Survey
14 pages
2409.20537v1
No ratings yet
2409.20537v1
24 pages
Visualnav 01
No ratings yet
Visualnav 01
8 pages
ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model
No ratings yet
ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model
12 pages
1
No ratings yet
1
7 pages
Efficient Learning of Robust Quadruped Bounding using Pretrained Neural Networks
No ratings yet
Efficient Learning of Robust Quadruped Bounding using Pretrained Neural Networks
12 pages
Zero-Shot Terrain Generalization For Visual Locomotion Policies
No ratings yet
Zero-Shot Terrain Generalization For Visual Locomotion Policies
7 pages
1802.09564v2
No ratings yet
1802.09564v2
12 pages
Bipedal Walking Robot Using Deep Deterministic Policy Gradient
No ratings yet
Bipedal Walking Robot Using Deep Deterministic Policy Gradient
5 pages
Deep Tracking Control
No ratings yet
Deep Tracking Control
18 pages
BC-Z- Zero-Shot Task Generalization With Robotic Imitation Learning
No ratings yet
BC-Z- Zero-Shot Task Generalization With Robotic Imitation Learning
23 pages
ROBEL - Robotics Benchmarks For Learning With Low-Cost Robots
No ratings yet
ROBEL - Robotics Benchmarks For Learning With Low-Cost Robots
14 pages
1Ouyang et al. - 2024 - Long-horizon Locomotion and Manipulation on a Quad
No ratings yet
1Ouyang et al. - 2024 - Long-horizon Locomotion and Manipulation on a Quad
8 pages
LM-Nav - Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action
No ratings yet
LM-Nav - Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action
18 pages
Robust Feedback Motion Policy Design Using Reinforcement Learning On A 3D Digit Bipedal Robot
No ratings yet
Robust Feedback Motion Policy Design Using Reinforcement Learning On A 3D Digit Bipedal Robot
8 pages
Combining Learning-based Locomotion Policy with Model-based Manipulation for Legged Mobile Manipulators
No ratings yet
Combining Learning-based Locomotion Policy with Model-based Manipulation for Legged Mobile Manipulators
8 pages
2405.01792v1
No ratings yet
2405.01792v1
22 pages
Universal Manipulation Interface
No ratings yet
Universal Manipulation Interface
18 pages
Behavior-Aware Robot Navigation With DRL
No ratings yet
Behavior-Aware Robot Navigation With DRL
8 pages
AutoRT
No ratings yet
AutoRT
26 pages
Adaptive Mobile Manipulation for Articulated Objects in the Open World
No ratings yet
Adaptive Mobile Manipulation for Articulated Objects in the Open World
9 pages
Vint: A Foundation Model For Visual Navigation
No ratings yet
Vint: A Foundation Model For Visual Navigation
25 pages
Motion mm2020
No ratings yet
Motion mm2020
9 pages
Distributed Multi-Robot Obstacle Avoidance Via Logarithmic Map-Based Deep Reinforcement Learning
No ratings yet
Distributed Multi-Robot Obstacle Avoidance Via Logarithmic Map-Based Deep Reinforcement Learning
8 pages
Gaitor: Learning A Unified Representation Across Gaits For Real-World Quadruped Locomotion
No ratings yet
Gaitor: Learning A Unified Representation Across Gaits For Real-World Quadruped Locomotion
10 pages
roboverse
No ratings yet
roboverse
36 pages
Insect Inspired Robots-Acra06
No ratings yet
Insect Inspired Robots-Acra06
6 pages
Lively: Enabling Multimodal, Lifelike, and Extensible Real-Time Robot Motion
No ratings yet
Lively: Enabling Multimodal, Lifelike, and Extensible Real-Time Robot Motion
9 pages
A Multimodal Perception Driven Self Evolving Autonomous Ground Vehicle (1)
No ratings yet
A Multimodal Perception Driven Self Evolving Autonomous Ground Vehicle (1)
11 pages
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
No ratings yet
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
22 pages
2024 - MotionClone - Ling Et Al
No ratings yet
2024 - MotionClone - Ling Et Al
17 pages
2019 Federated Transfer Reinforcement Learning For Autonomous Driving
No ratings yet
2019 Federated Transfer Reinforcement Learning For Autonomous Driving
7 pages
Motion Planning and Control For Mobile Robot Navigation Using
No ratings yet
Motion Planning and Control For Mobile Robot Navigation Using
29 pages
RobotKeyframing Learning Locomotion With High Level Objectives via Mixture of Dense and Sparse Rewards Paper
No ratings yet
RobotKeyframing Learning Locomotion With High Level Objectives via Mixture of Dense and Sparse Rewards Paper
17 pages
Zhao 2012
No ratings yet
Zhao 2012
6 pages
2013 Exploration of Adaptive Gait Patterns
No ratings yet
2013 Exploration of Adaptive Gait Patterns
8 pages
Robot Mapless Navigation in VUCA Environments via Deep Reinforcement Learning
No ratings yet
Robot Mapless Navigation in VUCA Environments via Deep Reinforcement Learning
11 pages
An Autonomous Swarm of Micro Flying Robots With Ra
No ratings yet
An Autonomous Swarm of Micro Flying Robots With Ra
8 pages
A Survey of Deep Learning Technologies For Mobile Robot Applications
No ratings yet
A Survey of Deep Learning Technologies For Mobile Robot Applications
10 pages
HiRT
No ratings yet
HiRT
14 pages
Learning Relation in Crowd Using Gated Graph Convolutional Networks For DRL-Based Robot Navigation
No ratings yet
Learning Relation in Crowd Using Gated Graph Convolutional Networks For DRL-Based Robot Navigation
11 pages
Effective Path Planning of A Multi Robot System Using A Hovering Eye
No ratings yet
Effective Path Planning of A Multi Robot System Using A Hovering Eye
2 pages
An Ergodic Measure For Active Learning From Equilibrium
No ratings yet
An Ergodic Measure For Active Learning From Equilibrium
15 pages
Paper Ask1 Arxiv
No ratings yet
Paper Ask1 Arxiv
7 pages
Dynamic Modelling and Simulation of A Three-Wheeled Omnidirectional Mobile Robot: Bond Graph Approach
No ratings yet
Dynamic Modelling and Simulation of A Three-Wheeled Omnidirectional Mobile Robot: Bond Graph Approach
6 pages
Multiple Robots Path Planning Based On Reinforcement Learning For Object Transportation
No ratings yet
Multiple Robots Path Planning Based On Reinforcement Learning For Object Transportation
6 pages
HOPPY: An Open-Source and Low-Cost Kit For Dynamic Robotics Education
No ratings yet
HOPPY: An Open-Source and Low-Cost Kit For Dynamic Robotics Education
8 pages
drones-08-00156
No ratings yet
drones-08-00156
22 pages
Pheeno A Versatile Swarm Robotic Research and Education Platform
No ratings yet
Pheeno A Versatile Swarm Robotic Research and Education Platform
8 pages
Virtual Locomotion Survey
No ratings yet
Virtual Locomotion Survey
20 pages
Controlling Real World Pervasive Environments With Knowledge Bases
No ratings yet
Controlling Real World Pervasive Environments With Knowledge Bases
10 pages
MR.CAP_Multi-Robot_Joint_Control_and_Planning_for_Object_Transport
No ratings yet
MR.CAP_Multi-Robot_Joint_Control_and_Planning_for_Object_Transport
6 pages
output
No ratings yet
output
7 pages
NeuroSwarm: Multi-Agent Neural 3D Scene Reconstruction and Segmentation With UAV For Optimal Navigation of Quadruped Robot
No ratings yet
NeuroSwarm: Multi-Agent Neural 3D Scene Reconstruction and Segmentation With UAV For Optimal Navigation of Quadruped Robot
6 pages
Mapping and Exploration in A Hierarchical Heterogeneous 2013 Robotics and Au 2
No ratings yet
Mapping and Exploration in A Hierarchical Heterogeneous 2013 Robotics and Au 2
15 pages
An Effective and Optimal Mobility Model and Its Prediction in MANETs
No ratings yet
An Effective and Optimal Mobility Model and Its Prediction in MANETs
9 pages
Scaling Robot Learning with Semantically Imagined Experience
No ratings yet
Scaling Robot Learning with Semantically Imagined Experience
21 pages
Deep Reinforcement Learning For Robotic Manipulation With Asynchronous Off-Policy Updates
No ratings yet
Deep Reinforcement Learning For Robotic Manipulation With Asynchronous Off-Policy Updates
9 pages
Micronaut in Action: Designing, Developing, and Deploying Resilient, Cloud-Native Microservices
From Everand
Micronaut in Action: Designing, Developing, and Deploying Resilient, Cloud-Native Microservices
Aarav Joshi
No ratings yet
CKSci G1U1 Sun Moon and Stars TG
No ratings yet
CKSci G1U1 Sun Moon and Stars TG
228 pages
Fa 5 A
No ratings yet
Fa 5 A
60 pages
KPS Inst Manual Ver 6 (1) .1 Eng
No ratings yet
KPS Inst Manual Ver 6 (1) .1 Eng
56 pages
DFDVSDVF
No ratings yet
DFDVSDVF
166 pages
832 00114 01 Rhoplex cs3800 Styreneacrylic Copolymer Tds
No ratings yet
832 00114 01 Rhoplex cs3800 Styreneacrylic Copolymer Tds
2 pages
Best Book For Jee Mains
No ratings yet
Best Book For Jee Mains
5 pages
Purposive Communication (Module 4)
No ratings yet
Purposive Communication (Module 4)
3 pages
Diagramas DD13 DD15
100% (8)
Diagramas DD13 DD15
480 pages
Lista Alarmas MR-J3-B
No ratings yet
Lista Alarmas MR-J3-B
14 pages
High Pressure Washers Car Washers: NACS Cleantech PVT LTD
No ratings yet
High Pressure Washers Car Washers: NACS Cleantech PVT LTD
14 pages
Usc Dissertation Guidelines
100% (3)
Usc Dissertation Guidelines
6 pages
I. LEXICO-GRAMMAR (8 Points)
No ratings yet
I. LEXICO-GRAMMAR (8 Points)
11 pages
Equipment Registration Form
No ratings yet
Equipment Registration Form
1 page
Notes For Homeostasis
83% (6)
Notes For Homeostasis
21 pages
DE ASS 1 Key
No ratings yet
DE ASS 1 Key
2 pages
Research Proposal: Effectiveness of Using Grammarly Improving Writing Skill For First-Year English Major Students at Hanoi Pedagogical University 2
No ratings yet
Research Proposal: Effectiveness of Using Grammarly Improving Writing Skill For First-Year English Major Students at Hanoi Pedagogical University 2
19 pages
Mathematics Grade 4 Homework Workbook Answers
100% (1)
Mathematics Grade 4 Homework Workbook Answers
5 pages
Cryptography Assignment-2
No ratings yet
Cryptography Assignment-2
6 pages
Remove Bushing
No ratings yet
Remove Bushing
17 pages
Bugreport Davinci PKQ1.190302.001 2019 08 18 21 23 51 Dumpstate - Log 25940
0% (1)
Bugreport Davinci PKQ1.190302.001 2019 08 18 21 23 51 Dumpstate - Log 25940
23 pages
Nature of Mathematics
No ratings yet
Nature of Mathematics
50 pages
Ureda Ureda
No ratings yet
Ureda Ureda
29 pages
The Aligarh Movement
No ratings yet
The Aligarh Movement
2 pages
QCP FOR Concrete Sampling - Testing
No ratings yet
QCP FOR Concrete Sampling - Testing
6 pages
Information and Communication Technology (ICT) Knowledge, Skills, and Attitude Basis For DEPED Support System
No ratings yet
Information and Communication Technology (ICT) Knowledge, Skills, and Attitude Basis For DEPED Support System
13 pages
DALY & QALY (1)
No ratings yet
DALY & QALY (1)
30 pages
Evaluation of Photoshop Software Potential For Food Colorimetry PDF
No ratings yet
Evaluation of Photoshop Software Potential For Food Colorimetry PDF
6 pages
John Deere L100series
50% (2)
John Deere L100series
18 pages
Unit 4 Safety Management: Definition of Accident
No ratings yet
Unit 4 Safety Management: Definition of Accident
19 pages
PM 150 CSD 120
No ratings yet
PM 150 CSD 120
7 pages

DiffuseLoco - Real-Time Legged Locomotion

Uploaded by

DiffuseLoco - Real-Time Legged Locomotion

Uploaded by

DiffuseLoco: Real-Time Legged Locomotion

Control with Diffusion from Offline Datasets

Hopping Bipedal Walking Pacing

Goal sequence at time t

Fig. 3: The DiffuseLoco architecture. At time step t, it l = M SE (ϵk , ϵθ ( at−h−2:t+n + ϵk ,

Gait pattern change Time (s)

0.5m/s Forward Stability (%) 100 100 100 100 100

0.7m/s Forward Stability (%) 0 20 0 40 100

Turn Left Stability (%) 20 100 0. 100 100

Turn Right Stability (%) 100 100 100 80 100

B. DiffuseLoco versus AMP (RL)

0.5m/s Forward Stability (%) 100 80 80 100 100 100

0.7m/s Forward Stability (%) 0 40 80 80 100 100

Turn Left Stability (%) 100 100 100 100 20 100

Fig. 8: Skill Transitioning: Pace to Stand

Fig. 9: Skill Transitioning: Hop to Stand on Bare Floor

Fig. 11: Skill Transitioning: Trot to Stand

Fig. 12: Skill Transitioning: Bounce to Pace

Fig. 13: Skill Transitioning: Hop to Bounce

Fig. 14: Skill Transitioning: Hop to Pace

Failure Rate (%)

Fig. 18: Snapshots of (top) deploying DiffuseLoco policy on

B. Edge Compute for DiffuseLoco Policy on Robots

You might also like