DiffuseLoco - Real-Time Legged Locomotion
DiffuseLoco - Real-Time Legged Locomotion
Diffusion model
runs on here
Trotting
arXiv:2404.19264v1 [cs.RO] 30 Apr 2024
Walking Running
Fig. 1. Snapshots of (top) a quadrupedal robot trotting with our diffusion-based real-time locomotion control policy via onboard computing;
(middle) our multi-skill offline-learned policy performing a sequence of challenging skills, including hopping, bipedal locomotion, and pacing
with smooth skill transitioning; (bottom) a bipedal robot controlled by our multi-skill policy transitioning from walking to running stably.
We present DiffuseLoco, a scalable framework that leverages diffusion models to learn legged locomotion control exclusively from offline
datasets. DiffuseLoco learns a state-of-the-art policy that is able to perform a diverse set of agile locomotion skills with a single policy,
exhibiting robustness in the real world, and is versatile to various sources of offline data. We encourage the viewers to watch supplementary
videos on these runs.
Abstract—This work introduces DiffuseLoco, a framework for cloning baselines. The design choices are validated via compre-
training multi-skill diffusion-based policies for dynamic legged hensive ablation studies. This work opens new possibilities for
locomotion from offline datasets, enabling real-time control of scaling up learning-based legged locomotion controllers through
diverse skills on robots in the real world. Offline learning at scale the scaling of large, expressive models and diverse offline datasets.
has led to breakthroughs in computer vision, natural language
processing, and robotic manipulation domains. However, scaling
up learning for legged robot locomotion, especially with multiple I. I NTRODUCTION
skills in a single policy, presents significant challenges for prior Learning from large-scale offline datasets has led to break-
online reinforcement learning methods. To address this challenge, throughs in a large variety of domains, such as computer vision
we propose a novel, scalable framework that leverages diffusion
models to directly learn from offline multimodal datasets with and natural language processing, where scaling up both the
a diverse set of locomotion skills. With design choices tailored size of models and datasets leads to improved performance
for real-time control in dynamical systems, including receding and generalization [32, 74]. This has led to the development of
horizon control and delayed inputs, DiffuseLoco is capable of powerful generative models, like diffusion models, which are
reproducing multimodality in performing various locomotion able to model complex multi-modal data distributions [64, 67],
skills, zero-shot transfer to real quadrupedal robots, and it
can be deployed on edge computing devices. Furthermore, and generate high-quality images and videos.
DiffuseLoco demonstrates free transitions between skills and In robotics, learning from diverse offline datasets has also
robustness against environmental variations. Through extensive been shown to be an effective and scalable approach for
benchmarking in real-world experiments, DiffuseLoco exhibits developing more versatile policies for domains such as robotic
better stability and velocity tracking performance compared to manipulation [3, 4] and autonomous driving [8, 13, 54].
prior reinforcement learning and non-diffusion-based behavior
However, these domains typically involve agents that have
low-dimensional action spaces (e.g., end-effector trajectory), A. Multi-skill Reinforcement Learning in Locomotion
with low re-planning frequency on inherently stable systems Recent advances in model-free RL have demonstrated
(e.g., robot arms or cars). For dynamical systems featuring promising results in developing agile locomotion skills for
higher degrees of freedom and more complex dynamics, legged robots in the real world [6, 14, 40, 47, 56, 66]. Among
such as legged robots, data-driven approaches have largely them, prior works have demonstrated impressive performances
been focused on online reinforcement learning (RL) tech- on highly agile skills such as jumping, running and sharp
niques [27, 43, 56]. Unlike offline learning, it can be difficult turning on bipedal robots [44, 91], and walking on two feet
to scale online RL to both large models and datasets due to the with quadrupedal robots [20, 42, 71], which requires a high
requirements of online rollouts. Most prior works have focused degree of agility and robustness with a floating-base robot.
on dynamic motions with specialized single-skill models, and However, the majority of these agile locomotion skills are
scaling towards a single model that can reproduce a diverse trained with single-skill RL and remain unscalable to large-
set of challenging locomotion skills remains an open problem. scale learning of multiple locomotion skills.
To this end, we present a novel approach that emphasizes A simple and natural idea of learning multi-skill locomotion
learning agile legged locomotion skills at scale by solving is to train separate policies for each skill, and then coor-
two aforementioned challenges: offline learning from various dinate through extra high-level planning [23, 26, 33, 88] or
data sources and the ability to learn a set of diverse skills. distill into a small-scale model via imitation learning such as
We propose DiffuseLoco, a framework that leverages expres- DAgger [93]. Due to the inherent coordination difficulty and
sive diffusion models to effectively learn the multi-modalities the requirements of online distillation, these methods remain
that exist in the diverse offline dataset without manual skill unscalable to an increasing number of skills.
labeling. Once trained, our controllers can execute robust In comparison, learning multi-skill policies directly from
locomotion skills on real-world legged robots for real-time scratch typically involves parameterized motions [62, 70] with
control. limited sets of applicable motions, and more popularly, motion
The primary contributions of this work include: imitation methods through either reward shaping [11, 34, 92]
1) A state-of-the-art multi-skill controller, leveraging expres- or adversarial imitation learning [16, 39, 83, 89]. However, this
sive diffusion models, that learns agile bipedal walking approach still faces challenges such as needing extra model-
and various quadrupedal locomotion skills within a single based trajectory optimization or well-trained expert policies
policy and can be deployed zero-shot on real-world for acquiring reference for agile motions and the limited
quadrupedal robots. expressiveness of simple models in online RL frameworks in
2) A novel framework that directly learns from a diverse learning diverse skills.
offline dataset for real-time control of legged robots, In general, learning a diverse, agile multi-skill policy with
showing the benefits and potentials of offline learning online RL remains challenging. For example, while existing
at scale for locomotion skills practically in a real-world RL methods have successfully combined skills like jumping
scenario. and trotting [89, 92], quadrupedal walking and standing on
3) Extensive real-world validation showing higher stability hind legs with wheels without bipedal walking [78], a combi-
and lower velocity tracking errors compared to baselines, nation of more diverse and agile skills such as stable bipedal
while demonstrating multi-modal behaviors with skill walking and quadrupedal hopping in a single policy has not
transitioning and robustness on terrains with varying yet been demonstrated.
ground frictions.
B. Offline Learning in Locomotion Control
This work opens up the possibility of leveraging large-scale
Compared to online learning, offline learning offers better
learning to create diverse and agile multi-skill controllers for
scalability, a simpler training scheme, and an effective way
legged locomotion from offline datasets. For the first time, we
to re-use data, yet prior works in learning low-level loco-
show that it is feasible to zero-shot transfer such a diverse
motion control from offline datasets remain limited. Most
locomotion policy learned from a static dataset to real-world
prior works focus on simple simulated tasks, such as Gym
applications. This approach offers a scalable and versatile
locomotion tasks, with behavior cloning (BC) [76] and offline
framework for learning-based control, allowing for continuous
RL [38]. Among them, some leverage Q-learning on offline
expansion of the dataset and integration of diverse skills from
datasets [35, 36, 51], or supervised learning techniques [9, 28,
various data sources. Codebase and checkpoints will be open-
81, 86]. However, these tasks are oversimplified and do not
sourced upon the acceptance of this work.
adequately consider the complexities in real-world scenarios.
An alternative is the use of offline data as a foundation for
II. R ELATED W ORK
online learning [50, 77]. Among them, Smith et al. develops
Our proposed framework leverages diffusion models for baseline policies from offline datasets to bootstrap online
multi-skill legged locomotion control. In this section, we re- learning on real robots. Yet, this approach still requires online
view the most closely related works on learning-based legged learning.
locomotion algorithms and applications of diffusion models in Another recent work develops offline learning on humanoid
robotics. locomotion with real robots [61], with the scope limited
to only one walking skill. In comparison, the efficacy of Data Source Training Deployment
learning completely from offline datasets, especially at a larger DiffuseLoco DiffuseLoco
scale than a few simple skills remains unproven in legged at st 30 Hz
(s, g, a+εk, k)
locomotion control.
Multiple Skills εθ
C. Diffusion Models in Robotics
(st, at, gt) Offline εk MSE
Recent advances have seen increasing applications of dif- Dataset Multiple Skills
loss
fusion models in control and planning systems. Some prior
works integrate them into the learning pipeline, including the
discriminators in adversarial IL for legged control [79], and re- Fig. 2: Overview of the three stages of DiffuseLoco. First,
ward models for RL [52, 80]. However, the use of small multi- we generate or utilize an offline dataset with demonstrations
layer perceptron (MLP) networks as policies presents simi- of a set of skills gathered with different methods (left). Then,
lar challenges in exploring diverse, multi-skill learning [81]. we train DiffuseLoco policy with DDPM loss on trajectories
Diffusion models are also employed in high-level trajectory within the dataset (middle). Finally, DiffuseLoco policy is
planning [24, 29], safe planning [85], and goal generation for deployed on robots in the real world and executes a diverse
low-level controllers [1, 31], as well as enhancing visuomotor set of agile skills (right).
planning for manipulation tasks [48, 55, 63]. However, most
prior works are limited to simulation environments only. Algorithm 1 DiffuseLoco Algorithm
Emerging efforts to apply diffusion models to real-world 1 N
1: Initilize: N source policies πsrc . . . , πsrc , Empty or Existing
robot manipulation include using diffusion to manage a variety
Offline Dataset Dsrc , Diffusion Model πθ
of manipulation tasks with visual inputs and incorporating self-
2: // OBTAIN DATA FROM πsrc
supervised learning and language conditioning [10, 12, 21, 41].
3: repeat
Yoneda et al. leverages the reverse diffusion process for
4: Sample n uniformly from {1, . . . , N }
shared autonomy with a human user in end-effector planning.
5: Sample environment dynamics p(s′ |s, a)
Additionally, hierarchical frameworks are being developed
6: Reset source-specific state ssrc t
to handle tasks requiring multiple skills, pushing towards
7: for t = 1 to T do
generalist policies [2, 53, 84]. However, these prior works pri-
8: if t mod random goal step = 0 then
marily focus on high-level planning on manipulation systems,
9: Sample goal g
featuring a low-dimensional action space (e.g., end-effector
10: end if
position), low-replanning frequency (e.g., around 10 Hz), and n
11: Act source policy asrc src
t = πsrc (st , gt )
inherently more stable dynamics.
12: Record source-agnostic state st
In contrast, using diffusion models for high-frequency, low- ′
13: Step environment ssrc t = environment.step(asrc t )
level control remains limited [7]. The most relevant work
14: Record source-agnostic action at
uses online RL to train diffusion-based actor policies in
15: Add data to offline dataset Dsrc ← (st , at , gt )
simpler simulation settings [87], but transitioning to real-world
16: end for
applications with high-frequency feedback control presents
17: until desired
significant challenges due to the instability and rapid dynamics
18: // TRAIN ON OFFLINE DATASET
of legged robots [82]. This work aims to leverage diffusion
19: for each epoch do
models in low-level control for legged locomotion and demon-
20: for each (straj , atraj , gtraj ) in Dsrc do
strate the advantages of multimodality and scalability in the
21: Compute the loss L(θ) with Eqn. 5
real world.
22: Update model parameters θ to minimize loss L(θ)
III. V ERSATILE F RAMEWORK FOR D IFFUSION 23: end for
L OCOMOTION C ONTROL F ROM O FFLINE DATASET 24: end for
In this section, we provide an overview of DiffuseLoco, a
framework designed to generate and utilize offline datasets for
training multi-skill locomotion policies scalably. DiffuseLoco handling data variability, and ensuring real-time control in
is based on the diffusion model and is designed to train a real-world environments.
low-level multi-skill locomotion policy from offline datasets A schematic illustration of the DiffuseLoco framework is
containing multiple agile locomotion skills with diverse behav- shown in Figure 2. Our framework consists of three stages:
iors. DiffuseLoco represents the state-of-the-art performance a) Data Source: We start with collecting a new or
in multi-skill locomotion, as it is able to learn both bipedal utilizing an existing offline dataset consisting of multiple
and quadrupedal locomotion skills in a unified policy, and can skills. To generate or expand a dataset, we first obtain single-
be deployed robustly zero-shot on hardware. DiffuseLoco ad- skill control policies as the source policies. These policies are
dresses several challenges inherent to learning from offline conditioned on given goals g (commands), such as velocity
data, including generating diverse skills with the same goals, commands and base heights, that are skill-agnostic. Note that
the source policies can be obtained with different methods. accelerate the inference to be faster than the control frequency
With the assumption of the frequency of low-level control to achieve this RHC manner. The acceleration techniques are
being the same, the observation and action spaces of the source detailed in Appendix E, which ultimately enables running the
policies can be vastly different. Thus, we need to collect a DiffuseLoco policy on an edge-compute device that can be
set of source-agnostic state and action pairs across all source mounted on the robot.
policies. For legged robots, the widely-used source-agnostic
IV. D IFFUSION M ODEL FOR R EAL -T IME C ONTROL
states and actions are the proprioceptive feedback st directly
from the robot and the joint-level PD targets at directly to Having introduced the framework of DiffuseLoco, we now
the robot, respectively. As illustrated in Algo. 1, we start begin to develop its backbone: a diffusion model for locomo-
an episode where the robot is controlled by the ith source tion control, shown in Fig. 3, with a special focus on design
policy π i . The policy acts at a source-specific state ssrc
t , and
choices for real-time control and inference acceleration.
only the source-agnostic state-action-goal pairs (st , at , gt ) A. DDPM for Control
are collected during the rollout of the robot’s closed-loop
dynamics until it reaches the maximum episode length T . In To model multi-modal behaviors from diverse datasets,
addition, the goal gt will be re-sampled within the command we leverage Denoising Diffusion Probabilistic Models
range after a time interval within the episode. In this work, (DDPM) [22] with a transformer backbone to model different
we leverage cheap simulation data as an example, but since skills that can be applied to achieve a common goal. DDPM is
DiffuseLoco effectively re-uses data, we can scalably extend a class of generative models in which the generative process
to more expensive data collection process, potentially from is modeled as a denoising procedure, often referred to as
real-world hardware. The details of dataset generation in our Stochastic Langevin Dynamics, expressed in the following
experiments can be found in the Appendix D. equation,
b) Training: In the second step, we train our DiffuseLoco xk−1 = α xk − γϵθ (xk , k) + N (0, σ 2 I)
(1)
policy from the offline dataset in an end-to-end manner. Let
input state and goal history length be h and output action where N (0, σ 2 I) denotes the sampled noise from a DDPM
prediction length be n. During training, we sample a segment scheduler, α, γ, and σ are its hyperparameters: α regulates
of state trajectory straj and corresponding action and goal se- the rate at which noise is added at each step, γ represents
quences, atraj and gtraj . We sample a diffusion step k randomly the denoising strength, and σ defines the noise level. For
from {1, . . . , K}, and sample a Gaussian noise ϵk to add to the clarity, we now use subscripts t−a:t−b to indicate trajectories
action sequence. Then, a transformer-based denoising model from timestep t − a to t − b, replacing subscripts traj . To
takes the noisy action sequence along with states trajectory generate the action trajectory for control, an initial noisy action
straj , goal trajectory gtraj , and diffusion step k as input, and sequence, aK t:t+n , is sampled from Gaussian noise, and the
predicts the added noise as ϵθ . The predicted noise ϵθ is then DDPM conditioned on states st−h:t , goals gt−h:t , and previous
regressed to match the true noise ϵk with mean square error actions at−h−1:t−1 undergoes K iterations of denoising steps.
loss. In this way, the denoising model is learned to generate Unlike previous works applying DDPM in manipula-
sequences of low-level actions conditioned on robot states and tion [12], the inclusion of previous actions, i.e., I/O history,
goals from the dataset. Details of the model architecture and helps the policy to better perform system identification and
training objective are introduced in Sec. IV. state estimations for legged locomotion control, as evaluated
c) Deployment: In the last stage, we zero-shot transfer in [44]. Furthermore, instead of concatenating state and goal
the trained DiffuseLoco policy on the robot hardware. During into a single embedding [12, 41], we conveniently leverage
deployment, the DiffuseLoco policy takes a sequence of pure the transformer’s attention mechanism to assign different at-
noise sampled from a Gaussian distribution and denoises it tention weights to separately embedded robot’s I/O and static
conditioned on the state trajectory straj from the robot hardware goals, enabling the policy to adjust focus between adapting to
and the given goal gtraj . The denoising process is repeated dynamic environments and achieving goals. We find that these
for K iterations to generate a sequence of actions, but only modifications result in better command tracking performance
the immediate action at is used as the robot’s joint-level PD and robustness, as shown in Sec. VII.
targets. After executing this action, the DiffuseLoco policy The denoising process yields a sequence of intermediate
takes a new sequence of states from the robot and updates the actions characterized by progressively decreasing noise levels:
immediate action from the newly generated action sequence. aK , aK−1 , . . . , a0 , until the desired noise-free output, a0 ,
This is designed to align with the Receding Horizon Control is attained. This process can be expressed as the following
(RHC) framework, instead of interpolating the action sequence equation:
at high frequency as previously used by other diffusion-based ak−1 k k
t:t+n = α at:t+n − γϵθ (at−h−1:t+n , st−h:t , gt−h:t , k)
work [31, 41]. RHC enables DiffuseLoco to replan rapidly
+N (0, σ 2 I)
with fast-changing states of the robot to ensure up-to-date
(2)
actions while keeping future steps in account. However, since
the diffusion model has a large number of parameters and the where akt:t+n represents the output at the k th iteration, and
diffusion process involves multiple denoising steps, we must ϵθ (akt−h−1:t+n , st−h:t , k) represents the predicted noise from
delayed input
st-3 st-2 st-1 st the following,
at-4 at-3 at-2 at-1 … Diffusion
Observation Goal Cond.
Timestep
I/O sequence at time t ak−1 k
t:t+n = α at:t+n
gt-3 gt-2 gt-1 gt … MLP MLP
Command
Velocity
(m/s)
Fig. 5: Foot contact map indicating stable walking and skill switching with DiffuseLoco policy and velocity commands. The
red circle denotes the legs that are in contact with the ground. The robot initially walks using trotting skill, indicated by a
purple background, then switches to pacing, shown in green, following a command change that involves a sudden stop and
resume. We emphasize DiffuseLoco’s ability to maintain different modalities for stable walking under the same command,
switching modalities only when necessary.
In conclusion, in addressing question a), we have demon- Skill information are often unscalable or unavailable during
strated that DiffuseLoco is invariant to the source of offline training and deployment. For a boarder range of applicability,
demonstrations, and can be trained with data from multiple we limit the scope of comparisons within not-skill-conditioned
specialized RL frameworks. More importantly, DiffuseLoco multi-skill RL and non-diffusion BC baselines. Specifically,
shows better scalability in learning diverse skills that existing the RL baselines include,
RL frameworks have not yet illustrated, affirming its state-of- • Adversarial Motion Priors (AMP) [16]: An MLP policy
the-art capability as an answer to question b). We believe that trained using AMP with RL (PPO) and style reward of
the potential of DiffuseLoco extends significantly beyond the both pacing and trotting reference motions. We directly
current five skills showcased, suggesting promising avenues use the open-sourced checkpoint from [16]. We note
for future works. that although several skill-conditioned RL policies [39,
VI. Q UANTITATIVE A NALYSIS 83, 89] have been introduced since [16], yielding better
sim-to-real results, progress in unconditioned multi-skill
In this section, we seek to quantitatively compare Diffuse- policies has been limited.
Loco against various existing multi-skill RL and non-diffusion • AMP with history steps (AMP w/ H): To align with
behavior cloning (BC) baselines. Since there is no existing RL DiffuseLoco, we train an AMP policy with 8 steps of
baseline that learns the five skills in the previous section, we state and action history with the same setup as [16] and
refer to only quadrupedal walking skills, including pacing and a similar evaluation return in simulation.
trotting, as a case study for the performance of DiffuseLoco.
Furthermore, we compare DiffuseLoco with non-diffusion
A. Task and Baselines BC policies, which can be categorized into autoregressive
This analysis includes walking for four meters under five token prediction [9, 25] and action sequence prediction as used
goals (commands) with different velocities. The goals are the in [19]. We adopt baselines for each category.
following: move forward at three different speeds: 0.3 m/s, • Transformer with Autoregressive Token Prediction (TF):
0.5 m/s, and 0.7 m/s, and make a left turn and a right turn at A Generative Pretrained Transformer (GPT) [5] policy
0.3 rad/s. We record the actual linear velocities via a Kalman similar to a decision transformer [9] without reward
filter state estimation [18], and the number of trials where the conditioning. This only generates one timestep action.
robot does not fall over through out the trial as the stability • Transformer with Receding Horizon Control (TF w/
metrics. We repeat each experiment five times and report the RHC): A transformer policy with the same future step
mean and standard deviation across five runs. action predictions. The model’s architecture is identical
Goal (Task) Metric AMP AMP w/ H TF TF w/ RHC DiffuseLoco (Ours)
0.3m/s Forward Stability (%) 100 100 80 100 100
Ev (%) 90.44 ± 1.87 90.63 ± 4.79 75.75 ± 6.07 39.28 ± 2.34 33.22 ± 12.48
TABLE I: Performance Benchmark across different baselines and our DiffuseLoco policy in the real world. Stability (the
higher the better) measures the number of trials in which the robot stays stable and does not fall over. Ev (the lower the
better) measures the deviation from the desired velocity in percentage. The experiments are conducted with different command
settings (Left). Each command is repeated non-stop for five trials, and we report the average and standard deviation of the
metrics across five trials.
epochs as DiffuseLoco.
Remark 2: Typically, previous work uses DAgger style al-
gorithms [65] to better cope with distribution shift, but these
methods require access to the expert policy in training and
online learning environments. As a more scalable and versatile
framework, we limit our focus to learning entirely from offline
(a) Turf (b) Turf datasets.
Eval Loss
0.04 0.04 In comparison, our DiffuseLoco shows more stable and
0.02
smooth motions measured by both stability metrics and magni-
0.02
tudes of the body’s angular velocity. On average, DiffuseLoco
0.00 0.00 achieves 10.40% less in magnitude for the body’s oscillation
0 1 2 0 1 2
Step 1e5 Step 1e5 over all trials. As a result, the smoother locomotion skill helps
(a) (b) DiffuseLoco to achieve on average 38.97% less tracking error
Fig. 7: Loss curves for training DiffuseLoco and TF w/ RHC compared to TF w/ RHC. Based on this observation, we
baseline. (a): Training Loss. (b): Evaluation Loss. We report suggest that DDPM-style training is more suitable for imi-
mean and standard deviation across three seeds. We find that tating locomotion tasks compared to prior Behavior Cloning
even though TF w/ RHC achieves a low reconstruction loss methods.
in training, the evaluation loss stays higher than DiffuseLoco VII. A BLATION S TUDY ON D ESIGN C HOICES
and increases at the end. This indicates TF w/ RHC tends to
overfit the offline datasets while DiffuseLoco is not, with the In this section, we further evaluate the design choices used
same number of parameters. to build DiffuseLoco policy in simulation and the real world by
extensive ablation studies. We use the same experiment setups
as the previous section. For brevity of the main content, further
environment, both AMP and AMP w/ H are able to control ablation studies can be found in Appendix C.
the robot to track different velocities without falling over. A. Ablation Components
Besides better sim-to-real transfer, DiffuseLoco with a
To validate our design choices, we ablate DiffuseLoco with
diffusion model is able to efficiently learn the multi-modality
the following critical components and compare them to our
presented by the different skills for the same locomotion task,
real-world benchmark.
and thus able to perform valid and coordinated locomotion
• Without Receding Horizon Control (DL w/o RHC): Re-
skills without mode-collapsing, as an example given by Fig. 5.
This helps DiffuseLoco achieve both better stability and track place RHC with one-step prediction in an autoregressive
completion (velocity tracking) performance compared with manner and keep the diffusion model.
• Without Domain Randomization (DL w/o Rand): Trained
AMP-based policy baselines.
on a dataset generated without domain randomization,
C. DiffuseLoco versus Non-Diffusion Behavior Cloning except for the ground friction coefficient.
We further compare DiffuseLoco with non-diffusion BC • DDIM Inference: We develop two DDIM baselines to
methods. For locomotion tasks, smooth and temporally con- investigate how training and inference steps affect per-
sistent actions are a necessity for stability and robustness. formance in locomotion control.
Looking at Table I, we find that DiffuseLoco outperforms – 100 Training + 10 Inference (DDIM-100/10)
TF and TF w/ RHC in both stability and robustness of – 10 Training + 5 Inference (DDIM-10/5)
the locomotion policy. Using one-step action output, we find Compared with our DiffuseLoco, DDIM-100/10 has the
that TF lacks robustness and fails the 0.7 m/s forward and same inference steps, and DDIM-10/5 has the same
left turn tasks completely. This is likely because single-step training steps.
action prediction makes the policy less aware of future actions, • U-Net as the backbone (U-Net): Replace the Transformer
leading to more jittering behavior. with a U-Net as the backbone, adjusted to the same
With receding horizon control, TF w/ RHC overcomes most parameter count.
of the jittering problem and can complete most of the tasks.
However, we note that for more agile motion such as the B. Single-step output versus RHC
0.7 m/s forward task, the stability metric drops drastically to To isolate the effects of RHC, we test a variant of Dif-
merely 40%. This is likely because the reconstruction loss used fuseLoco without RHC (DL w/o RHC), finding that it strug-
in TF w/ RHC training tends to overfit the action trajectories gles with faster speed goal and exhibits significant jittering
in the dataset, resulting in less robust policy in the out-of- behaviors, as detailed in Table II. This suggests that single-
distribution scenarios (such as in the real world). step token-prediction models like GPT are less suitable for
This is especially evident when looking at training curves legged locomotion control than diffusion models, which pre-
for TF w/ RHC versus DiffuseLoco shown in Figure 7, where dict sequences of future actions.
TF w/ RHC overfits significantly to the training dataset and
the evaluation loss stays high. Note that the model architecture C. Sampling Techniques
is kept identical across TF w/ RHC and DiffuseLoco, so only As discussed earlier, popular diffusion-based frameworks
the loss functions are different. In addition, the evaluation loss like DDIM often reduce sample iterations for inference ac-
here is calculated with samples from the same distribution celeration, trading off output quality for speed, often with
Goal (Task) Metric DL w/o RHC DL w/o Rand DDIM-100/10 DDIM-10/5 U-Net DiffuseLoco (Ours)
0.3m/s Forward Stability (%) 100 100 100 100 100 100
Ev (%) 75.09 ± 18.98 50.45 ± 2.70 56.89 ± 2.43 47.09 ± 2.40 81.31 ± 1.90 33.22 ± 12.48
Turn Right Stability (%) 100 100 100 100 100 100
Ev (%) 18.61 ± 2.40 8.18 ± 3.94 6.47 ± 2.49 7.42 ± 2.90 89.63± 3.36 2.22 ± 1.03
TABLE II: Performance Ablation Study across different ablations and DiffuseLoco policy in real-world experiments. Stability
(the higher the better) measures the number of trials in which the robot stays stable and does not fall over. Ev (the lower the
better) measures the deviation from the desired velocity in percentage. The experiments are conducted with different command
settings (Left). Each command is repeated non-stop for five trials, and we report the average and standard deviation of the
metrics across five trials.
ten times fewer iterations [72]. While this approach suits DiffuseLoco on a dataset with dynamics randomization leads
tasks like image generation, which tolerate some variance, it to a 44.26% increase in both robustness and stability compared
underperforms in quadrupedal locomotion control. As shown to DL w/o Rand baseline. Specifically, in the challenging
in Table II, both DDIM-100/10 and DDIM-10/5 exhibit worse 0.7 m/s forward task, DL w/o Rand falls in 3 out of 5
stability and higher velocity tracking errors. Noticeably, the trials. This ablation study points to the potential of altering
100 training steps and 10 inference steps variant demonstrates the dataset, by adding either more diversity and potentially
limping behavior and fails two trials. Tracking errors for real-world data or more fault-recovery behaviors, to further
both variants increase by 50.69% and 42.04%, respectively, enhance the robustness of DiffuseLoco.
compared to DiffuseLoco.
Thus, we believe that noisier control signals from the VIII. D ISCUSSION AND F UTURE W ORK
DDIM pipeline likely disrupt the control of inherently un-
stable floating-based dynamic systems, like legged robots. An We have presented DiffuseLoco, a scalable learning frame-
interesting future work direction could be on control-specific work to learn diverse agile legged locomotion skills from
sampling techniques to accelerate diffusion models without multi-modal offline datasets and can robustly transfer to real-
compromising stability and performance. world robots in real-time. Leveraging diffusion models to
capture the multi-modality in the offline dataset, DiffuseLoco
D. Model Architecture Effects learns a state-of-the-art controller that combines bipedal walk-
In addition, we compare against another commonly used ing and quadrupedal locomotion skills within one policy and
architecture in diffusion models, a CNN-based U-Net as the transitions freely among the skills.
backbone of DiffuseLoco. Qualitatively, the U-Net policy is
shaky and inconsistent, and quantitatively, it has one of the A. A Scalable Approach for Locomotion Skills
highest errors in velocity tracking, with worsened stability due
to its shaky actions. We reckon that this is because CNNs In this work, we focus on the scalability of learning legged
are not the best fit for temporal data and also lack separate locomotion control. Inspired by successes of large-scale learn-
attention weights for goal conditioning. This is consistent with ing in other robotics tasks, we leverage the most scalable
prior work [12] that finds U-Net underperforms Transformer, approach: learning from offline datasets, with special attention
especially in high action-rate dynamical systems. to the versatility of the data sources and the multi-modality
of different skills in a large-scale dataset. With an expressive
E. Dataset Effects diffusion model as the backbone, we are able to absorb
Lastly, we explore how dataset characteristics influence demonstrations learned with various existing RL algorithms
the robustness and performance of DiffuseLoco in real-world with potentially different observation and action spaces, and
scenarios. Consistent with previous findings that diversity in effectively execute skills, such as trotting and pacing, that
training data, such as noise insertion, mitigates compounding represent different modalities given identical commands. With
error [37], we demonstrate that increasing the variety of the five diverse skills presented in this work as a testimony,
dynamics parameters in simulation environments where we we show the scalability of DiffuseLoco towards a generalist
collect data also enhances robustness. As in Table II, training policy for locomotion control tasks.
B. Benefits over Other Multi-skill Policies Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran,
As we show in our thorough real-world benchmark (Sec. Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan
VI), DiffuseLoco demonstrates smoother actions and improved Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao,
stability and velocity tracking in real-world conditions com- Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich.
pared to non-diffusion Behavior Cloning baselines that are RT-2: Vision-Language-Action Models Transfer Web
commonly used in prior works. When not conditioned on Knowledge to Robotic Control, 2023.
explicit skill labels, DiffuseLoco shows better sim-to-real [4] Anthony Brohan, Noah Brown, Justice Carbajal, Yev-
transfer performance for multi-skill locomotion compared to gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana
AMP policies, which often face a significant sim-to-real gap Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine
due to mode-collapsing in generator-discriminator methods. Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas
DiffuseLoco avoids these issues, providing stable, coherent, Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian,
and effective control with smooth skill-switching and stable Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-
execution under consistent commands. Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deek-
sha Manjunath, Igor Mordatch, Ofir Nachum, Carolina
C. Large-scale Offline Dataset for Locomotion Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jor-
In the experiments, DiffuseLoco demonstrates scalability nell Quiambao, Kanishka Rao, Michael Ryoo, Grecia
and robustness by utilizing diverse offline datasets, as dis- Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh,
cussed in Sec. VII-E. Unlike online learning which mostly Sumedh Sontakke, Austin Stone, Clayton Tan, Huong
depends on simulations, it offers a practical inspiration for Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei
scalable real-world data collection and learning. Similar to Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and
[17], we also hypothesize that DiffuseLoco could adapt to Brianna Zitkovich. RT-1: Robotics Transformer for Real-
datasets containing different robot morphologies, thus allowing World Control at Scale, 2023.
for broader deployment and better generalization. Furthermore, [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
as a popular direction in manipulation fields, integrating vision biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee-
and language instructions into the goal-conditioning dataset lakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
could further enhance DiffuseLoco’s versatility and applica- et al. Language models are few-shot learners. Advances
bility in future works. in neural information processing systems, 33:1877–1901,
2020.
IX. ACKNOWLEDGEMENT [6] Guillermo A Castillo, Bowen Weng, Wei Zhang, and
This work was supported in part by NSF 2303735 for POSE, Ayonga Hereid. Robust feedback motion policy design
in part byNSF 2238346 for CAREER, in part by The AI using reinforcement learning on a 3d digit bipedal robot.
Institute and in part byInnoHK of the Government of the Hong In 2021 IEEE/RSJ International Conference on Intel-
Kong Special AdministrativeRegion via the Hong Kong Centre ligent Robots and Systems (IROS), pages 5136–5143.
for Logistics Robotics. IEEE, 2021.
[7] Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and
R EFERENCES Jun Zhu. Score Regularized Policy Optimization through
[1] Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Diffusion Behavior, 2023.
Tommi Jaakkola, and Pulkit Agrawal. Is conditional [8] Jianyu Chen, Bodi Yuan, and Masayoshi Tomizuka. Deep
generative modeling all you need for decision-making? imitation learning for autonomous driving in generic
arXiv preprint arXiv:2211.15657, 2022. urban scenarios with enhanced safety. In 2019 IEEE/RSJ
[2] Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, International Conference on Intelligent Robots and Sys-
Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey tems (IROS), pages 2884–2890. IEEE, 2019.
Levine. Zero-shot robotic manipulation with pre- [9] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee,
trained image-editing diffusion models. arXiv preprint Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind
arXiv:2310.10639, 2023. Srinivas, and Igor Mordatch. Decision Transformer:
[3] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Reinforcement Learning via Sequence Modeling, 2021.
Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, [10] Lili Chen, Shikhar Bahl, and Deepak Pathak. Playfusion:
Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo- Skill acquisition via diffusion from language-annotated
rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana play. In Conference on Robot Learning, pages 2012–
Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- 2029. PMLR, 2023.
der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil [11] Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak
Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Pathak. Extreme parkour with legged robots. arXiv
Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey preprint arXiv:2309.14341, 2023.
Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, [12] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric
Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Cousineau, Benjamin Burchfiel, and Shuran Song. Dif-
Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, fusion policy: Visuomotor policy learning via action
diffusion. arXiv preprint arXiv:2303.04137, 2023. ence on Computer Vision and Pattern Recognition, pages
[13] Felipe Codevilla, Matthias Müller, Antonio López, 16750–16761, 2023.
Vladlen Koltun, and Alexey Dosovitskiy. End-to-end [25] Xiaoyu Huang, Dhruv Batra, Akshara Rai, and Andrew
driving via conditional imitation learning. In 2018 IEEE Szot. Skill transformer: A monolithic policy for mobile
international conference on robotics and automation manipulation. In Proceedings of the IEEE/CVF Inter-
(ICRA), pages 4693–4700. IEEE, 2018. national Conference on Computer Vision, pages 10852–
[14] Jeremy Dao, Kevin Green, Helei Duan, Alan Fern, 10862, 2023.
and Jonathan Hurst. Sim-to-real learning for bipedal [26] Xiaoyu Huang, Zhongyu Li, Yanzhen Xiang, Yiming
locomotion under unsensed dynamic loads. In 2022 Ni, Yufeng Chi, Yunhao Li, Lizhi Yang, Xue Bin Peng,
International Conference on Robotics and Automation and Koushil Sreenath. Creating a dynamic quadrupedal
(ICRA), pages 10449–10455. IEEE, 2022. robotic goalkeeper with reinforcement learning. In 2023
[15] Ricard Durall, Avraam Chatzimichailidis, Peter Labus, IEEE/RSJ International Conference on Intelligent Robots
and Janis Keuper. Combating mode collapse in gan and Systems (IROS), pages 2715–2722. IEEE, 2023.
training: An empirical analysis using hessian eigenvalues. [27] Jemin Hwangbo, Joonho Lee, A. Dosovitskiy, Dario
arXiv preprint arXiv:2012.09673, 2020. Bellicoso, Vassilios Tsounis, V. Koltun, and M. Hutter.
[16] Alejandro Escontrela, Xue Bin Peng, Wenhao Yu, Learning agile and dynamic motor skills for legged
Tingnan Zhang, Atil Iscen, Ken Goldberg, and Pieter robots. Science Robotics, 4, 2019. doi: 10.1126/
Abbeel. Adversarial motion priors make good substitutes scirobotics.aau5872.
for complex reward functions. In 2022 IEEE/RSJ Inter- [28] Michael Janner, Qiyang Li, and Sergey Levine. Offline
national Conference on Intelligent Robots and Systems reinforcement learning as one big sequence modeling
(IROS), pages 25–32. IEEE, 2022. problem. Advances in neural information processing
[17] Gilbert Feng, Hongbo Zhang, Zhongyu Li, Xue Bin systems, 34:1273–1286, 2021.
Peng, Bhuvan Basireddy, Linzhu Yue, Zhitao Song, Lizhi [29] Michael Janner, Yilun Du, Joshua B. Tenenbaum, and
Yang, Yunhui Liu, Koushil Sreenath, et al. Genloco: Gen- Sergey Levine. Planning with Diffusion for Flexible
eralized locomotion controllers for quadrupedal robots. Behavior Synthesis, 2022.
In Conference on Robot Learning, pages 1893–1903. [30] Gwanghyeon Ji, Juhyeok Mun, Hyeongjun Kim, and
PMLR, 2023. Jemin Hwangbo. Concurrent training of a control policy
[18] Thomas Flayols, Andrea Del Prete, Patrick Wensing, and a state estimator for dynamic and robust legged
Alexis Mifsud, Mehdi Benallegue, and Olivier Stasse. locomotion. IEEE Robotics and Automation Letters, 7
Experimental evaluation of simple estimators for hu- (2):4630–4637, 2022.
manoid robots. In 2017 IEEE-RAS 17th International [31] Ivan Kapelyukh, Vitalis Vosylius, and Edward Johns.
Conference on Humanoid Robotics (Humanoids), pages DALL-E-Bot: Introducing Web-Scale Diffusion Models
889–895. IEEE, 2017. to Robotics. IEEE Robotics and Automation Letters, 8
[19] Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mo- (7):3956–3963, July 2023. ISSN 2377-3774. doi: 10.
bile aloha: Learning bimanual mobile manipulation with 1109/lra.2023.3272516. URL http://dx.doi.org/10.1109/
low-cost whole-body teleoperation. arXiv preprint LRA.2023.3272516.
arXiv:2401.02117, 2024. [32] J. Kaplan, Sam McCandlish, T. Henighan, Tom B.
[20] Yuni Fuchioka, Zhaoming Xie, and Michiel Van de Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Panne. Opt-mimic: Imitation of optimized trajectories Radford, Jeff Wu, and Dario Amodei. Scaling Laws for
for dynamic quadruped behaviors. In 2023 IEEE Interna- Neural Language Models. ArXiv, abs/2001.08361, 2020.
tional Conference on Robotics and Automation (ICRA), [33] Sunwoo Kim, Maks Sorokin, Jehee Lee, and Sehoon Ha.
pages 5092–5098. IEEE, 2023. Humanconquad: human motion control of quadrupedal
[21] Huy Ha, Pete Florence, and Shuran Song. Scaling up and robots using deep reinforcement learning. In SIGGRAPH
distilling down: Language-guided robot skill acquisition. Asia 2022 Emerging Technologies, pages 1–2, 2022.
In Conference on Robot Learning, pages 3766–3777. [34] Arnaud Klipfel, Nitish Sontakke, Ren Liu, and Sehoon
PMLR, 2023. Ha. Learning a single policy for diverse behaviors on
[22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising a quadrupedal robot using scalable motion imitation. In
Diffusion Probabilistic Models, 2020. 2023 IEEE/RSJ International Conference on Intelligent
[23] David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Robots and Systems (IROS), pages 2768–2775. IEEE,
Hutter. Anymal parkour: Learning agile navigation for 2023.
quadrupedal robots. Science Robotics, 9(88):eadi7566, [35] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
2024. reinforcement learning with implicit q-learning. arXiv
[24] Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, preprint arXiv:2110.06169, 2021.
Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. [36] Aviral Kumar, Aurick Zhou, G. Tucker, and S. Levine.
Diffusion-based generation, optimization, and planning Conservative Q-Learning for Offline Reinforcement
in 3d scenes. In Proceedings of the IEEE/CVF Confer- Learning. ArXiv, abs/2006.04779, 2020.
[37] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, jciech Zaremba, and Pieter Abbeel. Overcoming ex-
and Ken Goldberg. Dart: Noise injection for robust ploration in reinforcement learning with demonstrations.
imitation learning. In Conference on robot learning, In 2018 IEEE international conference on robotics and
pages 143–156. PMLR, 2017. automation (ICRA), pages 6292–6299. IEEE, 2018.
[38] Sergey Levine, Aviral Kumar, George Tucker, and Justin [51] Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh,
Fu. Offline reinforcement learning: Tutorial, review, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar,
and perspectives on open problems. arXiv preprint and Sergey Levine. Cal-QL: Calibrated Offline RL Pre-
arXiv:2005.01643, 2020. Training for Efficient Online Fine-Tuning. arXiv preprint
[39] Chenhao Li, Sebastian Blaes, Pavel Kolev, Marin Vlastel- arXiv:2303.05479, 2023.
ica, Jonas Frey, and Georg Martius. Versatile skill control [52] Felipe Nuti, Tim Franzmeyer, and João F Henriques.
via self-supervised adversarial imitation of unlabeled Extracting Reward Functions from Diffusion Models.
mixed motions. In 2023 IEEE international conference arXiv preprint arXiv:2306.01804, 2023.
on robotics and automation (ICRA), pages 2944–2950. [53] Octo Model Team, Dibya Ghosh, Homer Walke, Karl
IEEE, 2023. Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey
[40] Chenhao Li, Marin Vlastelica, Sebastian Blaes, Jonas Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You
Frey, Felix Grimminger, and Georg Martius. Learning Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey
agile skills via adversarial imitation of rough partial Levine. Octo: An Open-Source Generalist Robot Policy.
demonstrations. In Conference on Robot Learning, pages https://octo-models.github.io, 2023.
342–352. PMLR, 2023. [54] Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek
[41] Xiang Li, Varun Belagali, Jinghuan Shang, and Lee, Xinyan Yan, Evangelos Theodorou, and Byron
Michael S. Ryoo. Crossway Diffusion: Improving Boots. Agile autonomous driving using end-to-end deep
Diffusion-based Visuomotor Policy via Self-supervised imitation learning. arXiv preprint arXiv:1709.07174,
Learning, 2024. 2017.
[42] Yunfei Li, Jinhan Li, Wei Fu, and Yi Wu. Learning Agile [55] Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave
Bipedal Motions on a Quadrupedal Robot. arXiv preprint Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcar-
arXiv:2311.05818, 2023. cel Macua, Shan Zheng Tan, Ida Momennejad, Katja
[43] Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Hofmann, and Sam Devlin. Imitating Human Behaviour
Sergey Levine, Glen Berseth, and Koushil Sreenath. with Diffusion Models, 2023.
Reinforcement learning for robust parameterized loco- [56] X. B. Peng, Erwin Coumans, Tingnan Zhang, T. Lee, Jie
motion control of bipedal robots. In 2021 IEEE Interna- Tan, and S. Levine. Learning Agile Robotic Locomotion
tional Conference on Robotics and Automation (ICRA), Skills by Imitating Animals. ArXiv, abs/2004.00784,
pages 2811–2817. IEEE, 2021. 2020. doi: 10.15607/rss.2020.xvi.064.
[44] Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey [57] Xue Bin Peng, Marcin Andrychowicz, Wojciech
Levine, Glen Berseth, and Koushil Sreenath. Rein- Zaremba, and Pieter Abbeel. Sim-to-real transfer of
forcement Learning for Versatile, Dynamic, and Robust robotic control with dynamics randomization. In 2018
Bipedal Locomotion Control, 2024. IEEE international conference on robotics and automa-
[45] Qiayuan Liao, Zhongyu Li, Akshay Thirugnanam, Jun tion (ICRA), pages 3803–3810. IEEE, 2018.
Zeng, and Koushil Sreenath. Walking in narrow spaces: [58] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and
Safety-critical locomotion control for quadrupedal robots Angjoo Kanazawa. Amp: Adversarial motion priors for
with duality-based optimization. In 2023 IEEE/RSJ In- stylized physics-based character control. ACM Transac-
ternational Conference on Intelligent Robots and Systems tions on Graphics (ToG), 40(4):1–20, 2021.
(IROS), pages 2723–2730. IEEE, 2023. [59] Carole G. Prevost, Andre Desbiens, and Eric Gagnon.
[46] Kanglin Liu, Wenming Tang, Fei Zhou, and Guoping Extended Kalman Filter for State Estimation and Tra-
Qiu. Spectral regularization for combating mode collapse jectory Prediction of a Moving Object Detected by an
in gans. In Proceedings of the IEEE/CVF international Unmanned Aerial Vehicle. In 2007 American Control
conference on computer vision, pages 6382–6390, 2019. Conference, pages 1805–1810, 2007. doi: 10.1109/ACC.
[47] Gabriel B Margolis, Ge Yang, Kartik Paigwar, Tao Chen, 2007.4282823.
and Pulkit Agrawal. Rapid locomotion via reinforcement [60] Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell,
learning. arXiv preprint arXiv:2205.02824, 2022. Jitendra Malik, and Koushil Sreenath. Learning Hu-
[48] Utkarsh A. Mishra, Shangjie Xue, Yongxin Chen, and manoid Locomotion with Transformers. arXiv preprint
Danfei Xu. Generative Skill Chaining: Long-Horizon arXiv:2303.03381, 2023.
Skill Planning with Diffusion Models, 2023. [61] Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan
[49] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil
GAN optimization is locally stable. Advances in neural Sreenath, and Jitendra Malik. Humanoid Locomotion as
information processing systems, 30, 2017. Next Token Prediction. arXiv preprint arXiv:2402.19469,
[50] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- 2024.
[62] Alexander Reske, Jan Carius, Yuntao Ma, Farbod of Highly Dynamic Motions with Rigid and Articulated
Farshidian, and Marco Hutter. Imitation learning from Soft Quadrupeds. arXiv preprint arXiv:2309.09682,
mpc for quadrupedal multi-gait control. In 2021 IEEE 2023.
International Conference on Robotics and Automation [78] Eric Vollenweider, Marko Bjelonic, Victor Klemm,
(ICRA), pages 5014–5020. IEEE, 2021. Nikita Rudin, Joonho Lee, and Marco Hutter. Ad-
[63] Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf vanced skills through multiple adversarial motion priors
Lioutikov. Goal-Conditioned Imitation Learning using in reinforcement learning. In 2023 IEEE International
Score-based Diffusion Policies, 2023. Conference on Robotics and Automation (ICRA), pages
[64] Robin Rombach, Andreas Blattmann, Dominik Lorenz, 5120–5126. IEEE, 2023.
Patrick Esser, and Björn Ommer. High-Resolution Image [79] Bingzheng Wang, Guoqiang Wu, Teng Pang, Yan Zhang,
Synthesis with Latent Diffusion Models, 2022. and Yilong Yin. DiffAIL: Diffusion Adversarial Imitation
[65] Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bag- Learning, 2023.
nell. A Reduction of Imitation Learning and Structured [80] Hsiang-Chun Wang, Shang-Fu Chen, and Shao-Hua Sun.
Prediction to No-Regret Online Learning, 2011. Diffusion Model-Augmented Behavioral Cloning. arXiv
[66] Nikita Rudin, David Hoeller, Philipp Reist, and Marco preprint arXiv:2302.13335, 2023.
Hutter. Learning to walk in minutes using massively [81] Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou.
parallel deep reinforcement learning. In Conference on Diffusion Policies as an Expressive Policy Class for
Robot Learning, pages 91–100. PMLR, 2022. Offline Reinforcement Learning, 2023.
[67] Dohoon Ryu and Jong Chul Ye. Pyramidal Denoising [82] Eric R Westervelt, Jessy W Grizzle, Christine
Diffusion Probabilistic Models, 2022. Chevallereau, Jun Ho Choi, and Benjamin Morris.
[68] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Feedback control of dynamic bipedal robot locomotion.
Radford, and Oleg Klimov. Proximal Policy Optimization CRC press, 2018.
Algorithms, 2017. [83] Jinze Wu, Yufei Xue, and Chenkun Qi. Learning multiple
[69] Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, gaits within latent space for quadruped robots. arXiv
Hongtao Wang, and Wei Yang. Learning free gait tran- preprint arXiv:2308.03014, 2023.
sition for quadruped robots via phase-guided controller. [84] Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet,
IEEE Robotics and Automation Letters, 2021. Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif-
[70] Yecheng Shao, Yongbin Jin, Xianwei Liu, Weiyan He, fuser: Unifying trajectory diffusion and keypose predic-
Hongtao Wang, and Wei Yang. Learning Free Gait tion for robotic manipulation. In Conference on Robot
Transition for Quadruped Robots Via Phase-Guided Con- Learning, pages 2323–2339. PMLR, 2023.
troller. IEEE Robotics and Automation Letters, 7:1230– [85] Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, and Daniela
1237, 2022. doi: 10.1109/LRA.2021.3136645. Rus. SafeDiffuser: Safe Planning with Diffusion Proba-
[71] Laura Smith, J Chase Kew, Tianyu Li, Linda Luu, bilistic Models. arXiv preprint arXiv:2306.00148, 2023.
Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. [86] Haoran Xu, Li Jiang, Li Jianxiong, and Xianyuan Zhan.
Learning and adapting agile locomotion skills by trans- A policy-guided imitation approach for offline reinforce-
ferring experience. arXiv preprint arXiv:2304.09834, ment learning. Advances in Neural Information Process-
2023. ing Systems, 35:4085–4098, 2022.
[72] Jiaming Song, Chenlin Meng, and Stefano Ermon. De- [87] Long Yang, Zhixiong Huang, Fenghao Lei, Yucun
noising Diffusion Implicit Models, 2022. Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin
[73] Zhi Su, Xiaoyu Huang, Daniel Ordoñez-Apraez, Yunfei Zhou, and Zhouchen Lin. Policy Representation via
Li, Zhongyu Li, Qiayuan Liao, Giulio Turrisi, Massim- Diffusion Probability Model for Reinforcement Learning,
iliano Pontil, Claudio Semini, Yi Wu, et al. Leveraging 2023.
Symmetry in RL-based Legged Locomotion Control. [88] Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang.
arXiv preprint arXiv:2403.17320, 2024. Multi-task reinforcement learning with soft modular-
[74] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and ization. Advances in Neural Information Processing
A. Gupta. Revisiting Unreasonable Effectiveness of Systems, 33:4767–4777, 2020.
Data in Deep Learning Era. 2017 IEEE International [89] Ruihan Yang, Zhuoqun Chen, Jianhan Ma, Chongyi
Conference on Computer Vision (ICCV), pages 843–852, Zheng, Yiyu Chen, Quan Nguyen, and Xiaolong Wang.
2017. doi: 10.1109/ICCV.2017.97. Generalized animal imitator: Agile locomotion with ver-
[75] The Linux Foundation. Open neural network exchange. satile motion prior. arXiv preprint arXiv:2310.01408,
https://onnx.ai/, 2024. Accessed: 2024-01-29. 2023.
[76] Faraz Torabi, Garrett Warnell, and Peter Stone. Be- [90] Takuma Yoneda, Luzhe Sun, Bradly Stadie, Ge Yang,
havioral cloning from observation. arXiv preprint and Matthew Walter. To the Noise and Back: Diffusion
arXiv:1805.01954, 2018. for Shared Autonomy. arXiv preprint arXiv:2302.12244,
[77] Francecso Vezzi, Jiatao Ding, Antonin Raffin, Jens 2023.
Kober, and Cosimo Della Santina. Two-Stage Learning [91] Fangzhou Yu, Ryan Batke, Jeremy Dao, Jonathan Hurst,
Kevin Green, and Alan Fern. Dynamic bipedal turning
through sim-to-real reinforcement learning. In 2022
IEEE-RAS 21st International Conference on Humanoid
Robots (Humanoids), pages 903–910. IEEE, 2022.
[92] Chong Zhang, Jiapeng Sheng, Tingguang Li, He Zhang,
Cheng Zhou, Qingxu Zhu, Rui Zhao, Yizheng Zhang,
and Lei Han. Learning Highly Dynamic Behaviors for
Quadrupedal Robots. arXiv preprint arXiv:2402.13473,
2024.
[93] Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christo-
pher Atkeson, Soeren Schwertfeger, Chelsea Finn, and
Hang Zhao. Robot parkour learning. arXiv preprint
arXiv:2309.05665, 2023.
A PPENDIX A
M ORE R ESULTS ON S KILL T RANSITIONING
Fig. 10: Skill Transitioning: Bounce to Stand with Emergent Intermediate Pacing Skill
Tracking Error
steps of actions, but only executes the very first step of actions.
This is in contrast to previous work which infers a sequence
of actions at a lower frequency and using interpolation to get
a high-frequency action [31, 41]. Such a setup allows us to
replan rapidly with fast-changing states of the robot while
keeping future steps in account. As we evaluated in Sec. VII-B,
using RHC is critical in improving smoothness and consistency
Failed Episodes Linear Vel (m/s) Angular Vel (m/s)
of a legged locomotion control policy.
B. Architecture Fig. 15: Comparison of failure rates and tracking errors
between DL w/o Goal and DiffuseLoco (ours) in simulation.
The DiffuseLoco policy leverages an encoder-decoder trans- The left y-axis is the metric for Failed Episodes. The right y-
former DDPM. First, the past robot’s past I/O trajectory axis indicates the tracking error for linear velocity and angular
(st−h−1:t−1 , at−h−2:t−2 ) and given goal sequence gt−h−1:t−1 velocity.
are transformed into separate I/O embedding and goal embed-
ding by two 2-layer MLP encoders, respectively. Then, we
sample noise ϵ(k) for diffusion time step k with the DDPM
scheduler and add to the ground truth action a from the Considering the noisy base velocity estimations on real robots,
offline dataset to produce a noisy action akt:t+n = at:t+n + ϵk . we utilize simulation environments with extensive dynamics
The noisy action akt:t+n is then passed through an MLP randomization to perform a large number of trials and get more
layer into action embedding. The noisy action tokens are systematic results. As shown in Fig.15, DiffuseLoco achieves
then passed through 6 Transformer decoder layers, each of a 15.4% reduction in linear velocity tracking error and a
which is composed of an 8-head cross-attention layer. Each 14.5% reduction in angular velocity error compared to the
layer computes the attention weights for the noisy action DL w/o Goal baseline. Moreover, over 64 trials with identical
tokens querying all the state embedding, goal embedding, commands, DL w/o Goal falls over four times, or 6.25% of
and the timestep embedding reflecting the current diffusion all trials, whereas DiffuseLoco experiences no failures. This
timestep k. We apply causal attention masks to each of pattern persists in real-world testing, where DL w/o Goal fails
the state embeddings and goal embeddings separately. The one trial in a 0.7m/s forward test.
predicted noise ϵθ (at−h−2:t+n , st−h−1:t−1 , gt−h−1:t−1 , k) is These results underscore the importance of goal-
then computed by each corresponding output token of the conditioning with distinct attention weights for dynamic
decoder stack. We then supervise the output to predict the system control, revealing that the robot’s I/O history
added noise with Eqn. 5 to find optimal parameters θ of the and goals, governed by physics and arbitrary objectives
denoising model ϵθ . respectively, should not be merged into one embedding space.
C. Hyperparameters
A PPENDIX D
The hyperparameters are summarized in Table III, D ETAILS IN O FFLINE L OCOMOTION DATASET
A PPENDIX C
Here, we will introduce the details in creating the offline
M ORE A BLATION S TUDIES ON D ESIGN C HOICES
locomotion dataset used in this work. We will explain the state,
A. Use of Goal-conditioning goal, and action spaces, followed by a brief introduction to the
Here, we evaluate the impact of goal-conditioning, hypothe- source policies and dynamics randomization used to diversify
sizing enhancements tracking performance and stability. In the the data. We collect a total of 4 million data of state-action-
DL w/o Goal baseline, we do not add the goal-conditioning goal pairs in the offline dateset for the quadrupedal robot tasks,
encoder. Instead, the goal is concatenated with the robot’s I/O. and 10 million transitions for the bipedal robot tasks.
A. State Space 4) Other Methods: Although the source policies used to
collect the dataset in this work are all RL-based policies,
The state space is the robot’s proprioceptive feedback. In
our framework is general and can include the data generated
the quadrupedal locomotion control case, this consists of the
from model-based optimal controllers (such as from [45])
measured motor positions qm , measured motor velocities q̇m ,
and others. The requirement is to align the state and action
base orientation qψ,θ,ϕ , and base angular velocities q̇ψ,θ,ϕ .
spaces among different source policies, and the frequency of
Note that we exclude quantities from the estimation of base
the policy should be kept the same.
velocity (q̇x,y ) to prevent additional estimation errors.
E. Dynamics Randomization
B. Goal Space In order to diversify the training dataset for DiffuseLoco,
we also include the same amount of dynamics randomiza-
The goal of the locomotion task is the commands given
tion [57] during the training of the source policies and the data
to the policy. For quadrupedal robots, the command includes
generation using these policies. Specifically, in each episode
desired sagittal velocity qxd in the range of 0 m/s to 1 m/s,
in simulation, the dynamics parameters are randomized. These
desired base height from 0.2 m to 0.6 m, and desired turning
include the motor’s PD gains, the mass of the robot’s base (up
velocity qψd from -1 rad/s to 1 rad/s.
to the weight of the onboard compute), ground frictions, and
random changes in base velocity. The randomization ranges
C. Action Space are adapted from the source policy’s original methods.
The action space is the robot’s joint-level commands. In this A PPENDIX E
work, we use the desired motor position qdm as the action. This R EAL -T IME I NFERENCE ACCELERATION
is then used by joint-level PD controllers to compute motor Although the diffusion model targets real-time use, it cannot
torques τ at a higher frequency. meet the real-time targets without further tuning. Compared to
previous works using Transformer for locomotion control with
D. Source Policy 2M parameters [60], our model is 3 times larger in parameter
We obtain source policies using three different RL methods. count(6.8M parameters) and needs to be forwarded 10 times
We leverage proximal policy optimization (PPO) [68] to in each inference. Thus, an additional effort is needed to
optimize each of the source policies, and we train the policies accelerate the diffusion on the edge computing device on the
in simulation (Isaac Gym [66]). We evenly distribute the data robot, such as the setup shown in Fig. 17. In this section, we
generated from each of the source skill-specific policies. explore several methods to accelerate the inference process of
the diffusion model to enable it to run real-time onboard.
1) Adversarial Motion Prior (AMP): For skills trained with
AMP, we provide a reference motion retargeted from motion A. Acceleration Framework
capture data of a dog [56], and incorporate an GAN-style Our DiffuseLoco policy has a parameter count of 6.8 million
discriminator to encourage the robot to imitate the reference parameters, which exceeds most modern mobile processors’
motion without extended reward engineering. Then, the reward cache capacity. Furthermore, hardware on a typical consumer-
of this method is formulated as motion imitating term (pro- grade central processing unit (CPU) is not optimized for
vided by the discriminator [58]) and task term (e.g., tracking the operators used in transformer networks. The graphics
error, etc). processing unit (GPU) is more suitable for computing the
2) Central Pattern Generator Guidance (CPG): For skills high-dimension matrix and vector operations. To ensure the
trained with CPG-guidance, we follow the formulation in [69] portability of the setup, we use an accessible NVIDIA Mobile
and provide nominal reference motions of strictly periodic GPU as the deployment platform. For real-time deployment,
motions generated by a Central Pattern Generator with phase an acceleration pipeline is built in the DiffuseLoco framework
signals. Specifically, for hopping skill, the phase selections for to convert and optimize our model towards the target compute
all legs are 0, and for bouncing skill, the phase selections are platforms. The operators of the model are first extracted with
0 for the front legs, and π for the hinder legs. The reward ONNX [75]. Then, TensorRT is used to refine the execution
is composed of task term (e.g., velocity tracking error, etc), graph and compile the resulting execution pipeline onto the tar-
motion tracking term (e.g., reference motion tracking error, get GPU. Through domain-specific architecture optimizations,
etc), and smoothing terms (e.g., action rate, etc). the operations and memory access patterns are optimized to
3) Symmetry Augmented RL: We train the bipedal loco- utilize the full capability of the GPU. With this approach, the
motion skill for quadrupedal robots following a symmetry- speed for each denoising iteration is increased by about 7X
augmented RL policy [73] to achieve a symmetric gait pattern compared to the native implementation in PyTorch, and the
that is crucial for sim-to-real transfer. Specifically, the data maximum inference (with 10 denoising iterations) frequency
collection process is augmented by the addition of symmetric is increased from 17.0 Hz to 116.5 Hz. To showcase the effect
states and actions. The reward includes task term (e.g., velocity of this acceleration approach, we conducted a benchmark on
tracking error, etc), gait pattern term (e.g., feet clearance the inference frequency of the policy running on multiple
height, etc), and smoothing terms (e.g., action rate, etc). hardware platforms we have access to, shown in Fig. 16.
RTX 4090
RTX 4060 M
RTX 2080
GTX 1070
GTX TITAN X
Core i9-9900K
PyTorch – FP32
ONNX – FP32
Ryzen 7 1700
TensorRT – FP32