2502.09587v1
2502.09587v1
There are two stages of RDM, the warm-up stage and the Related Work
rolling stage. In the warm-up stage, the model handles the Traffic Simulation with Diffusion Models
initial boundary condition by generating from white noise,
as shown in the first row of Figure 2 (left), and denoises it Predicting the motion of road users is a critical task for au-
to produce one clean element and partially denoised future tonomous vehicle driving or simulation. For this reason, the
elements in the sliding window as shown in the bottom row. number of methods which have attempted to model traffic
Once it reaches the temporal correlated noise stage (Bottom behavior is vast. The literature contains a variety of tech-
row of Figure 2 left), RDM takes few denoising steps for the niques for modelling the distribution of driving behavior, in-
next step prediction shown in Figure 2 (right). This requires cluding mixture models (Chai et al. 2019; Cui et al. 2019;
the model to train two tasks, where β controls the training Nayakanti et al. 2023), variational autoencoders (Ścibior
task distribution and for each tasks, RDM designs an asso- et al. 2021; Suo et al. 2021), and generative adversarial net-
ciated function g for calculating the local diffusion time τw works (Zhao et al. 2019).
given τ and window index w. In addition, we can condition Our work builds upon recent methods which model driv-
n number of clean observations within the sliding window, ing behavior using diffusion models. In CTG (Zhong et al.
g is defined for warm-up and rolling stage as 2023), the authors model the motion of each agent in the
scene independently with a Diffuser (Janner et al. 2022)
w
gwarm-up (τ, w) := max(min( + τ, 1.0), 0.0) (7) based diffusion model. The authors of (Chang et al. 2023)
W also model agent motions via diffusion, with a focus on con-
w+τ −n trollability. By contrast, most other diffusion based traffic
grolling (τ, w) := max(min( , 1.0), 0.0), (8)
W −n models model entire traffic scenes. This includes Motion-
where n,W are application-dependent hyperparameters. Diffuser (Jiang et al. 2023), Scenario Diffusion (Pronovost
Time Time Noise
Denoised clean
element
Diffusion Time
Diffusion Time
Rolling window with
length
Figure 2: Rolling Diffusion Model. Columns represent sequence timesteps and rows represent diffusion timesteps. Circles are
shown in white if the corresponding sequence timestep is fully denoised; black if the sequence timestep is pure noise; and grey
if in between. During the denoising process, the SNR for each element in the rolling window depends on the local diffusion
time τw which can be calculated using Eq. (7) or Eq. (8), depending on whether it is in the warm-up or rolling stage.
et al. 2023) and SceneDM (Guo et al. 2023) which all diffuse conditioned on the most recent state and actions of the ego
the joint motion of all agents in the scene. Our work builds agent.
directly on that of DJINN (Niedoba et al. 2024), which uti-
lizes a transformer based network to generate joint traffic Replanning with a joint prediction model
scenarios based on a variable set of agent state observations. Our baseline planner relies on a conditional diffusion model
Crucially, due to the expensive computational cost of diffu- p(xtobs :T |x0:tobs , M, c), which jointly predicts the scenario
sion model sampling, only CTG (Zhong et al. 2023) utilize for all agents in the scene up to time T given the map M and
their model for closed-loop scenario simulation. Twice per additional conditioning information c. Although diffusing
second, they incorporate new state observations and resam- the joint states of all agents is a flexible way of modelling the
ple trajectories for each agent. By comparison, our method distribution of traffic scenarios, the model does not respond
does not require iterative replanning, greatly improving sim- to the ego agent trajectories which deviate from modelled
ulation speed. behavior. To mitigate this, one option is to regenerate the
traffic scenario after each simulator step to incorporate new
Methods ego agent state observations. We select DJINN (Niedoba
Problem Formulation et al. 2024) as our conditional diffusion model and we de-
note this method of iterative planning as DJINN-MPC as it
We refer to the motion of A agents across T discrete times resembles a traditional model predictive control loop. This
in an environment M as a traffic scenario. Formally, we de- allows the scenario simulation planner πsim to adjust its
fine the scenario as x ∈ RA×T ×3 , where we represent the predictions at every simulation step in response to the stan-
state of each agent a ∈ A at time t ∈ T as the combination dalone ego agent in the scene.
of its 2D position and 1D orientation. We introduce a proba-
bilistic planner πsim which jointly predicts the future states Diffusion based autoregressive model (Diff-AR)
for all agents, conditioned on static map information M and
previously observed agent states xobs ∈ RA×tobs ×3 . One key drawback of DJINN-MPC is that we must fully dif-
A more difficult form of this planning problem is closed- fuse a new traffic scenario at every simulator step, at signifi-
loop traffic simulation. In closed-loop simulation one agent, cant cost. As an alternative, one can train an diffusion based
known as the ego agent aego , is typically controlled by a autoregressive model as the simulation planner, which fac-
standalone motion planner πego which may be a black-box torizes the conditional probability as
and which may cause the ego agent to drive very differently −1
TY
to any agents in the training data and thus is not amenable to p(xtobs :T |x0:tobs , M, c) = p(xt |x00:t−1 , M, c).
accurate prediction by the traffic scenario planner πsim . At t=tobs
each time step t, the standalone motion planner πego plans
Given the past observations, the model only predicts one
the single next step for the ego-agent given the entire his-
subsequent step. In practice, the history of past observations
tory of the scenario, x0:t and the map M. The closed-loop
x00:t−1 is truncated to a fixed length. Compared to the pre-
traffic simulation problem is to model the behavior of ev-
vious method, Diff-AR is slightly more efficient as it only
ery other agent in this scene, including potential interactions
denoises the single-step future from scratch. However, Diff-
with the ego agent. Since the state of the ego agent is neither
AR cannot anticipate other agents’ long-term behaviors be-
controllable nor known in advance, the traffic scenario plan-
yond the immediate next step which is important for effec-
ner πsim must continually update its plan to be continuously
tive planning in many traffic scenarios.
Rolling ahead autoregressive model our score estimator Dθ takes a vector τ = {τw }W w=0 to re-
We propose a rolling diffusion based model (RoAD) for traf- flect the temporal correlation of different nose levels in xW
τ .
fic scenario planning based on RDM. We start by providing The weighting term is also a vector that takes vector τ as
an overview of our autoregressive traffic planner, then dis- input then assign different weights according to each τw .
cuss some details of our design choices on the diffusion pro- While our rolling ahead autoregressive model is efficient
cess and the model with our updated objective function. for long traffic scenario planning, the partially denoised fu-
We utilize a sliding window of length W , which is much ture plan affects the reactivity of our model. In traffic sim-
smaller than the scenario length T . This sliding window ulation, such degradation may cause a higher collision rate
includes tobs clean observations for all agents. Within this with the uncontrolled ego agent in the scene. The reactivity
window, only the tobs+1 th state is fully denoised at each of the model depends on the SNR of future states. We empir-
scenario time for all agents, while the remainder of the se- ically evaluate the reactivity of our model compared to the
quence undergoes partial denoising. At the next simulation AR baseline in our experiment section.
step, we then shift the sliding window and repeat the pre-
vious process. By focusing on a smaller window and selec- Conditioning Augmentation
tively denoising, our approach maintains computational effi- We have empirically found that noise conditioning augmen-
ciency while preserving the ability to adaptively plan for the tation, as described by (Ho et al. 2022b), is essential for all
immediate future. models operating in an autoregressive manner. This augmen-
We follow the design choices from EDM (Karras et al. tation is critical for autoregressive human motion genera-
2022) in designing our diffusion process. Given a local dif- tion (Yin et al. 2023), cascaded diffusion models for class-
fusion time τw , during training, our στ is a continuous ver- conditional generation, and super-resolution video gener-
sion of the sampling noise schedule in EDM, ation (Ho et al. 2022a). Noise conditioning augmentation
1 1 1
enhances the model’s robustness against generated noise,
στ = (σmax ρ + τ (σmin ρ − σmax ρ ))ρ , which serves as observations for subsequent predictions (Ho
et al. 2022b), and it mitigates the risk of the model overfit-
where we keep the default hyper-parameter choice for σmax , ting to its autoregressive nature. In the context of traffic sim-
σmin and ρ from (Karras et al. 2022). We apply the Heun 2nd ulation, this augmentation aids in generating smooth trajec-
order sampler to sample at prediction time with the same tories and ensures that the model does not ignore other con-
hyper-parameter reported in EDM. We referred the reader to ditional factors, such as the presence of other agents and, im-
EDM for the detailed denoising algorithms. portantly, the map M of the environment. Previous work on
As we are interested in modelling a joint traffic planner diffusion-based traffic simulation (Zhang et al. 2023; Chang
for all agents in the scene, we diffuse in a global coordinate. et al. 2023) circumvents the issue of noisy observations by
We adopt the map representation from (Niedoba et al. 2024) relying on a kinematic model to produce smooth trajecto-
where M is represented as an unordered set of polylines, ries; however, our autoregressive traffic simulation planner
each polylines describing lane centers and normalized for does not require such a kinematic model.
length and scale to match the agent states. Our model, built We follow (Ho et al. 2022b) to employ conditioning aug-
on a transformer-based architecture utilizes a feature tensor mentation for our rolling ahead traffic planner with an im-
shaped [A, W, F ] to process agent trajectories and map in- portant modification. During training, given a sampled train-
formation. It embeds noisy and observed states, temporal in- ing segment xW with length W , and n observations xobs
dices, and the local diffusion step τw into high-dimensional within this segment, we apply Gaussian noise augmentation
vectors with feature dimension F . We apply per-agent posi- to the xobs , where the noise level στca is sampled uniformly
tional embeddings to these feature vectors, which are then between σmin and σmax . Unlike (Ho et al. 2022b), we found
fed into a series of transformer blocks that perform self- jointly predicting all elements within the sliding window, in-
attention in the time and agent dimensions, as well as cross- cluding the noised observations is enssential for our applica-
attention with the map features. tion. At testing time, we apply Gaussian noise with σmin to
Rather than denoise the sequence one by one as in Eq. (6), our observations for minimal level of augmentation.
our transformer architecture jointly predicts the score for all From a broader perspective, conditioning augmentation
noisy states in the window. Denote xW as the sliding win- addresses a well-understood issue in imitation learning with
dow of interest. Our score estimator Dθ takes in xW τ , τ , and autoregressive-style methods, where at test-time the model
the map M, along with additional conditional information must condition on samples that it produced earlier. Through-
c that includes the dimensions of each agent. Our updated out a roll-out, the distribution of these samples may shift
objective function is so that they appear out-of-distribution relative to the train-
ExW W W 2
W [ω(τ )∥Dθ (xτ , M, c, τ ) − x0 ∥2 ]. (9) ing data. One solution to this problem allows the model
0 ,τ ,xτ to learn from its own mistakes using a differentiable sim-
Note that xW τ is sampled from RDM forward process de-
ulator (Ścibior et al. 2021). In the diffusion model con-
fined in Eq. (4) that contains states with noise level accord- text, though, this requires sampling from the reverse pro-
ing to τw . While local diffusion time τw depends on the win- cess which is expensive. We denote this type of augmen-
dow index w and the global diffusion steps τ , all agents in tation as reverse process conditioning augmentation, where
the scene has a consistent local diffusion time τw . Therefore, the noise originates from the model’s prediction. Existing
work (Ho et al. 2022b) on cascaded diffusion models has
achieved comparable performance through both reverse pro- performance of DJINN as the upper bound for this task since
cess conditioning augmentation and forward process condi- DJINN is trained only at this fixed time horizon and is not an
tioning augmentation for high-resolution image generation autoregressive model by nature. Following (Niedoba et al.
conditioned on a low-resolution image. Therefore, we opt 2024), displacement metrics are calculated by generating
for the more efficient forward process conditioning augmen- 24 samples for each scenario and fit a 6 component Gaus-
tation approach. sian mixture model to cover all future modes. DJINN-10
(MPC-1) achieves slightly better results than AR but per-
Experiments forms worse than RoAD due to larger accumulated errors
We evaluate our rolling ahead scene generation model from replanning at each simulation step.
(RoAD) on the INTERACTION dataset (Zhan et al. 2019), RoAD models with window sizes of 15 (RoAD-15) and
which contains 16.5 hours of driving records across 11 lo- 20 (RoAD-20) achieved lower displacement metrics than
cations. Our baselines include an autoregressive diffusion AR models, as RoAD also considers the noisy future steps
model (AR), which takes observations of length 10 and pre- beyond the next immediate one. Additionally, RoAD mod-
dicts the next step future for all agents. Another baseline, els exhibit a slightly lower miss rate. We also observed that
DJINN, is a scene generation model that takes 10 observa- RoAD models with larger window sizes demonstrate better
tions and jointly predicts the next 30 steps at 10Hz for all displacement metrics. Displacement metrics are one indica-
agents in a one-shot manner. We have also trained a version tor of the quality of the generated samples. We show quali-
of DJINN that predicts 10 future steps ahead jointly for all tatively in Figure 3 that RoAD reconstructs to ground truth
agents (DJINN-10). As our RoAD model with window size trajectories marked in grey better than AR.
20 (RoAD-20) predicts 10 partially denoised future steps as Reactivity The RoAD models efficiently roll out long sce-
well, we believe DJINN-10 provides a reasonable compari- narios by partially denoising future states. However, this
son. DJINN-10 (MPC-X) is a variant of DJINN-10 that has limits its ability to adapt to perturbations, such as an agent
been trained with conditioning augmentation and deployed controlled by a different motion planner while being ob-
in an MPC style, enabling us to replan after executing X served by our model. This is a typical setup in closed-loop
steps of predictions for all agents. simulation. To evaluate this, we evaluate the RoAD models’
We first compare RoAD with AR, DJINN, and DJINN- adaptation to an adversarial agent. We select one agent per
10 (MPC) using standard scene-level displacement metrics scene and control it using its replay log, slowing it down to
such as minSceneADE and minSceneFDE to demonstrate reach only half its trajectory by the end of the simulation.
the quality of samples generated by RoAD. We then as- The simulation runs for 40 time steps at 10 Hz, given ini-
sess the reactivity of DJINN, DJINN-10 (MPC-1), AR and tial 10-step observations, which makes the performance of
RoAD with an adversarial agent, which is not controlled by DJINN one-shot a lower bound since it is blind to the adver-
the scene generation model, by measuring the collision rates sarial agent during the simulation.
with the adversarial agent. In total, we select 1,440 scenes from the INTERACTION
Implementation Details We adopt the same transformer validation set, focusing on the top six locations with the
architecture from DJINN (Niedoba et al. 2024) for all of our largest number of scenarios. Scenes with a low number of
models. We apply 0.2 conditional augmentation for AR and participants are filtered out, as agents in these scenarios are
DJINN-10 and 0.5 for RoAD, as we found that higher con- less likely to interact. We take three samples per scenario
ditional augmentation ratio for AR and DJINN-10 results in and for each model, then report the average collision rate
worse performance. We train our RoAD planner with obser- in Table 1. In addition, we reported the prediction time for
vation length 10 and task ratio β=0.1. a single sample on a RTX2080Ti GPU in Table 2 for each
model to highlight the efficiency of our RoAD models.
Evaluation Metrics We measure the accuracy of our gen- DJINN-10 (MPC-1) achieves the lowest collision rate
erated trajectories with standard displacement metrics. To compared to other models while having the longest total pre-
measure the joint motion forecasting quality, we follow diction time. AR reduces the collision rate by 3x compared
(Ngiam et al. 2021) reporting minSceneADE and min- to the lower bound (DJINN). DJINN-10 (MPC-1) achiev-
SceneFDE. Both metrics capture the minimum joint dis- ing better results than AR aligns with our expectations, as
placements error for all agents across 6 joint traffic scenario DJINN-10 (MPC-1)can look ahead ten steps into the future,
samples. To measure per-agent motion forecasting perfor- whereas AR predicts only one step ahead. Our proposed
mance, we report the miss rate; the rate of agents where none RoAD-15 performs close to AR which reduces the collision
of the six predicted trajectories have a final displacement rate over 2.5x compared to DJINN while having half of the
error less than 2 meters. To measure the reactivity of each prediction time compared to AR and DJINN-10 (MPC-1).
model, we report the collision rate, the number of collisions As an ablation, we measured the collision rate for RoAD-
divided by the total number of simulated scenarios. 20 in the reactivity experiment. We observed that increas-
Motion Forecasting We compare RoAD with AR and ing the window size can reduce the prediction time further
DJINN-10 on the motion forecasting task using the valida- while having a slightly higher collision rate. Figure S1 in the
tion set of the INTERACTION dataset (Zhan et al. 2019). Supplementary Materials shows an example where RoAD-
We generate three seconds of driving behavior at 10 hertz, 20 failing to react, while RoAD-15 avoids a collision in the
conditioned on one second of observations. We consider the same scenario. This feature provides practitioners with the
Figure 3: From Top to Bottom row, AR, RoAD-20. By looking ahead of the subsequent step, the pedestrian marked with a red
dot controlled by RoAD-20 planner avoided colliding with the vehicle. Brown circles highlight the interaction region. Grey
trajectories denote replay logs and orange trajectories are the full predicted future. This example demonstrates that RoAD-20,
with a longer planning horizon compared to AR can anticipate and mitigate interactions with other agents effectively.
Location All
Model Type minSceneADE minSceneFDE Miss Rate
DJINN (One shot) 0.388 1.004 0.049
DJINN-10 (MPC-1) 0.692 1.675 0.166
AR 0.695 1.670 0.168
RoAD-15 0.673 1.596 0.160
RoAD-20 0.654 1.553 0.142
Table 2: Performance with an adversarial ego agent. Table 3: Ablation on conditioning argumentation (CA)
across six locations from INTERACTION dataset.
Model Collision Rate Prediction time (min)
Metrics RoAD w/o CA RoAD w CA
DJINN 0.052 0.07 minSceneADE 0.930 0.663
DJINN-10 minSceneFDE 2.197 1.579
0.014 0.69
(MPC-1) ego minADE 0.693 0.475
AR 0.016 0.68
RoAD-15 0.019 0.34
RoAD-20 0.024 0.20 Conclusion
In conclusion, we have proposed a rolling diffusion-based
traffic scene planning framework that strikes a benefi-
flexibility to decide whether reactivity or computational ef- cial compromise between reactivity and computational effi-
ficiency is more important for their simulation needs. ciency. We believe this work addresses a gap in the commu-
Ablation on conditioning augmentation We show the nity by enabling the autoregressive generation of traffic sce-
significance of conditioning augmentation by measuring the narios for all agents jointly, and it offers insights into the cru-
displacement metrics for two RoAD models trained with cial role of conditioning augmentation techniques. For fu-
same configuration but one without conditioning augmenta- ture work, we aim to explore test-time conditioning with this
tion in Table 3. We can see the displacement errors increased model and seek to enhance model performance through flex-
significantly without conditioning augmentation. ible conditioning on past observations (Harvey et al. 2022).
Acknowledgment Ho, J.; Saharia, C.; Chan, W.; Fleet, D. J.; Norouzi, M.; and
We acknowledge the support of the Natural Sciences and Salimans, T. 2022b. Cascaded diffusion models for high
Engineering Research Council of Canada (NSERC), the fidelity image generation. Journal of Machine Learning Re-
Canada CIFAR AI Chairs Program, Inverted AI, MITACS, search, 23(47): 1–33.
and Google. This research was enabled in part by technical Hoogeboom, E.; Gritsenko, A. A.; Bastings, J.; Poole, B.;
support and computational resources provided by the Digi- Berg, R. v. d.; and Salimans, T. 2021. Autoregressive diffu-
tal Research Alliance of Canada Compute Canada (alliance- sion models. arXiv preprint arXiv:2110.02037.
can.ca), the Advanced Research Computing at the Univer- Janner, M.; Du, Y.; Tenenbaum, J. B.; and Levine, S. 2022.
sity of British Columbia (arc.ubc.ca), and Amazon. Planning with Diffusion for Flexible Behavior Synthesis. In
Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu,
References G.; and Sabato, S., eds., International Conference on Ma-
Austin, J.; Johnson, D. D.; Ho, J.; Tarlow, D.; and Van chine Learning, ICML 2022, 17-23 July 2022, Baltimore,
Den Berg, R. 2021. Structured denoising diffusion mod- Maryland, USA, volume 162 of Proceedings of Machine
els in discrete state-spaces. Advances in Neural Information Learning Research, 9902–9915. PMLR.
Processing Systems, 34: 17981–17993. Jiang, C.; Cornman, A.; Park, C.; Sapp, B.; Zhou, Y.;
Chai, Y.; Sapp, B.; Bansal, M.; and Anguelov, D. 2019. Mul- Anguelov, D.; et al. 2023. Motiondiffuser: Controllable
tiPath: Multiple Probabilistic Anchor Trajectory Hypotheses multi-agent motion prediction using diffusion. In Proceed-
for Behavior Prediction. In Kaelbling, L. P.; Kragic, D.; and ings of the IEEE/CVF Conference on Computer Vision and
Sugiura, K., eds., 3rd Annual Conference on Robot Learn- Pattern Recognition, 9644–9653.
ing, CoRL 2019, Osaka, Japan, October 30 - November 1, Karras, T.; Aittala, M.; Aila, T.; and Laine, S. 2022. Eluci-
2019, Proceedings, volume 100 of Proceedings of Machine dating the design space of diffusion-based generative mod-
Learning Research, 86–99. PMLR. els. Advances in Neural Information Processing Systems,
Chang, W.-J.; Pittaluga, F.; Tomizuka, M.; Zhan, W.; and 35: 26565–26577.
Chandraker, M. 2023. Controllable Safety-Critical Closed- Kingma, D.; Salimans, T.; Poole, B.; and Ho, J. 2021. Vari-
loop Traffic Simulation via Guided Diffusion. arXiv preprint ational diffusion models. Advances in neural information
arXiv:2401.00391. processing systems, 34: 21696–21707.
Cui, H.; Radosavljevic, V.; Chou, F.; Lin, T.; Nguyen, T.; Liu, Y.; Lioutas, V.; Lavington, J. W.; Niedoba, M.; Sefas, J.;
Huang, T.; Schneider, J.; and Djuric, N. 2019. Multi- Dabiri, S.; Green, D.; Liang, X.; Zwartsenberg, B.; Ścibior,
modal Trajectory Predictions for Autonomous Driving us- A.; et al. 2023. Video Killed the HD-Map: Predicting Multi-
ing Deep Convolutional Networks. In International Con- Agent Behavior Directly From Aerial Images. In 2023 IEEE
ference on Robotics and Automation, ICRA 2019, Montreal, 26th International Conference on Intelligent Transportation
QC, Canada, May 20-24, 2019, 2090–2096. IEEE. Systems (ITSC), 3261–3267. IEEE.
Gulino, C.; Fu, J.; Luo, W.; Tucker, G.; Bronstein, E.; Lu, Nayakanti, N.; Al-Rfou, R.; Zhou, A.; Goel, K.; Refaat,
Y.; Harb, J.; Pan, X.; Wang, Y.; Chen, X.; et al. 2024. Way- K. S.; and Sapp, B. 2023. Wayformer: Motion Forecasting
max: An accelerated, data-driven simulator for large-scale via Simple & Efficient Attention Networks. In IEEE In-
autonomous driving research. Advances in Neural Informa- ternational Conference on Robotics and Automation, ICRA
tion Processing Systems, 36. 2023, London, UK, May 29 - June 2, 2023, 2980–2987.
Guo, Z.; Gao, X.; Zhou, J.; Cai, X.; and Shi, B. IEEE.
2023. SceneDM: Scene-level multi-agent trajectory gen- Ngiam, J.; Caine, B.; Vasudevan, V.; Zhang, Z.; Chiang, H.-
eration with consistent diffusion models. arXiv preprint T. L.; Ling, J.; Roelofs, R.; Bewley, A.; Liu, C.; Venugopal,
arXiv:2311.15736. A.; et al. 2021. Scene transformer: A unified architecture
Han, B.; Peng, H.; Dong, M.; Ren, Y.; Shen, Y.; and Xu, for predicting multiple agent trajectories. arXiv preprint
C. 2024. AMD: Autoregressive Motion Diffusion. In Pro- arXiv:2106.08417.
ceedings of the AAAI Conference on Artificial Intelligence, Niedoba, M.; Lavington, J.; Liu, Y.; Lioutas, V.; Sefas, J.;
volume 38, 2022–2030. Liang, X.; Green, D.; Dabiri, S.; Zwartsenberg, B.; Scibior,
Harvey, W.; Naderiparizi, S.; Masrani, V.; Weilbach, C.; and A.; et al. 2024. A Diffusion-Model of Joint Interactive Navi-
Wood, F. 2022. Flexible diffusion modeling of long videos. gation. Advances in Neural Information Processing Systems,
Advances in Neural Information Processing Systems, 35: 36.
27953–27965. Pronovost, E.; Ganesina, M. R.; Hendy, N.; Wang, Z.;
Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, Morales, A.; Wang, K.; and Roy, N. 2023. Scenario Dif-
A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; et al. fusion: Controllable driving scenario generation with diffu-
2022a. Imagen video: High definition video generation with sion. Advances in Neural Information Processing Systems,
diffusion models. arXiv preprint arXiv:2210.02303. 36: 68873–68894.
Rempe, D.; Philion, J.; Guibas, L. J.; Fidler, S.; and Litany,
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion
O. 2022. Generating Useful Accident-Prone Driving Scenar-
probabilistic models. Advances in neural information pro-
ios via a Learned Traffic Prior. In Conference on Computer
cessing systems, 33: 6840–6851.
Vision and Pattern Recognition (CVPR).
Ruhe, D.; Heek, J.; Salimans, T.; and Hoogeboom, E. Zhang, Z.; Liu, R.; Aberman, K.; and Hanocka, R. 2023.
2024. Rolling Diffusion Models. arXiv preprint TEDi: Temporally-entangled diffusion for long-term motion
arXiv:2402.09470. synthesis. arXiv preprint arXiv:2307.15042.
Ścibior, A.; Lioutas, V.; Reda, D.; Bateni, P.; and Wood, Zhao, T.; Xu, Y.; Monfort, M.; Choi, W.; Baker, C.; Zhao,
F. 2021. Imagining the road ahead: Multi-agent trajectory Y.; Wang, Y.; and Wu, Y. N. 2019. Multi-agent tensor fu-
prediction via differentiable simulation. In 2021 IEEE In- sion for contextual trajectory prediction. In Proceedings of
ternational Intelligent Transportation Systems Conference the IEEE/CVF conference on computer vision and pattern
(ITSC), 720–725. IEEE. recognition, 12126–12134.
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Zhong, Z.; Rempe, D.; Xu, D.; Chen, Y.; Veer, S.; Che, T.;
Ganguli, S. 2015. Deep unsupervised learning using Ray, B.; and Pavone, M. 2023. Guided conditional diffu-
nonequilibrium thermodynamics. In International confer- sion for controllable traffic simulation. In 2023 IEEE Inter-
ence on machine learning, 2256–2265. PMLR. national Conference on Robotics and Automation (ICRA),
Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Er- 3560–3566. IEEE.
mon, S.; and Poole, B. 2020. Score-based generative model-
ing through stochastic differential equations. arXiv preprint
arXiv:2011.13456.
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Pat-
naik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al.
2020. Scalability in perception for autonomous driving:
Waymo open dataset. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, 2446–
2454.
Suo, S.; Regalado, S.; Casas, S.; and Urtasun, R. 2021. Traf-
ficsim: Learning to simulate realistic multi-agent behaviors.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 10400–10409.
Tashiro, Y.; Song, J.; Song, Y.; and Ermon, S. 2021. Csdi:
Conditional score-based diffusion models for probabilistic
time series imputation. Advances in Neural Information
Processing Systems, 34: 24804–24816.
Treiber, M.; Hennecke, A.; and Helbing, D. 2000. Con-
gested traffic states in empirical observations and micro-
scopic simulations. Physical review E, 62(2): 1805.
Uria, B.; Murray, I.; and Larochelle, H. 2014. A deep and
tractable density estimator. In International Conference on
Machine Learning, 467–475. PMLR.
Wu, T.; Fan, Z.; Liu, X.; Zheng, H.-T.; Gong, Y.; Jiao, J.; Li,
J.; Guo, J.; Duan, N.; Chen, W.; et al. 2024. Ar-diffusion:
Auto-regressive diffusion model for text generation. Ad-
vances in Neural Information Processing Systems, 36.
Xu, D.; Chen, Y.; Ivanovic, B.; and Pavone, M. 2023. Bits:
Bi-level imitation for traffic simulation. In 2023 IEEE Inter-
national Conference on Robotics and Automation (ICRA),
2929–2936. IEEE.
Yin, W.; Tu, R.; Yin, H.; Kragic, D.; Kjellström, H.; and
Björkman, M. 2023. Controllable Motion Synthesis and Re-
construction with Autoregressive Diffusion Models. In 2023
32nd IEEE International Conference on Robot and Human
Interactive Communication (RO-MAN), 1102–1108. IEEE.
Zhan, W.; Sun, L.; Wang, D.; Shi, H.; Clausse, A.; Naumann,
M.; Kümmerle, J.; Königshof, H.; Stiller, C.; de La Fortelle,
A.; and Tomizuka, M. 2019. INTERACTION Dataset:
An INTERnational, Adversarial and Cooperative moTION
Dataset in Interactive Driving Scenarios with Semantic
Maps. arXiv:1910.03088 [cs, eess].
Supplementary Materials SNR ratio within the window. Denoising for the next sim-
Introduction ulation step does not start from Gaussian noise, resulting in
lower accumulation errors compared to DJINN-10 (MPC-1).
We provide an extended discussion on related work regard-
ing autoregressive diffusion models. We also detail the com- Table 4: Accumulation Errors caused by replanning for
putational resources and datasets used in our experiments. DJINN-10
Furthermore, we present additional results on accumulation
errors caused by replanning for DJINN-10, as well as visual- Metrics DJINN-10 (MPC-1) DJINN-10 (MPC-5)
izations showcasing the reactivity of the RoAD model with minSceneADE 0.692 0.583
varying window sizes. minSceneFDE 1.675 1.351
Miss Rate 0.166 0.091
Additional Related Work
Autoregressive Diffusion Models (ARDM) (Hoogeboom
et al. 2021) introduce an order-agnostic autoregressive dif- Additional Visualizations
fusion model that combines an order-agnostic autoregressive
model (Uria, Murray, and Larochelle 2014) with a discrete In Figure 4, we demonstrate that the flexibility of our RoAD
diffusion model (Austin et al. 2021). The order-agnostic na- model by adjusting the sliding window size.
ture of this model eliminates the need for generating sub-
sequent predictions in a specific order, thereby enabling
faster prediction times through parallel sampling. Addition-
ally, relaxing the causal assumption leads to a more efficient
per-time-step loss function during training. However, such
a model is not suitable for our application due to the se-
quential nature of traffic simulation. AMD (Han et al. 2024)
proposes an auto-regressive motion generation approach for
human motion given a text prompt, but unlike the Rolling
Diffusion Model, it denoises one clean motion sample at a
time, which is slow at prediction time. The Rolling Diffu-
sion Model (RDM) (Ruhe et al. 2024) proposes a sliding
window approach targeted at long video generation but does
not specifically study its application in a multi-agent system,
particularly for closed-loop traffic simulation. We investi-
gate the level of reactivity when applying rolling diffusion
models as a traffic scene planner.
Compute resources
We run all our experiments on four NVIDIA V100 GPUs
hosted by a cloud provider. We trained our RoAD models
for 9 days, and so 36 GPU-days. AR and DJINN were also
trained for 36 GPU-days. In total, including preliminary runs
and ablations, we estimate that the project required roughly
300 GPU-days.
Dataset
We experiment with the INTERACTION dataset (Zhan et al.
2019) which is available for non-commercial use following
the guidelines at https://interaction-dataset.com/.