0% found this document useful (0 votes)

17 views17 pages

2407.05679v2

BEVWorld is a novel multimodal world model designed for autonomous driving, which tokenizes various sensor inputs into a unified Bird's Eye View (BEV) latent space for effective environment modeling and future scenario prediction. The model comprises a multi-modal tokenizer that encodes and reconstructs sensor data, and a latent BEV sequence diffusion model that predicts future states based on action conditions. Experimental results demonstrate BEVWorld's effectiveness in generating future scenes and enhancing tasks such as perception and motion prediction in autonomous driving applications.

Uploaded by

soyebn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views17 pages

2407.05679v2

Uploaded by

soyebn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

BEVWorld: A Multimodal World Model for

Autonomous Driving via Unified BEV Latent Space

Yumeng Zhang Shi Gong Kaixin Xiong Xiaoqing Ye†

Xiao Tan Fan Wang Jizhou Huang† * Hua Wu Haifeng Wang
Baidu Inc., China
{zhangyumeng04,gongshi,yexiaoqing,huangjizhou01}@baidu.com
arXiv:2407.05679v2 [cs.CV] 18 Jul 2024

Abstract
World models are receiving increasing attention in autonomous driving for their
ability to predict potential future scenarios. In this paper, we present BEVWorld, a
novel approach that tokenizes multimodal sensor inputs into a unified and compact
Bird’s Eye View (BEV) latent space for environment modeling. The world model
consists of two parts: the multi-modal tokenizer and the latent BEV sequence
diffusion model. The multi-modal tokenizer first encodes multi-modality informa-
tion and the decoder is able to reconstruct the latent BEV tokens into LiDAR and
image observations by ray-casting rendering in a self-supervised manner. Then
the latent BEV sequence diffusion model predicts future scenarios given action
tokens as conditions. Experiments demonstrate the effectiveness of BEVWorld in
autonomous driving tasks, showcasing its capability in generating future scenes
and benefiting downstream tasks such as perception and motion prediction. Code
will be available at https://github.com/zympsyche/BevWorld.

1 Introduction
Autonomous driving has made significant progress in recent years, but it still faces several challenges.
First, training a reliable autonomous driving system requires a large amount of precisely annotated
data, which is resource-intensive and time-consuming. Thus, exploring how to utilize unlabeled
multimodal sensor data within a self-supervised learning paradigm is crucial. Moreover, a reliable
autonomous driving system requires not only the ability to perceive the environment but also a
comprehensive understanding of environmental information for decision-making.
We claim that the key to addressing these challenges is to construct a multimodal world model for
autonomous driving. By modeling the environment, the world model predicts future states and
behaviors, empowering the autonomous agent to make more sophisticated decisions. Recently, some
world models have demonstrated their practical significance in autonomous driving [12, 42, 40].
However, most methods are based on a single modality, which cannot adapt to current multisensor,
multimodal autonomous driving systems. Due to the heterogeneous nature of multimodal data,
integrating them into a unified generative model and seamlessly adapting to downstream tasks
remains an unresolved issue.
In this paper, we introduce BEVWorld, a multimodal world model that transforms diverse multimodal
data into a unified bird’s-eye-view (BEV) representation and performs action-conditioned future
prediction within this unified space. Our BEVWorld consists of two parts: a multimodal tokenizer
network and a latent BEV sequence diffusion network.
The core capability of the multimodal tokenizer lies in compressing original multimodal sensor
data into a unified BEV latent space. This is achieved by transforming visual information into 3D
†
Corresponding authors.
∗Project lead for end-to-end autonomous driving.
space and aligning visual semantic information with Lidar geometric information in a self-supervised
manner using an auto-encoder structure. To reverse this process and reconstruct the multimodal data,
a 3D volume representation is constructed from the BEV latent to predict high-resolution images and
point clouds using a ray-based rendering technique [39].
The Latent BEV Sequence Diffusion network is designed to predict future frames of images and point
clouds. With the help of a multimodal tokenizer, this task is made easier, allowing for accurate future
BEV predictions. Specifically, we use a diffusion-based method with a spatial-temporal transformer,
which converts sequential noisy BEV latents into clean future BEV predictions based on the action
condition.
To summarize, the main contributions of this paper are:
• We introduced a novel multimodal tokenizer that integrates visual semantics and 3D geome-
try into a unified BEV representation. The quality of the BEV representation is ensured by
innovatively applying a rendering-based method to restore multi-sensor data from BEV. The
effectiveness of the BEV representation is validated through ablation studies, visualizations,
and downstream task experiments.
• We designed a latent diffusion-based world model that enables the synchronous generation
of future multi-view images and point clouds. Extensive experiments on the nuScenes and
Carla datasets showcase the leading future prediction performance of multimodal data.

2 Related Works
2.1 World Model

This part mainly reviews the application of world models in the autonomous driving area, focusing
on scenario generation as well as the planning and control mechanism. If categorized by the key
applications, we divide the sprung-up world model works into two categories. (1) Driving Scene
Generation. The data collection and annotation for autonomous driving are high-cost and sometimes
risky. In contrast, world models find another way to enrich unlimited, varied driving data due to their
intrinsic self-supervised learning paradigms. GAIA-1 [12] adopts multi-modality inputs collected in
the real world to generate diverse driving scenarios based on different prompts (e.g., changing weather,
scenes, traffic participants, vehicle actions) in an autoregressive prediction manner, which shows
its ability of world understanding. ADriver-I [13] combines the multimodal large language model
and a video latent diffusion model to predict future scenes and control signals, which significantly
improves the interpretability of decision-making, indicating the feasibility of the world model as
a fundamental model. MUVO [3] integrates LiDAR point clouds beyond videos to predict future
driving scenes in the representation of images, point clouds, and 3D occupancy. Further, Copilot4D
[42] leverages a discrete diffusion model that operates on BEV tokens to perform 3D point cloud
forecasting and OccWorld [45] adopts a GPT-like generative architecture for 3D semantic occupancy
forecast and motion planning. DriveWorld [27] and UniWorld [26] approach the world model as 4D
scene understanding task for pre-training for downstream tasks. (2) Planning and Control. MILE
[11] is the pioneering work that adopts a model-based imitation learning approach for joint dynamics
future environment and driving policy learning in autonomous driving. DriveDreamer [33] offers a
comprehensive framework to utilize 3D structural information such as HDMap and 3D box to predict
future driving videos and driving actions. Beyond the single front view generation, DriveDreamer-2
[44] further produces multi-view driving videos based on user descriptions. TrafficBots [43] develops
a world model for multimodal motion prediction and end-to-end driving, by facilitating action
prediction from a BEV perspective. Drive-WM [34] generates controllable multiview videos and
applies the world model to safe driving planning to determine the optimal trajectory according to the
image-based rewards.

2.2 Video Diffusion Model

World model can be regarded as a sequence-data generation task, which belongs to the realm of video
prediction. Many early methods [11, 12] adopt VAE [17] and auto-regression [6] to generate future
predictions. However, the VAE suffers from unsatisfactory generation quality, and the auto-regressive
method has the problem of cumulative error. Thus, many researchers switch to study on diffusion-
based future prediction methods [44, 20], which achieves success in the realm of video generation

2
Multi-Modal Tokenizer
!

Camera Ray-casting
Multi-view Image Image reconstruction

Unified BEV Lidar Ray-casting

Representation Voxel
Multi-modal Multi-modal Features
Encoder Decoder

Lidar Lidar reconstruction

Latent BEV Sequence Diffusion

DDIM
Training process T-P T-P T-P T-P
Testing process … … … …
T Transformer Transformer T
T T
noise Condition Condition

T+1 T+1 … … T+1 T+1 … … T+1

… … T-P T T+N … … T-P T T+N …
T+N T+N T+N T+N T+N

clean token noise-added token action token denoised token pure noisy token

Figure 1: An overview of our method BEVWorld. BEVWorld consists of the multi-modal tokenizer
and the latent BEV sequence diffusion model. The tokenizer first encodes the image and Lidar
observations into BEV tokens, then decodes the unified BEV tokens to reconstructed observations by
NeRF rendering strategies. Latent BEV sequence diffusion model predicts future BEV tokens with
corresponding action conditions by a Spatial-Temporal Transformer. The multi-frame future BEV
tokens are obtained by a single inference, avoiding the cumulative errors of auto-regressive methods.

recently and has ability to predict multiple future frames simultaneously. This part mainly reviews
the related methods of video diffusion model.
The standard video diffusion model [10] takes temporal noise as input, and adopts the UNet [29]
with temporal attention to obtain denoised videos. However, this method requires high training costs
and the generation quality needs further improvement. Subsequent methods are mainly improved
along these two directions. In view of the high training cost problem, LVDM[9] and Open-Sora [18]
methods compress the video into a latent space through schemes such as VAE or VideoGPT [37],
which reduces the video capacity in terms of spatial and temporal dimensions. In order to improve the
generation quality of videos, stable video diffusion [2] proposes a multi-stage training strategy, which
adopts image and low-resolution video pretraining to accelerate the model convergence and improve
generation quality. GenAD [38] introduces the causal mask module into UNet to predict plausible
futures following the temporal causality. VDT [24] and Sora [4] replace the traditional UNet with a
spatial-temporal transformer structure. The powerful scale-up capability of the transformer enables
the model to fit the data better and generates more reasonable videos.

3 Method

In this section, we delineate the model structure of BEVWorld. The overall architecture
is illustrated in Figure 1. Given a sequence of multi-view image and Lidar observations
{ot−P , · · · , ot−1 , ot , ot+1 , · · · , ot+N } where ot is the current observation, +/− represent the fu-
ture/past observations and P/N is the number of past/future observations, we aim to predict
{ot+1 , · · · , ot+N } with the condition {ot−P , · · · , ot−1 , ot }. In view of the high computing costs
of learning a world model in original observation space, a multi-modal tokenizer is proposed to
compress the multi-view image and Lidar information into a unified BEV space by frame. The
encoder-decoder structure and the self-supervised reconstruction loss promise proper geometric
and semantic information is well stored in the BEV representation. This design exactly provides
a sufficiently concise representation for the world model and other downstream tasks. Our world
model is designed as a diffusion-based network to avoid the problem of error accumulating as those
in an auto-regressive fashion. It takes the ego motion and {xt−P , · · · , xt−1 , xt }, i.e. the BEV
representation of {ot−P , · · · , ot−1 , ot }, as condition to learn the noise {ϵt+1 , · · · , ϵt+N } added to

3
Image Backbone
Key & Value
"×)×* (" = 256) C " ×)×* (C " = 4)
Deformable-based Compress
Image Features
Transformer

Query Fusion BEV Unified BEV

Features Representation
Point Backbone

Lidar BEV Features

Figure 2: The detailed structure of BEV encoder. The encoder takes as input the multi-view multi-
modality sensor data. Multimodal information is fused using deformable attention, BEV features are
channel-compressed to be compatible with the diffusion models.

{xt+1 , · · · , xt+N } in the training process. In the testing process, a DDIM [32] scheduler is applied
to restore the future BEV token from pure noises. Next we use the decoder of multi-modal tokenizer
to render future multi-view images and Lidar frames out.

3.1 Multi-Modal Tokenizer

Our designed multi-modal tokenizer contains three parts: a BEV encoder network, a BEV Decoder
network and a multi-modal rendering network. The structure of BEV encoder network is illustrated
in the Figure 2. To make the multi-modal network as homogeneous as possible, we adopt the
Swin-Transformer [22] network as the image backbone to extract multi-image features. For Lidar
feature extraction, we first split point cloud into pillars [19] on the BEV space. Then we use the Swin-
Transformer network as the Lidar backbone to extract Lidar BEV features. We fuse the Lidar BEV
features and the multi-view images features with a deformable-based transformer [46]. Specifically,
we sample K(K = 4) points in the height dimension of pillars and project these points onto the
image to sample corresponding image features. The sampled image features are treated as values and
the Lidar BEV features is served as queries in the deformable attention calculation. Considering the
future prediction task requires low-dimension inputs, we further compress the fused BEV feature into
a low-dimensional(C ′ = 4) BEV feature.
For BEV decoder, there is an ambiguity problem when directly using a decoder to restore the images
and Lidar since the fused BEV feature lacks height information. To address this problem, we first
convert BEV tokens into 3D voxel features through stacked layers of upsampling and swin-blocks.
And then we use voxelized NeRF-based ray rendering to restore the multi-view images and Lidar
point cloud.
The multi-modal rendering network can be elegantly segmented into two distinct components, image
reconstruction network and Lidar reconstruction network. For image reconstruction network, we first
get the ray r(t) = o + td, which shooting from the camera center o to the pixel center in direction d.
Then we uniformly sample a set of points {(xi , yi , zi )}N
i=1 along the ray, where Nr (Nr = 150) is
r

the total number of points sampled along a ray. Given a sampled point (xi , yi , zi ), the corresponding
features vi are obtained from the voxel feature according to its position. Then, all the sampled
features in a ray are aggregated as pixel-wise feature descriptor (Eq. 1).
Nr
X i−1
Y
v(r) = wi vi , wi = αi (1 − αj ), αi = σ(MLP(vi )) (1)
i=1 j=1

We traverse all pixels and obtain the 2D feature map V ∈ RHf ×Wf ×Cf of the image. The 2D feature
is converted into the RGB image Ig ∈ RH×W ×3 through a CNN decoder. Three common losses are
added for improving the quality of generated images, perceptual loss [14], GAN loss [8] and L1
loss. Our full objective of image reconstruction is:
Nϕ
X
Lrgb = ∥Ig − It ∥1 + λperc ∥ ϕj (Ig ) − ϕj (It )∥ + λgan Lgan (Ig , It ) (2)
j=1

4
(%!"# , )!"# )
(%!"# , '$"% )

(%!"& , '$"' )
(%!"& , )!"& )

(%! , '$ ) (%! , )! )

Image reconstruction
z y
(%!(& , )!(& )
# !
Lidar reconstruction
"
Lidar O x
Volume
Rendering
Volume Depth
C# ×)# ×*# Rendering
Decoder 3×)$%& ×*$%&
Camera

Figure 3: Left: Details of the multi-view images rendering. Trilinear interpolation is applied to the
series of sampled points along the ray to obtain weight wi and feature vi . {vi } are weighted by {wi }
and summed, respectively, to get the rendered image features, which are concatenated and fed into the
decoder for 8× upsampling, resulting in multi-view RGB images. Right: Details of Lidar rendering.
Trilinear interpolation is also applied to obtain weight wi and depth ti . {ti } are weighted by {wi } and
summed, respectively, to get the final depth of point. Then the point in spherical coordinate system is
transformed to the Cartesian coordinate system to get vanilla Lidar point coordinate.

where It is the ground truth of Ig , ϕj represents the jth layer of pretrained VGG [31] model, and the
definition of Lgan (Ig , It ) can be found in [8].
For Lidar reconstruction network, the ray is defined in the spherical coordinate system with inclination
θ and azimuth ϕ. θ and ϕ are obtained by shooting from the Lidar center to current frame of Lidar point.
We sample the points and get the corresponding features in the same way of image reconstruction.
Since Lidar encodes the depth information, the expected depth Dg (r) of the sampled points are
calculated for Lidar simulation. The depth simulation process and loss function are shown in Eq. 3.
Nr
X
Dg (r) = wi ti , LLidar = ∥Dg (r) − Dt (r)∥1 , (3)
i=1

where ti denotes the depth of sampled point from the Lidar center and Dt (r) is the depth ground
truth calculated by the Lidar observation.
The Cartesian coordinate of point cloud could be calculated by:
(x, y, z) = (Dg (r) sin θ cos ϕ, Dg (r) sin θ sin ϕ, Dg (r) cos θ) (4)

Overall, the multi-modal tokenizer is trained end-to-end with the total loss in Eq. 5:
LTotal = LLidar + Lrgb (5)

3.2 Latent BEV Sequence Diffusion

Most existing world models [42, 12] adopt autoregression strategy to get longer future predictions,
but this method is easily affected by cumulative errors. Instead, we propose latent sequence diffusion
framework, which inputs multiple frames of noise BEV tokens and obtains all future BEV tokens
simultaneously.
The structure of latent sequence diffusion is illustrated in Figure 1. In the training process, the
low-dimensional BEV tokens (xt−P , · · · , xt−1 , xt , xt+1 , · · · , xt+N ) are firstly obtained from the
sensor data. Only BEV encoder in the multi-modal tokenizer is involved in this process and the
parameters of multi-modal tokenizer is frozen. To facilitate the learning of BEV token features
by the world model module, we standardize the input BEV features along the channel dimension
(xt−P , · · · , xt−1 , xt , xt+1 , · · · , xt+N ). Latest history BEV token and current frame BEV token

5
Input tokens Temporal Attn Block Spatial Attn Block Feed Forward Output tokens

#! #" ## Scale
Scale Scale
Action Pointwise
Temporal Multi-Head
C. Casual Attn.
C. Attn. C. Feedforward

Time !! "! !" "" !# "#

Scale Shift Scale Shift Scale Shift
stamp
Layer Norm Layer Norm Layer Norm

Figure 4: The architecture of Spatial-Temporal transformer block.

(xt−P , · · · , xt−1 , xt ) are served as condition tokens while (xt+1 , · · · , xt+N ) are diffused to noisy
BEV tokens (xϵt+1 , · · · , xϵt+N ) with noise {ϵit̂ }t+N
i=t+1 , where t̂ is the timestamp of diffusion process.
The denoising process is carried out with a spatial-temporal transformer containing a sequence
of transformer blocks, the architecture of which is shown in the Figure 4. The input of spatial-
temporal transformer is the concatenation of condition BEV tokens and noisy BEV tokens
(xt−P , · · · , xt−1 , xt , xϵt+1 , · · · , xϵt+N ). These tokens are modulated with action tokens {ai }Ti=T
+N
−P
of vehicle movement and steering, which together form the inputs to spatial-temporal transformer.
More specifically, the input tokens are first passed to temporal attention block for enhancing temporal
smoothness. To avoid time confusion problem, we added the causal mask into temporal attention.
Then, the output of temporal attention block are sent to spatial attention block for accurate details.
The design of spatial attention block follows standard transformer block criterion [24]. Action
token and diffusion timestamp {t̂di }Ti=T +N T +N
−P are concatenated as the condition {ci }i=T −P of diffusion
models and then sent to AdaLN [28] (6) to modulate the token features.
c = concat(a, t̂); γ, β = Linear(c); AdaLN(x̂, γ, β) = LayerNorm(x̂) · (1 + γ) + β (6)
where x̂ is the input features of one transformer block, γ, β is the scale and shift of c.
The output of the Spatial-Temporal transformer is the noise prediction {ϵit̂ (x)}N
i=1 , and the loss is
shown in Eq. 7.
Ldiff = ∥ϵt̂ (x) − ϵt̂ ∥1 . (7)
In the testing process, normalized history frame and current frame BEV tokens (xt−P , · · · , xt−1 , xt )
and pure noisy tokens (ϵt+1 , ϵt+2 , · · · , ϵt+N ) are concatenated as input to world model. The ego
motion token {ai }Ti=T
+N
−P , spanning from moment T − P to T + N , serve as the conditional inputs.
We employ the DDIM [32] schedule to forecast the subsequent BEV tokens. Subsequently, the
denormalized operation is applied to the predicted BEV tokens, which are then fed into the BEV
decoder and rendering network yielding a comprehensive set of predicted multi-sensor data.

4 Experiments
4.1 Dataset

NuScenes [5] NuScenes is a widely used autonomous driving dataset, which comprises multi-modal
data such as multi-view images from 6 cameras and Lidar scans. It includes a total of 700 training
videos and 150 validation videos. Each video includes 20 seconds at a frame rate of 12Hz.
Carla [7] The training data is collected in the open-source CARLA simulator at 2Hz, including 8
towns and 14 kinds of weather. We collect 3M frames with four cameras (1600 × 900) and one Lidar
(32p) for training, and evaluate on the Carla Town05 benchmark, which is the same setting of [30].

4.2 Multi-modal Tokenizer

In this section, we explore the impact of different design decisions in the proposed multi-modal
tokenizer and demonstrate its effectiveness in the downstream tasks. For multi-modal reconstruction
visualization results, please refer to Figure7 and Figure8.

6
4.2.1 Ablation Studies
Various input modalities and output modalities. The proposed multi-modal tokenizer supports
various choice of input and output modalities. We test the influence of different modalities, and
the results are shown in Table 1, where L indicates Lidar modality, C indicates multi-view cameras
modality, and L&C indicates multi-modal modalities. The combination of Lidar and cameras achieves
the best reconstruction performance, which demonstrates that using multi modalities can generate
better BEV features. We find that the PSNR metric is somewhat distorted when comparing ground
truth images and predicted images. This is caused by the mean characteristics of PSNR metric, which
does not evaluate sharpening and blurring well. As shown in Figure 12, though the PSNR of multi
modalities is slightly lower than single camera modality method, the visualization of multi modalities
is better than single camera modality as the FID metric indicates.

Table 1: Ablations of different modalities. Table 2: Ablations of rendering methods.

Input Output FID↓ PSNR↑ Chamfer↓
Method FID↓ PSNR↑ Chamfer↓
C C 19.18 26.95 -
(a) 67.28 9.45 0.24
C L - - 2.67
(b) 5.54 25.68 0.15
L L - - 0.19
L&C L&C 5.54 25.68 0.15

Rendering approaches. To convert from BEV features into multiple sensor data, the main challenge
lies in the varying positions and orientations of different sensors, as well as the differences in imaging
(points and pixels). We compared two types of rendering methods: a) attention-based method,
which implicitly encodes the geometric projection in the model parameters via global attention
mechanism; b) ray-based sampling method, which explicitly utilizes the sensor’s pose information
and imaging geometry. The results of the methods (a) and (b) are presented in Table 2. Method (a)
faces with a significant performance drop in multi-view reconstruction, indicating that our ray-based
sampling approach reduces the difficulty of view transformation, making it easier to achieve training
convergence. Thus we adopt ray-based sampling method for generating multiple sensor data.

4.2.2 Benefit for Downstream Tasks

3D Detection. To verify our proposed method is effective for downstream tasks when used in the
pre-train stage, we conduct experiments on the nuScenes 3D detection benchmark. For the model
structure, in order to maximize the reuse of the structure of our multi-modal tokenizer, the encoder in
the downstream 3D detection task is kept the same with the encoder of the tokenizer described in 3.
We use a BEV encoder attached to the tokenizer encoder for further extracting BEV features. We
design a UNet-style network with the Swin-transformer [22] layers as the BEV encoder. As for the
detection head, we adopt query-based head [21], which contains 500 object queries that searching the
whole BEV feature space and uses hungarian algorithm to match the prediction boxes and the ground
truth boxes. We report both single frame and two frames results. We warp history 0.5s BEV future
to current frame in two frames setting for better velocity estimation. Note that we do not perform
fine-tuning specifically for the detection task all in the interest of preserving the simplicity and clarity
of our setup. For example, the regular detection range is [-60.0m, -60.0m, -5.0m, 60.0m, 60.0m,
3.0m] in the nuScenes dataset while we follow the BEV range of [-80.0m, -80.0m, -4.5m, 80.0m,
80.0m, 4.5m] in the multi-modal reconstruction task, which would result in coarser BEV grids and
lower accuracy. Meanwhile, our experimental design eschew the use of data augmentation techniques
and the layering of point cloud frames. We train 30 epoches on 8 A100 GPUs with a starting
learning rate of 5e−4 that decayed with cosine annealing policy. We mainly focus on the relative
performance gap between training from scratch and use our proposed self-supervised tokenizer
as pre-training model. As demonstrated in Table 3, it is evident that employing our multi-modal
tokenizer as a pre-training model yields significantly enhanced performance across both single and
multi-frame scenarios. Specifically, with a two-frame configuration, we have achieved an impressive
8.4% improvement in the NDS metric and a substantial 13.4% improvement in the mAP metric,
attributable to our multi-modal tokenizer pre-training approach.
Motion Prediction. We further validate the performance of using our method as pre-training model
on the motion prediction task. We attach the motion prediction head to the 3D detection head. The
motion prediction head is stacked of 6 layers of cross attention(CA) and feed-forward network(FFN).

7
For the first layer, the trajectory queries is initialized from the top 200 highest score object queries
selected from the 3D detection head. Then for each layer, the trajectory queries is firstly interacting
with temporal BEV future in CA and further updated by FFN. We reuse the hungarian matching
results in 3D detection head to pair the prediction and ground truth for trajectories. We predict five
possible modes of trajectories and select the one closest to the ground truth for evaluation. For the
training strategy, we train 24 epoches on 8 A100 GPUs with a starting learning rate of 1e−4 . Other
settings are kept the same with the detection configuration. We display the motion prediction results
in Table 3. We observed a decrease of 0.455 meters in minADE and a reduction of 0.749 meters in
minFDE at the two-frames setting when utilizing the tokenizer during the pre-training phase. This
finding confirms the efficacy of self-supervised multi-modal tokenizer pre-training.

Table 3: Comparison of whether use pretrained tokenizer on the nuScenes validation set.
3D Object Detection Motion Prediction
Frames Pretrain
NDS↑ mAP↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓ minADE↓ minFDE↓
Single wo 0.366 0.338 0.555 0.290 0.832 1.290 0.357 2.055 3.469
Single w 0.415 0.412 0.497 0.278 0.769 1.275 0.367 1.851 3.153
Two wo 0.392 0.253 0.567 0.308 0.650 0.610 0.212 1.426 2.230
Two w 0.476 0.387 0.507 0.287 0.632 0.502 0.246 0.971 1.481

Table 4: Comparison of generation quality on nuScenes validation dataset.

Methods Multi-view Video Manual Labeling Cond. FID↓ FVD↓

DriveDreamer [33] ✓ ✓ 52.6 452.0
WoVoGen [25] ✓ ✓ ✓ 27.6 417.7
Drive-WM [34] ✓ ✓ ✓ 15.8 122.7
DriveGAN [16] ✓ 73.4 502.3
Drive-WM [34] ✓ ✓ 20.3 212.5
BEVWorld ✓ ✓ 19.0 154.0

Table 5: Comparison with SOTA methods on the nuScenes validation set and Carla dataset. The
suffix * represents the methods adopt classifier-free guidance (CFG) when getting the final results,
and † is the reproduced result. Cham. is the abbreviation of Chamfer Distance.
Dataset Methods Modal PSNR 1s↑ FID 1s↓ Cham. 1s↓ PSNR 3s↑ FID 3s↓ Cham. 3s↓
nuScenes SPFNet [35] Lidar - - 2.24 - - 2.50
nuScenes S2Net [36] Lidar - - 1.70 - - 2.06
nuScenes 4D-Occ [15] Lidar - - 1.41 - - 1.40
nuScenes Copilot4D* [42] Lidar - - 0.36 - - 0.58
nuScenes Copilot4D [42] Lidar - - - - - 1.40
nuScenes BEVWorld Multi 20.85 22.85 0.44 19.67 37.37 0.73
Carla 4D-Occ† [15] Lidar - - 0.27 - - 0.44
Carla BEVWorld Multi 20.71 36.80 0.07 19.12 43.12 0.17

4.3 Latent BEV Sequence Diffusion

In this section, we introduce the training details of latent BEV Sequence diffusion and compare this
method with other related methods.

4.3.1 Training Details.

NuScenes. We adopt a three stage training for future BEV prediction. 1) Next BEV pretraining. The
model predicts the next frame with the {xt−1 , xt } condition. In practice, we adopt sweep data of
nuScenes to reduce the difficulty of temporal feature learning. The model is trained 20000 iters with
a batch size 128. 2) Short Sequence training. The model predicts the N (N = 5) future frames of
sweep data. At this stage, the network can learn how to perform short-term (0.5s) feature reasoning.
The model is trained 20000 iters with a batch size 128. 3) Long Sequence Fine-tuning. The model
predicts the N (N = 6) future frames (3s) of key-frame data with the {xt−2 , xt−1 , xt } condition.

8
The model is trained 30000 iters with a batch size 128. The learning rate of three stages is 5e-4
and the optimizer is AdamW [23]. Note that our method does not introduce classifier-free gudiance
(CFG) strategy in the training process for better integration with downstream tasks, as CFG requires
an additional network inference, which doubles the computational cost.
Carla. The model is fine-tuned 30000 iterations with a nuScenes-pretrained model with a batch size
32. The initial learning rate is 5e-4 and the optimizer is AdamW [23]. CFG strategy is not introduced
in the training process, following the same setting of nuScenes.

4.3.2 Lidar Prediction Quality

NuScenes. We compare the Lidar prediction quality with existing SOTA methods. We follow the
evaluation process of [42] and report the Chamfer 1s/3s results in Table 5, where the metric is
computed within the region of interest: -70m to +70m in both x-axis and y-axis, -4.5m to +4.5m in
z-axis. Our proposed method outperforms SPFNet, S2Net and 4D-Occ in Chamfer metric by a large
margin. When compared to Copilot4D [42], our approach uses less history condition frames and no
CFG schedule setting considering the large memory cost for multi-modal inputs. Our BEVWorld
requires only 3 past frames for 3-second predictions, whereas Copilot4D utilizes 6 frames for the
same duration. Our method demonstrates superior performance, achieving chamfer distance of 0.73
compared to 1.40, in the no CFG schedule setting, ensuring a fair and comparable evaluation.
Carla. We also conducted experiments on the Carla dataset to verify the scalability of our method.
The quantitative results are shown in Table 5. We reproduce the results of 4D-Occ on Carla and
compare it with our method, obtaining similar conclusions to this on the nuScenes dataset. Our
method significantly outperform 4D-Occ in prediction results for both 1-second and 3-second.

4.3.3 Video Generation Quality

NuScenes. We compare the video generation quality with past single-view and multi-view generation
methods. Most of existing methods adopt manual labeling condition, such as layout or object label,
to improve the generation quality. However, using annotations reduces the scalability of the world
model, making it difficult to train with large amounts of unlabeled data. Thus we do not use the
manual annotations as model conditions. The results are shown in Table 4. The proposed method
achieves best FID and FVD performance in methods without using manual labeling condition and
exhibits comparable results with methods using extra conditions. The visual results of Lidar and
video prediction are shown in Figure 5. Furthermore, the generation can be controlled by the action
conditions. We transform the action token into left turn, right turn, speed up and slow down, and the
generated image and Lidar can be generated according to these instructions. The visualization of
controllability are shown in Figure 6.
Carla. The generation quality on Carla is similar to that on nuScenes dataset, which demonstrates
the scalability of our method across different datasets. The quantitative results of video predictions
are shown in Table 4 with 36.80(FID 1s) and 43.12(FID 3s). Qualitative results of video predictions
are shown in the appendix.

4.3.4 Benefit for Planning Tasks

We further validate the effectiveness of the predicted future BEV features from latent diffusion
network for toy downstream open-loop planning task [41] on nuScenes dataset. Note that we do not
use actions of ego car in future frames here and we adopt x0 -parameterization [1] for fast inference.
We adopt four vectors, history trajectory, command, perception and optional future BEV vectors,
as input for planning head. History trajectory vector encodes the ego movement from last frame
to current frame. Command vector refers to the routing command such as turning left or right.
Perception vector is extracted from the object query in the detection head that interacted with all
detection queries. Future BEV vector is obtained from the pooled BEV features from the fixed
diffusion model. When using future BEV vectors, PNC L2 3s metric is decreased from 1.030m to
0.977m, which validates that the predicted BEV from world model is beneficial for planning tasks.

9
Multi-view Image LiDAR
Reference T=0

Pred T=1s

Pred T=3s

Figure 5: The visualization of Lidar and video predictions.

Reference T=0s Turn Right T=2s Turn Left T=2s Speed Up T=2s Slow Down T=2s

Figure 6: The visualization of controllability. Due to space limitations, we only show the results of
the front and rear views for a clearer presentation.

5 Conclusion
We present BEVWorld, an innovative autonomous driving framework that leverages a unified Bird’s
Eye View latent space to construct a multi-modal world model. BEVWorld’s self-supervised learning
paradigm allows it to efficiently process extensive unlabeled multimodal sensor data, leading to a
holistic comprehension of the driving environment. We validate the effectiveness of BEVWorld in
the downstream autonomous driving tasks. Furthermore, BEVWorld achieves satisfactory results
in multi-modal future predictions with latent diffusion network, showcasing its capabilities through
experiments on both real-world(nuScenes) and simulated(carla) datasets. We hope that the work
presented in this paper will stimulate and foster future developments in the domain of world models
for autonomous driving.

References
[1] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured
denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems,

10
34:17981–17993, 2021.

[2] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz,
Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video
diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.

[3] Daniel Bogdoll, Yitian Yang, and J Marius Zöllner. Muvo: A multimodal generative world model for
autonomous driving with geometric representations. arXiv preprint arXiv:2311.11762, 2023.

[4] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr,
Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh.
Video generation models as world simulators. 2024. URL https://openai.com/research/
video-generation-models-as-world-simulators.

[5] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan,
Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631,
2020.

[6] Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, and Yufeng Yue. Pointgpt: Auto-regressively
generative pre-training from point clouds. Advances in Neural Information Processing Systems, 36, 2024.

[7] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open
urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):
139–144, 2020.

[9] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for
high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.

[10] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet.
Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.

[11] Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex
Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving. Advances
in Neural Information Processing Systems, 35:20703–20716, 2022.

[12] Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton,
and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint
arXiv:2309.17080, 2023.

[13] Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai
Wang. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023.

[14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-
resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands,
October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.

[15] Tarasha Khurana, Peiyun Hu, David Held, and Deva Ramanan. Point cloud forecasting as a proxy for
4d occupancy forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 1116–1124, 2023.

[16] Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable
high-quality neural simulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 5820–5829, 2021.

[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
2013.

[18] PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, April 2024. URL https://doi.org/10.5281/
zenodo.10948109.

[19] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars:
Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 12697–12705, 2019.

11
[20] Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scene
video generation with latent diffusion model. arXiv preprint arXiv:2310.07771, 2023.
[21] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai.
Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal trans-
formers. In European conference on computer vision, pages 1–18. Springer, 2022.
[22] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 10012–10022, 2021.
[23] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101, 2017.
[24] Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General-
purpose video diffusion transformers via mask modeling. In The Twelfth International Conference on
Learning Representations, 2023.
[25] Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, and Li Zhang. Wovogen: World volume-aware diffusion
for controllable multi-camera driving scene generation. arXiv preprint arXiv:2312.02934, 2023.
[26] Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai. Uniworld: Autonomous driving pre-training
via world models. arXiv preprint arXiv:2308.07234, 2023.
[27] Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo,
Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous
driving. arXiv preprint arXiv:2405.04390, 2024.
[28] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th
international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241.
Springer, 2015.
[30] Hao Shao, Letian Wang, RuoBing Chen, Hongsheng Li, and Yu Liu. Safety-enhanced autonomous driving
using interpretable sensor fusion transformer. arXiv preprint arXiv:2207.14024, 2022.
[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.
[32] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint
arXiv:2010.02502, 2020.
[33] Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, and Jiwen Lu. Drivedreamer: Towards real-world-
driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.
[34] Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future:
Multiview visual forecasting and planning with world model for autonomous driving. arXiv preprint
arXiv:2311.17918, 2023.
[35] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani, and Nicholas Rhinehart. Inverting the pose
forecasting pipeline with spf2: Sequential pointcloud forecasting for sequential pose forecasting. In
Conference on robot learning, pages 11–20. PMLR, 2021.
[36] Xinshuo Weng, Junyu Nan, Kuan-Hui Lee, Rowan McAllister, Adrien Gaidon, Nicholas Rhinehart, and
Kris M Kitani. S2net: Stochastic sequential pointcloud forecasting. In European Conference on Computer
Vision, pages 549–564. Springer, 2022.
[37] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae
and transformers, 2021.
[38] Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu,
Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. arXiv preprint
arXiv:2403.09630, 2024.
[39] Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel
Urtasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 1389–1399, 2023.

12
[40] Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable
autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2024.

[41] Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing
Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.
arXiv preprint arXiv:2305.10430, 2023.

[42] Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Copilot4d: Learning
unsupervised world models for autonomous driving via discrete diffusion. In The Twelfth International
Conference on Learning Representations, 2024.

[43] Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. Trafficbots: Towards world
models for autonomous driving simulation and motion prediction. In 2023 IEEE International Conference
on Robotics and Automation (ICRA), pages 1522–1529. IEEE, 2023.

[44] Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang
Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. arXiv preprint
arXiv:2403.06845, 2024.

[45] Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld:
Learning a 3d occupancy world model for autonomous driving. arXiv preprint arXiv:2311.16038, 2023.

[46] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable
transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.

13
Appendix

A Qualitative Results
In this section, qualitative results are presented to demonstrate the performance of the proposed
method.

A.1 Tokenizer Reconstructions

The visualization of tokenizer reconstructions are shown in Figure 7 and Figure 8. The proposed
tokenizer can recover the image and Lidar with the unified BEV features.

Reconstruction GT

102

103 demo_2

Figure 7: The visualization of LiDAR and video reconstructions on nuScenes dataset.

A.2 Multi-modal Future Predictions

Diverse generation. The proposed diffusion-based world model can produce high-quality future
predictions with different driving conditions, and both the dynamic and static objects can be generated
properly. The qualitative results are illustrated in Figure 9 and Figure 10.
Controllability. We present more visual results of controllability in Figure 11. The generated images
and Lidar exhibit a high degree of consistency with action, which demonstrates that our world model
has the potential of being a simulator.
PSNR metric. PSNR metric has the problem of being unable to differentiate between blurring and
sharpening. As shown in Figure 12, the image quality of L & C is better the that of C, while the psnr
metric of L & C is worse than that of C.

B Implementation Details
Training details of tokenizer. We trained our model using 32 GPUs, with a batch size of 1 per card.
We used the AdamW optimizer with a learning rate of 5e-4, beta1=0.5, and beta2=0.9, following a

14
Reconstruction GT

Figure 8: The visualization of LiDAR and video reconstructions on Carla dataset.

Multi-view Image Multi-view Image Lidar

Lidar
Reference T=0 Reference T=0

Pred T=1s Pred T=1s

Pred T=3s Pred T=3s

Figure 9: The visualization of LiDAR and future predictions on nuScenes dataset.

15
pred_img_pred_vis_149_gpu0_gt_7.png
pred_img_pred_vis_118_gpu0_gt_7.png
DAR Multi-view Image LiDAR

Reference T=0

Pred T=1s

Pred T=3s

Figure 10: The visualization of LiDAR and future predictions on Carla dataset.

Reference T=0s Turn Right T=2s Turn Left T=2s Speed Up T=2s Slow Down T=2s

Figure 11: More visual results of controllability.

16
C L&C GT

Figure 12: The visualization of C and L & C.

cosine learning rate decay strategy. The multi-task loss function includes a perceptual loss weight
of 0.1, a lidar loss weight of 1.0, and an RGB L1 reconstruction loss weight of 1.0. For the GAN
training, we employed a warm-up strategy, introducing the GAN loss after 30,000 iterations. The
discriminator loss weight was set to 1.0, and the generator loss weight was set to 0.1.

C Broader Impacts
The concept of a world model holds significant relevance and diverse applications within the realm of
autonomous driving. It serves as a versatile tool, functioning as a simulator, a generator of long-tail
data, and a pre-trained model for subsequent tasks. Our proposed method introduces a multi-modal
BEV world model framework, designed to align seamlessly with the multi-sensor configurations
inherent in existing autonomous driving models. Consequently, integrating our approach into current
autonomous driving methodologies stands to yield substantial benefits.

D Limitations
It is widely acknowledged that inferring diffusion models typically demands around 50 steps to attain
denoising results, a process characterized by its sluggishness and computational expense. Regrettably,
we encounter similar challenges. As pioneers in the exploration of constructing a multi-modal world
model, our primary emphasis lies on the generation quality within driving scenes, prioritizing it over
computational overhead. Recognizing the significance of efficiency, we identify the adoption of one-
step diffusion as a crucial direction for future improvement in the proposed method. Regarding the
quality of the generated imagery, we have noticed that dynamic objects within the images sometimes
suffer from blurriness. To address this and further improve their clarity and consistency, a dedicated
module specifically tailored for dynamic objects may be necessary in the future.