Omnidrive: A Holistic Llm-Agent Framework For Autonomous Driving With 3D Perception, Reasoning and Planning
Omnidrive: A Holistic Llm-Agent Framework For Autonomous Driving With 3D Perception, Reasoning and Planning
Shihao Wang1∗ , Zhiding Yu2† , Xiaohui Jiang1 , Shiyi Lan2 , Min Shi3∗ , Nadine
Chang2 , Jan Kautz2 , Ying Li1 , and Jose M. Alvarez2
1
Beijing Inst of Tech, 2 NVIDIA, 3 Huazhong Univ of Sci and Tech
https://github.com/NVlabs/OmniDrive
OmniDrive-Agent OmniDrive-nuScenes
�� Can yon tell me about Rules
Q-Former2D this image in detail? ��
��The image depicts a
Infos ‧ Traffic Light Simulated Trajectory
man ironing clothes on ‧ Collision
Image ‧ Drivable Area
weights the back of a van…�� ‧ Ego Condition
‧ 3D Objects
‧ Map Elements Refer Actual Trajectory
�� What traffic elements ‧ Multi-view Image
Q-Former3D should I be aware of? ��
√ Decision making & Planning
�� One directly in front at QA Generation
(+8.2, +2.4), and others in Question
the surrounding lanes…�� Generate
Muti-view Images
Object Detection
Scene General Counterfactual Decision making
Centerline Description Traffic Rules
Attention
Reasoning & Planning Counterfactual & Reasoning
3D Position
1 Introduction
The recent rapid development of multimodal LLMs (MLLMs) [1,24,31] and their
excellent reasoning capabilities have led to a stream of applications in end-to-end
autonomous driving [6,39,46,52,58]. However, the challenge to extend the capa-
bilities from 2D understanding to the intricacies of 3D space is a crucial hurdle
to overcome to fully unlock its potential in real-world applications. Understand-
ing and navigating through 3D space is indispensable for autonomous vehicles
(AVs) because they directly impact an AV’s ability to make informed decisions,
anticipate future states, and interact safely with their environment. Although
previous works [38,47] have demonstrated successful applications of LLM-agents
in autonomous driving, a holistic and principled approach is still needed to fully
extend the 2D understanding and reasoning capabilities of MLLMs into complex
3D scenes for understanding the 3D geometry and spatial relations.
Another open issue is the need to address multi-view high resolution video
input. On one hand, many current popular 2D MLLM architectures, such as
LLaVA-1.5 [30,31], can only take 336 × 336 image input due to the limited vision
encoder resolution and LLM token sequence length. Increasing the limitations is
not trivial, as it requires significantly more compute and memories. On the other
hand, dealing with high resolution video input, oftentimes even multi-view, is a
fundamental requirement for long-range AV perception and safe decision making.
However, unlike many cloud-based services, real-world industrial autonomous
driving applications are mostly on-device and compute-bound. As such, there
is a need to design an efficient MLLM architectures with compressed 3D visual
representations before feeding to the LLM.
Our answer to the above challenges is a novel Q-Former-styled [24] 3D MLLM
architecture as shown in Fig. 1. Unlike LLaVA which adopts a self-attention de-
sign, the cross-attention decoder in Q-Former makes it more scalable to higher
resolution input by compressing the visual information into sparse queries. Inter-
estingly, we notice that the Q-Former architecture shares considerable similarity
with the family of perspective-view models, such as DETR3D [53], PETR(v2) [32,
33], StreamPETR [50] and Far3D [21]. Using sparse 3D queries, these models
have demonstrated considerable advantages over dense bird’s-eye view (BEV)
representation with leading performance [21, 50], long-range perception [21] and
capability to jointly model map elements [55]. The similarity in query-based de-
coder architecture enables us to align both worlds by appending 3D position
encoding to the queries, lifting them to 3D, and attending the multi-view input,
as shown in the left portion of Fig. 1. This process allows the MLLM to gain
3D spatial understanding with minimal efforts and changes, while leveraging the
pre-trained knowledge from the abundant images in 2D.
Besides model architectures, recent drive LLM-agent works also feature the
importance of benchmarks [10, 38, 39, 43, 46, 47]. Many of them are presented as
question-answering (QA) datasets to train and benchmark the LLM-agent for
either reasoning or planning. Despite the various QA setups, benchmarks that
involve planning [10, 46, 47] still resort to an open-loop setting on real-world ses-
sions (e.g., nuScenes) where expert trajectories are used. Recent studies [26, 60],
Abbreviated paper title 3
2 OmniDrive-Agent
As a recap, we aim for a unified 3D MLLM design to: 1) leverage the 2D MLLM
pre-training knowledge, and 2) addressing the high-resolution multi-view input
in autonomous driving. We propose a Q-Former-styled architecture by com-
pressing the visual features into a fixed number of queries before feeding to an
LLM [24]. Noticing the similarity between Q-Former and query-based 3D per-
ception frameworks, we align our MLLM architecture with StreamPETR [50],
where use queries to encode both dynamic objects and static map elements.
These queries, together with additional carrier tokens, serve as a condensed
world model to align perception with reasoning and planning.
2.1 Preliminaries
The Q-Former based MLLMs are composed of a general visual encoder to ex-
tract single-view image features Fs ∈ RC×H×W , a projector (Q-Former) that
serves as visual-language alignment module, and a large language model for text
generation. The architecture of the projector is stacked transformer decoder lay-
ers. The projection process from image features to the textual embedding can
be represented as:
\label {Q-former_vlm} \begin {aligned} \Tilde {Q}_{t}=f_q(Q_{t}, {F}_s)\\ \end {aligned} (1)
where Qt is the initialized text embedding. Q̃t is the refined text embedding,
which will be sent to the language model to generate the final text output.
The query-based 3D perception models [21, 28, 29, 55] consist of a shared
visual encoder to extract multi-view image features, and a detection head fd .
It is based on the PETR [32] and utilizes transformer decoder architecture to
efficiently convert multi-view image features Fm ∈ RN ×C×H×W into detection
queries Q̃d , which can be formulated as:
\label {Q-former_det} \begin {aligned} \Tilde {Q}_{d}=f_d(Q_{d}, F_m + P_m)\\ \end {aligned} (2)
where Pm is the 3D position encoding that effectively capturing the geomet-
ric relationship between the image view and 3D domains. Qd is the initialized
detection queries to gather the multi-view image features.
4 F. Author et al.
… Hybrid Attention
Image
V K Q
Encoder 3D PE
Objects Centerline
Memory Bank
LLM with Adapter
Carrier Perception
… Response
Muti-view Video Instruction
Fig. 2: Overall pipeline of OmniDrive-Agent. The left diagram illustrates the overall
framework of the model. We employ a 3D perception task to guide Q-Former’s learning.
The right diagram depicts the specific structure of Q-Former3D, which is consist of six
transformer decoder layers. The attention weights are initialized from 2D pre-pretrain.
The input are multi-view image features. Additionally, 3D position encoding is added
in the attention operation. Furthermore, we introduce temporal modeling through a
memory bank.
It can be observed that the Transformer decoder in Q-Former and the sparse
query-based 3D perception models, represented by StreamPETR [50], share
highly similar architecture designs. To enhance the localization abilities of the
MLLMs, we consider introducing the design of 3D position encoding and the
supervision of the query-based perception models.
As show in Fig. 2, Omnidrive first uses a shared visual encoder to extract multi-
view image features Fm ∈ RN ×C×H×W . The extracted features with the position
encoding Pm are fed into the Q-Former3D. In Q-Former3D, we initialize the
detection queries and carrier queries and perform self-attention to exchange their
information, which can be summarized by the following formula:
\label {Q-former3d_self} \begin {aligned} (Q, K, V) = (\textbf {[}Q_{c}, {Q}_{d}\textbf {]}, \textbf {[}Q_{c}, {Q}_{d}\textbf {]}, \textbf {[}Q_{c}, {Q}_{d}\textbf {]}), \\ \Tilde {Q} = \text {Multi-head Attention}(Q, K, V) \end {aligned}
(3)
\label {Q-former3d_cross} \begin {aligned} (Q, K, V) = (\textbf {[}Q_{c}, {Q}_{d}\textbf {]}, P_m + F_m, F_m), \\ \Tilde {Q} = \text {Multi-head Attention}(Q, K, V) \end {aligned}
(4)
After that, the perception queries are used to predict the categories and co-
ordinates of the foreground elements. The carrier queries are sent to a single
layer MLP to align with the dimension of LLM tokens (e.g., 4096 dimensions in
LLaMA [48]) and further used for text generation following LLaVA [31].
Abbreviated paper title 5
In our model, the carrier queries play the role of the visual-language align-
ment. Additionally, this design enables carrier queries to leverage the geometric
priors provided by the 3D position encoding, while also allowing them to leverage
query-based representations acquired through the 3D perception tasks.
Our approach benefits from multi-task learning and temporal modeling [25, 33].
In multi-task learning, we can integrate task-specific Q-Former3D modules for
each perception task, employing a uniform initialization strategy (please refer
to Sec. 2.4). In different tasks, carrier queries can gather information of differ-
ent traffic elements. In our implementation, we cover tasks such as center-line
construction and 3D object detection. During the training and inference phases,
both heads share the same 3D position encoding. Regarding temporal modeling,
we store the perception queries with top-k classification scores into a memory
bank and propagate them frame by frame [28, 59]. The propagated queries in-
teract with the perception and carrier queries from the current frame through
cross-attention, expanding the capabilities of our model to effectively process
video input.
3 OmniDrive-nuScenes
To benchmark drive LLM-agents, we propose OmniDrive-nuScenes, a novel bench-
mark built on nuScenes [4] with high quality visual question-answering (QA)
pairs covering perception, reasoning and planning in 3D domain.
OmniDrive-nuScenes features a fully-automated procedural QA generation
pipeline using GPT4. Similar to LLaVA [31], the proposed pipeline takes the 3D
perception ground truths as context information via prompt input. Traffic rules
and planning simulations are further leveraged as additional inputs, thereby
easing the challenges GPT-4V faces in comprehending 3D environments. Our
benchmark asks long-horizon questions in the forms of attention, counterfactual
reasoning, and open-loop planning. These questions challenge the true spatial
understanding and planning capabilities in 3D space as they require planning
simulations in the next few seconds to obtain the correct answers.
Besides using the above pipeline to curate the offline question-answering ses-
sions for OmniDrive-nuScenes, we further propose a pipeline to online generate
various types of grounding questions. This part can also be viewed as certain
form of implicit data augmentation to enhance the 3D spatial understanding and
reasoning capabilities of the models.
trajectory paths. Then selecting different completion rates and speed targets
for various lanes (acceleration, deceleration, and speed maintenance) to create
simulated trajectories. 2) Generating driving trajectories based solely on lane
centerlines makes it difficult to simulate scenarios that are ‘out of the drivable’.
Therefore, we also performed clustering on the entire nuScenes dataset’s ego
trajectories, selecting representative driving paths each time.
8 F. Author et al.
lane (-4.9, +3.9), (+2.1, +3.7), <1, 325.8, 355.4, 369.3, 548.3>
(+9.1, +3.5), (+16.0, +3.4)
Lane-to-Objects 2D Grouding 3D Distance
Instruction: What objects are Instruction: Identify the Instruction: What objects
Front Left Front Front Right there on the lane (-4.9, +3.9), classification of object are there near the position
(+2.1, +3.7), (+9.1, +3.5), (+16.0, in the <1, 325.8, 355.4, (-2.5, +1.4)?.
+3.4). 369.3, 548.3> and Answer: There are the
Answer: The objects on this lane describe 3D information. following objects nearby:
include: a car in the back left, the Answer: The object is a a car in the back left,
same direction as you, location: (- car in the front left, the location: (-4.7, +5.8),
Back Left Back Back right 4.7, +5.8), length: 4.6, width: 3.4, same direction as you, length: 4.4, width: 3.6,
height: 2.8, angles in degrees: - location: (+6.5, +0.2), height: 3.0, angles in
7.8, a car in the front right, the length: 4.6, width: 3.4, degrees: 4.7. car in the
same direction as you, location: height: 2.8, angles in front left, location: (+0.6,
(+5.9, +4.9), length: 4.4, width: degrees: 7.8, velocity: +5.2), length: 4.5, width:
3.6, height: 3.0, angles in degrees: (10.3, 0.3). 3.6,, height: 3.0, angles in
4.7. degrees: 5.2…
(-4.7, +5.8)
Expert trajectory. This is the log replay trajectory from nuScenes. The expert
trajectories are classified into different types for high-level decision making. We
also identify an object as “close”, if its minimum distance to the trajectory is
smaller than 10 meters in the next 3 seconds. The close objects are then listed
below the expert trajectory.
In the bottom block of the Tab. 1, we describe the different types of QA
responses obtained by using the above context information:
Scene description. We directly take caption (prompt type 1 in Tab. 1) as the
answer of scene description.
Attention. Given the simulated and expert trajectories, run simulation to iden-
tify close objects. At the same time, we also allowed GPT4 to use its own common
sense to identify threatening traffic elements.
Counterfactual reasoning. Given the simulated trajectories, we simulate to
check if the trajectories violate the traffic rules, such as run a red light, collision
to other objects or the road boundary.
Decision making and planning. We present the high-level decision making
as well as the expert trajectory and use GPT-4V to reason why this trajectory
is safe, given the previous prompt and response information as context.
General Conversation. We also prompt GPT-4 with generating multi-turn
dialogues based on caption information and image content, involving the ob-
ject countings, color, relative position, and OCR-type tasks. We found that this
approach helps improve the model’s recognition of long-tail objects.
3.3 Metrics
4 Experiment
Our model uses EVA-02-L [14] as the vision encoder. It applies masked image
modeling to distill CLIP [44], which can extract language-aligned vision features.
During the 2D pre-training stage, the training data and strategies, including
batchsize, learning rate, and optimizer are the same as LLaVA v1.5’s [30]. In the
finetuning stage, the model is trained by AdamW [34] optimizer with a batch size
of 16. The learning rate for the projector is 4e-4, while the visual encoder and
the LLM’s learning rates are 5e-4. The cosine annealing policy is employed for
the training stability. The models in the ablation study are trained for 6 epochs
unless specified otherwise. The number of object queries, lane queries and carrier
queries is set to 900, 300 and 256 respectively.
We also explore alternative architectures. Q-Former2D is initialized with 2D
pre-trained weights. It processes the image features individually in the projector
and fuses them in the LLM. The Dense BEV approach uses LSS method [40, 41]
to transform perspective features into a BEV feature map. We implement tempo-
ral modeling following SOLOFusion [40]. The BEV features will be consecutively
fed into a MLP projector and a LLM.
10 F. Author et al.
Counterfactual Open-loop
Ablation Exp. Safe Red Light Collision Drivable Area Col(%) Inter(%)
P R P R P R P R Avg. Avg.
Full Model Q-Former3D 70.7 49.0 57.6 58.3 32.3 72.6 48.5 58.6 3.79 4.59
Data No Online 69.4 39.4 36.2 65.6 29.7 69.4 48.0 57.8 4.93 4.02
Q-Former2D 71.4 39.3 58.3 61.1 32.0 66.7 44.4 52.8 3.98 6.03
Architecture Dense BEV 70.2 17.3 48.7 53.6 31.1 70.4 32.4 56.6 4.43 8.56
No Temporal 67.8 48.4 47.0 62.6 31.2 63.8 46.5 55.3 6.07 5.83
Perception No Lane 67.7 57.3 58.1 59.6 31.0 56.7 47.9 56.8 4.65 8.71
Supervision No Object & Lane 69.0 57.8 51.3 61.2 30.0 53.2 45.3 57.1 6.77 8.43
Counterfactual Caption
Abaltion Exp.
AP (%) AR (%) METEOR↑ CIDEr ↑ ROUGE↑
Full Model Q-Former3D 52.3 59.6 38.0 68.6 32.6
Data No Online 45.8 58.1 38.2 69.0 32.7
Q-Former2D 51.5 55.0 38.3 67.1 32.5
Architecture Dense BEV 45.6 49.5 35.6 59.5 27.8
No Temporal 48.1 57.5 37.9 68.4 32.6
Perception No Lane 51.2 57.6 38.0 67.8 32.6
Supervision No Object & Lane 48.9 57.3 38.2 67.8 32.6
Comparison on NuScenes-QA.
Model Modality Acc.(%)
We also present results on NuScenes-
BEVDet+BUTD [43] C 57.0
QA in Tab. 4. In NuScenes-QA, most
BEVDet+MCAN [43] C 57.9
answers in NuScenes-QA are single- CenterPoint+BUTD [43] L 58.1
word and related to perception only. CenterPoint+MCAN [43] L 59.5
In the same camera modality, our OmniDrive C 59.2
model surpasses BEVDet+MCAN
Table 4: Results on NuScenes-QA [43]. L and C
by 1.3% in accuracy, demonstrating represent Lidar and Camera respectively. We high-
the importance of pre-training. Our light the SoTA methods in each modality.
model’s performance is comparable to
the Lidar modality’s models.
the metrics across all methods. Additionally, we found that the high-level com-
mand also drastically reduces the collision rate and the intersection rate. Pre-
vious methods provided high-level commands based on the relative position of
the ground-truth trajectory, which imposed significant constraints on the net-
work’s regression, thereby reducing the likelihood of deviating too far from the
ground truth trajectory. We believe this design is also unreasonable, so in other
experiments, we eliminated this setting.
User What would be the outcome if you decided to move slowly and make a right turn?
Ours If I were to move slowly and make a right turn, following the trajectory [PT, (+2.44,
-0.09), (+4.91, -0.36), (+7.41, -0.84), (+9.91, -1.53), (+12.37, -2.44), (+14.76, -3.55)], it
would result in leaving the drivable area, which is a dangerous and incorrect action. The
trajectory indicates a turn that is too sharp and not aligned with the road layout, leading
to potential collisions or driving off the road. In this scenario, the correct behavior would
be to maintain my lane and follow the road’s curvature, ensuring I stay within the bounds
of the drivable area and adhere to traffic rules.
4.5 Visualization
5 Related Works
6 Conclusion
We address the challenges of end-to-end autonomous driving with LLM-agents by
proposing OmniDrive-Agent and OmniDrive-nuScenes. OmniDrive-Agent adopts
a novel Q-Former3D MLLM architecture that can efficiently handle high res-
olution multi-view videos. Our model design enables minimal adjustments to
leverage 2D pre-trained knowledge while gaining important 3D spatial under-
standing. We additionally provide a novel benchmark for end-to-end autonomous
driving which features counterfactual reasoning alongside 3D spatial awareness
and reasoning tasks. OmniDrive-Agent demonstrates the efficacy by addressing
high-resolution multi-view video input and illustrate excellent scene description
and counterfactual reasoning. The model also yields compelling performance on
open-loop 3D planning.
Limitations. Our method has not been validated on even larger datasets e.g.,
nuPlan [5]. The simulation of counterfactual outcomes, despite moving beyond
single trajectories, does not yet consider reaction from other agents. This part
can be further formed as a closed-loop setup, and we will leave it for future work.
Abbreviated paper title 15
References
1. Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K.,
Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model
for few-shot learning. In: NeurIPs (2022) 2, 13
2. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C.,
Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abili-
ties. arXiv:2308.12966 (2023) 13
3. Banerjee, S., Lavie, A.: METEOR: An automatic metric for mt evaluation with
improved correlation with human judgments. In: ACL workshop (2005) 9
4. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,
Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous
driving. In: CVPR (2020) 6, 14
5. Caesar, H., Kabzan, J., Tan, K.S., Fong, W.K., Wolff, E., Lang, A., Fletcher, L.,
Beijbom, O., Omari, S.: NuPlan: A closed-loop ml-based planning benchmark for
autonomous vehicles. arXiv:2106.11810 (2021) 14
6. Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D.,
Maund, D., Shotton, J.: Driving with LLMs: Fusing object-level vector modality
for explainable autonomous driving. arXiv:2310.01957 (2023) 2
7. Chen, S., Jiang, B., Gao, H., Liao, B., Xu, Q., Zhang, Q., Huang, C., Liu, W.,
Wang, X.: VADv2: End-to-end vectorized autonomous driving via probabilistic
planning. arXiv:2402.13243 (2024) 13
8. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D.,
Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: PaLI: A jointly-scaled
multilingual language-image model. arXiv:2209.06794 (2022) 13
9. Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.:
Talk2Car: Taking control of your self-driving car. In: EMNLP-IJCNLP (2019) 14
10. Ding, X., Han, J., Xu, H., Liang, X., Zhang, W., Li, X.: Holistic autonomous
driving understanding by bird’s-eye-view injected multi-modal large models.
arXiv:2401.00988 (2024) 2
11. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open
urban driving simulator. In: CoRL (2017) 13, 14
12. Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A.,
Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language
model. arXiv:2303.03378 (2023) 13
13. Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp,
B., Qi, C.R., Zhou, Y., Yang, Z., Chouard, A., Sun, P., Ngiam, J., Vasudevan, V.,
McCauley, A., Shlens, J., Anguelov, D.: Large scale interactive motion forecasting
for autonomous driving: The waymo open motion dataset. In: ICCV (2021) 14
14. Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA-02: A visual
representation for neon genesis. arXiv:2303.11331 (2023) 9
15. Hu, A., Corrado, G., Griffiths, N., Murez, Z., Gurau, C., Yeo, H., Kendall, A.,
Cipolla, R., Shotton, J.: Model-based imitation learning for urban driving. In:
NeurIPS (2022) 13
16. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.:
LoRA: Low-rank adaptation of large language models. arXiv:2106.09685 (2021) 5
17. Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T.,
Wang, W., et al.: Planning-oriented autonomous driving. In: CVPR (2023) 13
18. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H.,
Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning
with noisy text supervision. In: ICML (2021) 13
16 F. Author et al.
19. Jia, X., Wu, P., Chen, L., Xie, J., He, C., Yan, J., Li, H.: Think Twice before
Driving: Towards scalable decoders for end-to-end autonomous driving. In: CVPR
(2023) 13
20. Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang,
C., Wang, X.: VAD: Vectorized scene representation for efficient autonomous driv-
ing. arXiv:2303.12077 (2023) 10, 13
21. Jiang, X., Li, S., Liu, Y., Wang, S., Jia, F., Wang, T., Han, L., Zhang, X.: Far3d:
Expanding the horizon for surround-view 3d object detection. arXiv:2308.09616
(2023) 2, 3
22. Kim, J., Misu, T., Chen, Y.T., Tawari, A., Canny, J.: Grounding human-to-vehicle
advice for self-driving vehicles. In: CVPR (2019) 14
23. Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for
self-driving vehicles. ECCV (2018) 14
24. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. In: ICML (2023)
2, 3, 5, 13
25. Li, Z., Deng, H., Li, T., Huang, Y., Sima, C., Geng, X., Gao, Y., Wang, W., Li, Y.,
Lu, L.: BEVFormer++ : Improving bevformer for 3d camera-only object detection:
1st place solution for waymo open dataset challenge 2022 (2023) 5
26. Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., Alvarez, J.M.: Is ego status all you
need for open-loop end-to-end autonomous driving? arXiv:2312.03031 (2023) 2, 9,
10, 11, 13
27. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text
Summarization Branches Out. pp. 74–81 (2004) 9
28. Lin, X., Lin, T., Pei, Z., Huang, L., Su, Z.: Sparse4D: Multi-view 3d object detection
with sparse spatial-temporal fusion. arXiv:2211.10581 (2022) 3, 5
29. Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: SparseBEV: High-performance
sparse 3d object detection from multi-camera videos. In: ICCV (2023) 3
30. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning.
arXiv:2310.03744 (2023) 2, 5, 9
31. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS (2023) 2, 4,
6, 13
32. Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: Position embedding transformation
for multi-view 3d object detection. arXiv:2203.05625 (2022) 2, 3
33. Liu, Y., Yan, J., Jia, F., Li, S., Gao, Q., Wang, T., Zhang, X., Sun, J.: PETRv2: A
unified framework for 3d perception from multi-camera images. arXiv:2206.01256
(2022) 2, 5
34. Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts.
arXiv:1608.03983 (2016) 9
35. Ma, Y., Cui, C., Cao, X., Ye, W., Liu, P., Lu, J., Abdelraouf, A., Gupta, R., Han,
K., Bera, A., et al.: LaMPilot: An open benchmark dataset for autonomous driving
with language model programs. arXiv:2312.04372 (2023) 14
36. Malla, S., Choi, C., Dwivedi, I., Choi, J.H., Li, J.: DRAMA: Joint risk localization
and captioning in driving. In: WACV (2023) 14
37. Mao, J., Niu, M., Jiang, C., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., Yu, J.,
Xu, C., et al.: One million scenes for autonomous driving: Once dataset (2021) 14
38. Marcu, A.M., Chen, L., Hünermann, J., Karnsund, A., Hanotte, B., Chidananda,
P., Nair, S., Badrinarayanan, V., Kendall, A., Shotton, J., et al.: LingoQA: Video
question answering for autonomous driving. arXiv:2312.14115 (2023) 2, 14
Abbreviated paper title 17
39. Nie, M., Peng, R., Wang, C., Cai, X., Han, J., Xu, H., Zhang, L.: Rea-
son2Drive: Towards interpretable and chain-based reasoning for autonomous driv-
ing. arXiv:2312.03661 (2023) 2, 14
40. Park, J., Xu, C., Yang, S., Keutzer, K., Kitani, K., Tomizuka, M., Zhan, W.: Time
will tell: New outlooks and a baseline for temporal multi-view 3d object detection.
arXiv:2210.02443 (2022) 9
41. Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera
rigs by implicitly unprojecting to 3d. In: ECCV (2020) 9
42. Prakash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end
autonomous driving. In: CVPR (2021) 13
43. Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: NuScenes-QA: A multi-
modal visual question answering benchmark for autonomous driving scenario.
arXiv:2305.14836 (2023) 2, 11, 13, 14
44. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models
from natural language supervision. In: ICML (2021) 9, 13
45. Sachdeva, E., Agarwal, N., Chundi, S., Roelofs, S., Li, J., Kochenderfer, M., Choi,
C., Dariush, B.: Rank2Tell: A multimodal driving dataset for joint importance
ranking and reasoning. In: WACV (2024) 14
46. Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Luo, P., Geiger, A.,
Li, H.: DriveLM: Driving with graph visual question answering. arXiv:2312.14150
(2023) 2, 14
47. Tian, X., Gu, J., Li, B., Liu, Y., Hu, C., Wang, Y., Zhan, K., Jia, P., Lang, X., Zhao,
H.: DriveVLM: The convergence of autonomous driving and large vision-language
models. arXiv:2402.12289 (2024) 2, 14
48. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash-
lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation
and fine-tuned chat models. arXiv:2307.09288 (2023) 4
49. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: Consensus-based image
description evaluation. In: CVPR (2015) 9
50. Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal
modeling for efficient multi-view 3d object detection. arXiv:2303.11926 (2023) 2,
3, 4
51. Wang, T.H., Maalouf, A., Xiao, W., Ban, Y., Amini, A., Rosman, G., Karaman,
S., Rus, D.: Drive Anywhere: Generalizable end-to-end autonomous driving with
multi-modal foundation models. arXiv:2310.17642 (2023) 14
52. Wang, W., Xie, J., Hu, C., Zou, H., Fan, J., Tong, W., Wen, Y., Wu, S., Deng,
H., Li, Z., et al.: DriveMLM: Aligning multi-modal large language models with
behavioral planning states for autonomous driving. arXiv:2312.09245 (2023) 2
53. Wang, Y., Vitor Campagnolo, G., Zhang, T., Zhao, H., Solomon, J.: DETR3D: 3d
object detection from multi-view images via 3d-to-2d queries. In: CoRL (2022) 2
54. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou,
D., et al.: Chain-of-thought prompting elicits reasoning in large language models.
NeurIPS (2022) 14
55. Wu, D., Chang, J., Jia, F., Liu, Y., Wang, T., Shen, J.: TopoMLP: A simple yet
strong pipeline for driving topology reasoning. arXiv:2310.06753 (2023) 2, 3
56. Wu, D., Han, W., Wang, T., Liu, Y., Zhang, X., Shen, J.: Language prompt for
autonomous driving. arXiv:2309.04379 (2023) 14
57. Wu, P., Jia, X., Chen, L., Yan, J., Li, H., Qiao, Y.: Trajectory-guided control
prediction for end-to-end autonomous driving: A simple yet strong baseline. In:
NeurIPS (2022) 13
18 F. Author et al.
58. Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.K., Li, Z., Zhao,
H.: DriveGPT4: Interpretable end-to-end autonomous driving via large language
model. arXiv:2310.01412 (2023) 2, 14
59. Yuan, T., Liu, Y., Wang, Y., Wang, Y., Zhao, H.: StreamMapNet: Streaming map-
ping network for vectorized online hd map construction. arXiv:2308.12570 (2023)
5
60. Zhai, J.T., Feng, Z., Du, J., Mao, Y., Liu, J.J., Tan, Z., Zhang, Y., Ye, X., Wang, J.:
Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.
arXiv:2305.10430 (2023) 2, 13
61. Zhang, Z., Liniger, A., Dai, D., Yu, F., Van Gool, L.: End-to-end urban driving by
imitating a reinforcement learning coach. In: ICCV (2021) 13
62. Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P.,
Welker, S., Wahid, A., et al.: RT-2: Vision-language-action models transfer web
knowledge to robotic control. In: CoRL (2023) 13