0% found this document useful (0 votes)
41 views20 pages

A Survey On Occupancy Perception For Autonomous Driving: The Information Fusion Perspective

Uploaded by

axegamingyt2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views20 pages

A Survey On Occupancy Perception For Autonomous Driving: The Information Fusion Perspective

Uploaded by

axegamingyt2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Survey on Occupancy Perception for Autonomous Driving: The Information Fusion

Perspective

Huaiyuan Xu, Junliang Chen, Shiyu Meng, Yi Wang, Lap-Pui Chau


Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR

Abstract
3D occupancy perception technology aims to observe and understand dense 3D environments for autonomous vehicles. Owing
to its comprehensive perception capability, this technology is emerging as a trend in autonomous driving perception systems, and
is attracting significant attention from both industry and academia. Similar to traditional bird’s-eye view (BEV) perception, 3D
arXiv:2405.05173v3 [cs.CV] 21 Jul 2024

occupancy perception has the nature of multi-source input and the necessity for information fusion. However, the difference is
that it captures vertical structures that are ignored by 2D BEV. In this survey, we review the most recent works on 3D occupancy
perception, and provide in-depth analyses of methodologies with various input modalities. Specifically, we summarize general
network pipelines, highlight information fusion techniques, and discuss effective network training. We evaluate and analyze the
occupancy perception performance of the state-of-the-art on the most popular datasets. Furthermore, challenges and future research
directions are discussed. We hope this paper will inspire the community and encourage more research work on 3D occupancy
perception. A comprehensive list of studies in this survey is publicly available in an active repository that continuously collects the
latest work: https://github.com/HuaiyuanXu/3D-Occupancy-Perception.
Keywords: Autonomous Driving, Information Fusion, Occupancy Perception, Multi-Modal Data.

1. Introduction Autonomous Camera Intelligent Perception


Driving Vehicle
Perspective
1.1. Occupancy Perception in Autonomous Driving View
LiDAR
Autonomous driving can improve urban traffic efficiency
Point
and reduce energy consumption. For reliable and safe au- Radar Cloud
Bird’s-Eye View
tonomous driving, a crucial capability is to understand the sur-
rounding environment, that is, to perceive the observed world.
Intelligent Decision
At present, bird’s-eye view (BEV) perception is the mainstream
perception pattern [1, 2], with the advantages of absolute scale Explainable
and no occlusion for describing environments. BEV percep- Control

tion provides a unified representation space for multi-source Motion


3D Semantic Occupancy Perception
information fusion (e.g., information from diverse viewpoints, Planning Detection Segmentation Tracking
modalities, sensors, and time series) and numerous downstream
applications (e.g., explainable decision making and motion Figure 1: Autonomous driving vehicle system. The sensing data from cam-
planning). However, BEV perception does not monitor height eras, LiDAR, and radar enable the vehicle to intelligently perceive its surround-
information, thereby cannot provide a complete representation ings. Subsequently, the intelligent decision module generates control and plan-
ning of driving behavior. Occupancy perception surpasses other perception
for the 3D scene. methods based on perspective view, bird’s-eye view, or point clouds, in terms
To address this, occupancy perception was proposed for au- of 3D understanding and density.
tonomous driving to capture the dense 3D structure of the real
world. This burgeoning perception technology aims to infer the
occupied state of each voxel for the voxelized world, character- In academia and industry, occupancy perception for holis-
ized by a strong generalization capability to open-set objects, tic 3D scene understanding poses a meaningful impact. On
irregular-shaped vehicles, and special road structures [3, 4]. the academic consideration, it is challenging to estimate dense
Compared with 2D views such as perspective view and bird’s- 3D occupancy of the real world from complex input formats,
eye view, occupancy perception has a nature of 3D attributes, encompassing multiple sensors, modalities, and temporal se-
making it more suitable for 3D downstream tasks, such as 3D quences. Moreover, it is valuable to further reason about se-
detection [5, 6], segmentation [4], and tracking [7]. mantic categories [8], textual descriptions [9], and motion states
[10] for occupied voxels, which paves the way toward a more
1 Lap-Pui
comprehensive understanding of the environment. From the in-
Chau is the corresponding author.

Preprint submitted to Elsevier


2017 2021 2023
SSCNet JS3C-Net OCF PointOcc Occupancy-MAE OpenOccupancy POP-3D TPVFormer LiDAR-Centric Occupancy Perception
(Song et al.) (Yan et al.) (Liu et al.) (Zuo et al.) (Min et al.) (Wang et al.) (Vobecky et al.) (Huang et al.)
Vision-Centric Occupancy Perception
DIFs OccNet SurroundOcc NDC-Scene Occ3D OccFormer VoxFormer
(Rist et al.) (Sima et al.) (Wei et al.) (Yao et al.) (Tian et al.) (Zhang et al.) (Li et al.)
Multi-Modal Occupancy Perception

2020 2022 2024


LMSCNet MonoScene SCSF OccFusion Co-Occ HyDRa OccGen HASSC SparseOcc RenderOcc COTR Cam4DOcc Symphonies
(Roldao et al.) (Inria et al.) (Wang et al.) (Ming et al.) (Pan et al.) (Wolters et al.) (Wang et al.) (Wang et al.) (Tang et al.) (Pan et al.) (Ma et al.) (Ma et al.) (Jiang et al.)
S3CNet MVBTS BRGScene BRGScene DriveWorld CoHFF Vampire FastOcc MonoOcc PanoOcc SelfOcc VPD
(Cheng et al.) (Han et al.) (Li et al.) (Li et al.) (Min et al.) (Song et al.) (Xu et al.) (Hou et al.) (Zheng et al.) (Wang et al.) (Huang et al.) (Li et al.)
Indoor Outdoor

Figure 2: Chronological overview of 3D occupancy perception. It can be observed that: (1) research on occupancy has undergone explosive growth since 2023;
(2) the predominant trend focuses on vision-centric occupancy, supplemented by LiDAR-centric and multi-modal methods.

dustrial perspective, the deployment of a LiDAR Kit on each [1, 2]. Our survey focuses on 3D occupancy perception, which
autonomous vehicle is expensive. With cameras as a cheap al- captures the environmental height information overlooked by
ternative to LiDAR, vision-centric occupancy perception is in- BEV perception. There are two related reviews: Roldao et al.
deed a cost-effective solution that reduces the manufacturing [22] conducted a literature review on 3D scene completion for
cost for vehicle equipment manufacturers. both indoor and outdoor scenes; Zhang et al. [23] only re-
viewed 3D occupancy prediction based on the visual modality.
1.2. Motivation to Information Fusion Research Unlike their work, our survey is tailored to autonomous driv-
The gist of occupancy perception lies in comprehending ing scenarios, and extends the existing 3D occupancy survey
complete and dense 3D scenes, including understanding oc- by considering more sensor modalities. Moreover, given the
cluded areas. However, the observation from a single sensor multi-source nature of 3D occupancy perception, we provide
only captures parts of the scene. For instance, Fig. 1 intuitively an in-depth analysis of information fusion techniques for this
illustrates that an image or a point cloud cannot provide a 3D field. The primary contributions of this survey are three-fold:
panorama or a dense environmental scan. To this end, study- • We systematically review the latest research on 3D oc-
ing information fusion from multiple sensors [11, 12, 13] and cupancy perception in the field of autonomous driving,
multiple frames [4, 8] will facilitate a more comprehensive per- covering motivation analysis, the overall research back-
ception. This is because, on the one hand, information fusion ground, and an in-depth discussion on methodology,
expands the spatial range of perception, and on the other hand, evaluation, and challenges.
it densifies scene observation. Besides, for occluded regions,
integrating multi-frame observations is beneficial, as the same • We provide a taxonomy of 3D occupancy perception, and
scene is observed by a host of viewpoints, which offer sufficient elaborate on core methodological issues, including net-
scene features for occlusion inference. work pipelines, multi-source information fusion, and ef-
Furthermore, in complex outdoor scenarios with varying fective network training.
lighting and weather conditions, the need for stable occupancy
perception is paramount. This stability is crucial for ensuring • We present evaluations for 3D occupancy perception, and
driving safety. At this point, research on multi-modal fusion offer detailed performance comparisons. Furthermore,
will promote robust occupancy perception, by combining the current limitations and future research directions are dis-
strengths of different modalities of data [11, 12, 14, 15]. For cussed.
example, LiDAR and radar data are insensitive to illumination The remainder of this paper is structured as follows. Sec. 2
changes and can sense the precise depth of the scene. This ca- provides a brief background on the history, definitions, and re-
pability is particularly important during nighttime driving or in lated research domains. Sec. 3 details methodological insights.
scenarios where the shadow/glare obscures critical information. Sec. 4 conducts performance comparisons and analyses. Fi-
Camera data excel in capturing detailed visual texture, being nally, future research directions are discussed and the survey is
adept at identifying color-based environmental elements (e.g., concluded in Sec. 5 and 6, respectively.
road signs and traffic lights) and long-distance objects. There-
fore, the fusion of data from LiDAR, radar, and camera will
present a holistic understanding of the environment meanwhile 2. Background
against adverse environmental changes.
2.1. A Brief History of Occupancy Perception
1.3. Contributions Occupancy perception is derived from Occupancy Grid
Mapping (OGM) [24], which is a classic topic in mobile robot
Among perception-related topics, 3D semantic segmenta-
navigation, and aims to generate a grid map from noisy and un-
tion [16, 17] and 3D object detection [18, 19, 20, 21] have been
certain measurements. Each grid in this map is assigned a value
extensively reviewed. However, these tasks do not facilitate
that scores the probability of the grid space being occupied by
a dense understanding of the environment. BEV perception,
obstacles. Semantic occupancy perception originates from SS-
which addresses this issue, has also been thoroughly reviewed
CNet [25], which predicts the occupied status and semantics
2
of all voxels in an indoor scene from a single image. How- class0
ever, studying occupancy perception in outdoor scenes is im- 0 class1
perative for autonomous driving, as opposed to indoor scenes.
class2
MonoScene [26] is a pioneering work of outdoor scene occu-
pancy perception using only a monocular camera. Contempo- 1 class3

rary with MonoScene, Tesla announced its brand-new camera- class4


only occupancy network at the CVPR 2022 workshop on Au-
w/o Semantics with Semantics
tonomous Driving [27]. This new network comprehensively un-
derstands the 3D environment surrounding a vehicle according Figure 3: Illustration of voxel-wise representations with and without se-
to surround-view RGB images. Subsequently, occupancy per- mantics. The left voxel volume depicts the overall occupancy distribution. The
ception has attracted extensive attention, catalyzing a surge in right voxel volume incorporates semantic enrichment, where each voxel is as-
research on occupancy perception for autonomous driving in sociated with a class estimation.
recent years. The chronological overview in Fig. 2 indicates
rapid development in occupancy perception since 2023. occupancy perception network ΦO processes information from
Early approaches to outdoor occupancy perception primar- the t-th frame and the previous k frames, generating the voxel-
ily used LiDAR input to infer 3D occupancy [28, 29, 30]. wise representation Vt of the t-th frame:
However, recent methods have shifted towards more challeng-
ing vision-centric 3D occupancy prediction [31, 32, 33, 34]. Vt = ΦO (Ωt , . . . , Ωt−k ) , s.t. t − k ≥ 0. (2)
Presently, a dominant trend in occupancy perception research
is vision-centric solutions, complemented by LiDAR-centric 2.3. Related Works
methods and multi-modal approaches. Occupancy perception 2.3.1. Bird’s-Eye-View Perception
can serve as a unified representation of the 3D physical world Bird’s-eye-view perception represents the 3D scene on a
within the end-to-end autonomous driving framework [8, 35], BEV plane. Specifically, it extracts the feature of each entire
followed by downstream applications spanning various driving pillar in 3D space as the feature of the corresponding BEV grid.
tasks such as detection, tracking, and planning. The training This compact representation provides a clear and intuitive de-
of occupancy perception networks heavily relies on dense 3D piction of the spatial layout from a top-down perspective. Tesla
occupancy labels, leading to the development of diverse street released its BEV perception-based systematic pipeline [41],
view occupancy datasets [11, 10, 36, 37]. Recently, taking ad- which is capable of detecting objects and lane lines in BEV
vantage of the powerful performance of large models, the inte- space, for Level 2 highway navigation and smart summoning.
gration of large models with occupancy perception has shown According to the input data, BEV perception is primarily
promise in alleviating the need for cumbersome 3D occupancy categorized into three groups: BEV camera [42, 43, 44], BEV
labeling [38]. LiDAR [45, 46], and BEV fusion [47, 48]. Current research
predominantly focuses on the BEV camera, the key of which
2.2. Task Definition lies in the effective feature conversion from image space to
Occupancy perception aims to extract voxel-wise represen- BEV space. To address this challenge, one type of work adopts
tations of observed 3D scenes from multi-source inputs. Specif- explicit transformation, which initially estimates the depth for
ically, this representation involves discretizing a continuous 3D front-view images, then utilizes the camera’s intrinsic and ex-
space W into a grid volume V composed of dense voxels. trinsic matrices to map image features into 3D space, and sub-
The state of each voxel is described by the value of {1, 0} or sequently engages in BEV pooling [43, 48, 49]. Conversely, an-
{c0 , · · · , cn }, as illustrated in Fig. 3, other type of work employs implicit conversion [44, 50], which
implicitly models depth through a cross-attention mechanism
X×Y ×Z X×Y ×Z
W ∈ R3 → V ∈ {0, 1} or {c0 , · · · , cn } , (1) and extracts BEV features from image features. Remarkably,
the performance of camera-based BEV perception in down-
where 0 and 1 denote the occupied state; c represents seman- stream tasks is now on par with that of LiDAR-based methods
tics; (X, Y, Z) are the length, width, and height of the voxel [49]. In contrast, occupancy perception can be regarded as an
volume. This voxelized representation offers two primary ad- extension of BEV perception. Occupancy perception constructs
vantages: (1) it enables the transformation of unstructured data a 3D volumetric space instead of a 2D BEV plane, resulting in
into a voxel volume, thereby facilitating processing by convo- a more complete description of the 3D scene.
lution [39] and Transformer [40] architectures; (2) it provides
a flexible and scalable representation for 3D scene understand- 2.3.2. 3D Semantic Scene completion
ing, striking an optimal trade-off between spatial granularity 3D semantic scene completion (3D SSC) is the task of si-
and memory consumption. multaneously estimating the geometry and semantics of a 3D
Multi-source input encompasses signals from multiple sen- environment within a given range from limited observations,
sors, modalities, and frames, including common formats such which requires imagining the complete 3D content of occluded
as
 1images and point clouds. We take the multi-camera images objects and scenes. From a task content perspective, 3D seman-
. , ItN and point cloud Pt of the t-th frame as an input
It , . .  tic scene completion [26, 37, 51, 52, 53] aligns with semantic
Ωt = It1 , . . . , ItN , Pt . N is the number of cameras. The occupancy perception [12, 32, 54, 55, 56].
3
Occupancy
Drawing on prior knowledge, humans excel at estimating TPV 2D
Branch
the geometry and semantics of 3D environments and occluded or Head
regions, but this is more challenging for computers and ma- BEV Features Encoder Decoder
Point Cloud
chines [22]. SSCNet [25] first raised the problem of seman- Fusion
Refinement
tic scene completion and tried to address it via a convolu-
Voxelization Occupancy
tional neural network. Early 3D SSC research mainly dealt 2D Prediction 3D Prediction

with static indoor scenes [25, 57, 58], such as NYU [59] and Feature
Extraction
SUNCG [25] datasets. After the release of the large-scale out- Head
Volumetric 3D
Features Encoder Decoder Branch
door benchmark SemanticKITTI [60], numerous outdoor SSC Occupancy

methods emerged. Among them, MonoScene [26] introduced


Figure 4: Architecture for LiDAR-centric occupancy perception: Solely the
the first monocular method for outdoor 3D semantic scene com- 2D branch [75, 79], solely the 3D branch [11, 28, 107], and integrating both 2D
pletion. It employs 2D-to-3D back projection to lift the 2D and 3D branches [30].
image and utilizes consecutive 2D and 3D UNets for seman-
tic scene completion. In recent years, an increasing number of
approaches have incorporated multi-camera and temporal infor- 3.1. LiDAR-Centric Occupancy Perception
mation [56, 61, 62] to enhance model comprehension of scenes 3.1.1. General Pipeline
and reduce completion ambiguity. LiDAR-centric semantic segmentation [104, 105, 106] only
predicts the semantic categories for sparse points. In contrast,
2.3.3. 3D Reconstruction from images LiDAR-centric occupancy perception provides a dense 3D un-
3D reconstruction is a traditional but important topic in the derstanding of the environment, crucial to autonomous driving
computer vision and robotics communities [63, 64, 65, 66]. The systems. For LiDAR sensing, the acquired point clouds have
objective of 3D reconstruction from images is to construct 3D an inherently sparse nature and suffer from occlusion. This re-
of an object or scene based on 2D images captured from one or quires that LiDAR-centric occupancy perception not only ad-
more viewpoints. Early methods exploited shape-from-shading dress the sparse-to-dense occupancy reasoning of the scene, but
[67] or structure-from-motion [68]. Afterwards, the neural ra- also achieve the partial-to-complete estimation of objects [12].
diation field (NeRF) [69] introduced a novel paradigm for 3D Fig. 4 illustrates the general pipeline of LiDAR-centric
reconstruction, which learned the density and color fields of occupancy perception. The input point cloud first undergoes
3D scenes, producing results with unprecedented detail and voxelization and feature extraction, followed by representation
fidelity. However, such performance necessitates substantial enhancement via an encoder-decoder module. Ultimately, the
training time and resources for rendering [70, 71, 72], espe- complete and dense occupancy of the scene is inferred. Specif-
cially for high-resolution output. Recently, 3D Gaussian splat- ically, given a point cloud P ∈ RN ×3 , we generate a series of
ting (3D GS) [73] has addressed this issue by redefining a initial voxels and extract their features. These voxels are dis-
paradigm-shifting approach to scene representation and render- tributed in a 3D volume [28, 30, 107, 108], a 2D BEV plane
ing. Specifically, it represents scene representation with mil- [30, 75], or three 2D tri-perspective view planes [79]. This op-
lions of 3D Gaussian functions in an explicit way, achieving eration constructs the 3D feature volume or 2D feature map,
faster and more efficient rendering [74]. 3D reconstruction em- denoted as Vinit−3D ∈ RX×Y ×Z×D and Vinit−2D ∈ RX×Y ×D
phasizes the geometric quality and visual appearance of the respectively. N represents the number of points; (X, Y, Z) are
scene. In comparison, voxel-wise occupancy perception has the length, width, and height; D mean the feature dimensions
lower resolution and visual appearance requirements, focusing of voxels. In addition to voxelizing in regular Euclidean space,
instead on the occupancy distribution and semantic understand- PointOcc [79] builds tri-perspective 2D feature maps in a cylin-
ing of the scene. drical coordinate system. The cylindrical coordinate system
aligns more closely with the spatial distribution of points in the
LiDAR point cloud, where points closer to the LiDAR sensor
3. Methodologies are denser than those at farther distances. Therefore, it is rea-
sonable to use smaller-sized cylindrical voxels for fine-grained
Recent methods of occupancy perception for autonomous
modeling in nearby areas. The voxelization and feature extrac-
driving and their characteristics are detailed in Tab. 1. This
tion of point clouds can be formulated as:
table elaborates on the publication venue, input modality, net-
work design, target task, network training and evaluation, and Vinit−2D/3D = ΦV (ΨV (P )) , (3)
open-source status of each method. In this section, according
to the modality of input data, we categorize occupancy percep- where ΨV stands for pillar or cubic voxelization. ΦV is a fea-
tion methods into three types: LiDAR-centric occupancy per- ture extractor that extracts neural features of voxels (e.g., using
ception, vision-centric occupancy perception, and multi-modal PointPillars [109], VoxelNet [110], and MLP) [75, 79], or di-
occupancy perception. Additionally, network training strategies rectly counts the geometric features of points within the voxel
and corresponding loss functions will be discussed. (e.g., mean, minimum, and maximum heights) [30, 107]. En-
coder and decoder can be various modules to enhance features.
The final 3D occupancy inference involves applying convolu-
tion [28, 30, 78] or MLP [75, 79, 108] on the enhanced features
4
Table 1: 3D occupancy perception methods for autonomous driving.

Method Venue Modality Design Choices Task Training Evaluation Datases Open Source

Occ3D-nuScenes [36]
SemanticKITTI [60]
Lightweight Design
Feature Format

Multi-Camera
Multi-Frame

Supervision

Weight
Others
Head

Code
Loss
LMSCNet [28] 3DV 2020 L BEV - - 2D Conv 3D Conv P Strong CE ✓ - - ✓ ✓
S3CNet [30] CoRL 2020 L BEV+Vol - - Sparse Conv 2D &3D Conv P Strong CE, PA, BCE ✓ - - - -

DIFs [75] T-PAMI 2021 L BEV - - 2D Conv MLP P Strong CE, BCE, SC ✓ - - - -

MonoScene [26] CVPR 2022 C Vol - - - 3D Conv P Strong CE, FP, Aff ✓ - - ✓ ✓

TPVFormer [32] CVPR 2023 C TPV ✓ - TPV Rp MLP P Strong CE, LS ✓ - - ✓ ✓


VoxFormer [33] CVPR 2023 C Vol - ✓ - MLP P Strong CE, BCE, Aff ✓ - - ✓ ✓
OccFormer [55] ICCV 2023 C BEV+Vol - - - Mask Decoder P Strong BCE, MC ✓ - - ✓ ✓
OccNet [8] ICCV 2023 C BEV+Vol ✓ ✓ - MLP P Strong Foc - - OpenOcc [8] ✓ -
SurroundOcc [76] ICCV 2023 C Vol ✓ - - 3D Conv P Strong CE, Aff ✓ - SurroundOcc [76] ✓ ✓
OpenOccupancy [11] ICCV 2023 C+L Vol ✓ - - 3D Conv P Strong CE, LS, Aff - - OpenOccupancy [11] ✓ -
NDC-Scene [77] ICCV 2023 C Vol - - - 3D Conv P Strong CE, FP, Aff ✓ - - ✓ ✓
Occ3D [36] NeurIPS 2023 C Vol ✓ - - MLP P - - - ✓ Occ3D-Waymo [36] - -
POP-3D [9] NeurIPS 2023 C+T+L Vol ✓ - - MLP OP Semi CE, LS, MA - - POP-3D [9] ✓ ✓
OCF [78] arXiv 2023 L BEV/Vol - ✓ - 3D Conv P&F Strong BCE, SI - - OCFBen [78] ✓ -
PointOcc [79] arXiv 2023 L TPV - - TPV Rp MLP P Strong Aff - - OpenOccupancy [11] ✓ ✓
FlashOcc [80] arXiv 2023 C BEV ✓ ✓ 2D Conv 2D Conv P - - - ✓ - ✓ ✓
OccNeRF [38] arXiv 2023 C Vol ✓ ✓ - 3D Conv P Self CE, Pho - ✓ - ✓ ✓

Vampire [81] AAAI 2024 C Vol ✓ - - MLP+T P Weak CE, LS - ✓ - ✓ ✓


FastOcc [82] ICRA 2024 C BEV ✓ - 2D Conv MLP P Strong CE, LS, Foc, BCE, Aff - ✓ - - -
RenderOcc [83] ICRA 2024 C Vol ✓ ✓ - MLP+T P Weak CE, SIL ✓ ✓ - ✓ ✓
MonoOcc [84] ICRA 2024 C Vol - ✓ - MLP P Strong CE, Aff ✓ - - ✓ ✓
COTR [85] CVPR 2024 C Vol ✓ ✓ - Mask Decoder P Strong CE, MC - ✓ - - -
Cam4DOcc [10] CVPR 2024 C Vol ✓ ✓ - 3D Conv P&F Strong CE - - Cam4DOcc [10] ✓ ✓
PanoOcc [4] CVPR 2024 C Vol ✓ ✓ - MLP, DETR PO Strong Foc, LS - ✓ nuScenes [86] ✓ ✓
SelfOcc [87] CVPR 2024 C BEV/TPV ✓ - - MLP+T P Self Pho ✓ ✓ - ✓ ✓
Symphonies [88] CVPR 2024 C Vol - - - 3D Conv P Strong CE, Aff ✓ - SSCBench [37] ✓ ✓
HASSC [89] CVPR 2024 C Vol ✓ ✓ - MLP P Strong CE, Aff, KD ✓ - - - -
SparseOcc [90] CVPR 2024 C Vol ✓ - Sparse Rp Mask Decoder P Strong - ✓ - OpenOccupancy [11] - -
MVBTS [91] CVPR 2024 C Vol ✓ ✓ - MLP+T P Self Pho, KD - - KITTI-360 [92] ✓ -
DriveWorld [7] CVPR 2024 C BEV ✓ ✓ - 3D Conv P Strong CE - - OpenScene [93] - -
Bi-SSC [94] CVPR 2024 C BEV ✓ - - 3D Conv P Strong CE, Aff ✓ - - - -
LowRankOcc [95] CVPR 2024 C Vol - - TRDR Mask Decoder P Strong MC ✓ - SurroundOcc [76] - -
PaSCo [96] CVPR 2024 L Vol - - - Mask Decoder PO Strong CE, LS, MC ✓ - SSCBench [37], Robo3D [97] ✓ ✓
HTCL [98] ECCV 2024 C Vol ✓ ✓ - 3D Conv P Strong CE, Aff ✓ - OpenOccupancy [11] ✓ ✓
OSP [99] ECCV 2024 C Point ✓ - - MLP P Strong CE - ✓ - ✓ ✓
OccGen [100] ECCV 2024 C+L Vol ✓ - - 3D Conv P Strong CE, LS, Aff ✓ - OpenOccupancy [11] - -
Scribble2Scene [101] IJCAI 2024 C Vol - ✓ - MLP P Weak CE, Aff, KD ✓ - SemanticPOSS [102] - -
BRGScene IJCAI 2024 C Vol ✓ - - 3D Conv P Strong CE, BCE, Aff ✓ - - ✓ ✓
Co-Occ [103] RA-L 2024 C+L Vol ✓ - - 3D Conv P Strong CE, LS, Pho ✓ - SurroundOcc [76] - -
OccFusion [12] arXiv 2024 C+L/R BEV+Vol ✓ - - MLP P Strong Foc, LS, Aff - - SurroundOcc [76] - -
HyDRa [14] arXiv 2024 C+R BEV+PV ✓ - - 3D Conv P - - - ✓ - - -
Modality: C - Camera; L - LiDAR; R - Radar; T - Text.
Feature Format: Vol - Volumetric Feature; BEV - Bird’s-Eye View Feature; PV - Perspective View Feature; TPV - Tri-Perspective View Feature; Point - Point Feature.
Lightweight Design: TPV Rp - Tri-Perspective View Representation; Sparse Rp - Sparse Representation; TRDR - Tensor Residual Decomposition and Recovery.
Head: MLP+T - Multi-Layer Perceptron followed by Thresholding.
Task: P - Prediction; F - Forecasting; OP - Open-Vocabulary Prediction; PO - Panoptic Occupancy.
Loss: [Geometric] BCE - Binary Cross Entropy, SIL - Scale-Invariant Logarithmic, SI - Soft-IoU; [Semantic] CE - Cross Entropy, PA - Position Awareness, FP - Frustum Proportion, LS -
Lovasz-Softmax, Foc - Focal; [Semantic and Geometric] Aff - Scene-Class Affinity, MC - Mask Classification; [Consistency] SC - Spatial Consistency, MA - Modality Alignment, Pho -
Photometric Consistency; [Distillation] KD - Knowledge Distillation.

to infer the occupied status {1, 0} of each voxel, and even esti- of height information. In contrast, the 3D branch does not
mate its semantic category: compress data in any dimension, thereby protecting the com-
plete 3D scene. To enhance memory efficiency in the 3D
V = fConv/MLP (ED (Vinit−2D/3D )) , (4)
branch, LMSCNet [28] turns the height dimension into the fea-
where ED represents encoder and decoder. ture channel dimension. This adaptation facilitates the use of
more efficient 2D convolutions compared to 3D convolutions
3.1.2. Information Fusion in LiDAR-Centric Occupancy in the 3D branch. Moreover, integrating information from both
Some works directly utilize a single 2D branch to reason 2D and 3D branches can significantly refine occupancy predic-
about 3D occupancy, such as DIFs [75] and PointOcc [79]. In tions [30].
these approaches, only 2D feature maps instead of 3D feature S3CNet [30] proposes a unique late fusion strategy for in-
volumes are required, resulting in reduced computational de- tegrating information from 2D and 3D branches. This fusion
mand. However, a significant disadvantage is the partial loss strategy involves a dynamic voxel fusion technique that lever-
5
Optional: Temporal Branch Volume Projection Back Projection Cross Attention
Features Occupancy Key,
Backbone
2D Image
or Value
2D to 3D TPV Spatial
Features Fusion Query
Multi-camera Feature Temporal-Spatial or BEV
Images Maps Alignment Features
3D Feature 2D Feature 3D Feature 2D Feature 3D Feature 2D Feature
Camera Parameters
Time

Temporal Volume Map Volume Map Volume Map


Fusion Head
Camera Parameters (a) 2D-to-3D transformation. This serves as the fundamental unit for constructing 3D data
Volume
Features from 2D observations, typically by projection [26, 31, 38, 53, 91, 119], back projection
Backbone
2D Image

or [55, 80, 81, 82, 90], and cross attention [4, 36, 76, 113, 118].
TPV Spatial
2D to 3D Features Fusion Multi-camera Features
Multi-camera Feature or BEV
Images Maps Features Sum

3D
Figure 5: Architecture for vision-centric occupancy perception: Methods
2D Cross
without temporal fusion [31, 32, 36, 38, 76, 81, 82, 83, 87, 116]; Methods with
Attention
temporal fusion [4, 8, 10, 56, 80, 85]. 3D Feature Output
Volume Multi-camera Feature Maps Query Feature

ages the results of the 2D branch to enhance the density of the (b) Spatial information fusion. In areas where multiple cameras have overlapping
fields of view, features from these cameras are fused through average [38, 53, 82]
output from the 3D branch. Ablation studies report that this or cross attention [4, 32, 76, 113, 120].
straightforward and direct information fusion strategy can yield Feature Fusion
a 5-12% performance boost in 3D occupancy perception.
3D Convolution
Concatenation
3.2. Vision-Centric Occupancy Perception or Key, value
Query
Temporal-
3.2.1. General Pipeline Cross Attention
Time

Spatial
Alignment Previous features Current features
Inspired by Tesla’s technology of the perception system or

for their autonomous vehicles [27], vision-centric occupancy BEV


Features Channel Point
Linear
Volumetric Mixing Mixing
perception has garnered significant attention both in industry Features Historical Features
Linear Linear
and academia. Compared to LiDAR-centric methods, vision-
Query Feature
centric occupancy perception, which solely relies on camera
sensors, represents a current trend. There are three main rea- (c) Temporal information fusion. Historical and current features undergo the temporal-
spatial alignment, and then are fused via convolution [4] (see row 1 in the feature fusion
sons: (i) Cameras are cost-effective for large-scale deployment block), cross attention [8, 33, 56, 120] (row 2), and adaptive mixing [118] (row 3).
on vehicles. (ii) RGB images capture rich environmental tex-
tures, aiding in the understanding of scenes and objects such as Figure 6: Key components of vision-centric 3D occupancy perception.
traffic signs and lane lines. (iii) The burgeoning advancement Specifically, we present techniques for view transformation (i.e., 2D to 3D),
multi-camera information integration (i.e., spatial fusion), and historical infor-
of deep learning technologies enables a possibility to achieve mation integration (i.e., temporal fusion).
3D occupancy perception from 2D vision. Vision-centric oc-
cupancy perception can be divided into monocular solutions
[26, 33, 51, 52, 54, 55, 84, 88, 111, 112] and multi-camera solu- ric features [33, 76, 85] to acquire the missing depth dimen-
tions [8, 31, 32, 38, 53, 61, 80, 81, 113, 114, 115]. Multi-camera sion of the front view. Notably, although BEV features are
perception, which covers a broader field of view, follows a gen- located on the top-view 2D plane, they can encode height in-
eral pipeline as shown in Fig. 5. It begins by extracting front- formation into the channel dimension of the features, thereby
view feature maps from multi-camera images, followed by a representing the 3D scene. The tri-perspective view projects
2D-to-3D transformation, spatial information fusion, and op- the 3D space into three orthogonal 2D planes, so that each fea-
tional temporal information fusion, culminating with an occu- ture in the 3D space can be represented as a combination of
pancy head that infers the environmental 3D occupancy. three TPV features. The 2D-to-3D transformation is formu-
Specifically, the 2D feature map F2D (u, v) from the lated as FBEV /T P V /V ol (x∗ , y ∗ , z ∗ ) = ΦT (F2D (u, v)), where
RGB image forms the basis of the vision-centric occupancy (x, y, z) represent the coordinates in the 3D space, ∗ means that
pipeline. Its extraction leverages the pre-trained image back- the specific dimension may not exist in the BEV or TPV planes,
bone network ΦF , such as convolution-based ResNet [39] and ΦT is the conversion from 2D to 3D. This transformation can
Transformer-based ViT [117], F2D (u, v) = ΦF (I (u, v)). I be categorized into three types, characterized by using projec-
denotes the input image, (u, v) are pixel coordinates. Since the tion [26, 31, 38, 53], back projection [55, 80, 81, 82], and cross
front view provides only a 2D perspective, a 2D-to-3D trans- attention [4, 36, 76, 113, 118] technologies respectively. Tak-
formation is essential to deduce the depth dimension that the ing the construction of volumetric features as an example, the
front view lacks, thereby enabling 3D scene perception. The process is illustrated in Fig. 6a.
2D-to-3D transformation is detailed next. (1) Projection: It establishes a geometric mapping from the
feature volume to the feature map. The mapping is achieved by
3.2.2. 2D-to-3D Transformation projecting the voxel centroid in the 3D space onto the 2D front-
The transformation is designed to convert front-view fea- view feature map through the perspective projection model Ψρ
tures to BEV features [61, 80], TPV features [32], or volumet- [121], followed by performing sampling ΨS by bilinear inter-

6
polation [26, 31, 38, 53]. This projection process is formulated formable attention mechanism:
as: Nkey
NX
head X
FV ol (x, y, z) = ΨS (F2D (Ψρ (x, y, z, K, RT ))) , (5) FV ol (q) = Wi Aij Wij F2D (p + △pij ) , (7)
i=1 j=1
where K and RT are the intrinsics and extrinsics of the cam-
where (Wi , Wij ) are learnable weights, Aij denotes attention,
era. However, the problem with the projection-based 2D-to-3D
p + △pij represents the position of the reference feature, and
transformation is that along the line of sight, multiple voxels
△pij indicates a learnable position shift.
in the 3D space correspond to the same location in the front-
Furthermore, there are some hybrid transformation meth-
view feature map. This leads to many-to-one mapping that in-
ods that combine multiple 2D-to-3D transformation techniques.
troduces the ambiguity in the correspondence between 2D and
VoxFormer [33] and SGN [51] initially compute a coarse 3D
3D.
feature volume by per-pixel depth estimation and back projec-
(2) Back Projection: Back projection is the reverse process
tion, and subsequently refine the feature volume using cross
of projection. Similarly, it also utilizes perspective projection to
attention. COTR [85] has a similar hybrid transformation as
establish correspondences between 2D and 3D. However, un-
VoxFormer and SGN, but it replaces per-pixel depth estimation
like projection, back projection uses the estimated depth d of
with estimating depth distributions.
each pixel to calculate an accurate one-to-one mapping from
For TPV features, TPVFormer [32] achieves the 2D-to-3D
2D to 3D.
transformation via cross attention. The conversion process dif-
FV ol ΨV Ψ−1
 fers slightly from that depicted in Fig. 6a, where the 3D feature
ρ (u, v, d, K, RT ) = F2D (u, v) , (6)
volume is replaced by a 2D feature map in a specific perspective
where Ψ−1 ρ indicates the inverse projection function; ΨV is of three views. For BEV features, the conversion from the front
voxelization. Since estimating the depth value may introduce view to the bird’s-eye view can be achieved by cross attention
errors, it is more effective to predict a discrete depth distribution [61] or by back projection and then vertical pooling [61, 80].
Dis along the optical ray rather than estimating a specific depth
for each pixel [55, 80, 81, 82]. That is, FV ol = F2D ⊗ Dis, 3.2.3. Information Fusion in Vision-Centric Occupancy
where ⊗ denotes outer product. This depth distribution-based In a multi-camera setting, each camera’s front-view feature
re-projection, derived from LSS [122], has significant advan- map describes a part of the scene. To comprehensively under-
tages. On one hand, it can handle uncertainty and ambiguity in stand the scene, it is necessary to spatially fuse the information
depth perception. For instance, if the depth of a certain pixel from multiple feature maps. Additionally, objects in the scene
is unclear, the model can realize this uncertainty by the depth might be occluded or in motion. Temporally fusing feature
distribution. On the other hand, this probabilistic method of maps of multiple frames can help reason about the occluded
depth estimation provides greater robustness, particularly in a areas and recognize the motion status of objects.
multi-camera setting. If corresponding pixels in multi-camera (1) Spatial Information Fusion: The fusion of observations
images have incorrect depth values to be mapped to the same from multiple cameras can create a 3D feature volume with an
voxel in the 3D space, their information might be unable to be expanded field of view for scene perception. Within the over-
integrated. In contrast, estimating depth distributions allows for lapping area of multi-camera views, a 3D voxel in the feature
information fusion with depth uncertainty, leading to more ro- volume will hit several 2D front-view feature maps after pro-
bustness and accuracy. jection. There are two ways to fuse the hit 2D features: average
(3) Cross Attention: The cross attention-based transforma- [38, 53, 82] and cross attention [4, 32, 76, 113], as illustrated in
tion aims to interact between the feature volume and the fea- Fig. 6b. The averaging operation calculates the mean of mul-
ture map in a learnable manner. Consistent with the attention tiple features, which simplifies the fusion process and reduces
mechanism [40], each volumetric feature in the 3D feature vol- computational costs. However, it assumes the equivalent con-
ume acts as the query, and the key and value come from the tribution of different 2D perspectives to perceive the 3D scene.
2D feature map. However, employing vanilla cross attention This may not always be the case, especially when certain views
for the 2D-to-3D transformation requires considerable compu- are occluded or blurry.
tational expense, as each query must attend to all features in the To address this problem, multi-camera cross attention is
feature map. To optimize for GPU efficiency, many transfor- used to adaptively fuse information from multiple views.
mation methods [4, 36, 76, 113, 118] adopt deformable cross Specifically, its process can be regarded as an extension of Eq.
attention [123], where the query interacts with selected refer- 7 by incorporating more camera views. We redefine the de-
ence features instead of all features in the feature map, there- formable attention function as DA (q, pi , F2D-i ), where q is a
fore greatly reducing computation. Specifically, for each query, query position in the 3D space, pi is its projection position on a
we project its 3D position q onto the 2D feature map according specific 2D view, and F2D-i is the corresponding 2D front-view
to the given intrinsic and extrinsic. We sample some reference feature map. The multi-camera cross attention process can be
features around the projected 2D position p. These sampled formulated as:
features are then weighted and summed according to the de- 1 X
FV ol (q) = DA (q, pi , F2D-i ) , (8)
|ν| i∈ν
7
where FV ol (q) represents the feature of the query position in Feature or Occupancy
Extractor
the 3D feature volume, and ν denotes all hit views. Volumetric BEV
Point Cloud Features Features
(2) Temporal Information Fusion: Recent advancements Fusion
Head
① 𝛿 1−𝛿
in vision-based BEV perception systems [44, 124, 125] have key, value

② Optional
demonstrated that integrating temporal information can signif- + query
Refinement
icantly improve perception performance. Similarly, in vision- ③ ①Concatenation ②Summation ③ Cross Attention

based occupancy perception, accuracy and reliability can be im-


proved by combining relevant information from historical fea- Feature 2D to 3D
or
Extractor
tures and current perception inputs. The process of temporal in- Volumetric BEV
Multi-camera Feature Features Features
formation fusion consists of two components: temporal-spatial Images Maps

alignment and feature fusion, as illustrated in Fig. 6c. The


Figure 7: Architecture for multi-modal occupancy perception: Fusion of
temporal-spatial alignment leverages pose information of the information from point clouds and images [11, 12, 15, 103, 100]. Dashed lines
ego vehicle to spatially align historical features Ft−k with cur- signify additional fusion of perspective-view feature maps [14]. ⊙ represents
rent features. The alignment process is formulated as: element-wise product. δ is a learnable weight.

Ft−k = ΨS (Tt−k→t · Ft−k ) , (9)
3D masks, each associated with a corresponding semantic cat-
where Tt−k→t is the transformation matrix that converts frame egory. Specifically, they compute per-voxel embeddings and
t − k to the current frame t, involving translation and rotation; assess per-query embeddings along with their related seman-
ΨS represents feature sampling. tics. The final occupancy predictions are obtained by calculat-
Once the alignment is completed, the historical and current ing the dot product of these two embeddings. Linear projection-
features are fed to the feature fusion module to enhance the based heads [4, 32, 33, 36, 51, 84, 89] leverage lightweight
representation, especially to strengthen the reasoning ability for MLPs on the dimension of feature channels to produce oc-
occlusion and the recognition ability of moving objects. There cupied status and semantics. Furthermore, for the occupancy
are three main streamlines to feature fusion, namely convolu- methods[81, 83, 87, 91, 116] based on NeRF [69], their occu-
tion, cross attention, and adaptive mixing. PanoOcc [4] con- pancy heads use two separate MLPs (MLPσ , MLPs ) to esti-
catenates the previous features with the current ones, then fuses mate the density volume Vσ and the semantic volume VS . Then
them using a set of 3D residual convolution blocks. Many occu- the occupied voxels are selected based on a given confidence
pancy perception methods [22, 33, 56, 84, 120] utilize cross at- threshold τ , and their semantic categories are determined based
tention for fusion. The process is similar to multi-camera cross on VS :
attention (refer to Eq. 8), but the difference is that 3D-space (
voxels are projected to 2D multi-frame feature maps instead of argmax (VS (x, y, z)) if Vσ (x, y, z) ≥ τ
multi-camera feature maps. Moreover, SparseOcc [118]2 em- V (x, y, z) =
empty if Vσ (x, y, z) < τ,
ploys adaptive mixing [126] for temporal information fusion. (11)
For the query feature of the current frame, SparseOcc sam- where (x, y, z) represent 3D coordinates.
ples Sn features from historical frames, and aggregates them
through adaptive mixing. Specifically, the sampled features are 3.3. Multi-Modal Occupancy Perception
multiplied by the channel mixing matrix WC and the point mix-
3.3.1. General Pipeline
ing matrix WSn , respectively. These mixing matrices are dy-
RGB images captured by cameras provide rich and dense
namically generated from the query feature Fq of the current
semantic information but are sensitive to weather condition
frame:
changes and lack precise geometric details. In contrast, point
WC/Sn = Linear (Fq ) ∈ RC×C /RSn ×Sn . (10) clouds from LiDAR or radar are robust to weather changes
and excel at capturing scene geometry with accurate depth
The output of adaptive mixing is flattened, undergoes linear measurements. However, they only produce sparse features.
projection, and is then added to the query feature as residuals. Multi-modal occupancy perception can combine the advantages
The features resulting from spatial and temporal informa- from multiple modalities, and mitigate the limitations of single-
tion fusion are processed by various types of heads to deter- modal perception. Fig. 7 illustrates the general pipeline of
mine 3D occupancy. These include convolutional heads, mask multi-modal occupancy perception. Most multi-modal meth-
decoder heads, linear projection heads, and linear projection ods [11, 12, 15, 103] map 2D image features into 3D space and
with threshold heads. Convolution-based heads [7, 10, 26, then fuse them with point cloud features. Moreover, incorporat-
38, 61, 76, 114] consist of multiple 3D convolutional layers. ing 2D perspective-view features in the fusion process can fur-
Mask decoder-based heads [55, 85, 90, 118], inspired by Mask- ther refine the representation [14]. The fused representation is
Former [127] and Mask2Former [128], formalize 3D seman- processed by an optional refinement module and an occupancy
tic occupancy prediction into the estimation of a set of binary head, such as 3D convolution or MLP, to generate the final 3D
occupancy predictions. The optional refinement module [100]
2 Concurrently, two works with the same name SparseOcc [90, 118] explore could be a combination of cross attention, self attention, and
sparsity in occupancy from different directions. diffusion denoising [129].
8
3.3.2. Information Fusion in Multi-Modal Occupancy 3.4.1. Training with Strong Supervision
There are three primary multi-modal information fusion Strongly-supervised learning for occupancy perception in-
techniques to integrate different modality branches: concate- volves using occupancy labels to train occupancy networks.
nation, summation, and cross attention. Most occupancy perception methods adopt this training man-
(1) Concatenation: Inspired by BEVFusion [47, 48], Oc- ner [4, 10, 26, 28, 32, 55, 76, 82, 84, 85, 108, 114]. The
cFusion [12] combines 3D feature volumes from different corresponding loss functions can be categorized as: geomet-
modalities through concatenating them along the feature chan- ric losses, which optimize geometric accuracy; semantic losses,
nel, and subsequently applies convolutional layers. Similarly, which enhance semantic prediction; combined semantic and ge-
RT3DSO [15] concatenates the intensity values of 3D points ometric losses, which encourage both better semantic and geo-
and their corresponding 2D image features (via projection), and metric accuracy; consistency losses, encouraging overall con-
then feeds the combined data to convolutional layers. How- sistency; and distillation losses, transferring knowledge from
ever, some voxels in 3D space may only contain features from the teacher model to the student model. Next, we will provide
either the point cloud branch or the vision branch. To alle- detailed descriptions.
viate this problem, CO-Occ [103] introduces the geometric- Among geometric losses, Binary Cross-Entropy (BCE)
and semantic-aware fusion (GSFusion) module, which identi- Loss is the most commonly used [30, 33, 55, 75, 82], which
fies voxels containing both point-cloud and visual information. distinguishes empty voxels and occupied voxels. The BCE loss
This module utilizes a K-nearest neighbors (KNN) search [130] is formulated as:
to select the k nearest neighbors of a given position in voxel NV
space within a specific radius. For the i-th non-empty fea- 1 X  
LBCE = − V̂i log (Vi )− 1 − V̂i log (1 − Vi ) , (13)
ture F Li from the point-cloud branch, its nearest visual branch NV i=0
features are represented as {F Vi1 , · · · , F Vik }, and a learnable
weight ωi is acquired by linear projection: where NV is the number of voxels in the occupancy volume V .
Moreover, there are two other geometric losses: scale-invariant
ωi = Linear (Concat (F Vi1 , · · · , F Vik )) . (12) logarithmic loss [132] and soft-IoU loss [133]. SimpleOccu-
pancy [31] calculates the logarithmic difference between the
The resulting LiDAR-vision features are expressed as F LV = predicted and ground-truth depths as the scale-invariant loga-
Concat (F V, F L, F L · ω), where ω denotes the geometric- rithmic loss. This loss relies on logarithmic rather than absolute
semantic weight from ωi . differences, therefore offering certain scale invariance. OCF
(2) Summation: CONet [11] and OccGen [100] adopt an [78] employs the soft-IoU loss to better optimize Intersection
adaptive fusion module to dynamically integrate the occupancy over Union (IoU) and prediction confidence.
representations from camera and LiDAR branches. It lever- Cross-entropy (CE) loss is the preferred loss to optimize
ages 3D convolution to process multiple single-modal repre- occupancy semantics [26, 32, 88, 89, 103, 114]. It treats classes
sentations to determine their fusion weight, subsequently ap- as independent entities, and is formally expressed as:
plying these weights to sum the LiDAR-branch representation
and camera-branch features. NV X NC
!
1 X eVic
(3) Cross Attention: HyDRa [14] proposes the integration LCE = − ωc V̂ic log PNC , (14)
NC i=0 c=0 c′ e
Vic′
of multi-modal information in perspective-view (PV) and BEV
representation spaces. Specifically, the PV image features are  
improved by the BEV point-cloud features using cross atten- where V, V̂ are the ground-truth and predicted semantic oc-
tion. Afterwards, the enhanced PV image features are converted cupancy with NC categories. ωc is a weight for a specific class
into BEV visual representation with estimated depth. These c according to the inverse of the class frequency. Notably, CE
BEV visual features are further enhanced by concatenation with loss and BCE loss are also widely used in semantic segmenta-
BEV point-cloud features, followed by a simple Squeeze-and- tion [134, 135]. Besides these losses, some occupancy percep-
Excitation layer [131]. Finally, the enhanced PV image features tion methods employ other semantic losses commonly utilized
and enhanced BEV visual features are fused through cross at- in semantic segmentation tasks [136, 137], such as Lovasz-
tention, resulting in the final occupancy representation. Softmax loss [138] and Focal loss [139]. Furthermore, there are
two specialized semantic losses: frustum proportion loss [26],
3.4. Network Training which provides cues to alleviate occlusion ambiguities from
We classify network training techniques mentioned in the the perspective of the visual frustum, and position awareness
literature based on their supervised training types. The most loss [140], which leverages local semantic entropy to encour-
prevalent type is strongly-supervised learning, while others em- age sharper semantic and geometric gradients.
ploy weak, semi, or self supervision for training. This section The losses that can simultaneously optimize semantics and
details these network training techniques and their associated geometry for occupancy perception include scene-class affin-
loss functions. The ’Training’ column in Tab. 1 offers a concise ity loss [26] and mask classification loss [127, 128]. The for-
overview of network training across various occupancy percep- mer optimizes the combination of precision, recall, and speci-
tion methods. ficity from both geometric and semantic perspectives. The lat-
ter is typically associated with a mask decoder head [55, 85].
9
Mask classification loss, originating from MaskFormer [127] image and the target RGB image. However, several other meth-
and Mask2Former [128], combines cross-entropy classification ods calculate this difference between the warped image (from
loss and a binary mask loss for each predicted mask segment. the source image) and the target image [31, 38, 87], where the
The consistency loss and distillation loss correspond to spa- depth needed for the warping process is acquired by volumet-
tial consistency loss [75] and Kullback–Leibler (KL) diver- ric rendering. OccNeRF [38] believes that the reason for not
gence loss [141], respectively. Spatial consistency loss min- comparing rendered images is that the large scale of outdoor
imizes the Jenssen-Shannon divergence of semantic inference scenes and few view supervision would make volume rendering
between a given point and some support points in space, thereby networks difficult to converge. Mathematically, the photomet-
enhancing the spatial consistency of semantics. KL divergence, ric consistency loss [148] combines a L1 loss and an optional
also known as relative entropy, quantifies how one probabil- structured similarity (SSIM) loss [149] to calculate the recon-
ity distribution deviates from a reference distribution. HASSC struction error between the warped image Iˆ and the target image
[89] adopts KL divergence loss to encourage student models to I:
learn more accurate occupancy from online soft labels provided α  
by the teacher model. LP ho = 1 − SSIM I, Iˆ + (1 − α) I, Iˆ , (15)
2 1

3.4.2. Training with Other Supervisions where α is a hyperparameter weight. Furthermore, OccN-
Training with strong supervision is straightforward and ef- eRF leverages cross-Entropy loss for semantic optimization in
fective, but requires tedious annotation for voxel-wise labels. In a self-supervised manner. The semantic labels directly come
contrast, training with other types of supervision, such as weak, from pre-trained semantic segmentation models, such as a
semi, and self supervision, is label-efficient. pre-trained open-vocabulary model Grounded-SAM [150, 151,
(1) Weak Supervision: It indicates that occupancy labels 152].
are not used, and supervision is derived from alternative la-
bels. For example, point clouds with semantic labels can guide 4. Evaluation
occupancy prediction. Specifically, Vampire [81] and Rende-
rOcc [83] construct density and semantic volumes, which fa- In this section, we will provide the performance evaluation
cilitate the inference of semantic occupancy of the scene and of 3D occupancy perception. First, the datasets and metrics
the computation of depth and semantic maps through volumet- commonly used for evaluation are introduced. Subsequently,
ric rendering. These methods do not employ occupancy la- we offer detailed performance comparisons and discussions on
bels. Alternatively, they project LiDAR point clouds with se- state-of-the-art 3D occupancy perception methods using the
mantic labels onto the camera plane to acquire ground-truth most popular datasets.
depth and semantics, which then supervise network training.
Since both strongly-supervised and weakly-supervised learn- 4.1. Datasets and Metrics
ing predict geometric and semantic occupancy, the losses used 4.1.1. Datasets
in strongly-supervised learning, such as cross-entropy loss, There are a variety of datasets to evaluate the performance
Lovasz-Softmax loss, and scale-invariant logarithmic loss, are of occupancy prediction approaches, e.g., the widely used
also applicable to weakly-supervised learning. KITTI [142], nuScenes [86], and Waymo [143]. However, most
(2) Semi Supervision: It utilizes occupancy labels but does of the datasets only contain 2D semantic segmentation annota-
not cover the complete scene, therefore providing only semi tions, which is not practical for the training or evaluation of
supervision for occupancy network training. POP-3D [9] ini- 3D occupancy prediction approaches. To support the bench-
tially generates occupancy labels by processing LiDAR point marks for 3D occupancy perception, many new datasets such as
clouds, where a voxel is recorded as occupied if it contains Monoscene [26], Occ3D [36], and OpenScene [93] are devel-
at least one LiDAR point, and empty otherwise. Given the oped based on the previous datasets like nuScenes and Waymo.
sparsity and occlusions inherent in LiDAR point clouds, the A detailed summary of datasets is provided in Tab. 2.
occupancy labels produced in this manner do not encompass
the entire space, meaning that only portions of the scene have Traditional Datasets. Before the development of 3D occu-
their occupancy labelled. POP-3D employs cross-entropy loss pancy based algorithms, KITTI [142], SemanticKITTI [60],
and Lovasz-Softmax loss to supervise network training. More- nuScenes [86], Waymo [143], and KITTI-360 [92] are widely
over, to establish the cross-modal correspondence between text used benchmarks for 2D semantic perception methods. KITTI
and 3D occupancy, POP-3D proposes to calculate the L2 mean contains ∼15K annotated frames from ∼15K 3D scans across
square error between language-image features and 3D-language 22 scenes with camera and LiDAR inputs. SemanticKITTI ex-
features as the modality alignment loss. tends KITTI with more annotated frames (∼20K) from more
(3) Self Supervision: It trains occupancy perception net- 3D scans (∼43K). nuScenes collects more 3D scans (∼390K)
works without any labels. To this end, volume rendering from 1,000 scenes, resulting in more annotated frames (∼40K)
[69] provides a self-supervised signal to encourage consistency and supports extra radar inputs. Waymo and KITTI-360 are
across different views from temporal and spatial perspectives, two large datasets with ∼230K and ∼80K frames with annota-
by minimizing photometric differences. MVBTS [91] com- tions, respectively, while Waymo contains more scenes (1000
putes the photometric difference between the rendered RGB scenes) than KITTI-360 does (only 11 scenes). The above
10
Table 2: Overview of 3D occupancy datasets with multi-modal sensors. Ann.: Annotation. Occ.: Occupancy. C: Camera. L: LiDAR. R: Radar. D: Depth map.
Flow: 3D occupancy flow. Datasets highlighted in light gray are meta datasets.

Sensor Data Annotation


Dataset Year Meta Dataset
Modalities Scenes Frames/Clips with Ann. 3D Scans Images w/ 3D Occ.? Classes w/ Flow?
KITTI [142] CVPR 2012 - C+L 22 15K Frames 15K 15k ✗ 21 ✓
SemanticKITTI [60] ICCV 2019 KITTI [142] C+L 22 20K Frames 43K 15k ✓ 28 ✗
nuScenes [86] CVPR 2019 - C+L+R 1,000 40K Frames 390K 1.4M ✗ 32 ✗
Waymo [143] CVPR 2020 - C+L 1,150 230K Frames 230K 12M ✗ 23 ✓
KITTI-360 [92] TPAMI 2022 - C+L 11 80K Frames 320K 80K ✗ 19 ✗
MonoScene-SemanticKITT [26] CVPR 2022 SemanticKITTI [60], KITTI [142] C - 4.6K Clips - - ✓ 19 ✗
MonoScene-NYUv2 [26] CVPR 2022 NYUv2 [59] C+D - 1.4K Clips - - ✓ 10 ✗
SSCBench-KITTI-360 [37] arXiv 2023 KITTI-360 [92] C 9 - - - ✓ 19 ✗
SSCBench-nuScenes [37] arXiv 2023 nuScenes [86] C 850 - - - ✓ 16 ✗
SSCBench-Waymo [37] arXiv 2023 Waymo [143] C 1,000 - - - ✓ 14 ✗
OCFBench-Lyft [78] arXiv 2023 Lyft-Level-5 [144] L 180 22K Frames - - ✓ - ✗
OCFBench-Argoverse [78] arXiv 2023 Argoverse [145] L 89 13K Frames - - ✓ 17 ✗
OCFBench-ApolloScape [78] arXiv 2023 ApolloScape [146] L 52 4K Frames - - ✓ 25 ✗
OCFBench-nuScenes [78] arXiv 2023 nuScenes [86] L - - - - ✓ 16 ✗
SurroundOcc [76] ICCV 2023 nuScenes [86] C 1,000 - - - ✓ 16 ✗
OpenOccupancy [11] ICCV 2023 nuScenes [86] C+L - 34K Frames - - ✓ 16 ✗
OpenOcc [8] ICCV 2023 nuScenes [86] C 850 40K Frames - - ✓ 16 ✗
Occ3D-nuScenes [36] NeurIPS 2024 nuScenes [86] C 900 1K Clips, 40K Frames - - ✓ 16 ✗
Occ3D-Waymo [36] NeurIPS 2024 Waymo [143] C 1,000 1.1K Clips, 200K Frames - - ✓ 14 ✗
Cam4DOcc [10] CVPR 2024 nuScenes [86] + Lyft-Level5 [144] C+L 1,030 51K Frames - - ✓ 2 ✓
OpenScene [93] CVPR 2024 Challenge nuPlan [147] C - 4M Frames 40M - ✓ - ✓

datasets are the widely adopted benchmarks for 2D perception where T P , F P , and F N represent the number of true positives,
algorithms before the popularity of 3D occupancy perception. false positives, and false negatives. A true positive means that
These datasets also serve as the meta datasets of benchmarks an actual occupied voxel is correctly predicted.
for 3D occupancy based perception algorithms. Occupancy prediction that simultaneously infers the occu-
pation status and semantic classification of voxels can be re-
3D Occupancy Datasets. The occupancy network proposed garded as semantic-geometric perception. In this context, the
by Tesla has led the trend of 3D occupancy based percep- mean Intersection-over-Union (mIoU) is commonly used as the
tion for autonomous driving. However, the lack of a publicly evaluation metric. The mIoU metric calculates the IoU for each
available large dataset containing 3D occupancy annotations semantic class separately and then averages these IoUs across
brings difficulty to the development of 3D occupancy percep- all classes, excluding the ’empty’ class:
tion. To deal with this dilemma, many researchers develop
3D occupancy datasets based on meta datasets like nuScenes NC
1 X T Pi
and Waymo. Monoscene [26] supporting 3D occupancy an- mIoU = , (17)
NC i=1 T Pi + F Pi + F Ni
notations is created from SemanticKITTI plus KITTI datasets,
and NYUv2 [59] datasets. SSCBench [37] is developed based where T Pi , F Pi , and F Ni are the number of true positives,
on KITTI-360, nuScenes, and Waymo datasets with camera in- false positives, and false negatives for a specific semantic cate-
puts. OCFBench [78] built on Lyft-Level-5 [144], Argoverse gory i. NC denotes the total number of semantic categories.
[145], ApolloScape [146], and nuScenes datasets only con- (2) Ray-level Metric: Although voxel-level IoU and mIoU
tain LiDAR inputs. SurroundOcc [76], OpenOccupancy [11], metrics are widely recognized [10, 38, 53, 76, 80, 84, 85, 87],
and OpenOcc [8] are developed on nuScenes dataset. Occ3D they still have limitations. Due to unbalanced distribution and
[36] contains more annotated frames with 3D occupancy la- occlusion of LiDAR sensing, ground-truth voxel labels from ac-
bels (∼40K based on nuScenes and ∼200K frames based on cumulated LiDAR point clouds are imperfect, where the areas
Waymo). Cam4DOcc [10] and OpenScene [93] are two new not scanned by LiDAR are annotated as empty. Moreover, for
datasets that contain large-scale 3D occupancy and 3D occu- thin objects, voxel-level metrics are too strict, as a one-voxel
pancy flow annotations. Cam4DOcc is based on nuScenes plus deviation would reduce the IoU values of thin objects to zero.
Lyft-Level-5 datasets, while OpenScene with ∼4M frames with To solve these issues, SparseOcc [118] imitates LiDAR’s ray
annotations is built on a very large dataset nuPlan [147]. casting and proposes ray-level mIoU, which evaluates rays to
their closest contact surface. This novel mIoU, combined with
4.1.2. Metrics the mean absolute velocity error (mAVE), is adopted by the oc-
(1) Voxel-level Metrics: Occupancy prediction without cupancy score (OccScore) metric [93]. OccScore overcomes
semantic consideration is regarded as class-agnostic percep- the shortcomings of voxel-level metrics while also evaluating
tion. It focuses solely on understanding spatial geometry, that the performance in perceiving object motion in the scene (i.e.,
is, determining whether each voxel in a 3D space is occu- occupancy flow).
pied or empty. The common evaluation metric is voxel-level The formulation of ray-level mIoU is consistent with Eq. 17
Intersection-over-Union (IOU), expressed as: in form but differs in application. The ray-level mIoU evaluates
TP each query ray rather than each voxel. A query ray is consid-
IoU = , (16) ered a true positive if (i) its predicted class label matches the
TP + FP + FN
11
Table 3: 3D occupancy prediction comparison (%) on the SemanticKITTI test set [60]. Mod.: Modality. C: Camera. L: LiDAR. The IoU evaluates the
performance in geometric occupancy perception, and the mIoU evaluates semantic occupancy perception.

■ motorcyclist.
■ motorcycle
■ other-grnd

■ vegetation
■ other-veh.

■ traf.-sign
■ sidewalk

■ bicyclist
■ building
■ parking

■ bicycle

■ person
■ terrain
(15.30%)

(11.13%)

■ fence
■ trunk
■ truck
(1.12%)

(0.56%)

(14.1%)

(3.92%)

(0.16%)

(0.03%)

(0.03%)

(0.20%)

(39.3%)

(0.51%)

(9.17%)

(0.07%)

(0.07%)

(0.05%)

(3.90%)

(0.29%)

(0.08%)
■ road

■ pole
■ car
Method Mod. IoU mIoU
S3CNet [30] L 45.60 29.53 42.00 22.50 17.00 7.90 52.20 31.20 6.70 41.50 45.00 16.10 39.50 34.00 21.20 45.90 35.80 16.00 31.30 31.00 24.30
LMSCNet [28] L 56.72 17.62 64.80 34.68 29.02 4.62 38.08 30.89 1.47 0.00 0.00 0.81 41.31 19.89 32.05 0.00 0.00 0.00 21.32 15.01 0.84
JS3C-Net [29] L 56.60 23.75 64.70 39.90 34.90 14.10 39.40 33.30 7.20 14.40 8.80 12.70 43.10 19.60 40.50 8.00 5.10 0.40 30.40 18.90 15.90
DIFs [75] L 58.90 23.56 69.60 44.50 41.80 12.70 41.30 35.40 4.70 3.60 2.70 4.70 43.80 27.40 40.90 2.40 1.00 0.00 30.50 22.10 18.50
OpenOccupancy [11] C+L - 20.42 60.60 36.10 29.00 13.00 38.40 33.80 4.70 3.00 2.20 5.90 41.50 20.50 35.10 0.80 2.30 0.60 26.00 18.70 15.70
Co-Occ [103] C+L - 24.44 72.00 43.50 42.50 10.20 35.10 40.00 6.40 4.40 3.30 8.80 41.20 30.80 40.80 1.60 3.30 0.40 32.70 26.60 20.70
MonoScene [26] C 34.16 11.08 54.70 27.10 24.80 5.70 14.40 18.80 3.30 0.50 0.70 4.40 14.90 2.40 19.50 1.00 1.40 0.40 11.10 3.30 2.10
TPVFormer [32] C 34.25 11.26 55.10 27.20 27.40 6.50 14.80 19.20 3.70 1.00 0.50 2.30 13.90 2.60 20.40 1.10 2.40 0.30 11.00 2.90 1.50
OccFormer [55] C 34.53 12.32 55.90 30.30 31.50 6.50 15.70 21.60 1.20 1.50 1.70 3.20 16.80 3.90 21.30 2.20 1.10 0.20 11.90 3.80 3.70
SurroundOcc [76] C 34.72 11.86 56.90 28.30 30.20 6.80 15.20 20.60 1.40 1.60 1.20 4.40 14.90 3.40 19.30 1.40 2.00 0.10 11.30 3.90 2.40
NDC-Scene [77] C 36.19 12.58 58.12 28.05 25.31 6.53 14.90 19.13 4.77 1.93 2.07 6.69 17.94 3.49 25.01 3.44 2.77 1.64 12.85 4.43 2.96
RenderOcc [83] C - 8.24 43.64 19.10 12.54 0.00 11.59 14.83 2.47 0.42 0.17 1.78 17.61 1.48 20.01 0.94 3.20 0.00 4.71 1.17 0.88
Symphonies [88] C 42.19 15.04 58.40 29.30 26.90 11.70 24.70 23.60 3.20 3.60 2.60 5.60 24.20 10.00 23.10 3.20 1.90 2.00 16.10 7.70 8.00
Scribble2Scene [101] C 42.60 13.33 50.30 27.30 20.60 11.30 23.70 20.10 5.60 2.70 1.60 4.50 23.50 9.60 23.80 1.60 1.80 0.00 13.30 5.60 6.50
HASSC [89] C 42.87 14.38 55.30 29.60 25.90 11.30 23.10 23.00 9.80 1.90 1.50 4.90 24.80 9.80 26.50 1.40 3.00 0.00 14.30 7.00 7.10
BRGScene [114] C 43.34 15.36 61.90 31.20 30.70 10.70 24.20 22.80 8.40 3.40 2.40 6.10 23.80 8.40 27.00 2.90 2.20 0.50 16.50 7.00 7.20
VoxFormer [33] C 44.15 13.35 53.57 26.52 19.69 0.42 19.54 26.54 7.26 1.28 0.56 7.81 26.10 6.10 33.06 1.93 1.97 0.00 7.31 9.15 4.94
MonoOcc [84] C - 15.63 59.10 30.90 27.10 9.80 22.90 23.90 7.20 4.50 2.40 7.70 25.00 9.80 26.10 2.80 4.70 0.60 16.90 7.30 8.40
HTCL [98] C 44.23 17.09 64.40 34.80 33.80 12.40 25.90 27.30 10.80 1.80 2.20 5.40 25.30 10.80 31.20 1.10 3.10 0.90 21.10 9.00 8.30
Bi-SSC [94] C 45.10 16.73 63.40 33.30 31.70 11.20 26.60 25.00 6.80 1.80 1.00 6.80 26.10 10.50 28.90 1.70 3.30 1.00 19.40 9.30 8.40

ground-truth class and (ii) the L1 error between the predicted capabilities. Tab. 4 adopts mIoU and mIoU∗ to assess 3D se-
and ground-truth depths is below a given threshold. The mAVE mantic occupancy perception. Unlike mIoU, the mIoU∗ met-
measures the average velocity error for true positive rays among ric excludes the ’others’ and ’other flat’ classes and is used by
8 semantic categories. The final OccScore is calculated as: the self-supervised OccNeRF [38]. For fairness, we compare
the mIoU∗ of OccNeRF with other self-supervised occupancy
OccScore = mIoU × 0.9 + max (1 − mAVE, 0.0) × 0.1. (18) methods. Notably, the OccScore metric is used in the CVPR
2024 Autonomous Grand Challenge [158], but it has yet to be-
4.2. Performance come widely adopted. Thus, we do not summarize the occu-
In this subsection, we will compare and analyze the perfor- pancy performance with this metric. Below, we will compare
mance accuracy and inference speed of various 3D occupancy perception accuracy from three aspects: overall comparison,
perception methods. For performance accuracy, we discuss modality comparison, and supervision comparison.
three aspects: overall comparison, modality comparison, and (1) Overall Comparison. Tab. 3 and 5 show that (i) the
supervision comparison. The evaluation datasets used include IoU scores of occupancy networks are less than 60%, while the
SemanticKITTI, Occ3D-nuScenes, and SSCBench-KITTI-360. mIoU scores are less than 30%. The IoU scores (indicating ge-
ometric perception, i.e., ignoring semantics) substantially sur-
4.2.1. Perception Accuracy pass the mIoU scores. This is because predicting occupancy
SemanticKITTI [60] is the first dataset with 3D occupancy for some semantic categories is challenging, such as bicycles,
labels for outdoor driving scenes. Occ3D-nuScenes [36] is the motorcycles, persons, bicyclists, motorcyclists, poles, and traf-
dataset used in the CVPR 2023 3D Occupancy Prediction Chal- fic signs. Each of these classes has a small proportion (under
lenge [157]. These two datasets are currently the most popular. 0.3%) in the dataset, and their small sizes in shapes make them
Therefore, we summarize the performance of various 3D occu- difficult to observe and detect. Therefore, if the IOU scores of
pancy methods that are trained and tested on these datasets, as these categories are low, they significantly impact the overall
reported in Tab. 3 and 4. Additionally, we evaluate the perfor- mIoU value. Because the mIOU calculation, which does not
mance of 3D occupancy methods on the SSCBench-KITTI-360 account for category frequency, divides the total IoU scores of
dataset, as reported in Tab. 5. These tables classify occupancy all categories by the number of categories. (ii) A higher IoU
methods according to input modalities and supervised learning does not guarantee a higher mIoU. One possible explanation is
types, respectively. The best performances are highlighted in that the semantic perception capacity (reflected in mIoU) and
bold. Tab. 3 and 5 utilize the IoU and mIoU metrics to eval- the geometric perception capacity (reflected in IoU) of an occu-
uate the 3D geometric and 3D semantic occupancy perception pancy network are distinct and not positively correlated.
Form Tab. 4, it is evident that (i) the mIOU scores of oc-

12
Table 4: 3D semantic occupancy prediction comparison (%) on the validation set of Occ3D-nuScenes [36]. Sup. represents the supervised learning type.
mIoU∗ is the mean Intersection-over-Union excluding the ’others’ and ’other flat’ classes. For fairness, all compared methods are vision-centric.

■ traffic cone
■ motorcycle
■ const. veh.

■ vegetation
■ pedestrian

■ drive. suf.

■ manmade
■ other flat

■ sidewalk
■ bicycle
■ barrier

■ terrain
■ others

■ trailer

■ truck
■ bus

■ car
Method Sup. mIoU mIoU∗
SelfOcc (BEV) [87] Self 6.76 7.66 0.00 0.00 0.00 0.00 9.82 0.00 0.00 0.00 0.00 0.00 6.97 47.03 0.00 18.75 16.58 11.93 3.81
SelfOcc (TPV) [87] Self 7.97 9.03 0.00 0.00 0.00 0.00 10.03 0.00 0.00 0.00 0.00 0.00 7.11 52.96 0.00 23.59 25.16 11.97 4.61
SimpleOcc [31] Self - 7.99 - 0.67 1.18 3.21 7.63 1.02 0.26 1.80 0.26 1.07 2.81 40.44 - 18.30 17.01 13.42 10.84
OccNeRF [38] Self - 10.81 - 0.83 0.82 5.13 12.49 3.50 0.23 3.10 1.84 0.52 3.90 52.62 - 20.81 24.75 18.45 13.19
RenderOcc [83] Weak 23.93 - 5.69 27.56 14.36 19.91 20.56 11.96 12.42 12.14 14.34 20.81 18.94 68.85 33.35 42.01 43.94 17.36 22.61
Vampire [81] Weak 28.33 - 7.48 32.64 16.15 36.73 41.44 16.59 20.64 16.55 15.09 21.02 28.47 67.96 33.73 41.61 40.76 24.53 20.26
OccFormer [55] Strong 21.93 - 5.94 30.29 12.32 34.40 39.17 14.44 16.45 17.22 9.27 13.90 26.36 50.99 30.96 34.66 22.73 6.76 6.97
TPVFormer [32] Strong 27.83 - 7.22 38.90 13.67 40.78 45.90 17.23 19.99 18.85 14.30 26.69 34.17 55.65 35.47 37.55 30.70 19.40 16.78
Occ3D [36] Strong 28.53 - 8.09 39.33 20.56 38.29 42.24 16.93 24.52 22.72 21.05 22.98 31.11 53.33 33.84 37.98 33.23 20.79 18.00
SurroundOcc [76] Strong 38.69 - 9.42 43.61 19.57 47.66 53.77 21.26 22.35 24.48 19.36 32.96 39.06 83.15 43.26 52.35 55.35 43.27 38.02
FastOcc [82] Strong 40.75 - 12.86 46.58 29.93 46.07 54.09 23.74 31.10 30.68 28.52 33.08 39.69 83.33 44.65 53.90 55.46 42.61 36.50
FB-OCC [153] Strong 42.06 - 14.30 49.71 30.00 46.62 51.54 29.30 29.13 29.35 30.48 34.97 39.36 83.07 47.16 55.62 59.88 44.89 39.58
PanoOcc [4] Strong 42.13 - 11.67 50.48 29.64 49.44 55.52 23.29 33.26 30.55 30.99 34.43 42.57 83.31 44.23 54.40 56.04 45.94 40.40
COTR [85] Strong 46.21 - 14.85 53.25 35.19 50.83 57.25 35.36 34.06 33.54 37.14 38.99 44.97 84.46 48.73 57.60 61.08 51.61 46.72

cupancy networks are within 50%, higher than the scores on terms of IoU and mIoU. But notably, the mIoU of the vision-
SemanticKITTI and SSCBench-KITTI-360. For example, the centric CGFormer [156] has surpassed that of LiDAR-centric
mIOUs of TPVFormer [32] on SemanticKITTI and SSCBench- methods on the SSCBench-KITTI-360 dataset.
KITTI-360 are 11.26% and 13.64%, but it gets 27.83% on (3) Supervision Comparison. The ’Sup.’ column of Tab. 4
Occ3D-nuScenes. OccFormer [55] and SurroundOcc [76] have outlines supervised learning types used for training occupancy
similar situations. We consider this might be due to the networks. Training with strong supervision, which directly em-
simpler task setting in Occ3D-nuScenes. On the one hand, ploys 3D occupancy labels, is the most prevalent type. Tab. 4
Occ3D-nuScenes uses surrounding-view images as input, con- shows that occupancy networks based on strongly-supervised
taining richer scene information compared to SemanticKITTI learning achieve impressive performance. The mIoU scores
and SSCBench-KITTI-360, which only utilize monocular or of FastOcc [82], FB-Occ [153], PanoOcc [4], and COTR [85]
binocular images. On the other hand, Occ3D-nuScenes only are significantly higher (12.42%-38.24% increased mIoU) than
calculates mIOU for visible 3D voxels, whereas the other those of weakly-supervised or self-supervised methods. This
two datasets evaluate both visible and occluded areas, posing is because occupancy labels provided by the dataset are care-
greater challenges. (ii) COTR [85] has the best mIoU (46.21%) fully annotated with high accuracy, and can impose strong con-
and also achieves the highest scores in IoU across all categories straints on network training. However, annotating these dense
on Occ3D-nuScenes. occupancy labels is time-consuming and laborious. It is neces-
(2) Modality Comparison. The input data modality signif- sary to explore network training based on weak or self super-
icantly influences 3D occupancy perception accuracy. Tab. 3 vision to reduce reliance on occupancy labels. Vampire [81] is
and 5 report the performance of occupancy perception in dif- the best-performing method based on weakly-supervised learn-
ferent modalities. It can be seen that, due to the accurate depth ing, achieving a mIoU score of 28.33%. It demonstrates that
information provided by LiDAR sensing, LiDAR-centric oc- semantic LiDAR point clouds can supervise the training of 3D
cupancy methods have more precise perception with higher occupancy networks. However, the collection and annotation
IoU and mIoU scores. For example, on the SemanticKITTI of semantic LiDAR point clouds are expensive. SelfOcc [87]
dataset, S3CNet [30] has the top mIoU (29.53%) and DIFs and OccNeRF [38] are two representative occupancy works
[75] achieves the highest IoU (58.90%); on the SSCBench- based on self-supervised learning. They utilize volume render-
KITTI-360 dataset, S3CNet achieves the best IoU (53.58%). ing and photometric consistency to acquire self-supervised sig-
However, we observe that the multi-modal approaches (e.g., nals, proving that a network can learn 3D occupancy perception
OpenOccupancy [11] and Co-Occ [103]) do not outperform without any labels. However, their performance remains lim-
single-modal (i.e., LiDAR-centric or vision-centric) methods, ited, with SelfOcc achieving an mIoU of 7.97% and OccNeRF
indicating that they have not fully leveraged the benefits of an mIoU∗ of 10.81%.
multi-modal fusion and the richness of input data. Therefore,
there is considerable potential for further improvement in multi- 4.2.2. Inference Speed
modal occupancy perception. Moreover, vision-centric occu- Recent studies on 3D occupancy perception [82, 118] have
pancy perception has advanced rapidly in recent years. On the begun to consider not only perception accuracy but also its in-
SemanticKITTI dataset, the state-of-the-art vision-centric oc- ference speed. According to the data provided by FastOcc [82]
cupancy methods still lag behind LiDAR-centric methods in and SparseOcc [118], we sort out the inference speeds of 3D oc-
13
Table 5: 3D occupancy benchmarking results (%) on the SSCBench-KITTI-360 test set. The best results are in bold. OccFiner (Mono.) indicates that OccFiner
refines the predicted occupancy from MonoScene.

■ other-struct.
■ motorcycle

■ other-grnd.

■ vegetation
■ other-veh.

■ other-obj.
■ traf.-sign
■ sidewalk

■ building
■ parking
■ bicycle

■ person

■ terrain
(14.98%)

(15.67%)

(41.99%)
■ fence
■ truck
(2.85%)

(0.01%)

(0.01%)

(0.16%)

(5.75%)

(0.02%)

(2.31%)

(6.43%)

(2.05%)

(0.96%)

(7.10%)

(0.22%)

(0.06%)

(4.33%)

(0.28%)
■ road

■ pole
■ car
Method IoU mIoU

LiDAR-Centric Methods
SSCNet [25] 53.58 16.95 31.95 0.00 0.17 10.29 0.00 0.07 65.70 17.33 41.24 3.22 44.41 6.77 43.72 28.87 0.78 0.75 8.69 0.67
LMSCNet [28] 47.35 13.65 20.91 0.00 0.00 0.26 0.58 0.00 62.95 13.51 33.51 0.20 43.67 0.33 40.01 26.80 0.00 0.00 3.63 0.00
Vision-Centric Methods
GaussianFormer [154] 35.38 12.92 18.93 1.02 4.62 18.07 7.59 3.35 45.47 10.89 25.03 5.32 28.44 5.68 29.54 8.62 2.99 2.32 9.51 5.14
MonoScene [26] 37.87 12.31 19.34 0.43 0.58 8.02 2.03 0.86 48.35 11.38 28.13 3.32 32.89 3.53 26.15 16.75 6.92 5.67 4.20 3.09
OccFiner (Mono.) [155] 38.51 13.29 20.78 1.08 1.03 9.04 3.58 1.46 53.47 12.55 31.27 4.13 33.75 4.62 26.83 18.67 5.04 4.58 4.05 3.32
VoxFormer [33] 38.76 11.91 17.84 1.16 0.89 4.56 2.06 1.63 47.01 9.67 27.21 2.89 31.18 4.97 28.99 14.69 6.51 6.92 3.79 2.43
TPVFormer [32] 40.22 13.64 21.56 1.09 1.37 8.06 2.57 2.38 52.99 11.99 31.07 3.78 34.83 4.80 30.08 17.52 7.46 5.86 5.48 2.70
OccFormer [55] 40.27 13.81 22.58 0.66 0.26 9.89 3.82 2.77 54.30 13.44 31.53 3.55 36.42 4.80 31.00 19.51 7.77 8.51 6.95 4.60
Symphonies [88] 44.12 18.58 30.02 1.85 5.90 25.07 12.06 8.20 54.94 13.83 32.76 6.93 35.11 8.58 38.33 11.52 14.01 9.57 14.44 11.28
CGFormer [156] 48.07 20.05 29.85 3.42 3.96 17.59 6.79 6.63 63.85 17.15 40.72 5.53 42.73 8.22 38.80 24.94 16.24 17.45 10.18 6.77

Table 6: Inference speed analysis of 3D occupancy perception on the toward generalized 3D occupancy perception.
Occ3D-nuScenes [36] dataset. † indicates data from SparseOcc [118]. ‡
means data from FastOcc [82]. R-50 represents ResNet50 [39]. TRT denotes
acceleration using the TensorRT SDK [159]. 5.1. Occupancy-based Applications in Autonomous Driving
3D occupancy perception enables a comprehensive under-
Method GPU Input Size Backbone mIoU(%) FPS(Hz) standing of the 3D world and supports various tasks in au-
BEVDet† [43] A100 704× 256 R-50 36.10 2.6 tonomous driving. Existing occupancy-based applications in-
BEVFormer† [44] A100 1600×900 R-101 39.30 3.0 clude segmentation, detection, dynamic perception, world mod-
FB-Occ† [153] A100 704×256 R-50 10.30 10.3
SparseOcc [118] A100 704×256 R-50 30.90 12.5
els, and autonomous driving algorithm frameworks. (1) Seg-
mentation: Semantic occupancy perception can essentially be
SurroundOcc‡ [76] V100 1600×640 R-101 37.18 2.8
FastOcc [82] V100 1600×640 R-101 40.75 4.5
regarded as a 3D semantic segmentation task. (2) Detec-
FastOcc(TRT) [82] V100 1600×640 R-101 40.75 12.8 tion: OccupancyM3D [5] and SOGDet [6] are two occupancy-
based works that implement 3D object detection. Occupan-
cyM3D first learns occupancy to enhance 3D features, which
cupancy methods, and also report their running platforms, input are then used for 3D detection. SOGDet develops two con-
image sizes, backbone architectures, and occupancy accuracy current tasks: semantic occupancy prediction and 3D object
on the Occ3D-nuScenes dataset, as depicted in Tab. 6. detection, training these tasks simultaneously for mutual en-
A practical occupancy method should have high accuracy hancement. (3) Dynamic perception: Its goal is to capture dy-
(mIoU) and fast inference speed (FPS). From Tab. 6, FastOcc namic objects and their motion in the surrounding environment,
achieves a high mIoU (40.75%), comparable to the mIOU of in the form of predicting occupancy flows for dynamic ob-
BEVFomer. Notably, FastOcc has a higher FPS value on a jects. Strongly-supervised Cam4DOcc [10] and self-supervised
lower-performance GPU platform than BEVFomer. Further- LOF [160] have demonstrated potential in occupancy flow pre-
more, after being accelerated by TensorRT [159], the inference diction. (4) World model: It simulates and forecasts the fu-
speed of FastOcc reaches 12.8Hz. ture state of the surrounding environment by observing current
and historical data [161]. Pioneering works, according to in-
put observation data, can be divided into semantic occupancy
5. Challenges and Opportunities sequence-based world models (e.g., OccWorld [162] and Occ-
Sora [163]), point cloud sequence-based world models (e.g.,
In this section, we explore the challenges and opportuni-
SCSF [108], UnO [164], PCF [165]), and multi-camera im-
ties of 3D occupancy perception in autonomous driving. Occu-
age sequence-based world models (e.g., DriveWorld [7] and
pancy, as a geometric and semantic representation of the 3D
Cam4DOcc [10]). However, these works still perform poorly
world, can facilitate various autonomous driving tasks. We
in high-quality long-term forecasting. (5) Autonomous driving
discuss both existing and prospective applications of 3D occu-
algorithm framework: It integrates different sensor inputs into
pancy, demonstrating its potential in the field of autonomous
a unified occupancy representation, then applies the occupancy
driving. Furthermore, we discuss the deployment efficiency of
representation to a wide span of driving tasks, such as 3D object
occupancy perception on edge devices, the necessity for robust-
detection, online mapping, multi-object tracking, motion pre-
ness in complex real-world driving environments, and the path

14
diction, occupancy prediction, and motion planning. Related in lighting and weather, which would introduce visual biases,
works include OccNet [8], DriveWorld [7], and UniScene [61]. and input image blurring, which is caused by vehicle move-
However, existing occupancy-based applications primarily ment. Moreover, sensor malfunctions (e.g., loss of frames and
focus on the perception level, and less on the decision-making camera views) are common [176]. In light of these challenges,
level. Given that 3D occupancy is more consistent with the studying robust 3D occupancy perception is valuable.
3D physical world than other perception manners (e.g., bird’s- However, research on robust 3D occupancy is limited, pri-
eye view perception and perspective-view perception), we be- marily due to the scarcity of datasets. Recently, the ICRA 2024
lieve that 3D occupancy holds opportunities for broader ap- RoboDrive Challenge [177] provides imperfect scenarios for
plications in autonomous driving. At the perception level, studying robust 3D occupancy perception.
it could improve the accuracy of existing place recognition In terms of network architecture and scene representa-
[166, 167], pedestrian detection [168, 169], accident predic- tion, we consider that related works on robust BEV percep-
tion [170], and lane line segmentation [171]. At the decision- tion [47, 48, 178, 179, 180, 181] could inspire developing ro-
making level, it could help safer driving decisions [172] and bust occupancy perception. M-BEV [179] proposes a masked
navigation [173, 174], and provide 3D explainability for driv- view reconstruction module to enhance robustness under vari-
ing behaviors. ous missing camera cases. GKT [180] employs coarse projec-
tion to achieve robust BEV representation. In terms of sensor
5.2. Deployment Efficiency modality, radar can penetrate small particles such as raindrops,
For complex 3D scenes, large amounts of point cloud data fog, and snowflakes in adverse weather conditions, thus pro-
or multi-view visual information always need to be processed viding reliable detection capability. Radar-centric RadarOcc
and analyzed to extract and update occupancy state informa- [182] achieves robust occupancy perception with imaging radar,
tion. To achieve real-time performance for the autonomous which not only inherits the robustness of mmWave radar in all
driving application, solutions commonly need to be computa- lighting and weather conditions, but also has higher vertical
tionally complete in a limited amount of time and need to have resolution than mmWave radar. RadarOcc has demonstrated
efficient data structures and algorithm designs. In general, de- more accurate 3D occupancy prediction than LiDAR-centric
ploying deep learning algorithms on target edge devices is not and vision-centric methods in adverse weather. Besides, in
an easy task. most damage scenarios involving natural factors, multi-modal
Currently, some real-time and deployment-friendly efforts models [47, 48, 181] usually outperform single-modal mod-
on occupancy tasks have been attempted. For instance, Hou et els, benefiting from the complementary nature of multi-modal
al. [82] proposed a solution, FastOcc, to accelerate prediction inputs. In terms of training strategies, Robo3D [97] distills
inference speed by adjusting the input resolution, view trans- knowledge from a teacher model with complete point clouds
formation module, and prediction head. Zhang et al. [175] to a student model with imperfect input, enhancing the stu-
further lightweighted FlashOcc by decomposing its occupancy dent model’s robustness. Therefore, based on these works, ap-
network and binarizing it with binarized convolutions. Liu et al. proaches to robust 3D occupancy perception could include, but
[118] proposed SparseOcc, a sparse occupancy network with- are not limited to, robust scene representation, multiple modal-
out any dense 3-D features, to minimize computational costs ities, network design, and learning strategies.
using sparse convolution layers and mask-guided sparse sam-
pling. Tang et al. [90] proposed to adopt sparse latent repre- 5.4. Generalized 3D Occupancy Perception
sentations and sparse interpolation operations to avoid informa- Although more accurate 3D labels mean higher occupancy
tion loss and reduce computational complexity. Additionally, prediction performance [183], 3D labels are costly and large-
Huang et al. recently proposed GaussianFormer [154], which scale 3D annotations for the real world are impractical. The
utilizes a series of 3D Gaussians to represent sparse interest re- generalization capabilities of existing networks trained on lim-
gions in space. GaussianFormer optimizes the geometric and ited 3D-labeled datasets have not been extensively studied. To
semantic properties of the 3D Gaussians, corresponding to the get rid of dependence on 3D labels, self-supervised learning
semantic occupancy of the interest regions. It achieves compa- represents a potential pathway toward generalized 3D occu-
rable accuracy to state-of-the-art methods using only 17.8%- pancy perception. It learns occupancy perception from a broad
24.8% of their memory consumption. However, the above- range of unlabelled images. However, the performance of cur-
mentioned approaches are still some way from practical deploy- rent self-supervised occupancy perception [31, 38, 87, 91] is
ment in autonomous driving systems. A deployment-efficient poor. On the Occ3D-nuScene dataset (see Tab. 4), the top ac-
occupancy method requires superiority in real-time processing, curacy of self-supervised methods is inferior to that of strongly-
lightweight design, and accuracy simultaneously. supervised methods by a large margin. Moreover, current self-
supervised methods require training and evaluation with more
5.3. Robust 3D Occupancy Perception data. Thus, enhancing self-supervised generalized 3D occu-
pancy perception is an important future research direction.
In dynamic and unpredictable real-world driving environ-
Furthermore, current 3D occupancy perception can only
ments, the perception robustness is crucial to autonomous vehi-
recognize a set of predefined object categories, which limits its
cle safety. State-of-the-art 3D occupancy models may be vul-
generalizability and practicality. Recent advances in large lan-
nerable to out-of-distribution scenes and data, such as changes
guage models (LLMs) [184, 185, 186, 187] and large visual-
15
language models (LVLMs) [188, 189, 190, 191, 192] demon- [9] A. Vobecky, O. Siméoni, D. Hurych, S. Gidaris, A. Bursuc, P. Pérez,
strate a promising ability for reasoning and visual understand- J. Sivic, Pop-3d: Open-vocabulary 3d occupancy prediction from im-
ages, Advances in Neural Information Processing Systems 36 (2024).
ing. Integrating these pre-trained large models has been proven [10] J. Ma, X. Chen, J. Huang, J. Xu, Z. Luo, J. Xu, W. Gu, R. Ai, H. Wang,
to enhance generalization for perception [9]. POP-3D [9] lever- Cam4docc: Benchmark for camera-only 4d occupancy forecasting in au-
ages a powerful pre-trained visual-language model [192] to tonomous driving applications, arXiv preprint arXiv:2311.17663 (2023).
train its network and achieves open-vocabulary 3D occupancy [11] X. Wang, Z. Zhu, W. Xu, Y. Zhang, Y. Wei, X. Chi, Y. Ye, D. Du, J. Lu,
X. Wang, Openoccupancy: A large scale benchmark for surrounding
perception. Therefore, we consider that employing LLMs and semantic occupancy perception, in: Proceedings of the IEEE/CVF In-
LVLMs is a challenge and opportunity for achieving general- ternational Conference on Computer Vision, 2023, pp. 17850–17859.
ized 3D occupancy perception. [12] Z. Ming, J. S. Berrio, M. Shan, S. Worrall, Occfusion: A straightforward
and effective multi-sensor fusion framework for 3d occupancy predic-
tion, arXiv preprint arXiv:2403.01644 (2024).
6. Conclusion [13] R. Song, C. Liang, H. Cao, Z. Yan, W. Zimmer, M. Gross, A. Fes-
tag, A. Knoll, Collaborative semantic occupancy prediction with hy-
brid feature fusion in connected automated vehicles, arXiv preprint
This paper provided a comprehensive survey of 3D occu- arXiv:2402.07635 (2024).
pancy perception in autonomous driving in recent years. We [14] P. Wolters, J. Gilg, T. Teepe, F. Herzog, A. Laouichi, M. Hofmann,
reviewed and discussed in detail the state-of-the-art LiDAR- G. Rigoll, Unleashing hydra: Hybrid fusion, depth consistency and radar
centric, vision-centric, and multi-modal perception solutions for unified 3d perception, arXiv preprint arXiv:2403.07746 (2024).
[15] S. Sze, L. Kunze, Real-time 3d semantic occupancy prediction for au-
and highlighted information fusion techniques for this field. tonomous vehicles using memory-efficient sparse convolution, arXiv
To facilitate further research, detailed performance compar- preprint arXiv:2403.08748 (2024).
isons of existing occupancy methods are provided. Finally, we [16] Y. Xie, J. Tian, X. X. Zhu, Linking points with labels in 3d: A review of
described some open challenges that could inspire future re- point cloud semantic segmentation, IEEE Geoscience and remote sens-
ing magazine 8 (4) (2020) 38–59.
search directions in the coming years. We hope that this survey [17] J. Zhang, X. Zhao, Z. Chen, Z. Lu, A review of deep learning-based
can benefit the community, support further development in au- semantic segmentation for point cloud, IEEE access 7 (2019) 179118–
tonomous driving, and help inexpert readers navigate the field. 179133.
[18] X. Ma, W. Ouyang, A. Simonelli, E. Ricci, 3d object detection from
images for autonomous driving: a survey, IEEE Transactions on Pattern
Acknowledgment Analysis and Machine Intelligence (2023).
[19] J. Mao, S. Shi, X. Wang, H. Li, 3d object detection for autonomous driv-
ing: A comprehensive survey, International Journal of Computer Vision
The research work was conducted in the JC STEM Lab of 131 (8) (2023) 1909–1963.
Machine Learning and Computer Vision funded by The Hong [20] L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang,
Kong Jockey Club Charities Trust and was partially supported J. Li, C. Jia, et al., Multi-modal 3d object detection in autonomous driv-
by the Research Grants Council of the Hong Kong SAR, China ing: A survey and taxonomy, IEEE Transactions on Intelligent Vehicles
(2023).
(Project No. PolyU 15215824). [21] D. Fernandes, A. Silva, R. Névoa, C. Simões, D. Gonzalez, M. Guevara,
P. Novais, J. Monteiro, P. Melo-Pinto, Point-cloud based 3d object de-
tection and classification methods for self-driving applications: A survey
References and taxonomy, Information Fusion 68 (2021) 161–191.
[22] L. Roldao, R. De Charette, A. Verroust-Blondet, 3d semantic scene com-
[1] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, pletion: A survey, International Journal of Computer Vision 130 (8)
H. Deng, et al., Delving into the devils of bird’s-eye-view perception: A (2022) 1978–2005.
review, evaluation and recipe, IEEE Transactions on Pattern Analysis [23] Y. Zhang, J. Zhang, Z. Wang, J. Xu, D. Huang, Vision-based 3d occu-
and Machine Intelligence (2023). pancy prediction in autonomous driving: a review and outlook, arXiv
[2] Y. Ma, T. Wang, X. Bai, H. Yang, Y. Hou, Y. Wang, Y. Qiao, R. Yang, preprint arXiv:2405.02595 (2024).
D. Manocha, X. Zhu, Vision-centric bev perception: A survey, arXiv [24] S. Thrun, Probabilistic robotics, Communications of the ACM 45 (3)
preprint arXiv:2208.02797 (2022). (2002) 52–57.
[3] Occupancy networks, accessed July 25, 2024. [25] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, T. Funkhouser, Seman-
URL https://www.thinkautonomous.ai/blog/ tic scene completion from a single depth image, in: Proceedings of the
occupancy-networks/ IEEE conference on computer vision and pattern recognition, 2017, pp.
[4] Y. Wang, Y. Chen, X. Liao, L. Fan, Z. Zhang, Panoocc: Unified occu- 1746–1754.
pancy representation for camera-based 3d panoptic segmentation, arXiv [26] A.-Q. Cao, R. De Charette, Monoscene: Monocular 3d semantic scene
preprint arXiv:2306.10013 (2023). completion, in: Proceedings of the IEEE/CVF Conference on Computer
[5] L. Peng, J. Xu, H. Cheng, Z. Yang, X. Wu, W. Qian, W. Wang, B. Wu, Vision and Pattern Recognition, 2022, pp. 3991–4001.
D. Cai, Learning occupancy for monocular 3d object detection, arXiv [27] Workshop on autonomous driving at cvpr 2022, accessed July 25, 2024.
preprint arXiv:2305.15694 (2023). URL https://cvpr2022.wad.vision/
[6] Q. Zhou, J. Cao, H. Leng, Y. Yin, Y. Kun, R. Zimmermann, Sogdet: [28] L. Roldao, R. de Charette, A. Verroust-Blondet, Lmscnet: Lightweight
Semantic-occupancy guided multi-view 3d object detection, in: Pro- multiscale 3d semantic completion, in: 2020 International Conference
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, on 3D Vision (3DV), IEEE, 2020, pp. 111–119.
2024, pp. 7668–7676. [29] X. Yan, J. Gao, J. Li, R. Zhang, Z. Li, R. Huang, S. Cui, Sparse single
[7] C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y. Guo, sweep lidar point cloud segmentation via learning contextual shape pri-
J. Xing, et al., Driveworld: 4d pre-trained scene understanding via ors from scene completion, in: Proceedings of the AAAI Conference on
world models for autonomous driving, arXiv preprint arXiv:2405.04390 Artificial Intelligence, Vol. 35, 2021, pp. 3101–3109.
(2024). [30] R. Cheng, C. Agia, Y. Ren, X. Li, L. Bingbing, S3cnet: A sparse seman-
[8] W. Tong, C. Sima, T. Wang, L. Chen, S. Wu, H. Deng, Y. Gu, L. Lu, tic scene completion network for lidar point clouds, in: Conference on
P. Luo, D. Lin, et al., Scene as occupancy, in: Proceedings of the Robot Learning, PMLR, 2021, pp. 2148–2161.
IEEE/CVF International Conference on Computer Vision, 2023, pp. [31] W. Gan, N. Mo, H. Xu, N. Yokoya, A simple attempt for 3d occupancy
8406–8415.

16
estimation in autonomous driving, arXiv preprint arXiv:2303.10076 [52] J. Yao, J. Zhang, Depthssc: Depth-spatial alignment and dynamic voxel
(2023). resolution for monocular 3d semantic scene completion, arXiv preprint
[32] Y. Huang, W. Zheng, Y. Zhang, J. Zhou, J. Lu, Tri-perspective view arXiv:2311.17084 (2023).
for vision-based 3d semantic occupancy prediction, in: Proceedings of [53] R. Miao, W. Liu, M. Chen, Z. Gong, W. Xu, C. Hu, S. Zhou, Occdepth:
the IEEE/CVF conference on computer vision and pattern recognition, A depth-aware method for 3d semantic scene completion, arXiv preprint
2023, pp. 9223–9232. arXiv:2302.13540 (2023).
[33] Y. Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, [54] A. N. Ganesh, Soccdpt: Semi-supervised 3d semantic occupancy from
A. Anandkumar, Voxformer: Sparse voxel transformer for camera-based dense prediction transformers trained under memory constraints, arXiv
3d semantic scene completion, in: Proceedings of the IEEE/CVF confer- preprint arXiv:2311.11371 (2023).
ence on computer vision and pattern recognition, 2023, pp. 9087–9098. [55] Y. Zhang, Z. Zhu, D. Du, Occformer: Dual-path transformer for
[34] B. Li, Y. Sun, J. Dong, Z. Zhu, J. Liu, X. Jin, W. Zeng, One at a time: vision-based 3d semantic occupancy prediction, in: Proceedings of the
Progressive multi-step volumetric probability learning for reliable 3d IEEE/CVF International Conference on Computer Vision, 2023, pp.
scene perception, in: Proceedings of the AAAI Conference on Artifi- 9433–9443.
cial Intelligence, Vol. 38, 2024, pp. 3028–3036. [56] S. Silva, S. B. Wannigama, R. Ragel, G. Jayatilaka, S2tpvformer:
[35] Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, Spatio-temporal tri-perspective view for temporally coherent 3d seman-
W. Wang, et al., Planning-oriented autonomous driving, in: Proceedings tic occupancy prediction, arXiv preprint arXiv:2401.13785 (2024).
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- [57] M. Firman, O. Mac Aodha, S. Julier, G. J. Brostow, Structured prediction
tion, 2023, pp. 17853–17862. of unobserved voxels from a single depth image, in: Proceedings of the
[36] X. Tian, T. Jiang, L. Yun, Y. Mao, H. Yang, Y. Wang, Y. Wang, IEEE Conference on Computer Vision and Pattern Recognition, 2016,
H. Zhao, Occ3d: A large-scale 3d occupancy prediction benchmark for pp. 5431–5440.
autonomous driving, Advances in Neural Information Processing Sys- [58] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,
tems 36 (2024). S. Song, A. Zeng, Y. Zhang, Matterport3d: Learning from rgb-d data in
[37] Y. Li, S. Li, X. Liu, M. Gong, K. Li, N. Chen, Z. Wang, Z. Li, T. Jiang, indoor environments, arXiv preprint arXiv:1709.06158 (2017).
F. Yu, et al., Sscbench: Monocular 3d semantic scene completion bench- [59] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and
mark in street views, arXiv preprint arXiv:2306.09001 2 (2023). support inference from rgbd images, in: Computer Vision–ECCV 2012:
[38] C. Zhang, J. Yan, Y. Wei, J. Li, L. Liu, Y. Tang, Y. Duan, J. Lu, Occnerf: 12th European Conference on Computer Vision, Florence, Italy, October
Self-supervised multi-camera occupancy prediction with neural radiance 7-13, 2012, Proceedings, Part V 12, Springer, 2012, pp. 746–760.
fields, arXiv preprint arXiv:2312.09243 (2023). [60] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss,
[39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog- J. Gall, Semantickitti: A dataset for semantic scene understanding of
nition, in: Proceedings of the IEEE conference on computer vision and lidar sequences, in: Proceedings of the IEEE/CVF international confer-
pattern recognition, 2016, pp. 770–778. ence on computer vision, 2019, pp. 9297–9307.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [61] C. Min, L. Xiao, D. Zhao, Y. Nie, B. Dai, Multi-camera unified pre-
Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural training via 3d scene reconstruction, IEEE Robotics and Automation
information processing systems 30 (2017). Letters (2024).
[41] Tesla ai day, accessed July 25, 2024. [62] C. Lyu, S. Guo, B. Zhou, H. Xiong, H. Zhou, 3dopformer: 3d occupancy
URL https://www.youtube.com/watch?v=j0z4FweCy4M perception from multi-camera images with directional and distance en-
[42] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, J. Lu, Be- hancement, IEEE Transactions on Intelligent Vehicles (2023).
verse: Unified perception and prediction in birds-eye-view for vision- [63] C. Häne, C. Zach, A. Cohen, M. Pollefeys, Dense semantic 3d recon-
centric autonomous driving, arXiv preprint arXiv:2205.09743 (2022). struction, IEEE transactions on pattern analysis and machine intelli-
[43] J. Huang, G. Huang, Z. Zhu, Y. Ye, D. Du, Bevdet: High-performance gence 39 (9) (2016) 1730–1743.
multi-camera 3d object detection in bird-eye-view, arXiv preprint [64] X. Chen, J. Sun, Y. Xie, H. Bao, X. Zhou, Neuralrecon: Real-time coher-
arXiv:2112.11790 (2021). ent 3d scene reconstruction from monocular video, IEEE Transactions
[44] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, J. Dai, Bev- on Pattern Analysis and Machine Intelligence (2024).
former: Learning bird’s-eye-view representation from multi-camera im- [65] X. Tian, R. Liu, Z. Wang, J. Ma, High quality 3d reconstruction based on
ages via spatiotemporal transformers, in: European conference on com- fusion of polarization imaging and binocular stereo vision, Information
puter vision, Springer, 2022, pp. 1–18. Fusion 77 (2022) 19–28.
[45] B. Yang, W. Luo, R. Urtasun, Pixor: Real-time 3d object detection from [66] P. N. Leite, A. M. Pinto, Fusing heterogeneous tri-dimensional informa-
point clouds, in: Proceedings of the IEEE conference on Computer Vi- tion for reconstructing submerged structures in harsh sub-sea environ-
sion and Pattern Recognition, 2018, pp. 7652–7660. ments, Information Fusion 103 (2024) 102126.
[46] B. Yang, M. Liang, R. Urtasun, Hdnet: Exploiting hd maps for 3d object [67] J.-D. Durou, M. Falcone, M. Sagona, Numerical methods for shape-
detection, in: Conference on Robot Learning, PMLR, 2018, pp. 146– from-shading: A new survey with benchmarks, Computer Vision and
155. Image Understanding 109 (1) (2008) 22–43.
[47] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, S. Han, Bev- [68] J. L. Schonberger, J.-M. Frahm, Structure-from-motion revisited, in:
fusion: Multi-task multi-sensor fusion with unified bird’s-eye view rep- Proceedings of the IEEE conference on computer vision and pattern
resentation, in: 2023 IEEE international conference on robotics and au- recognition, 2016, pp. 4104–4113.
tomation (ICRA), IEEE, 2023, pp. 2774–2781. [69] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor-
[48] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, thi, R. Ng, Nerf: Representing scenes as neural radiance fields for view
Z. Tang, Bevfusion: A simple and robust lidar-camera fusion frame- synthesis, Communications of the ACM 65 (1) (2021) 99–106.
work, Advances in Neural Information Processing Systems 35 (2022) [70] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, J. Valentin, Fast-
10421–10434. nerf: High-fidelity neural rendering at 200fps, in: Proceedings of
[49] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, Z. Li, Bevdepth: the IEEE/CVF international conference on computer vision, 2021, pp.
Acquisition of reliable depth for multi-view 3d object detection, in: Pro- 14346–14355.
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, [71] C. Reiser, S. Peng, Y. Liao, A. Geiger, Kilonerf: Speeding up neu-
2023, pp. 1477–1485. ral radiance fields with thousands of tiny mlps, in: Proceedings of
[50] Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, Y.-G. Jiang, Po- the IEEE/CVF international conference on computer vision, 2021, pp.
larformer: Multi-camera 3d object detection with polar transformer, in: 14335–14345.
Proceedings of the AAAI conference on Artificial Intelligence, Vol. 37, [72] T. Takikawa, J. Litalien, K. Yin, K. Kreis, C. Loop, D. Nowrouzezahrai,
2023, pp. 1042–1050. A. Jacobson, M. McGuire, S. Fidler, Neural geometric level of detail:
[51] J. Mei, Y. Yang, M. Wang, J. Zhu, X. Zhao, J. Ra, L. Li, Y. Liu, Real-time rendering with implicit 3d shapes, in: Proceedings of the
Camera-based 3d semantic scene completion with sparse guidance net- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
work, arXiv preprint arXiv:2312.05752 (2023). 2021, pp. 11358–11367.

17
[73] B. Kerbl, G. Kopanas, T. Leimkühler, G. Drettakis, 3d gaussian splatting [95] L. Zhao, X. Xu, Z. Wang, Y. Zhang, B. Zhang, W. Zheng, D. Du, J. Zhou,
for real-time radiance field rendering, ACM Transactions on Graphics J. Lu, Lowrankocc: Tensor decomposition and low-rank recovery for
42 (4) (2023) 1–14. vision-based 3d semantic occupancy prediction, in: Proceedings of the
[74] G. Chen, W. Wang, A survey on 3d gaussian splatting, arXiv preprint IEEE/CVF Conference on Computer Vision and Pattern Recognition,
arXiv:2401.03890 (2024). 2024, pp. 9806–9815.
[75] C. B. Rist, D. Emmerichs, M. Enzweiler, D. M. Gavrila, Semantic scene [96] A.-Q. Cao, A. Dai, R. de Charette, Pasco: Urban 3d panoptic scene com-
completion using local deep implicit functions on lidar data, IEEE trans- pletion with uncertainty awareness, in: Proceedings of the IEEE/CVF
actions on pattern analysis and machine intelligence 44 (10) (2021) Conference on Computer Vision and Pattern Recognition, 2024, pp.
7205–7218. 14554–14564.
[76] Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, J. Lu, Surroundocc: Multi- [97] L. Kong, Y. Liu, X. Li, R. Chen, W. Zhang, J. Ren, L. Pan, K. Chen,
camera 3d occupancy prediction for autonomous driving, in: Proceed- Z. Liu, Robo3d: Towards robust and reliable 3d perception against cor-
ings of the IEEE/CVF International Conference on Computer Vision, ruptions, in: Proceedings of the IEEE/CVF International Conference on
2023, pp. 21729–21740. Computer Vision, 2023, pp. 19994–20006.
[77] J. Yao, C. Li, K. Sun, Y. Cai, H. Li, W. Ouyang, H. Li, Ndc-scene: Boost [98] B. Li, J. Deng, W. Zhang, Z. Liang, D. Du, X. Jin, W. Zeng, Hierarchical
monocular 3d semantic scene completion in normalized device coordi- temporal context learning for camera-based semantic scene completion,
nates space, in: 2023 IEEE/CVF International Conference on Computer arXiv preprint arXiv:2407.02077 (2024).
Vision (ICCV), IEEE Computer Society, 2023, pp. 9421–9431. [99] Y. Shi, T. Cheng, Q. Zhang, W. Liu, X. Wang, Occupancy as set of
[78] X. Liu, M. Gong, Q. Fang, H. Xie, Y. Li, H. Zhao, C. Feng, points, arXiv preprint arXiv:2407.04049 (2024).
Lidar-based 4d occupancy completion and forecasting, arXiv preprint [100] G. Wang, Z. Wang, P. Tang, J. Zheng, X. Ren, B. Feng, C. Ma, Occ-
arXiv:2310.11239 (2023). gen: Generative multi-modal 3d occupancy prediction for autonomous
[79] S. Zuo, W. Zheng, Y. Huang, J. Zhou, J. Lu, Pointocc: Cylindrical driving, arXiv preprint arXiv:2404.15014 (2024).
tri-perspective view for point-based 3d semantic occupancy prediction, [101] S. Wang, J. Yu, W. Li, H. Shi, K. Yang, J. Chen, J. Zhu, Label-efficient
arXiv preprint arXiv:2308.16896 (2023). semantic scene completion with scribble annotations, arXiv preprint
[80] Z. Yu, C. Shu, J. Deng, K. Lu, Z. Liu, J. Yu, D. Yang, H. Li, Y. Chen, arXiv:2405.15170 (2024).
Flashocc: Fast and memory-efficient occupancy prediction via channel- [102] Y. Pan, B. Gao, J. Mei, S. Geng, C. Li, H. Zhao, Semanticposs: A point
to-height plugin, arXiv preprint arXiv:2311.12058 (2023). cloud dataset with large quantity of dynamic instances, in: 2020 IEEE
[81] J. Xu, L. Peng, H. Cheng, L. Xia, Q. Zhou, D. Deng, W. Qian, Intelligent Vehicles Symposium (IV), IEEE, 2020, pp. 687–693.
W. Wang, D. Cai, Regulating intermediate 3d features for vision-centric [103] J. Pan, Z. Wang, L. Wang, Co-occ: Coupling explicit feature fusion with
autonomous driving, in: Proceedings of the AAAI Conference on Arti- volume rendering regularization for multi-modal 3d semantic occupancy
ficial Intelligence, Vol. 38, 2024, pp. 6306–6314. prediction, arXiv preprint arXiv:2404.04561 (2024).
[82] J. Hou, X. Li, W. Guan, G. Zhang, D. Feng, Y. Du, X. Xue, J. Pu, Fas- [104] L. Li, H. P. Shum, T. P. Breckon, Less is more: Reducing task and model
tocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye complexity for 3d point cloud semantic segmentation, in: Proceedings
view and perspective view, arXiv preprint arXiv:2403.02710 (2024). of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[83] M. Pan, J. Liu, R. Zhang, P. Huang, X. Li, L. Liu, S. Zhang, Renderocc: tion, 2023, pp. 9361–9371.
Vision-centric 3d occupancy prediction with 2d rendering supervision, [105] L. Kong, J. Ren, L. Pan, Z. Liu, Lasermix for semi-supervised lidar
arXiv preprint arXiv:2309.09502 (2023). semantic segmentation, in: Proceedings of the IEEE/CVF Conference
[84] Y. Zheng, X. Li, P. Li, Y. Zheng, B. Jin, C. Zhong, X. Long, H. Zhao, on Computer Vision and Pattern Recognition, 2023, pp. 21705–21715.
Q. Zhang, Monoocc: Digging into monocular semantic occupancy pre- [106] P. Tang, H.-M. Xu, C. Ma, Prototransfer: Cross-modal prototype transfer
diction, arXiv preprint arXiv:2403.08766 (2024). for point cloud segmentation, in: Proceedings of the IEEE/CVF Interna-
[85] Q. Ma, X. Tan, Y. Qu, L. Ma, Z. Zhang, Y. Xie, Cotr: Compact oc- tional Conference on Computer Vision, 2023, pp. 3337–3347.
cupancy transformer for vision-based 3d occupancy prediction, arXiv [107] C. Min, L. Xiao, D. Zhao, Y. Nie, B. Dai, Occupancy-mae: Self-
preprint arXiv:2312.01919 (2023). supervised pre-training large-scale lidar point clouds with masked occu-
[86] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Kr- pancy autoencoders, IEEE Transactions on Intelligent Vehicles (2023).
ishnan, Y. Pan, G. Baldan, O. Beijbom, nuscenes: A multimodal dataset [108] Z. Wang, Z. Ye, H. Wu, J. Chen, L. Yi, Semantic complete scene fore-
for autonomous driving, in: Proceedings of the IEEE/CVF conference casting from a 4d dynamic point cloud sequence, in: Proceedings of the
on computer vision and pattern recognition, 2020, pp. 11621–11631. AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 5867–
[87] Y. Huang, W. Zheng, B. Zhang, J. Zhou, J. Lu, Selfocc: Self-supervised 5875.
vision-based 3d occupancy prediction, arXiv preprint arXiv:2311.12754 [109] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, O. Beijbom, Point-
(2023). pillars: Fast encoders for object detection from point clouds, in: Pro-
[88] H. Jiang, T. Cheng, N. Gao, H. Zhang, W. Liu, X. Wang, Symphonize ceedings of the IEEE/CVF conference on computer vision and pattern
3d semantic scene completion with contextual instance queries, arXiv recognition, 2019, pp. 12697–12705.
preprint arXiv:2306.15670 (2023). [110] Y. Zhou, O. Tuzel, Voxelnet: End-to-end learning for point cloud based
[89] S. Wang, J. Yu, W. Li, W. Liu, X. Liu, J. Chen, J. Zhu, Not all voxels are 3d object detection, in: Proceedings of the IEEE conference on computer
equal: Hardness-aware semantic scene completion with self-distillation, vision and pattern recognition, 2018, pp. 4490–4499.
arXiv preprint arXiv:2404.11958 (2024). [111] Z. Tan, Z. Dong, C. Zhang, W. Zhang, H. Ji, H. Li, Ovo: Open-
[90] P. Tang, Z. Wang, G. Wang, J. Zheng, X. Ren, B. Feng, C. Ma, vocabulary occupancy, arXiv preprint arXiv:2305.16133 (2023).
Sparseocc: Rethinking sparse latent representation for vision-based se- [112] Y. Shi, J. Li, K. Jiang, K. Wang, Y. Wang, M. Yang, D. Yang, Panossc:
mantic occupancy prediction, arXiv preprint arXiv:2404.09502 (2024). Exploring monocular panoptic 3d scene reconstruction for autonomous
[91] K. Han, D. Muhle, F. Wimbauer, D. Cremers, Boosting self-supervision driving, in: 2024 International Conference on 3D Vision (3DV), IEEE,
for single-view scene completion via knowledge distillation, arXiv 2024, pp. 1219–1228.
preprint arXiv:2404.07933 (2024). [113] Y. Lu, X. Zhu, T. Wang, Y. Ma, Octreeocc: Efficient and multi-
[92] Y. Liao, J. Xie, A. Geiger, Kitti-360: A novel dataset and benchmarks for granularity occupancy prediction using octree queries, arXiv preprint
urban scene understanding in 2d and 3d, IEEE Transactions on Pattern arXiv:2312.03774 (2023).
Analysis and Machine Intelligence 45 (3) (2022) 3292–3310. [114] B. Li, Y. Sun, Z. Liang, D. Du, Z. Zhang, X. Wang, Y. Wang, X. Jin,
[93] Openscene: The largest up-to-date 3d occupancy prediction benchmark W. Zeng, Bridging stereo geometry and bev representation with reliable
in autonomous driving, accessed July 25, 2024. mutual interaction for semantic scene completion (2024).
URL https://github.com/OpenDriveLab/OpenScene [115] X. Tan, W. Wu, Z. Zhang, C. Fan, Y. Peng, Z. Zhang, Y. Xie, L. Ma,
[94] Y. Xue, R. Li, F. Wu, Z. Tang, K. Li, M. Duan, Bi-ssc: Geometric- Geocc: Geometrically enhanced 3d occupancy network with implicit-
semantic bidirectional fusion for camera-based 3d semantic scene com- explicit depth fusion and contextual self-supervision, arXiv preprint
pletion, in: Proceedings of the IEEE/CVF Conference on Computer Vi- arXiv:2405.10591 (2024).
sion and Pattern Recognition, 2024, pp. 20124–20134. [116] S. Boeder, F. Gigengack, B. Risse, Occflownet: Towards self-supervised

18
occupancy estimation via differentiable rendering and occupancy flow, object detection, in: Proceedings of the IEEE international conference
arXiv preprint arXiv:2402.12792 (2024). on computer vision, 2017, pp. 2980–2988.
[117] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [140] J. Li, Y. Liu, X. Yuan, C. Zhao, R. Siegwart, I. Reid, C. Cadena, Depth
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., based semantic scene completion with position importance aware loss,
An image is worth 16x16 words: Transformers for image recognition at IEEE Robotics and Automation Letters 5 (1) (2019) 219–226.
scale, arXiv preprint arXiv:2010.11929 (2020). [141] S. Kullback, R. A. Leibler, On information and sufficiency, The annals
[118] H. Liu, H. Wang, Y. Chen, Z. Yang, J. Zeng, L. Chen, L. Wang, of mathematical statistics 22 (1) (1951) 79–86.
Fully sparse 3d occupancy prediction, arXiv preprint arXiv:2312.17118 [142] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving?
(2024). the kitti vision benchmark suite, in: 2012 IEEE conference on computer
[119] Z. Ming, J. S. Berrio, M. Shan, S. Worrall, Inversematrixvt3d: An ef- vision and pattern recognition, IEEE, 2012, pp. 3354–3361.
ficient projection matrix-based approach for 3d occupancy prediction, [143] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,
arXiv preprint arXiv:2401.12422 (2024). J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., Scalability in perception
[120] J. Li, X. He, C. Zhou, X. Cheng, Y. Wen, D. Zhang, Viewformer: Explor- for autonomous driving: Waymo open dataset, in: Proceedings of
ing spatiotemporal modeling for multi-view 3d occupancy perception the IEEE/CVF conference on computer vision and pattern recognition,
via view-guided transformers, arXiv preprint arXiv:2405.04299 (2024). 2020, pp. 2446–2454.
[121] D. Scaramuzza, F. Fraundorfer, Visual odometry [tutorial], IEEE [144] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, L. Chen, A. Jain, S. Omari,
robotics & automation magazine 18 (4) (2011) 80–92. V. Iglovikov, P. Ondruska, One thousand and one hours: Self-driving
[122] J. Philion, S. Fidler, Lift, splat, shoot: Encoding images from arbitrary motion prediction dataset, in: Conference on Robot Learning, PMLR,
camera rigs by implicitly unprojecting to 3d, in: Computer Vision– 2021, pp. 409–418.
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, [145] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett,
2020, Proceedings, Part XIV 16, Springer, 2020, pp. 194–210. D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., Argoverse: 3d track-
[123] Z. Xia, X. Pan, S. Song, L. E. Li, G. Huang, Vision transformer with ing and forecasting with rich maps, in: Proceedings of the IEEE/CVF
deformable attention, in: Proceedings of the IEEE/CVF conference on conference on computer vision and pattern recognition, 2019, pp. 8748–
computer vision and pattern recognition, 2022, pp. 4794–4803. 8757.
[124] J. Park, C. Xu, S. Yang, K. Keutzer, K. M. Kitani, M. Tomizuka, [146] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin,
W. Zhan, Time will tell: New outlooks and a baseline for temporal multi- R. Yang, The apolloscape dataset for autonomous driving, in: Proceed-
view 3d object detection, in: The Eleventh International Conference on ings of the IEEE conference on computer vision and pattern recognition
Learning Representations, 2022. workshops, 2018, pp. 954–960.
[125] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, X. Zhang, Petrv2: A unified [147] H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang,
framework for 3d perception from multi-camera images, in: Proceedings L. Fletcher, O. Beijbom, S. Omari, nuplan: A closed-loop ml-
of the IEEE/CVF International Conference on Computer Vision, 2023, based planning benchmark for autonomous vehicles, arXiv preprint
pp. 3262–3272. arXiv:2106.11810 (2021).
[126] H. Liu, Y. Teng, T. Lu, H. Wang, L. Wang, Sparsebev: High- [148] R. Garg, V. K. Bg, G. Carneiro, I. Reid, Unsupervised cnn for single
performance sparse 3d object detection from multi-camera videos, in: view depth estimation: Geometry to the rescue, in: Computer Vision–
Proceedings of the IEEE/CVF International Conference on Computer ECCV 2016: 14th European Conference, Amsterdam, The Netherlands,
Vision, 2023, pp. 18580–18590. October 11-14, 2016, Proceedings, Part VIII 14, Springer, 2016, pp.
[127] B. Cheng, A. G. Schwing, A. Kirillov, Per-pixel classification is not all 740–756.
you need for semantic segmentation, 2021. [149] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality as-
[128] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked- sessment: from error visibility to structural similarity, IEEE transactions
attention mask transformer for universal image segmentation, 2022. on image processing 13 (4) (2004) 600–612.
[129] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Ad- [150] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang,
vances in neural information processing systems 33 (2020) 6840–6851. Y. Chen, F. Yan, et al., Grounded sam: Assembling open-world models
[130] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE transac- for diverse visual tasks, arXiv preprint arXiv:2401.14159 (2024).
tions on information theory 13 (1) (1967) 21–27. [151] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
[131] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceed- T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything,
ings of the IEEE conference on computer vision and pattern recognition, in: Proceedings of the IEEE/CVF International Conference on Com-
2018, pp. 7132–7141. puter Vision, 2023, pp. 4015–4026.
[132] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single [152] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su,
image using a multi-scale deep network, Advances in neural information J. Zhu, et al., Grounding dino: Marrying dino with grounded pre-training
processing systems 27 (2014). for open-set object detection, arXiv preprint arXiv:2303.05499 (2023).
[133] Y. Huang, Z. Tang, D. Chen, K. Su, C. Chen, Batching soft iou for train- [153] Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, J. M. Alvarez, Fb-occ:
ing semantic segmentation networks, IEEE Signal Processing Letters 27 3d occupancy prediction based on forward-backward view transforma-
(2019) 66–70. tion, arXiv preprint arXiv:2307.01492 (2023).
[134] J. Chen, W. Lu, Y. Li, L. Shen, J. Duan, Adversarial learning of object- [154] Y. Huang, W. Zheng, Y. Zhang, J. Zhou, J. Lu, Gaussianformer: Scene
aware activation map for weakly-supervised semantic segmentation, as gaussians for vision-based 3d semantic occupancy prediction, arXiv
IEEE Transactions on Circuits and Systems for Video Technology 33 (8) preprint arXiv:2405.17429 (2024).
(2023) 3935–3946. [155] H. Shi, S. Wang, J. Zhang, X. Yin, Z. Wang, Z. Zhao, G. Wang, J. Zhu,
[135] J. Chen, X. Zhao, C. Luo, L. Shen, Semformer: semantic guided acti- K. Yang, K. Wang, Occfiner: Offboard occupancy refinement with hy-
vation transformer for weakly supervised semantic segmentation, arXiv brid propagation, arXiv preprint arXiv:2403.08504 (2024).
preprint arXiv:2210.14618 (2022). [156] Z. Yu, R. Zhang, J. Ying, J. Yu, X. Hu, L. Luo, S. Cao, H. Shen, Context
[136] Y. Wu, J. Liu, M. Gong, Q. Miao, W. Ma, C. Xu, Joint semantic segmen- and geometry aware voxel transformer for semantic scene completion,
tation using representations of lidar point clouds and camera images, arXiv preprint arXiv:2405.13675 (2024).
Information Fusion 108 (2024) 102370. [157] Cvpr 2023 3d occupancy prediction challenge, accessed July 25, 2024.
[137] Q. Yan, S. Li, Z. He, X. Zhou, M. Hu, C. Liu, Q. Chen, Decoupling se- URL https://opendrivelab.com/challenge2023/
mantic and localization for semantic segmentation via magnitude-aware [158] Cvpr 2024 autonomous grand challenge, accessed July 25, 2024.
and phase-sensitive learning, Information Fusion (2024) 102314. URL https://opendrivelab.com/challenge2024/
[138] M. Berman, A. R. Triki, M. B. Blaschko, The lovász-softmax loss: A #occupancy_and_flow
tractable surrogate for the optimization of the intersection-over-union [159] H. Vanholder, Efficient inference with tensorrt, in: GPU Technology
measure in neural networks, in: Proceedings of the IEEE conference on Conference, Vol. 1, 2016.
computer vision and pattern recognition, 2018, pp. 4413–4421. [160] Y. Liu, L. Mou, X. Yu, C. Han, S. Mao, R. Xiong, Y. Wang, Let
[139] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense occ flow: Self-supervised 3d occupancy flow prediction, arXiv preprint

19
arXiv:2407.07587 (2024). arXiv:2405.14014 (2024).
[161] Y. LeCun, A path towards autonomous machine intelligence version 0.9. [183] J. Kälble, S. Wirges, M. Tatarchenko, E. Ilg, Accurate training data for
2, 2022-06-27, Open Review 62 (1) (2022) 1–62. occupancy map prediction in automated driving using evidence theory,
[162] W. Zheng, W. Chen, Y. Huang, B. Zhang, Y. Duan, J. Lu, Occworld: in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Learning a 3d occupancy world model for autonomous driving, arXiv Pattern Recognition, 2024, pp. 5281–5290.
preprint arXiv:2311.16038 (2023). [184] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
[163] L. Wang, W. Zheng, Y. Ren, H. Jiang, Z. Cui, H. Yu, J. Lu, Occsora: X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned
4d occupancy generation models as world simulators for autonomous language models, Journal of Machine Learning Research 25 (70) (2024)
driving, arXiv preprint arXiv:2405.20337 (2024). 1–53.
[164] B. Agro, Q. Sykora, S. Casas, T. Gilles, R. Urtasun, Uno: Unsupervised [185] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin,
occupancy fields for perception and forecasting, in: Proceedings of the Z. Li, D. Li, E. Xing, et al., Judging llm-as-a-judge with mt-bench and
IEEE/CVF Conference on Computer Vision and Pattern Recognition, chatbot arena, Advances in Neural Information Processing Systems 36
2024, pp. 14487–14496. (2024).
[165] T. Khurana, P. Hu, D. Held, D. Ramanan, Point cloud forecasting as a [186] Openai chatgpt, accessed July 25, 2024.
proxy for 4d occupancy forecasting, in: Proceedings of the IEEE/CVF URL https://www.openai.com/
Conference on Computer Vision and Pattern Recognition, 2023, pp. [187] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
1116–1124. T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama:
[166] H. Xu, H. Liu, S. Meng, Y. Sun, A novel place recognition network us- Open and efficient foundation language models (2023), arXiv preprint
ing visual sequences and lidar point clouds for autonomous vehicles, in: arXiv:2302.13971 (2023).
2023 IEEE 26th International Conference on Intelligent Transportation [188] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing
Systems (ITSC), IEEE, 2023, pp. 2862–2867. vision-language understanding with advanced large language models,
[167] H. Xu, H. Liu, S. Huang, Y. Sun, C2l-pr: Cross-modal camera-to-lidar arXiv preprint arXiv:2304.10592 (2023).
place recognition via modality alignment and orientation voting, IEEE [189] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in
Transactions on Intelligent Vehicles (2024). neural information processing systems 36 (2024).
[168] D. K. Jain, X. Zhao, G. González-Almagro, C. Gan, K. Kotecha, Multi- [190] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
modal pedestrian detection using metaheuristics with deep convolutional D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 tech-
neural network in crowded scenes, Information Fusion 95 (2023) 401– nical report, arXiv preprint arXiv:2303.08774 (2023).
414. [191] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N.
[169] D. Guan, Y. Cao, J. Yang, Y. Cao, M. Y. Yang, Fusion of multispectral Fung, S. Hoi, Instructblip: Towards general-purpose vision-language
data through illumination-aware deep neural networks for pedestrian de- models with instruction tuning, Advances in Neural Information Pro-
tection, Information Fusion 50 (2019) 148–157. cessing Systems 36 (2024).
[170] T. Wang, S. Kim, J. Wenxuan, E. Xie, C. Ge, J. Chen, Z. Li, P. Luo, [192] C. Zhou, C. C. Loy, B. Dai, Extract free dense labels from clip, in: Eu-
Deepaccident: A motion and accident prediction benchmark for v2x au- ropean Conference on Computer Vision, Springer, 2022, pp. 696–712.
tonomous driving, in: Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 38, 2024, pp. 5599–5606.
[171] Z. Zou, X. Zhang, H. Liu, Z. Li, A. Hussain, J. Li, A novel multimodal
fusion network based on a joint-coding model for lane line segmentation,
Information Fusion 80 (2022) 167–178.
[172] Y. Xu, X. Yang, L. Gong, H.-C. Lin, T.-Y. Wu, Y. Li, N. Vasconcelos,
Explainable object-induced action decision for autonomous vehicles, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, 2020, pp. 9523–9532.
[173] Y. Zhuang, X. Sun, Y. Li, J. Huai, L. Hua, X. Yang, X. Cao, P. Zhang,
Y. Cao, L. Qi, et al., Multi-sensor integrated navigation/positioning sys-
tems using data fusion: From analytics-based to learning-based ap-
proaches, Information Fusion 95 (2023) 62–90.
[174] S. Li, X. Li, H. Wang, Y. Zhou, Z. Shen, Multi-gnss ppp/ins/vision/lidar
tightly integrated system for precise navigation in urban environments,
Information Fusion 90 (2023) 218–232.
[175] Z. Zhang, Z. Xu, W. Yang, Q. Liao, J.-H. Xue, Bdc-occ: Binarized
deep convolution unit for binarized occupancy network, arXiv preprint
arXiv:2405.17037 (2024).
[176] Z. Huang, S. Sun, J. Zhao, L. Mao, Multi-modal policy fusion for end-
to-end autonomous driving, Information Fusion 98 (2023) 101834.
[177] Icra 2024 robodrive challenge, accessed July 25, 2024.
URL https://robodrive-24.github.io/#tracks
[178] S. Xie, L. Kong, W. Zhang, J. Ren, L. Pan, K. Chen, Z. Liu, Robobev:
Towards robust bird’s eye view perception under corruptions, arXiv
preprint arXiv:2304.06719 (2023).
[179] S. Chen, Y. Ma, Y. Qiao, Y. Wang, M-bev: Masked bev perception for
robust autonomous driving, in: Proceedings of the AAAI Conference on
Artificial Intelligence, Vol. 38, 2024, pp. 1183–1191.
[180] S. Chen, T. Cheng, X. Wang, W. Meng, Q. Zhang, W. Liu, Efficient
and robust 2d-to-bev representation learning via geometry-guided kernel
transformer, arXiv preprint arXiv:2206.04584 (2022).
[181] Y. Kim, J. Shin, S. Kim, I.-J. Lee, J. W. Choi, D. Kum, Crn: Camera
radar net for accurate, robust, efficient 3d perception, in: Proceedings of
the IEEE/CVF International Conference on Computer Vision, 2023, pp.
17615–17626.
[182] F. Ding, X. Wen, Y. Zhu, Y. Li, C. X. Lu, Radarocc: Ro-
bust 3d occupancy prediction with 4d imaging radar, arXiv preprint

20

You might also like