A Survey On Occupancy Perception For Autonomous Driving: The Information Fusion Perspective
A Survey On Occupancy Perception For Autonomous Driving: The Information Fusion Perspective
Perspective
Abstract
3D occupancy perception technology aims to observe and understand dense 3D environments for autonomous vehicles. Owing
to its comprehensive perception capability, this technology is emerging as a trend in autonomous driving perception systems, and
is attracting significant attention from both industry and academia. Similar to traditional bird’s-eye view (BEV) perception, 3D
arXiv:2405.05173v3 [cs.CV] 21 Jul 2024
occupancy perception has the nature of multi-source input and the necessity for information fusion. However, the difference is
that it captures vertical structures that are ignored by 2D BEV. In this survey, we review the most recent works on 3D occupancy
perception, and provide in-depth analyses of methodologies with various input modalities. Specifically, we summarize general
network pipelines, highlight information fusion techniques, and discuss effective network training. We evaluate and analyze the
occupancy perception performance of the state-of-the-art on the most popular datasets. Furthermore, challenges and future research
directions are discussed. We hope this paper will inspire the community and encourage more research work on 3D occupancy
perception. A comprehensive list of studies in this survey is publicly available in an active repository that continuously collects the
latest work: https://github.com/HuaiyuanXu/3D-Occupancy-Perception.
Keywords: Autonomous Driving, Information Fusion, Occupancy Perception, Multi-Modal Data.
Figure 2: Chronological overview of 3D occupancy perception. It can be observed that: (1) research on occupancy has undergone explosive growth since 2023;
(2) the predominant trend focuses on vision-centric occupancy, supplemented by LiDAR-centric and multi-modal methods.
dustrial perspective, the deployment of a LiDAR Kit on each [1, 2]. Our survey focuses on 3D occupancy perception, which
autonomous vehicle is expensive. With cameras as a cheap al- captures the environmental height information overlooked by
ternative to LiDAR, vision-centric occupancy perception is in- BEV perception. There are two related reviews: Roldao et al.
deed a cost-effective solution that reduces the manufacturing [22] conducted a literature review on 3D scene completion for
cost for vehicle equipment manufacturers. both indoor and outdoor scenes; Zhang et al. [23] only re-
viewed 3D occupancy prediction based on the visual modality.
1.2. Motivation to Information Fusion Research Unlike their work, our survey is tailored to autonomous driv-
The gist of occupancy perception lies in comprehending ing scenarios, and extends the existing 3D occupancy survey
complete and dense 3D scenes, including understanding oc- by considering more sensor modalities. Moreover, given the
cluded areas. However, the observation from a single sensor multi-source nature of 3D occupancy perception, we provide
only captures parts of the scene. For instance, Fig. 1 intuitively an in-depth analysis of information fusion techniques for this
illustrates that an image or a point cloud cannot provide a 3D field. The primary contributions of this survey are three-fold:
panorama or a dense environmental scan. To this end, study- • We systematically review the latest research on 3D oc-
ing information fusion from multiple sensors [11, 12, 13] and cupancy perception in the field of autonomous driving,
multiple frames [4, 8] will facilitate a more comprehensive per- covering motivation analysis, the overall research back-
ception. This is because, on the one hand, information fusion ground, and an in-depth discussion on methodology,
expands the spatial range of perception, and on the other hand, evaluation, and challenges.
it densifies scene observation. Besides, for occluded regions,
integrating multi-frame observations is beneficial, as the same • We provide a taxonomy of 3D occupancy perception, and
scene is observed by a host of viewpoints, which offer sufficient elaborate on core methodological issues, including net-
scene features for occlusion inference. work pipelines, multi-source information fusion, and ef-
Furthermore, in complex outdoor scenarios with varying fective network training.
lighting and weather conditions, the need for stable occupancy
perception is paramount. This stability is crucial for ensuring • We present evaluations for 3D occupancy perception, and
driving safety. At this point, research on multi-modal fusion offer detailed performance comparisons. Furthermore,
will promote robust occupancy perception, by combining the current limitations and future research directions are dis-
strengths of different modalities of data [11, 12, 14, 15]. For cussed.
example, LiDAR and radar data are insensitive to illumination The remainder of this paper is structured as follows. Sec. 2
changes and can sense the precise depth of the scene. This ca- provides a brief background on the history, definitions, and re-
pability is particularly important during nighttime driving or in lated research domains. Sec. 3 details methodological insights.
scenarios where the shadow/glare obscures critical information. Sec. 4 conducts performance comparisons and analyses. Fi-
Camera data excel in capturing detailed visual texture, being nally, future research directions are discussed and the survey is
adept at identifying color-based environmental elements (e.g., concluded in Sec. 5 and 6, respectively.
road signs and traffic lights) and long-distance objects. There-
fore, the fusion of data from LiDAR, radar, and camera will
present a holistic understanding of the environment meanwhile 2. Background
against adverse environmental changes.
2.1. A Brief History of Occupancy Perception
1.3. Contributions Occupancy perception is derived from Occupancy Grid
Mapping (OGM) [24], which is a classic topic in mobile robot
Among perception-related topics, 3D semantic segmenta-
navigation, and aims to generate a grid map from noisy and un-
tion [16, 17] and 3D object detection [18, 19, 20, 21] have been
certain measurements. Each grid in this map is assigned a value
extensively reviewed. However, these tasks do not facilitate
that scores the probability of the grid space being occupied by
a dense understanding of the environment. BEV perception,
obstacles. Semantic occupancy perception originates from SS-
which addresses this issue, has also been thoroughly reviewed
CNet [25], which predicts the occupied status and semantics
2
of all voxels in an indoor scene from a single image. How- class0
ever, studying occupancy perception in outdoor scenes is im- 0 class1
perative for autonomous driving, as opposed to indoor scenes.
class2
MonoScene [26] is a pioneering work of outdoor scene occu-
pancy perception using only a monocular camera. Contempo- 1 class3
with static indoor scenes [25, 57, 58], such as NYU [59] and Feature
Extraction
SUNCG [25] datasets. After the release of the large-scale out- Head
Volumetric 3D
Features Encoder Decoder Branch
door benchmark SemanticKITTI [60], numerous outdoor SSC Occupancy
Method Venue Modality Design Choices Task Training Evaluation Datases Open Source
Occ3D-nuScenes [36]
SemanticKITTI [60]
Lightweight Design
Feature Format
Multi-Camera
Multi-Frame
Supervision
Weight
Others
Head
Code
Loss
LMSCNet [28] 3DV 2020 L BEV - - 2D Conv 3D Conv P Strong CE ✓ - - ✓ ✓
S3CNet [30] CoRL 2020 L BEV+Vol - - Sparse Conv 2D &3D Conv P Strong CE, PA, BCE ✓ - - - -
DIFs [75] T-PAMI 2021 L BEV - - 2D Conv MLP P Strong CE, BCE, SC ✓ - - - -
MonoScene [26] CVPR 2022 C Vol - - - 3D Conv P Strong CE, FP, Aff ✓ - - ✓ ✓
to infer the occupied status {1, 0} of each voxel, and even esti- of height information. In contrast, the 3D branch does not
mate its semantic category: compress data in any dimension, thereby protecting the com-
plete 3D scene. To enhance memory efficiency in the 3D
V = fConv/MLP (ED (Vinit−2D/3D )) , (4)
branch, LMSCNet [28] turns the height dimension into the fea-
where ED represents encoder and decoder. ture channel dimension. This adaptation facilitates the use of
more efficient 2D convolutions compared to 3D convolutions
3.1.2. Information Fusion in LiDAR-Centric Occupancy in the 3D branch. Moreover, integrating information from both
Some works directly utilize a single 2D branch to reason 2D and 3D branches can significantly refine occupancy predic-
about 3D occupancy, such as DIFs [75] and PointOcc [79]. In tions [30].
these approaches, only 2D feature maps instead of 3D feature S3CNet [30] proposes a unique late fusion strategy for in-
volumes are required, resulting in reduced computational de- tegrating information from 2D and 3D branches. This fusion
mand. However, a significant disadvantage is the partial loss strategy involves a dynamic voxel fusion technique that lever-
5
Optional: Temporal Branch Volume Projection Back Projection Cross Attention
Features Occupancy Key,
Backbone
2D Image
or Value
2D to 3D TPV Spatial
Features Fusion Query
Multi-camera Feature Temporal-Spatial or BEV
Images Maps Alignment Features
3D Feature 2D Feature 3D Feature 2D Feature 3D Feature 2D Feature
Camera Parameters
Time
or [55, 80, 81, 82, 90], and cross attention [4, 36, 76, 113, 118].
TPV Spatial
2D to 3D Features Fusion Multi-camera Features
Multi-camera Feature or BEV
Images Maps Features Sum
3D
Figure 5: Architecture for vision-centric occupancy perception: Methods
2D Cross
without temporal fusion [31, 32, 36, 38, 76, 81, 82, 83, 87, 116]; Methods with
Attention
temporal fusion [4, 8, 10, 56, 80, 85]. 3D Feature Output
Volume Multi-camera Feature Maps Query Feature
ages the results of the 2D branch to enhance the density of the (b) Spatial information fusion. In areas where multiple cameras have overlapping
fields of view, features from these cameras are fused through average [38, 53, 82]
output from the 3D branch. Ablation studies report that this or cross attention [4, 32, 76, 113, 120].
straightforward and direct information fusion strategy can yield Feature Fusion
a 5-12% performance boost in 3D occupancy perception.
3D Convolution
Concatenation
3.2. Vision-Centric Occupancy Perception or Key, value
Query
Temporal-
3.2.1. General Pipeline Cross Attention
Time
Spatial
Alignment Previous features Current features
Inspired by Tesla’s technology of the perception system or
6
polation [26, 31, 38, 53]. This projection process is formulated formable attention mechanism:
as: Nkey
NX
head X
FV ol (x, y, z) = ΨS (F2D (Ψρ (x, y, z, K, RT ))) , (5) FV ol (q) = Wi Aij Wij F2D (p + △pij ) , (7)
i=1 j=1
where K and RT are the intrinsics and extrinsics of the cam-
where (Wi , Wij ) are learnable weights, Aij denotes attention,
era. However, the problem with the projection-based 2D-to-3D
p + △pij represents the position of the reference feature, and
transformation is that along the line of sight, multiple voxels
△pij indicates a learnable position shift.
in the 3D space correspond to the same location in the front-
Furthermore, there are some hybrid transformation meth-
view feature map. This leads to many-to-one mapping that in-
ods that combine multiple 2D-to-3D transformation techniques.
troduces the ambiguity in the correspondence between 2D and
VoxFormer [33] and SGN [51] initially compute a coarse 3D
3D.
feature volume by per-pixel depth estimation and back projec-
(2) Back Projection: Back projection is the reverse process
tion, and subsequently refine the feature volume using cross
of projection. Similarly, it also utilizes perspective projection to
attention. COTR [85] has a similar hybrid transformation as
establish correspondences between 2D and 3D. However, un-
VoxFormer and SGN, but it replaces per-pixel depth estimation
like projection, back projection uses the estimated depth d of
with estimating depth distributions.
each pixel to calculate an accurate one-to-one mapping from
For TPV features, TPVFormer [32] achieves the 2D-to-3D
2D to 3D.
transformation via cross attention. The conversion process dif-
FV ol ΨV Ψ−1
fers slightly from that depicted in Fig. 6a, where the 3D feature
ρ (u, v, d, K, RT ) = F2D (u, v) , (6)
volume is replaced by a 2D feature map in a specific perspective
where Ψ−1 ρ indicates the inverse projection function; ΨV is of three views. For BEV features, the conversion from the front
voxelization. Since estimating the depth value may introduce view to the bird’s-eye view can be achieved by cross attention
errors, it is more effective to predict a discrete depth distribution [61] or by back projection and then vertical pooling [61, 80].
Dis along the optical ray rather than estimating a specific depth
for each pixel [55, 80, 81, 82]. That is, FV ol = F2D ⊗ Dis, 3.2.3. Information Fusion in Vision-Centric Occupancy
where ⊗ denotes outer product. This depth distribution-based In a multi-camera setting, each camera’s front-view feature
re-projection, derived from LSS [122], has significant advan- map describes a part of the scene. To comprehensively under-
tages. On one hand, it can handle uncertainty and ambiguity in stand the scene, it is necessary to spatially fuse the information
depth perception. For instance, if the depth of a certain pixel from multiple feature maps. Additionally, objects in the scene
is unclear, the model can realize this uncertainty by the depth might be occluded or in motion. Temporally fusing feature
distribution. On the other hand, this probabilistic method of maps of multiple frames can help reason about the occluded
depth estimation provides greater robustness, particularly in a areas and recognize the motion status of objects.
multi-camera setting. If corresponding pixels in multi-camera (1) Spatial Information Fusion: The fusion of observations
images have incorrect depth values to be mapped to the same from multiple cameras can create a 3D feature volume with an
voxel in the 3D space, their information might be unable to be expanded field of view for scene perception. Within the over-
integrated. In contrast, estimating depth distributions allows for lapping area of multi-camera views, a 3D voxel in the feature
information fusion with depth uncertainty, leading to more ro- volume will hit several 2D front-view feature maps after pro-
bustness and accuracy. jection. There are two ways to fuse the hit 2D features: average
(3) Cross Attention: The cross attention-based transforma- [38, 53, 82] and cross attention [4, 32, 76, 113], as illustrated in
tion aims to interact between the feature volume and the fea- Fig. 6b. The averaging operation calculates the mean of mul-
ture map in a learnable manner. Consistent with the attention tiple features, which simplifies the fusion process and reduces
mechanism [40], each volumetric feature in the 3D feature vol- computational costs. However, it assumes the equivalent con-
ume acts as the query, and the key and value come from the tribution of different 2D perspectives to perceive the 3D scene.
2D feature map. However, employing vanilla cross attention This may not always be the case, especially when certain views
for the 2D-to-3D transformation requires considerable compu- are occluded or blurry.
tational expense, as each query must attend to all features in the To address this problem, multi-camera cross attention is
feature map. To optimize for GPU efficiency, many transfor- used to adaptively fuse information from multiple views.
mation methods [4, 36, 76, 113, 118] adopt deformable cross Specifically, its process can be regarded as an extension of Eq.
attention [123], where the query interacts with selected refer- 7 by incorporating more camera views. We redefine the de-
ence features instead of all features in the feature map, there- formable attention function as DA (q, pi , F2D-i ), where q is a
fore greatly reducing computation. Specifically, for each query, query position in the 3D space, pi is its projection position on a
we project its 3D position q onto the 2D feature map according specific 2D view, and F2D-i is the corresponding 2D front-view
to the given intrinsic and extrinsic. We sample some reference feature map. The multi-camera cross attention process can be
features around the projected 2D position p. These sampled formulated as:
features are then weighted and summed according to the de- 1 X
FV ol (q) = DA (q, pi , F2D-i ) , (8)
|ν| i∈ν
7
where FV ol (q) represents the feature of the query position in Feature or Occupancy
Extractor
the 3D feature volume, and ν denotes all hit views. Volumetric BEV
Point Cloud Features Features
(2) Temporal Information Fusion: Recent advancements Fusion
Head
① 𝛿 1−𝛿
in vision-based BEV perception systems [44, 124, 125] have key, value
② Optional
demonstrated that integrating temporal information can signif- + query
Refinement
icantly improve perception performance. Similarly, in vision- ③ ①Concatenation ②Summation ③ Cross Attention
3.4.2. Training with Other Supervisions where α is a hyperparameter weight. Furthermore, OccN-
Training with strong supervision is straightforward and ef- eRF leverages cross-Entropy loss for semantic optimization in
fective, but requires tedious annotation for voxel-wise labels. In a self-supervised manner. The semantic labels directly come
contrast, training with other types of supervision, such as weak, from pre-trained semantic segmentation models, such as a
semi, and self supervision, is label-efficient. pre-trained open-vocabulary model Grounded-SAM [150, 151,
(1) Weak Supervision: It indicates that occupancy labels 152].
are not used, and supervision is derived from alternative la-
bels. For example, point clouds with semantic labels can guide 4. Evaluation
occupancy prediction. Specifically, Vampire [81] and Rende-
rOcc [83] construct density and semantic volumes, which fa- In this section, we will provide the performance evaluation
cilitate the inference of semantic occupancy of the scene and of 3D occupancy perception. First, the datasets and metrics
the computation of depth and semantic maps through volumet- commonly used for evaluation are introduced. Subsequently,
ric rendering. These methods do not employ occupancy la- we offer detailed performance comparisons and discussions on
bels. Alternatively, they project LiDAR point clouds with se- state-of-the-art 3D occupancy perception methods using the
mantic labels onto the camera plane to acquire ground-truth most popular datasets.
depth and semantics, which then supervise network training.
Since both strongly-supervised and weakly-supervised learn- 4.1. Datasets and Metrics
ing predict geometric and semantic occupancy, the losses used 4.1.1. Datasets
in strongly-supervised learning, such as cross-entropy loss, There are a variety of datasets to evaluate the performance
Lovasz-Softmax loss, and scale-invariant logarithmic loss, are of occupancy prediction approaches, e.g., the widely used
also applicable to weakly-supervised learning. KITTI [142], nuScenes [86], and Waymo [143]. However, most
(2) Semi Supervision: It utilizes occupancy labels but does of the datasets only contain 2D semantic segmentation annota-
not cover the complete scene, therefore providing only semi tions, which is not practical for the training or evaluation of
supervision for occupancy network training. POP-3D [9] ini- 3D occupancy prediction approaches. To support the bench-
tially generates occupancy labels by processing LiDAR point marks for 3D occupancy perception, many new datasets such as
clouds, where a voxel is recorded as occupied if it contains Monoscene [26], Occ3D [36], and OpenScene [93] are devel-
at least one LiDAR point, and empty otherwise. Given the oped based on the previous datasets like nuScenes and Waymo.
sparsity and occlusions inherent in LiDAR point clouds, the A detailed summary of datasets is provided in Tab. 2.
occupancy labels produced in this manner do not encompass
the entire space, meaning that only portions of the scene have Traditional Datasets. Before the development of 3D occu-
their occupancy labelled. POP-3D employs cross-entropy loss pancy based algorithms, KITTI [142], SemanticKITTI [60],
and Lovasz-Softmax loss to supervise network training. More- nuScenes [86], Waymo [143], and KITTI-360 [92] are widely
over, to establish the cross-modal correspondence between text used benchmarks for 2D semantic perception methods. KITTI
and 3D occupancy, POP-3D proposes to calculate the L2 mean contains ∼15K annotated frames from ∼15K 3D scans across
square error between language-image features and 3D-language 22 scenes with camera and LiDAR inputs. SemanticKITTI ex-
features as the modality alignment loss. tends KITTI with more annotated frames (∼20K) from more
(3) Self Supervision: It trains occupancy perception net- 3D scans (∼43K). nuScenes collects more 3D scans (∼390K)
works without any labels. To this end, volume rendering from 1,000 scenes, resulting in more annotated frames (∼40K)
[69] provides a self-supervised signal to encourage consistency and supports extra radar inputs. Waymo and KITTI-360 are
across different views from temporal and spatial perspectives, two large datasets with ∼230K and ∼80K frames with annota-
by minimizing photometric differences. MVBTS [91] com- tions, respectively, while Waymo contains more scenes (1000
putes the photometric difference between the rendered RGB scenes) than KITTI-360 does (only 11 scenes). The above
10
Table 2: Overview of 3D occupancy datasets with multi-modal sensors. Ann.: Annotation. Occ.: Occupancy. C: Camera. L: LiDAR. R: Radar. D: Depth map.
Flow: 3D occupancy flow. Datasets highlighted in light gray are meta datasets.
datasets are the widely adopted benchmarks for 2D perception where T P , F P , and F N represent the number of true positives,
algorithms before the popularity of 3D occupancy perception. false positives, and false negatives. A true positive means that
These datasets also serve as the meta datasets of benchmarks an actual occupied voxel is correctly predicted.
for 3D occupancy based perception algorithms. Occupancy prediction that simultaneously infers the occu-
pation status and semantic classification of voxels can be re-
3D Occupancy Datasets. The occupancy network proposed garded as semantic-geometric perception. In this context, the
by Tesla has led the trend of 3D occupancy based percep- mean Intersection-over-Union (mIoU) is commonly used as the
tion for autonomous driving. However, the lack of a publicly evaluation metric. The mIoU metric calculates the IoU for each
available large dataset containing 3D occupancy annotations semantic class separately and then averages these IoUs across
brings difficulty to the development of 3D occupancy percep- all classes, excluding the ’empty’ class:
tion. To deal with this dilemma, many researchers develop
3D occupancy datasets based on meta datasets like nuScenes NC
1 X T Pi
and Waymo. Monoscene [26] supporting 3D occupancy an- mIoU = , (17)
NC i=1 T Pi + F Pi + F Ni
notations is created from SemanticKITTI plus KITTI datasets,
and NYUv2 [59] datasets. SSCBench [37] is developed based where T Pi , F Pi , and F Ni are the number of true positives,
on KITTI-360, nuScenes, and Waymo datasets with camera in- false positives, and false negatives for a specific semantic cate-
puts. OCFBench [78] built on Lyft-Level-5 [144], Argoverse gory i. NC denotes the total number of semantic categories.
[145], ApolloScape [146], and nuScenes datasets only con- (2) Ray-level Metric: Although voxel-level IoU and mIoU
tain LiDAR inputs. SurroundOcc [76], OpenOccupancy [11], metrics are widely recognized [10, 38, 53, 76, 80, 84, 85, 87],
and OpenOcc [8] are developed on nuScenes dataset. Occ3D they still have limitations. Due to unbalanced distribution and
[36] contains more annotated frames with 3D occupancy la- occlusion of LiDAR sensing, ground-truth voxel labels from ac-
bels (∼40K based on nuScenes and ∼200K frames based on cumulated LiDAR point clouds are imperfect, where the areas
Waymo). Cam4DOcc [10] and OpenScene [93] are two new not scanned by LiDAR are annotated as empty. Moreover, for
datasets that contain large-scale 3D occupancy and 3D occu- thin objects, voxel-level metrics are too strict, as a one-voxel
pancy flow annotations. Cam4DOcc is based on nuScenes plus deviation would reduce the IoU values of thin objects to zero.
Lyft-Level-5 datasets, while OpenScene with ∼4M frames with To solve these issues, SparseOcc [118] imitates LiDAR’s ray
annotations is built on a very large dataset nuPlan [147]. casting and proposes ray-level mIoU, which evaluates rays to
their closest contact surface. This novel mIoU, combined with
4.1.2. Metrics the mean absolute velocity error (mAVE), is adopted by the oc-
(1) Voxel-level Metrics: Occupancy prediction without cupancy score (OccScore) metric [93]. OccScore overcomes
semantic consideration is regarded as class-agnostic percep- the shortcomings of voxel-level metrics while also evaluating
tion. It focuses solely on understanding spatial geometry, that the performance in perceiving object motion in the scene (i.e.,
is, determining whether each voxel in a 3D space is occu- occupancy flow).
pied or empty. The common evaluation metric is voxel-level The formulation of ray-level mIoU is consistent with Eq. 17
Intersection-over-Union (IOU), expressed as: in form but differs in application. The ray-level mIoU evaluates
TP each query ray rather than each voxel. A query ray is consid-
IoU = , (16) ered a true positive if (i) its predicted class label matches the
TP + FP + FN
11
Table 3: 3D occupancy prediction comparison (%) on the SemanticKITTI test set [60]. Mod.: Modality. C: Camera. L: LiDAR. The IoU evaluates the
performance in geometric occupancy perception, and the mIoU evaluates semantic occupancy perception.
■ motorcyclist.
■ motorcycle
■ other-grnd
■ vegetation
■ other-veh.
■ traf.-sign
■ sidewalk
■ bicyclist
■ building
■ parking
■ bicycle
■ person
■ terrain
(15.30%)
(11.13%)
■ fence
■ trunk
■ truck
(1.12%)
(0.56%)
(14.1%)
(3.92%)
(0.16%)
(0.03%)
(0.03%)
(0.20%)
(39.3%)
(0.51%)
(9.17%)
(0.07%)
(0.07%)
(0.05%)
(3.90%)
(0.29%)
(0.08%)
■ road
■ pole
■ car
Method Mod. IoU mIoU
S3CNet [30] L 45.60 29.53 42.00 22.50 17.00 7.90 52.20 31.20 6.70 41.50 45.00 16.10 39.50 34.00 21.20 45.90 35.80 16.00 31.30 31.00 24.30
LMSCNet [28] L 56.72 17.62 64.80 34.68 29.02 4.62 38.08 30.89 1.47 0.00 0.00 0.81 41.31 19.89 32.05 0.00 0.00 0.00 21.32 15.01 0.84
JS3C-Net [29] L 56.60 23.75 64.70 39.90 34.90 14.10 39.40 33.30 7.20 14.40 8.80 12.70 43.10 19.60 40.50 8.00 5.10 0.40 30.40 18.90 15.90
DIFs [75] L 58.90 23.56 69.60 44.50 41.80 12.70 41.30 35.40 4.70 3.60 2.70 4.70 43.80 27.40 40.90 2.40 1.00 0.00 30.50 22.10 18.50
OpenOccupancy [11] C+L - 20.42 60.60 36.10 29.00 13.00 38.40 33.80 4.70 3.00 2.20 5.90 41.50 20.50 35.10 0.80 2.30 0.60 26.00 18.70 15.70
Co-Occ [103] C+L - 24.44 72.00 43.50 42.50 10.20 35.10 40.00 6.40 4.40 3.30 8.80 41.20 30.80 40.80 1.60 3.30 0.40 32.70 26.60 20.70
MonoScene [26] C 34.16 11.08 54.70 27.10 24.80 5.70 14.40 18.80 3.30 0.50 0.70 4.40 14.90 2.40 19.50 1.00 1.40 0.40 11.10 3.30 2.10
TPVFormer [32] C 34.25 11.26 55.10 27.20 27.40 6.50 14.80 19.20 3.70 1.00 0.50 2.30 13.90 2.60 20.40 1.10 2.40 0.30 11.00 2.90 1.50
OccFormer [55] C 34.53 12.32 55.90 30.30 31.50 6.50 15.70 21.60 1.20 1.50 1.70 3.20 16.80 3.90 21.30 2.20 1.10 0.20 11.90 3.80 3.70
SurroundOcc [76] C 34.72 11.86 56.90 28.30 30.20 6.80 15.20 20.60 1.40 1.60 1.20 4.40 14.90 3.40 19.30 1.40 2.00 0.10 11.30 3.90 2.40
NDC-Scene [77] C 36.19 12.58 58.12 28.05 25.31 6.53 14.90 19.13 4.77 1.93 2.07 6.69 17.94 3.49 25.01 3.44 2.77 1.64 12.85 4.43 2.96
RenderOcc [83] C - 8.24 43.64 19.10 12.54 0.00 11.59 14.83 2.47 0.42 0.17 1.78 17.61 1.48 20.01 0.94 3.20 0.00 4.71 1.17 0.88
Symphonies [88] C 42.19 15.04 58.40 29.30 26.90 11.70 24.70 23.60 3.20 3.60 2.60 5.60 24.20 10.00 23.10 3.20 1.90 2.00 16.10 7.70 8.00
Scribble2Scene [101] C 42.60 13.33 50.30 27.30 20.60 11.30 23.70 20.10 5.60 2.70 1.60 4.50 23.50 9.60 23.80 1.60 1.80 0.00 13.30 5.60 6.50
HASSC [89] C 42.87 14.38 55.30 29.60 25.90 11.30 23.10 23.00 9.80 1.90 1.50 4.90 24.80 9.80 26.50 1.40 3.00 0.00 14.30 7.00 7.10
BRGScene [114] C 43.34 15.36 61.90 31.20 30.70 10.70 24.20 22.80 8.40 3.40 2.40 6.10 23.80 8.40 27.00 2.90 2.20 0.50 16.50 7.00 7.20
VoxFormer [33] C 44.15 13.35 53.57 26.52 19.69 0.42 19.54 26.54 7.26 1.28 0.56 7.81 26.10 6.10 33.06 1.93 1.97 0.00 7.31 9.15 4.94
MonoOcc [84] C - 15.63 59.10 30.90 27.10 9.80 22.90 23.90 7.20 4.50 2.40 7.70 25.00 9.80 26.10 2.80 4.70 0.60 16.90 7.30 8.40
HTCL [98] C 44.23 17.09 64.40 34.80 33.80 12.40 25.90 27.30 10.80 1.80 2.20 5.40 25.30 10.80 31.20 1.10 3.10 0.90 21.10 9.00 8.30
Bi-SSC [94] C 45.10 16.73 63.40 33.30 31.70 11.20 26.60 25.00 6.80 1.80 1.00 6.80 26.10 10.50 28.90 1.70 3.30 1.00 19.40 9.30 8.40
ground-truth class and (ii) the L1 error between the predicted capabilities. Tab. 4 adopts mIoU and mIoU∗ to assess 3D se-
and ground-truth depths is below a given threshold. The mAVE mantic occupancy perception. Unlike mIoU, the mIoU∗ met-
measures the average velocity error for true positive rays among ric excludes the ’others’ and ’other flat’ classes and is used by
8 semantic categories. The final OccScore is calculated as: the self-supervised OccNeRF [38]. For fairness, we compare
the mIoU∗ of OccNeRF with other self-supervised occupancy
OccScore = mIoU × 0.9 + max (1 − mAVE, 0.0) × 0.1. (18) methods. Notably, the OccScore metric is used in the CVPR
2024 Autonomous Grand Challenge [158], but it has yet to be-
4.2. Performance come widely adopted. Thus, we do not summarize the occu-
In this subsection, we will compare and analyze the perfor- pancy performance with this metric. Below, we will compare
mance accuracy and inference speed of various 3D occupancy perception accuracy from three aspects: overall comparison,
perception methods. For performance accuracy, we discuss modality comparison, and supervision comparison.
three aspects: overall comparison, modality comparison, and (1) Overall Comparison. Tab. 3 and 5 show that (i) the
supervision comparison. The evaluation datasets used include IoU scores of occupancy networks are less than 60%, while the
SemanticKITTI, Occ3D-nuScenes, and SSCBench-KITTI-360. mIoU scores are less than 30%. The IoU scores (indicating ge-
ometric perception, i.e., ignoring semantics) substantially sur-
4.2.1. Perception Accuracy pass the mIoU scores. This is because predicting occupancy
SemanticKITTI [60] is the first dataset with 3D occupancy for some semantic categories is challenging, such as bicycles,
labels for outdoor driving scenes. Occ3D-nuScenes [36] is the motorcycles, persons, bicyclists, motorcyclists, poles, and traf-
dataset used in the CVPR 2023 3D Occupancy Prediction Chal- fic signs. Each of these classes has a small proportion (under
lenge [157]. These two datasets are currently the most popular. 0.3%) in the dataset, and their small sizes in shapes make them
Therefore, we summarize the performance of various 3D occu- difficult to observe and detect. Therefore, if the IOU scores of
pancy methods that are trained and tested on these datasets, as these categories are low, they significantly impact the overall
reported in Tab. 3 and 4. Additionally, we evaluate the perfor- mIoU value. Because the mIOU calculation, which does not
mance of 3D occupancy methods on the SSCBench-KITTI-360 account for category frequency, divides the total IoU scores of
dataset, as reported in Tab. 5. These tables classify occupancy all categories by the number of categories. (ii) A higher IoU
methods according to input modalities and supervised learning does not guarantee a higher mIoU. One possible explanation is
types, respectively. The best performances are highlighted in that the semantic perception capacity (reflected in mIoU) and
bold. Tab. 3 and 5 utilize the IoU and mIoU metrics to eval- the geometric perception capacity (reflected in IoU) of an occu-
uate the 3D geometric and 3D semantic occupancy perception pancy network are distinct and not positively correlated.
Form Tab. 4, it is evident that (i) the mIOU scores of oc-
12
Table 4: 3D semantic occupancy prediction comparison (%) on the validation set of Occ3D-nuScenes [36]. Sup. represents the supervised learning type.
mIoU∗ is the mean Intersection-over-Union excluding the ’others’ and ’other flat’ classes. For fairness, all compared methods are vision-centric.
■ traffic cone
■ motorcycle
■ const. veh.
■ vegetation
■ pedestrian
■ drive. suf.
■ manmade
■ other flat
■ sidewalk
■ bicycle
■ barrier
■ terrain
■ others
■ trailer
■ truck
■ bus
■ car
Method Sup. mIoU mIoU∗
SelfOcc (BEV) [87] Self 6.76 7.66 0.00 0.00 0.00 0.00 9.82 0.00 0.00 0.00 0.00 0.00 6.97 47.03 0.00 18.75 16.58 11.93 3.81
SelfOcc (TPV) [87] Self 7.97 9.03 0.00 0.00 0.00 0.00 10.03 0.00 0.00 0.00 0.00 0.00 7.11 52.96 0.00 23.59 25.16 11.97 4.61
SimpleOcc [31] Self - 7.99 - 0.67 1.18 3.21 7.63 1.02 0.26 1.80 0.26 1.07 2.81 40.44 - 18.30 17.01 13.42 10.84
OccNeRF [38] Self - 10.81 - 0.83 0.82 5.13 12.49 3.50 0.23 3.10 1.84 0.52 3.90 52.62 - 20.81 24.75 18.45 13.19
RenderOcc [83] Weak 23.93 - 5.69 27.56 14.36 19.91 20.56 11.96 12.42 12.14 14.34 20.81 18.94 68.85 33.35 42.01 43.94 17.36 22.61
Vampire [81] Weak 28.33 - 7.48 32.64 16.15 36.73 41.44 16.59 20.64 16.55 15.09 21.02 28.47 67.96 33.73 41.61 40.76 24.53 20.26
OccFormer [55] Strong 21.93 - 5.94 30.29 12.32 34.40 39.17 14.44 16.45 17.22 9.27 13.90 26.36 50.99 30.96 34.66 22.73 6.76 6.97
TPVFormer [32] Strong 27.83 - 7.22 38.90 13.67 40.78 45.90 17.23 19.99 18.85 14.30 26.69 34.17 55.65 35.47 37.55 30.70 19.40 16.78
Occ3D [36] Strong 28.53 - 8.09 39.33 20.56 38.29 42.24 16.93 24.52 22.72 21.05 22.98 31.11 53.33 33.84 37.98 33.23 20.79 18.00
SurroundOcc [76] Strong 38.69 - 9.42 43.61 19.57 47.66 53.77 21.26 22.35 24.48 19.36 32.96 39.06 83.15 43.26 52.35 55.35 43.27 38.02
FastOcc [82] Strong 40.75 - 12.86 46.58 29.93 46.07 54.09 23.74 31.10 30.68 28.52 33.08 39.69 83.33 44.65 53.90 55.46 42.61 36.50
FB-OCC [153] Strong 42.06 - 14.30 49.71 30.00 46.62 51.54 29.30 29.13 29.35 30.48 34.97 39.36 83.07 47.16 55.62 59.88 44.89 39.58
PanoOcc [4] Strong 42.13 - 11.67 50.48 29.64 49.44 55.52 23.29 33.26 30.55 30.99 34.43 42.57 83.31 44.23 54.40 56.04 45.94 40.40
COTR [85] Strong 46.21 - 14.85 53.25 35.19 50.83 57.25 35.36 34.06 33.54 37.14 38.99 44.97 84.46 48.73 57.60 61.08 51.61 46.72
cupancy networks are within 50%, higher than the scores on terms of IoU and mIoU. But notably, the mIoU of the vision-
SemanticKITTI and SSCBench-KITTI-360. For example, the centric CGFormer [156] has surpassed that of LiDAR-centric
mIOUs of TPVFormer [32] on SemanticKITTI and SSCBench- methods on the SSCBench-KITTI-360 dataset.
KITTI-360 are 11.26% and 13.64%, but it gets 27.83% on (3) Supervision Comparison. The ’Sup.’ column of Tab. 4
Occ3D-nuScenes. OccFormer [55] and SurroundOcc [76] have outlines supervised learning types used for training occupancy
similar situations. We consider this might be due to the networks. Training with strong supervision, which directly em-
simpler task setting in Occ3D-nuScenes. On the one hand, ploys 3D occupancy labels, is the most prevalent type. Tab. 4
Occ3D-nuScenes uses surrounding-view images as input, con- shows that occupancy networks based on strongly-supervised
taining richer scene information compared to SemanticKITTI learning achieve impressive performance. The mIoU scores
and SSCBench-KITTI-360, which only utilize monocular or of FastOcc [82], FB-Occ [153], PanoOcc [4], and COTR [85]
binocular images. On the other hand, Occ3D-nuScenes only are significantly higher (12.42%-38.24% increased mIoU) than
calculates mIOU for visible 3D voxels, whereas the other those of weakly-supervised or self-supervised methods. This
two datasets evaluate both visible and occluded areas, posing is because occupancy labels provided by the dataset are care-
greater challenges. (ii) COTR [85] has the best mIoU (46.21%) fully annotated with high accuracy, and can impose strong con-
and also achieves the highest scores in IoU across all categories straints on network training. However, annotating these dense
on Occ3D-nuScenes. occupancy labels is time-consuming and laborious. It is neces-
(2) Modality Comparison. The input data modality signif- sary to explore network training based on weak or self super-
icantly influences 3D occupancy perception accuracy. Tab. 3 vision to reduce reliance on occupancy labels. Vampire [81] is
and 5 report the performance of occupancy perception in dif- the best-performing method based on weakly-supervised learn-
ferent modalities. It can be seen that, due to the accurate depth ing, achieving a mIoU score of 28.33%. It demonstrates that
information provided by LiDAR sensing, LiDAR-centric oc- semantic LiDAR point clouds can supervise the training of 3D
cupancy methods have more precise perception with higher occupancy networks. However, the collection and annotation
IoU and mIoU scores. For example, on the SemanticKITTI of semantic LiDAR point clouds are expensive. SelfOcc [87]
dataset, S3CNet [30] has the top mIoU (29.53%) and DIFs and OccNeRF [38] are two representative occupancy works
[75] achieves the highest IoU (58.90%); on the SSCBench- based on self-supervised learning. They utilize volume render-
KITTI-360 dataset, S3CNet achieves the best IoU (53.58%). ing and photometric consistency to acquire self-supervised sig-
However, we observe that the multi-modal approaches (e.g., nals, proving that a network can learn 3D occupancy perception
OpenOccupancy [11] and Co-Occ [103]) do not outperform without any labels. However, their performance remains lim-
single-modal (i.e., LiDAR-centric or vision-centric) methods, ited, with SelfOcc achieving an mIoU of 7.97% and OccNeRF
indicating that they have not fully leveraged the benefits of an mIoU∗ of 10.81%.
multi-modal fusion and the richness of input data. Therefore,
there is considerable potential for further improvement in multi- 4.2.2. Inference Speed
modal occupancy perception. Moreover, vision-centric occu- Recent studies on 3D occupancy perception [82, 118] have
pancy perception has advanced rapidly in recent years. On the begun to consider not only perception accuracy but also its in-
SemanticKITTI dataset, the state-of-the-art vision-centric oc- ference speed. According to the data provided by FastOcc [82]
cupancy methods still lag behind LiDAR-centric methods in and SparseOcc [118], we sort out the inference speeds of 3D oc-
13
Table 5: 3D occupancy benchmarking results (%) on the SSCBench-KITTI-360 test set. The best results are in bold. OccFiner (Mono.) indicates that OccFiner
refines the predicted occupancy from MonoScene.
■ other-struct.
■ motorcycle
■ other-grnd.
■ vegetation
■ other-veh.
■ other-obj.
■ traf.-sign
■ sidewalk
■ building
■ parking
■ bicycle
■ person
■ terrain
(14.98%)
(15.67%)
(41.99%)
■ fence
■ truck
(2.85%)
(0.01%)
(0.01%)
(0.16%)
(5.75%)
(0.02%)
(2.31%)
(6.43%)
(2.05%)
(0.96%)
(7.10%)
(0.22%)
(0.06%)
(4.33%)
(0.28%)
■ road
■ pole
■ car
Method IoU mIoU
LiDAR-Centric Methods
SSCNet [25] 53.58 16.95 31.95 0.00 0.17 10.29 0.00 0.07 65.70 17.33 41.24 3.22 44.41 6.77 43.72 28.87 0.78 0.75 8.69 0.67
LMSCNet [28] 47.35 13.65 20.91 0.00 0.00 0.26 0.58 0.00 62.95 13.51 33.51 0.20 43.67 0.33 40.01 26.80 0.00 0.00 3.63 0.00
Vision-Centric Methods
GaussianFormer [154] 35.38 12.92 18.93 1.02 4.62 18.07 7.59 3.35 45.47 10.89 25.03 5.32 28.44 5.68 29.54 8.62 2.99 2.32 9.51 5.14
MonoScene [26] 37.87 12.31 19.34 0.43 0.58 8.02 2.03 0.86 48.35 11.38 28.13 3.32 32.89 3.53 26.15 16.75 6.92 5.67 4.20 3.09
OccFiner (Mono.) [155] 38.51 13.29 20.78 1.08 1.03 9.04 3.58 1.46 53.47 12.55 31.27 4.13 33.75 4.62 26.83 18.67 5.04 4.58 4.05 3.32
VoxFormer [33] 38.76 11.91 17.84 1.16 0.89 4.56 2.06 1.63 47.01 9.67 27.21 2.89 31.18 4.97 28.99 14.69 6.51 6.92 3.79 2.43
TPVFormer [32] 40.22 13.64 21.56 1.09 1.37 8.06 2.57 2.38 52.99 11.99 31.07 3.78 34.83 4.80 30.08 17.52 7.46 5.86 5.48 2.70
OccFormer [55] 40.27 13.81 22.58 0.66 0.26 9.89 3.82 2.77 54.30 13.44 31.53 3.55 36.42 4.80 31.00 19.51 7.77 8.51 6.95 4.60
Symphonies [88] 44.12 18.58 30.02 1.85 5.90 25.07 12.06 8.20 54.94 13.83 32.76 6.93 35.11 8.58 38.33 11.52 14.01 9.57 14.44 11.28
CGFormer [156] 48.07 20.05 29.85 3.42 3.96 17.59 6.79 6.63 63.85 17.15 40.72 5.53 42.73 8.22 38.80 24.94 16.24 17.45 10.18 6.77
Table 6: Inference speed analysis of 3D occupancy perception on the toward generalized 3D occupancy perception.
Occ3D-nuScenes [36] dataset. † indicates data from SparseOcc [118]. ‡
means data from FastOcc [82]. R-50 represents ResNet50 [39]. TRT denotes
acceleration using the TensorRT SDK [159]. 5.1. Occupancy-based Applications in Autonomous Driving
3D occupancy perception enables a comprehensive under-
Method GPU Input Size Backbone mIoU(%) FPS(Hz) standing of the 3D world and supports various tasks in au-
BEVDet† [43] A100 704× 256 R-50 36.10 2.6 tonomous driving. Existing occupancy-based applications in-
BEVFormer† [44] A100 1600×900 R-101 39.30 3.0 clude segmentation, detection, dynamic perception, world mod-
FB-Occ† [153] A100 704×256 R-50 10.30 10.3
SparseOcc [118] A100 704×256 R-50 30.90 12.5
els, and autonomous driving algorithm frameworks. (1) Seg-
mentation: Semantic occupancy perception can essentially be
SurroundOcc‡ [76] V100 1600×640 R-101 37.18 2.8
FastOcc [82] V100 1600×640 R-101 40.75 4.5
regarded as a 3D semantic segmentation task. (2) Detec-
FastOcc(TRT) [82] V100 1600×640 R-101 40.75 12.8 tion: OccupancyM3D [5] and SOGDet [6] are two occupancy-
based works that implement 3D object detection. Occupan-
cyM3D first learns occupancy to enhance 3D features, which
cupancy methods, and also report their running platforms, input are then used for 3D detection. SOGDet develops two con-
image sizes, backbone architectures, and occupancy accuracy current tasks: semantic occupancy prediction and 3D object
on the Occ3D-nuScenes dataset, as depicted in Tab. 6. detection, training these tasks simultaneously for mutual en-
A practical occupancy method should have high accuracy hancement. (3) Dynamic perception: Its goal is to capture dy-
(mIoU) and fast inference speed (FPS). From Tab. 6, FastOcc namic objects and their motion in the surrounding environment,
achieves a high mIoU (40.75%), comparable to the mIOU of in the form of predicting occupancy flows for dynamic ob-
BEVFomer. Notably, FastOcc has a higher FPS value on a jects. Strongly-supervised Cam4DOcc [10] and self-supervised
lower-performance GPU platform than BEVFomer. Further- LOF [160] have demonstrated potential in occupancy flow pre-
more, after being accelerated by TensorRT [159], the inference diction. (4) World model: It simulates and forecasts the fu-
speed of FastOcc reaches 12.8Hz. ture state of the surrounding environment by observing current
and historical data [161]. Pioneering works, according to in-
put observation data, can be divided into semantic occupancy
5. Challenges and Opportunities sequence-based world models (e.g., OccWorld [162] and Occ-
Sora [163]), point cloud sequence-based world models (e.g.,
In this section, we explore the challenges and opportuni-
SCSF [108], UnO [164], PCF [165]), and multi-camera im-
ties of 3D occupancy perception in autonomous driving. Occu-
age sequence-based world models (e.g., DriveWorld [7] and
pancy, as a geometric and semantic representation of the 3D
Cam4DOcc [10]). However, these works still perform poorly
world, can facilitate various autonomous driving tasks. We
in high-quality long-term forecasting. (5) Autonomous driving
discuss both existing and prospective applications of 3D occu-
algorithm framework: It integrates different sensor inputs into
pancy, demonstrating its potential in the field of autonomous
a unified occupancy representation, then applies the occupancy
driving. Furthermore, we discuss the deployment efficiency of
representation to a wide span of driving tasks, such as 3D object
occupancy perception on edge devices, the necessity for robust-
detection, online mapping, multi-object tracking, motion pre-
ness in complex real-world driving environments, and the path
14
diction, occupancy prediction, and motion planning. Related in lighting and weather, which would introduce visual biases,
works include OccNet [8], DriveWorld [7], and UniScene [61]. and input image blurring, which is caused by vehicle move-
However, existing occupancy-based applications primarily ment. Moreover, sensor malfunctions (e.g., loss of frames and
focus on the perception level, and less on the decision-making camera views) are common [176]. In light of these challenges,
level. Given that 3D occupancy is more consistent with the studying robust 3D occupancy perception is valuable.
3D physical world than other perception manners (e.g., bird’s- However, research on robust 3D occupancy is limited, pri-
eye view perception and perspective-view perception), we be- marily due to the scarcity of datasets. Recently, the ICRA 2024
lieve that 3D occupancy holds opportunities for broader ap- RoboDrive Challenge [177] provides imperfect scenarios for
plications in autonomous driving. At the perception level, studying robust 3D occupancy perception.
it could improve the accuracy of existing place recognition In terms of network architecture and scene representa-
[166, 167], pedestrian detection [168, 169], accident predic- tion, we consider that related works on robust BEV percep-
tion [170], and lane line segmentation [171]. At the decision- tion [47, 48, 178, 179, 180, 181] could inspire developing ro-
making level, it could help safer driving decisions [172] and bust occupancy perception. M-BEV [179] proposes a masked
navigation [173, 174], and provide 3D explainability for driv- view reconstruction module to enhance robustness under vari-
ing behaviors. ous missing camera cases. GKT [180] employs coarse projec-
tion to achieve robust BEV representation. In terms of sensor
5.2. Deployment Efficiency modality, radar can penetrate small particles such as raindrops,
For complex 3D scenes, large amounts of point cloud data fog, and snowflakes in adverse weather conditions, thus pro-
or multi-view visual information always need to be processed viding reliable detection capability. Radar-centric RadarOcc
and analyzed to extract and update occupancy state informa- [182] achieves robust occupancy perception with imaging radar,
tion. To achieve real-time performance for the autonomous which not only inherits the robustness of mmWave radar in all
driving application, solutions commonly need to be computa- lighting and weather conditions, but also has higher vertical
tionally complete in a limited amount of time and need to have resolution than mmWave radar. RadarOcc has demonstrated
efficient data structures and algorithm designs. In general, de- more accurate 3D occupancy prediction than LiDAR-centric
ploying deep learning algorithms on target edge devices is not and vision-centric methods in adverse weather. Besides, in
an easy task. most damage scenarios involving natural factors, multi-modal
Currently, some real-time and deployment-friendly efforts models [47, 48, 181] usually outperform single-modal mod-
on occupancy tasks have been attempted. For instance, Hou et els, benefiting from the complementary nature of multi-modal
al. [82] proposed a solution, FastOcc, to accelerate prediction inputs. In terms of training strategies, Robo3D [97] distills
inference speed by adjusting the input resolution, view trans- knowledge from a teacher model with complete point clouds
formation module, and prediction head. Zhang et al. [175] to a student model with imperfect input, enhancing the stu-
further lightweighted FlashOcc by decomposing its occupancy dent model’s robustness. Therefore, based on these works, ap-
network and binarizing it with binarized convolutions. Liu et al. proaches to robust 3D occupancy perception could include, but
[118] proposed SparseOcc, a sparse occupancy network with- are not limited to, robust scene representation, multiple modal-
out any dense 3-D features, to minimize computational costs ities, network design, and learning strategies.
using sparse convolution layers and mask-guided sparse sam-
pling. Tang et al. [90] proposed to adopt sparse latent repre- 5.4. Generalized 3D Occupancy Perception
sentations and sparse interpolation operations to avoid informa- Although more accurate 3D labels mean higher occupancy
tion loss and reduce computational complexity. Additionally, prediction performance [183], 3D labels are costly and large-
Huang et al. recently proposed GaussianFormer [154], which scale 3D annotations for the real world are impractical. The
utilizes a series of 3D Gaussians to represent sparse interest re- generalization capabilities of existing networks trained on lim-
gions in space. GaussianFormer optimizes the geometric and ited 3D-labeled datasets have not been extensively studied. To
semantic properties of the 3D Gaussians, corresponding to the get rid of dependence on 3D labels, self-supervised learning
semantic occupancy of the interest regions. It achieves compa- represents a potential pathway toward generalized 3D occu-
rable accuracy to state-of-the-art methods using only 17.8%- pancy perception. It learns occupancy perception from a broad
24.8% of their memory consumption. However, the above- range of unlabelled images. However, the performance of cur-
mentioned approaches are still some way from practical deploy- rent self-supervised occupancy perception [31, 38, 87, 91] is
ment in autonomous driving systems. A deployment-efficient poor. On the Occ3D-nuScene dataset (see Tab. 4), the top ac-
occupancy method requires superiority in real-time processing, curacy of self-supervised methods is inferior to that of strongly-
lightweight design, and accuracy simultaneously. supervised methods by a large margin. Moreover, current self-
supervised methods require training and evaluation with more
5.3. Robust 3D Occupancy Perception data. Thus, enhancing self-supervised generalized 3D occu-
pancy perception is an important future research direction.
In dynamic and unpredictable real-world driving environ-
Furthermore, current 3D occupancy perception can only
ments, the perception robustness is crucial to autonomous vehi-
recognize a set of predefined object categories, which limits its
cle safety. State-of-the-art 3D occupancy models may be vul-
generalizability and practicality. Recent advances in large lan-
nerable to out-of-distribution scenes and data, such as changes
guage models (LLMs) [184, 185, 186, 187] and large visual-
15
language models (LVLMs) [188, 189, 190, 191, 192] demon- [9] A. Vobecky, O. Siméoni, D. Hurych, S. Gidaris, A. Bursuc, P. Pérez,
strate a promising ability for reasoning and visual understand- J. Sivic, Pop-3d: Open-vocabulary 3d occupancy prediction from im-
ages, Advances in Neural Information Processing Systems 36 (2024).
ing. Integrating these pre-trained large models has been proven [10] J. Ma, X. Chen, J. Huang, J. Xu, Z. Luo, J. Xu, W. Gu, R. Ai, H. Wang,
to enhance generalization for perception [9]. POP-3D [9] lever- Cam4docc: Benchmark for camera-only 4d occupancy forecasting in au-
ages a powerful pre-trained visual-language model [192] to tonomous driving applications, arXiv preprint arXiv:2311.17663 (2023).
train its network and achieves open-vocabulary 3D occupancy [11] X. Wang, Z. Zhu, W. Xu, Y. Zhang, Y. Wei, X. Chi, Y. Ye, D. Du, J. Lu,
X. Wang, Openoccupancy: A large scale benchmark for surrounding
perception. Therefore, we consider that employing LLMs and semantic occupancy perception, in: Proceedings of the IEEE/CVF In-
LVLMs is a challenge and opportunity for achieving general- ternational Conference on Computer Vision, 2023, pp. 17850–17859.
ized 3D occupancy perception. [12] Z. Ming, J. S. Berrio, M. Shan, S. Worrall, Occfusion: A straightforward
and effective multi-sensor fusion framework for 3d occupancy predic-
tion, arXiv preprint arXiv:2403.01644 (2024).
6. Conclusion [13] R. Song, C. Liang, H. Cao, Z. Yan, W. Zimmer, M. Gross, A. Fes-
tag, A. Knoll, Collaborative semantic occupancy prediction with hy-
brid feature fusion in connected automated vehicles, arXiv preprint
This paper provided a comprehensive survey of 3D occu- arXiv:2402.07635 (2024).
pancy perception in autonomous driving in recent years. We [14] P. Wolters, J. Gilg, T. Teepe, F. Herzog, A. Laouichi, M. Hofmann,
reviewed and discussed in detail the state-of-the-art LiDAR- G. Rigoll, Unleashing hydra: Hybrid fusion, depth consistency and radar
centric, vision-centric, and multi-modal perception solutions for unified 3d perception, arXiv preprint arXiv:2403.07746 (2024).
[15] S. Sze, L. Kunze, Real-time 3d semantic occupancy prediction for au-
and highlighted information fusion techniques for this field. tonomous vehicles using memory-efficient sparse convolution, arXiv
To facilitate further research, detailed performance compar- preprint arXiv:2403.08748 (2024).
isons of existing occupancy methods are provided. Finally, we [16] Y. Xie, J. Tian, X. X. Zhu, Linking points with labels in 3d: A review of
described some open challenges that could inspire future re- point cloud semantic segmentation, IEEE Geoscience and remote sens-
ing magazine 8 (4) (2020) 38–59.
search directions in the coming years. We hope that this survey [17] J. Zhang, X. Zhao, Z. Chen, Z. Lu, A review of deep learning-based
can benefit the community, support further development in au- semantic segmentation for point cloud, IEEE access 7 (2019) 179118–
tonomous driving, and help inexpert readers navigate the field. 179133.
[18] X. Ma, W. Ouyang, A. Simonelli, E. Ricci, 3d object detection from
images for autonomous driving: a survey, IEEE Transactions on Pattern
Acknowledgment Analysis and Machine Intelligence (2023).
[19] J. Mao, S. Shi, X. Wang, H. Li, 3d object detection for autonomous driv-
ing: A comprehensive survey, International Journal of Computer Vision
The research work was conducted in the JC STEM Lab of 131 (8) (2023) 1909–1963.
Machine Learning and Computer Vision funded by The Hong [20] L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang,
Kong Jockey Club Charities Trust and was partially supported J. Li, C. Jia, et al., Multi-modal 3d object detection in autonomous driv-
by the Research Grants Council of the Hong Kong SAR, China ing: A survey and taxonomy, IEEE Transactions on Intelligent Vehicles
(2023).
(Project No. PolyU 15215824). [21] D. Fernandes, A. Silva, R. Névoa, C. Simões, D. Gonzalez, M. Guevara,
P. Novais, J. Monteiro, P. Melo-Pinto, Point-cloud based 3d object de-
tection and classification methods for self-driving applications: A survey
References and taxonomy, Information Fusion 68 (2021) 161–191.
[22] L. Roldao, R. De Charette, A. Verroust-Blondet, 3d semantic scene com-
[1] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, pletion: A survey, International Journal of Computer Vision 130 (8)
H. Deng, et al., Delving into the devils of bird’s-eye-view perception: A (2022) 1978–2005.
review, evaluation and recipe, IEEE Transactions on Pattern Analysis [23] Y. Zhang, J. Zhang, Z. Wang, J. Xu, D. Huang, Vision-based 3d occu-
and Machine Intelligence (2023). pancy prediction in autonomous driving: a review and outlook, arXiv
[2] Y. Ma, T. Wang, X. Bai, H. Yang, Y. Hou, Y. Wang, Y. Qiao, R. Yang, preprint arXiv:2405.02595 (2024).
D. Manocha, X. Zhu, Vision-centric bev perception: A survey, arXiv [24] S. Thrun, Probabilistic robotics, Communications of the ACM 45 (3)
preprint arXiv:2208.02797 (2022). (2002) 52–57.
[3] Occupancy networks, accessed July 25, 2024. [25] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, T. Funkhouser, Seman-
URL https://www.thinkautonomous.ai/blog/ tic scene completion from a single depth image, in: Proceedings of the
occupancy-networks/ IEEE conference on computer vision and pattern recognition, 2017, pp.
[4] Y. Wang, Y. Chen, X. Liao, L. Fan, Z. Zhang, Panoocc: Unified occu- 1746–1754.
pancy representation for camera-based 3d panoptic segmentation, arXiv [26] A.-Q. Cao, R. De Charette, Monoscene: Monocular 3d semantic scene
preprint arXiv:2306.10013 (2023). completion, in: Proceedings of the IEEE/CVF Conference on Computer
[5] L. Peng, J. Xu, H. Cheng, Z. Yang, X. Wu, W. Qian, W. Wang, B. Wu, Vision and Pattern Recognition, 2022, pp. 3991–4001.
D. Cai, Learning occupancy for monocular 3d object detection, arXiv [27] Workshop on autonomous driving at cvpr 2022, accessed July 25, 2024.
preprint arXiv:2305.15694 (2023). URL https://cvpr2022.wad.vision/
[6] Q. Zhou, J. Cao, H. Leng, Y. Yin, Y. Kun, R. Zimmermann, Sogdet: [28] L. Roldao, R. de Charette, A. Verroust-Blondet, Lmscnet: Lightweight
Semantic-occupancy guided multi-view 3d object detection, in: Pro- multiscale 3d semantic completion, in: 2020 International Conference
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, on 3D Vision (3DV), IEEE, 2020, pp. 111–119.
2024, pp. 7668–7676. [29] X. Yan, J. Gao, J. Li, R. Zhang, Z. Li, R. Huang, S. Cui, Sparse single
[7] C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y. Guo, sweep lidar point cloud segmentation via learning contextual shape pri-
J. Xing, et al., Driveworld: 4d pre-trained scene understanding via ors from scene completion, in: Proceedings of the AAAI Conference on
world models for autonomous driving, arXiv preprint arXiv:2405.04390 Artificial Intelligence, Vol. 35, 2021, pp. 3101–3109.
(2024). [30] R. Cheng, C. Agia, Y. Ren, X. Li, L. Bingbing, S3cnet: A sparse seman-
[8] W. Tong, C. Sima, T. Wang, L. Chen, S. Wu, H. Deng, Y. Gu, L. Lu, tic scene completion network for lidar point clouds, in: Conference on
P. Luo, D. Lin, et al., Scene as occupancy, in: Proceedings of the Robot Learning, PMLR, 2021, pp. 2148–2161.
IEEE/CVF International Conference on Computer Vision, 2023, pp. [31] W. Gan, N. Mo, H. Xu, N. Yokoya, A simple attempt for 3d occupancy
8406–8415.
16
estimation in autonomous driving, arXiv preprint arXiv:2303.10076 [52] J. Yao, J. Zhang, Depthssc: Depth-spatial alignment and dynamic voxel
(2023). resolution for monocular 3d semantic scene completion, arXiv preprint
[32] Y. Huang, W. Zheng, Y. Zhang, J. Zhou, J. Lu, Tri-perspective view arXiv:2311.17084 (2023).
for vision-based 3d semantic occupancy prediction, in: Proceedings of [53] R. Miao, W. Liu, M. Chen, Z. Gong, W. Xu, C. Hu, S. Zhou, Occdepth:
the IEEE/CVF conference on computer vision and pattern recognition, A depth-aware method for 3d semantic scene completion, arXiv preprint
2023, pp. 9223–9232. arXiv:2302.13540 (2023).
[33] Y. Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, [54] A. N. Ganesh, Soccdpt: Semi-supervised 3d semantic occupancy from
A. Anandkumar, Voxformer: Sparse voxel transformer for camera-based dense prediction transformers trained under memory constraints, arXiv
3d semantic scene completion, in: Proceedings of the IEEE/CVF confer- preprint arXiv:2311.11371 (2023).
ence on computer vision and pattern recognition, 2023, pp. 9087–9098. [55] Y. Zhang, Z. Zhu, D. Du, Occformer: Dual-path transformer for
[34] B. Li, Y. Sun, J. Dong, Z. Zhu, J. Liu, X. Jin, W. Zeng, One at a time: vision-based 3d semantic occupancy prediction, in: Proceedings of the
Progressive multi-step volumetric probability learning for reliable 3d IEEE/CVF International Conference on Computer Vision, 2023, pp.
scene perception, in: Proceedings of the AAAI Conference on Artifi- 9433–9443.
cial Intelligence, Vol. 38, 2024, pp. 3028–3036. [56] S. Silva, S. B. Wannigama, R. Ragel, G. Jayatilaka, S2tpvformer:
[35] Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, Spatio-temporal tri-perspective view for temporally coherent 3d seman-
W. Wang, et al., Planning-oriented autonomous driving, in: Proceedings tic occupancy prediction, arXiv preprint arXiv:2401.13785 (2024).
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- [57] M. Firman, O. Mac Aodha, S. Julier, G. J. Brostow, Structured prediction
tion, 2023, pp. 17853–17862. of unobserved voxels from a single depth image, in: Proceedings of the
[36] X. Tian, T. Jiang, L. Yun, Y. Mao, H. Yang, Y. Wang, Y. Wang, IEEE Conference on Computer Vision and Pattern Recognition, 2016,
H. Zhao, Occ3d: A large-scale 3d occupancy prediction benchmark for pp. 5431–5440.
autonomous driving, Advances in Neural Information Processing Sys- [58] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,
tems 36 (2024). S. Song, A. Zeng, Y. Zhang, Matterport3d: Learning from rgb-d data in
[37] Y. Li, S. Li, X. Liu, M. Gong, K. Li, N. Chen, Z. Wang, Z. Li, T. Jiang, indoor environments, arXiv preprint arXiv:1709.06158 (2017).
F. Yu, et al., Sscbench: Monocular 3d semantic scene completion bench- [59] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and
mark in street views, arXiv preprint arXiv:2306.09001 2 (2023). support inference from rgbd images, in: Computer Vision–ECCV 2012:
[38] C. Zhang, J. Yan, Y. Wei, J. Li, L. Liu, Y. Tang, Y. Duan, J. Lu, Occnerf: 12th European Conference on Computer Vision, Florence, Italy, October
Self-supervised multi-camera occupancy prediction with neural radiance 7-13, 2012, Proceedings, Part V 12, Springer, 2012, pp. 746–760.
fields, arXiv preprint arXiv:2312.09243 (2023). [60] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss,
[39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog- J. Gall, Semantickitti: A dataset for semantic scene understanding of
nition, in: Proceedings of the IEEE conference on computer vision and lidar sequences, in: Proceedings of the IEEE/CVF international confer-
pattern recognition, 2016, pp. 770–778. ence on computer vision, 2019, pp. 9297–9307.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [61] C. Min, L. Xiao, D. Zhao, Y. Nie, B. Dai, Multi-camera unified pre-
Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural training via 3d scene reconstruction, IEEE Robotics and Automation
information processing systems 30 (2017). Letters (2024).
[41] Tesla ai day, accessed July 25, 2024. [62] C. Lyu, S. Guo, B. Zhou, H. Xiong, H. Zhou, 3dopformer: 3d occupancy
URL https://www.youtube.com/watch?v=j0z4FweCy4M perception from multi-camera images with directional and distance en-
[42] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, J. Lu, Be- hancement, IEEE Transactions on Intelligent Vehicles (2023).
verse: Unified perception and prediction in birds-eye-view for vision- [63] C. Häne, C. Zach, A. Cohen, M. Pollefeys, Dense semantic 3d recon-
centric autonomous driving, arXiv preprint arXiv:2205.09743 (2022). struction, IEEE transactions on pattern analysis and machine intelli-
[43] J. Huang, G. Huang, Z. Zhu, Y. Ye, D. Du, Bevdet: High-performance gence 39 (9) (2016) 1730–1743.
multi-camera 3d object detection in bird-eye-view, arXiv preprint [64] X. Chen, J. Sun, Y. Xie, H. Bao, X. Zhou, Neuralrecon: Real-time coher-
arXiv:2112.11790 (2021). ent 3d scene reconstruction from monocular video, IEEE Transactions
[44] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, J. Dai, Bev- on Pattern Analysis and Machine Intelligence (2024).
former: Learning bird’s-eye-view representation from multi-camera im- [65] X. Tian, R. Liu, Z. Wang, J. Ma, High quality 3d reconstruction based on
ages via spatiotemporal transformers, in: European conference on com- fusion of polarization imaging and binocular stereo vision, Information
puter vision, Springer, 2022, pp. 1–18. Fusion 77 (2022) 19–28.
[45] B. Yang, W. Luo, R. Urtasun, Pixor: Real-time 3d object detection from [66] P. N. Leite, A. M. Pinto, Fusing heterogeneous tri-dimensional informa-
point clouds, in: Proceedings of the IEEE conference on Computer Vi- tion for reconstructing submerged structures in harsh sub-sea environ-
sion and Pattern Recognition, 2018, pp. 7652–7660. ments, Information Fusion 103 (2024) 102126.
[46] B. Yang, M. Liang, R. Urtasun, Hdnet: Exploiting hd maps for 3d object [67] J.-D. Durou, M. Falcone, M. Sagona, Numerical methods for shape-
detection, in: Conference on Robot Learning, PMLR, 2018, pp. 146– from-shading: A new survey with benchmarks, Computer Vision and
155. Image Understanding 109 (1) (2008) 22–43.
[47] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, S. Han, Bev- [68] J. L. Schonberger, J.-M. Frahm, Structure-from-motion revisited, in:
fusion: Multi-task multi-sensor fusion with unified bird’s-eye view rep- Proceedings of the IEEE conference on computer vision and pattern
resentation, in: 2023 IEEE international conference on robotics and au- recognition, 2016, pp. 4104–4113.
tomation (ICRA), IEEE, 2023, pp. 2774–2781. [69] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor-
[48] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, thi, R. Ng, Nerf: Representing scenes as neural radiance fields for view
Z. Tang, Bevfusion: A simple and robust lidar-camera fusion frame- synthesis, Communications of the ACM 65 (1) (2021) 99–106.
work, Advances in Neural Information Processing Systems 35 (2022) [70] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, J. Valentin, Fast-
10421–10434. nerf: High-fidelity neural rendering at 200fps, in: Proceedings of
[49] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, Z. Li, Bevdepth: the IEEE/CVF international conference on computer vision, 2021, pp.
Acquisition of reliable depth for multi-view 3d object detection, in: Pro- 14346–14355.
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, [71] C. Reiser, S. Peng, Y. Liao, A. Geiger, Kilonerf: Speeding up neu-
2023, pp. 1477–1485. ral radiance fields with thousands of tiny mlps, in: Proceedings of
[50] Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, Y.-G. Jiang, Po- the IEEE/CVF international conference on computer vision, 2021, pp.
larformer: Multi-camera 3d object detection with polar transformer, in: 14335–14345.
Proceedings of the AAAI conference on Artificial Intelligence, Vol. 37, [72] T. Takikawa, J. Litalien, K. Yin, K. Kreis, C. Loop, D. Nowrouzezahrai,
2023, pp. 1042–1050. A. Jacobson, M. McGuire, S. Fidler, Neural geometric level of detail:
[51] J. Mei, Y. Yang, M. Wang, J. Zhu, X. Zhao, J. Ra, L. Li, Y. Liu, Real-time rendering with implicit 3d shapes, in: Proceedings of the
Camera-based 3d semantic scene completion with sparse guidance net- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
work, arXiv preprint arXiv:2312.05752 (2023). 2021, pp. 11358–11367.
17
[73] B. Kerbl, G. Kopanas, T. Leimkühler, G. Drettakis, 3d gaussian splatting [95] L. Zhao, X. Xu, Z. Wang, Y. Zhang, B. Zhang, W. Zheng, D. Du, J. Zhou,
for real-time radiance field rendering, ACM Transactions on Graphics J. Lu, Lowrankocc: Tensor decomposition and low-rank recovery for
42 (4) (2023) 1–14. vision-based 3d semantic occupancy prediction, in: Proceedings of the
[74] G. Chen, W. Wang, A survey on 3d gaussian splatting, arXiv preprint IEEE/CVF Conference on Computer Vision and Pattern Recognition,
arXiv:2401.03890 (2024). 2024, pp. 9806–9815.
[75] C. B. Rist, D. Emmerichs, M. Enzweiler, D. M. Gavrila, Semantic scene [96] A.-Q. Cao, A. Dai, R. de Charette, Pasco: Urban 3d panoptic scene com-
completion using local deep implicit functions on lidar data, IEEE trans- pletion with uncertainty awareness, in: Proceedings of the IEEE/CVF
actions on pattern analysis and machine intelligence 44 (10) (2021) Conference on Computer Vision and Pattern Recognition, 2024, pp.
7205–7218. 14554–14564.
[76] Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, J. Lu, Surroundocc: Multi- [97] L. Kong, Y. Liu, X. Li, R. Chen, W. Zhang, J. Ren, L. Pan, K. Chen,
camera 3d occupancy prediction for autonomous driving, in: Proceed- Z. Liu, Robo3d: Towards robust and reliable 3d perception against cor-
ings of the IEEE/CVF International Conference on Computer Vision, ruptions, in: Proceedings of the IEEE/CVF International Conference on
2023, pp. 21729–21740. Computer Vision, 2023, pp. 19994–20006.
[77] J. Yao, C. Li, K. Sun, Y. Cai, H. Li, W. Ouyang, H. Li, Ndc-scene: Boost [98] B. Li, J. Deng, W. Zhang, Z. Liang, D. Du, X. Jin, W. Zeng, Hierarchical
monocular 3d semantic scene completion in normalized device coordi- temporal context learning for camera-based semantic scene completion,
nates space, in: 2023 IEEE/CVF International Conference on Computer arXiv preprint arXiv:2407.02077 (2024).
Vision (ICCV), IEEE Computer Society, 2023, pp. 9421–9431. [99] Y. Shi, T. Cheng, Q. Zhang, W. Liu, X. Wang, Occupancy as set of
[78] X. Liu, M. Gong, Q. Fang, H. Xie, Y. Li, H. Zhao, C. Feng, points, arXiv preprint arXiv:2407.04049 (2024).
Lidar-based 4d occupancy completion and forecasting, arXiv preprint [100] G. Wang, Z. Wang, P. Tang, J. Zheng, X. Ren, B. Feng, C. Ma, Occ-
arXiv:2310.11239 (2023). gen: Generative multi-modal 3d occupancy prediction for autonomous
[79] S. Zuo, W. Zheng, Y. Huang, J. Zhou, J. Lu, Pointocc: Cylindrical driving, arXiv preprint arXiv:2404.15014 (2024).
tri-perspective view for point-based 3d semantic occupancy prediction, [101] S. Wang, J. Yu, W. Li, H. Shi, K. Yang, J. Chen, J. Zhu, Label-efficient
arXiv preprint arXiv:2308.16896 (2023). semantic scene completion with scribble annotations, arXiv preprint
[80] Z. Yu, C. Shu, J. Deng, K. Lu, Z. Liu, J. Yu, D. Yang, H. Li, Y. Chen, arXiv:2405.15170 (2024).
Flashocc: Fast and memory-efficient occupancy prediction via channel- [102] Y. Pan, B. Gao, J. Mei, S. Geng, C. Li, H. Zhao, Semanticposs: A point
to-height plugin, arXiv preprint arXiv:2311.12058 (2023). cloud dataset with large quantity of dynamic instances, in: 2020 IEEE
[81] J. Xu, L. Peng, H. Cheng, L. Xia, Q. Zhou, D. Deng, W. Qian, Intelligent Vehicles Symposium (IV), IEEE, 2020, pp. 687–693.
W. Wang, D. Cai, Regulating intermediate 3d features for vision-centric [103] J. Pan, Z. Wang, L. Wang, Co-occ: Coupling explicit feature fusion with
autonomous driving, in: Proceedings of the AAAI Conference on Arti- volume rendering regularization for multi-modal 3d semantic occupancy
ficial Intelligence, Vol. 38, 2024, pp. 6306–6314. prediction, arXiv preprint arXiv:2404.04561 (2024).
[82] J. Hou, X. Li, W. Guan, G. Zhang, D. Feng, Y. Du, X. Xue, J. Pu, Fas- [104] L. Li, H. P. Shum, T. P. Breckon, Less is more: Reducing task and model
tocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye complexity for 3d point cloud semantic segmentation, in: Proceedings
view and perspective view, arXiv preprint arXiv:2403.02710 (2024). of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[83] M. Pan, J. Liu, R. Zhang, P. Huang, X. Li, L. Liu, S. Zhang, Renderocc: tion, 2023, pp. 9361–9371.
Vision-centric 3d occupancy prediction with 2d rendering supervision, [105] L. Kong, J. Ren, L. Pan, Z. Liu, Lasermix for semi-supervised lidar
arXiv preprint arXiv:2309.09502 (2023). semantic segmentation, in: Proceedings of the IEEE/CVF Conference
[84] Y. Zheng, X. Li, P. Li, Y. Zheng, B. Jin, C. Zhong, X. Long, H. Zhao, on Computer Vision and Pattern Recognition, 2023, pp. 21705–21715.
Q. Zhang, Monoocc: Digging into monocular semantic occupancy pre- [106] P. Tang, H.-M. Xu, C. Ma, Prototransfer: Cross-modal prototype transfer
diction, arXiv preprint arXiv:2403.08766 (2024). for point cloud segmentation, in: Proceedings of the IEEE/CVF Interna-
[85] Q. Ma, X. Tan, Y. Qu, L. Ma, Z. Zhang, Y. Xie, Cotr: Compact oc- tional Conference on Computer Vision, 2023, pp. 3337–3347.
cupancy transformer for vision-based 3d occupancy prediction, arXiv [107] C. Min, L. Xiao, D. Zhao, Y. Nie, B. Dai, Occupancy-mae: Self-
preprint arXiv:2312.01919 (2023). supervised pre-training large-scale lidar point clouds with masked occu-
[86] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Kr- pancy autoencoders, IEEE Transactions on Intelligent Vehicles (2023).
ishnan, Y. Pan, G. Baldan, O. Beijbom, nuscenes: A multimodal dataset [108] Z. Wang, Z. Ye, H. Wu, J. Chen, L. Yi, Semantic complete scene fore-
for autonomous driving, in: Proceedings of the IEEE/CVF conference casting from a 4d dynamic point cloud sequence, in: Proceedings of the
on computer vision and pattern recognition, 2020, pp. 11621–11631. AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 5867–
[87] Y. Huang, W. Zheng, B. Zhang, J. Zhou, J. Lu, Selfocc: Self-supervised 5875.
vision-based 3d occupancy prediction, arXiv preprint arXiv:2311.12754 [109] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, O. Beijbom, Point-
(2023). pillars: Fast encoders for object detection from point clouds, in: Pro-
[88] H. Jiang, T. Cheng, N. Gao, H. Zhang, W. Liu, X. Wang, Symphonize ceedings of the IEEE/CVF conference on computer vision and pattern
3d semantic scene completion with contextual instance queries, arXiv recognition, 2019, pp. 12697–12705.
preprint arXiv:2306.15670 (2023). [110] Y. Zhou, O. Tuzel, Voxelnet: End-to-end learning for point cloud based
[89] S. Wang, J. Yu, W. Li, W. Liu, X. Liu, J. Chen, J. Zhu, Not all voxels are 3d object detection, in: Proceedings of the IEEE conference on computer
equal: Hardness-aware semantic scene completion with self-distillation, vision and pattern recognition, 2018, pp. 4490–4499.
arXiv preprint arXiv:2404.11958 (2024). [111] Z. Tan, Z. Dong, C. Zhang, W. Zhang, H. Ji, H. Li, Ovo: Open-
[90] P. Tang, Z. Wang, G. Wang, J. Zheng, X. Ren, B. Feng, C. Ma, vocabulary occupancy, arXiv preprint arXiv:2305.16133 (2023).
Sparseocc: Rethinking sparse latent representation for vision-based se- [112] Y. Shi, J. Li, K. Jiang, K. Wang, Y. Wang, M. Yang, D. Yang, Panossc:
mantic occupancy prediction, arXiv preprint arXiv:2404.09502 (2024). Exploring monocular panoptic 3d scene reconstruction for autonomous
[91] K. Han, D. Muhle, F. Wimbauer, D. Cremers, Boosting self-supervision driving, in: 2024 International Conference on 3D Vision (3DV), IEEE,
for single-view scene completion via knowledge distillation, arXiv 2024, pp. 1219–1228.
preprint arXiv:2404.07933 (2024). [113] Y. Lu, X. Zhu, T. Wang, Y. Ma, Octreeocc: Efficient and multi-
[92] Y. Liao, J. Xie, A. Geiger, Kitti-360: A novel dataset and benchmarks for granularity occupancy prediction using octree queries, arXiv preprint
urban scene understanding in 2d and 3d, IEEE Transactions on Pattern arXiv:2312.03774 (2023).
Analysis and Machine Intelligence 45 (3) (2022) 3292–3310. [114] B. Li, Y. Sun, Z. Liang, D. Du, Z. Zhang, X. Wang, Y. Wang, X. Jin,
[93] Openscene: The largest up-to-date 3d occupancy prediction benchmark W. Zeng, Bridging stereo geometry and bev representation with reliable
in autonomous driving, accessed July 25, 2024. mutual interaction for semantic scene completion (2024).
URL https://github.com/OpenDriveLab/OpenScene [115] X. Tan, W. Wu, Z. Zhang, C. Fan, Y. Peng, Z. Zhang, Y. Xie, L. Ma,
[94] Y. Xue, R. Li, F. Wu, Z. Tang, K. Li, M. Duan, Bi-ssc: Geometric- Geocc: Geometrically enhanced 3d occupancy network with implicit-
semantic bidirectional fusion for camera-based 3d semantic scene com- explicit depth fusion and contextual self-supervision, arXiv preprint
pletion, in: Proceedings of the IEEE/CVF Conference on Computer Vi- arXiv:2405.10591 (2024).
sion and Pattern Recognition, 2024, pp. 20124–20134. [116] S. Boeder, F. Gigengack, B. Risse, Occflownet: Towards self-supervised
18
occupancy estimation via differentiable rendering and occupancy flow, object detection, in: Proceedings of the IEEE international conference
arXiv preprint arXiv:2402.12792 (2024). on computer vision, 2017, pp. 2980–2988.
[117] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [140] J. Li, Y. Liu, X. Yuan, C. Zhao, R. Siegwart, I. Reid, C. Cadena, Depth
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., based semantic scene completion with position importance aware loss,
An image is worth 16x16 words: Transformers for image recognition at IEEE Robotics and Automation Letters 5 (1) (2019) 219–226.
scale, arXiv preprint arXiv:2010.11929 (2020). [141] S. Kullback, R. A. Leibler, On information and sufficiency, The annals
[118] H. Liu, H. Wang, Y. Chen, Z. Yang, J. Zeng, L. Chen, L. Wang, of mathematical statistics 22 (1) (1951) 79–86.
Fully sparse 3d occupancy prediction, arXiv preprint arXiv:2312.17118 [142] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving?
(2024). the kitti vision benchmark suite, in: 2012 IEEE conference on computer
[119] Z. Ming, J. S. Berrio, M. Shan, S. Worrall, Inversematrixvt3d: An ef- vision and pattern recognition, IEEE, 2012, pp. 3354–3361.
ficient projection matrix-based approach for 3d occupancy prediction, [143] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,
arXiv preprint arXiv:2401.12422 (2024). J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., Scalability in perception
[120] J. Li, X. He, C. Zhou, X. Cheng, Y. Wen, D. Zhang, Viewformer: Explor- for autonomous driving: Waymo open dataset, in: Proceedings of
ing spatiotemporal modeling for multi-view 3d occupancy perception the IEEE/CVF conference on computer vision and pattern recognition,
via view-guided transformers, arXiv preprint arXiv:2405.04299 (2024). 2020, pp. 2446–2454.
[121] D. Scaramuzza, F. Fraundorfer, Visual odometry [tutorial], IEEE [144] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, L. Chen, A. Jain, S. Omari,
robotics & automation magazine 18 (4) (2011) 80–92. V. Iglovikov, P. Ondruska, One thousand and one hours: Self-driving
[122] J. Philion, S. Fidler, Lift, splat, shoot: Encoding images from arbitrary motion prediction dataset, in: Conference on Robot Learning, PMLR,
camera rigs by implicitly unprojecting to 3d, in: Computer Vision– 2021, pp. 409–418.
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, [145] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett,
2020, Proceedings, Part XIV 16, Springer, 2020, pp. 194–210. D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., Argoverse: 3d track-
[123] Z. Xia, X. Pan, S. Song, L. E. Li, G. Huang, Vision transformer with ing and forecasting with rich maps, in: Proceedings of the IEEE/CVF
deformable attention, in: Proceedings of the IEEE/CVF conference on conference on computer vision and pattern recognition, 2019, pp. 8748–
computer vision and pattern recognition, 2022, pp. 4794–4803. 8757.
[124] J. Park, C. Xu, S. Yang, K. Keutzer, K. M. Kitani, M. Tomizuka, [146] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin,
W. Zhan, Time will tell: New outlooks and a baseline for temporal multi- R. Yang, The apolloscape dataset for autonomous driving, in: Proceed-
view 3d object detection, in: The Eleventh International Conference on ings of the IEEE conference on computer vision and pattern recognition
Learning Representations, 2022. workshops, 2018, pp. 954–960.
[125] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, X. Zhang, Petrv2: A unified [147] H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang,
framework for 3d perception from multi-camera images, in: Proceedings L. Fletcher, O. Beijbom, S. Omari, nuplan: A closed-loop ml-
of the IEEE/CVF International Conference on Computer Vision, 2023, based planning benchmark for autonomous vehicles, arXiv preprint
pp. 3262–3272. arXiv:2106.11810 (2021).
[126] H. Liu, Y. Teng, T. Lu, H. Wang, L. Wang, Sparsebev: High- [148] R. Garg, V. K. Bg, G. Carneiro, I. Reid, Unsupervised cnn for single
performance sparse 3d object detection from multi-camera videos, in: view depth estimation: Geometry to the rescue, in: Computer Vision–
Proceedings of the IEEE/CVF International Conference on Computer ECCV 2016: 14th European Conference, Amsterdam, The Netherlands,
Vision, 2023, pp. 18580–18590. October 11-14, 2016, Proceedings, Part VIII 14, Springer, 2016, pp.
[127] B. Cheng, A. G. Schwing, A. Kirillov, Per-pixel classification is not all 740–756.
you need for semantic segmentation, 2021. [149] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality as-
[128] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked- sessment: from error visibility to structural similarity, IEEE transactions
attention mask transformer for universal image segmentation, 2022. on image processing 13 (4) (2004) 600–612.
[129] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Ad- [150] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang,
vances in neural information processing systems 33 (2020) 6840–6851. Y. Chen, F. Yan, et al., Grounded sam: Assembling open-world models
[130] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE transac- for diverse visual tasks, arXiv preprint arXiv:2401.14159 (2024).
tions on information theory 13 (1) (1967) 21–27. [151] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
[131] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceed- T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything,
ings of the IEEE conference on computer vision and pattern recognition, in: Proceedings of the IEEE/CVF International Conference on Com-
2018, pp. 7132–7141. puter Vision, 2023, pp. 4015–4026.
[132] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single [152] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su,
image using a multi-scale deep network, Advances in neural information J. Zhu, et al., Grounding dino: Marrying dino with grounded pre-training
processing systems 27 (2014). for open-set object detection, arXiv preprint arXiv:2303.05499 (2023).
[133] Y. Huang, Z. Tang, D. Chen, K. Su, C. Chen, Batching soft iou for train- [153] Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, J. M. Alvarez, Fb-occ:
ing semantic segmentation networks, IEEE Signal Processing Letters 27 3d occupancy prediction based on forward-backward view transforma-
(2019) 66–70. tion, arXiv preprint arXiv:2307.01492 (2023).
[134] J. Chen, W. Lu, Y. Li, L. Shen, J. Duan, Adversarial learning of object- [154] Y. Huang, W. Zheng, Y. Zhang, J. Zhou, J. Lu, Gaussianformer: Scene
aware activation map for weakly-supervised semantic segmentation, as gaussians for vision-based 3d semantic occupancy prediction, arXiv
IEEE Transactions on Circuits and Systems for Video Technology 33 (8) preprint arXiv:2405.17429 (2024).
(2023) 3935–3946. [155] H. Shi, S. Wang, J. Zhang, X. Yin, Z. Wang, Z. Zhao, G. Wang, J. Zhu,
[135] J. Chen, X. Zhao, C. Luo, L. Shen, Semformer: semantic guided acti- K. Yang, K. Wang, Occfiner: Offboard occupancy refinement with hy-
vation transformer for weakly supervised semantic segmentation, arXiv brid propagation, arXiv preprint arXiv:2403.08504 (2024).
preprint arXiv:2210.14618 (2022). [156] Z. Yu, R. Zhang, J. Ying, J. Yu, X. Hu, L. Luo, S. Cao, H. Shen, Context
[136] Y. Wu, J. Liu, M. Gong, Q. Miao, W. Ma, C. Xu, Joint semantic segmen- and geometry aware voxel transformer for semantic scene completion,
tation using representations of lidar point clouds and camera images, arXiv preprint arXiv:2405.13675 (2024).
Information Fusion 108 (2024) 102370. [157] Cvpr 2023 3d occupancy prediction challenge, accessed July 25, 2024.
[137] Q. Yan, S. Li, Z. He, X. Zhou, M. Hu, C. Liu, Q. Chen, Decoupling se- URL https://opendrivelab.com/challenge2023/
mantic and localization for semantic segmentation via magnitude-aware [158] Cvpr 2024 autonomous grand challenge, accessed July 25, 2024.
and phase-sensitive learning, Information Fusion (2024) 102314. URL https://opendrivelab.com/challenge2024/
[138] M. Berman, A. R. Triki, M. B. Blaschko, The lovász-softmax loss: A #occupancy_and_flow
tractable surrogate for the optimization of the intersection-over-union [159] H. Vanholder, Efficient inference with tensorrt, in: GPU Technology
measure in neural networks, in: Proceedings of the IEEE conference on Conference, Vol. 1, 2016.
computer vision and pattern recognition, 2018, pp. 4413–4421. [160] Y. Liu, L. Mou, X. Yu, C. Han, S. Mao, R. Xiong, Y. Wang, Let
[139] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense occ flow: Self-supervised 3d occupancy flow prediction, arXiv preprint
19
arXiv:2407.07587 (2024). arXiv:2405.14014 (2024).
[161] Y. LeCun, A path towards autonomous machine intelligence version 0.9. [183] J. Kälble, S. Wirges, M. Tatarchenko, E. Ilg, Accurate training data for
2, 2022-06-27, Open Review 62 (1) (2022) 1–62. occupancy map prediction in automated driving using evidence theory,
[162] W. Zheng, W. Chen, Y. Huang, B. Zhang, Y. Duan, J. Lu, Occworld: in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Learning a 3d occupancy world model for autonomous driving, arXiv Pattern Recognition, 2024, pp. 5281–5290.
preprint arXiv:2311.16038 (2023). [184] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
[163] L. Wang, W. Zheng, Y. Ren, H. Jiang, Z. Cui, H. Yu, J. Lu, Occsora: X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned
4d occupancy generation models as world simulators for autonomous language models, Journal of Machine Learning Research 25 (70) (2024)
driving, arXiv preprint arXiv:2405.20337 (2024). 1–53.
[164] B. Agro, Q. Sykora, S. Casas, T. Gilles, R. Urtasun, Uno: Unsupervised [185] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin,
occupancy fields for perception and forecasting, in: Proceedings of the Z. Li, D. Li, E. Xing, et al., Judging llm-as-a-judge with mt-bench and
IEEE/CVF Conference on Computer Vision and Pattern Recognition, chatbot arena, Advances in Neural Information Processing Systems 36
2024, pp. 14487–14496. (2024).
[165] T. Khurana, P. Hu, D. Held, D. Ramanan, Point cloud forecasting as a [186] Openai chatgpt, accessed July 25, 2024.
proxy for 4d occupancy forecasting, in: Proceedings of the IEEE/CVF URL https://www.openai.com/
Conference on Computer Vision and Pattern Recognition, 2023, pp. [187] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
1116–1124. T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama:
[166] H. Xu, H. Liu, S. Meng, Y. Sun, A novel place recognition network us- Open and efficient foundation language models (2023), arXiv preprint
ing visual sequences and lidar point clouds for autonomous vehicles, in: arXiv:2302.13971 (2023).
2023 IEEE 26th International Conference on Intelligent Transportation [188] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing
Systems (ITSC), IEEE, 2023, pp. 2862–2867. vision-language understanding with advanced large language models,
[167] H. Xu, H. Liu, S. Huang, Y. Sun, C2l-pr: Cross-modal camera-to-lidar arXiv preprint arXiv:2304.10592 (2023).
place recognition via modality alignment and orientation voting, IEEE [189] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in
Transactions on Intelligent Vehicles (2024). neural information processing systems 36 (2024).
[168] D. K. Jain, X. Zhao, G. González-Almagro, C. Gan, K. Kotecha, Multi- [190] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
modal pedestrian detection using metaheuristics with deep convolutional D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 tech-
neural network in crowded scenes, Information Fusion 95 (2023) 401– nical report, arXiv preprint arXiv:2303.08774 (2023).
414. [191] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N.
[169] D. Guan, Y. Cao, J. Yang, Y. Cao, M. Y. Yang, Fusion of multispectral Fung, S. Hoi, Instructblip: Towards general-purpose vision-language
data through illumination-aware deep neural networks for pedestrian de- models with instruction tuning, Advances in Neural Information Pro-
tection, Information Fusion 50 (2019) 148–157. cessing Systems 36 (2024).
[170] T. Wang, S. Kim, J. Wenxuan, E. Xie, C. Ge, J. Chen, Z. Li, P. Luo, [192] C. Zhou, C. C. Loy, B. Dai, Extract free dense labels from clip, in: Eu-
Deepaccident: A motion and accident prediction benchmark for v2x au- ropean Conference on Computer Vision, Springer, 2022, pp. 696–712.
tonomous driving, in: Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 38, 2024, pp. 5599–5606.
[171] Z. Zou, X. Zhang, H. Liu, Z. Li, A. Hussain, J. Li, A novel multimodal
fusion network based on a joint-coding model for lane line segmentation,
Information Fusion 80 (2022) 167–178.
[172] Y. Xu, X. Yang, L. Gong, H.-C. Lin, T.-Y. Wu, Y. Li, N. Vasconcelos,
Explainable object-induced action decision for autonomous vehicles, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, 2020, pp. 9523–9532.
[173] Y. Zhuang, X. Sun, Y. Li, J. Huai, L. Hua, X. Yang, X. Cao, P. Zhang,
Y. Cao, L. Qi, et al., Multi-sensor integrated navigation/positioning sys-
tems using data fusion: From analytics-based to learning-based ap-
proaches, Information Fusion 95 (2023) 62–90.
[174] S. Li, X. Li, H. Wang, Y. Zhou, Z. Shen, Multi-gnss ppp/ins/vision/lidar
tightly integrated system for precise navigation in urban environments,
Information Fusion 90 (2023) 218–232.
[175] Z. Zhang, Z. Xu, W. Yang, Q. Liao, J.-H. Xue, Bdc-occ: Binarized
deep convolution unit for binarized occupancy network, arXiv preprint
arXiv:2405.17037 (2024).
[176] Z. Huang, S. Sun, J. Zhao, L. Mao, Multi-modal policy fusion for end-
to-end autonomous driving, Information Fusion 98 (2023) 101834.
[177] Icra 2024 robodrive challenge, accessed July 25, 2024.
URL https://robodrive-24.github.io/#tracks
[178] S. Xie, L. Kong, W. Zhang, J. Ren, L. Pan, K. Chen, Z. Liu, Robobev:
Towards robust bird’s eye view perception under corruptions, arXiv
preprint arXiv:2304.06719 (2023).
[179] S. Chen, Y. Ma, Y. Qiao, Y. Wang, M-bev: Masked bev perception for
robust autonomous driving, in: Proceedings of the AAAI Conference on
Artificial Intelligence, Vol. 38, 2024, pp. 1183–1191.
[180] S. Chen, T. Cheng, X. Wang, W. Meng, Q. Zhang, W. Liu, Efficient
and robust 2d-to-bev representation learning via geometry-guided kernel
transformer, arXiv preprint arXiv:2206.04584 (2022).
[181] Y. Kim, J. Shin, S. Kim, I.-J. Lee, J. W. Choi, D. Kum, Crn: Camera
radar net for accurate, robust, efficient 3d perception, in: Proceedings of
the IEEE/CVF International Conference on Computer Vision, 2023, pp.
17615–17626.
[182] F. Ding, X. Wen, Y. Zhu, Y. Li, C. X. Lu, Radarocc: Ro-
bust 3d occupancy prediction with 4d imaging radar, arXiv preprint
20