0% found this document useful (0 votes)
32 views

Deep Learning For Image and Point Cloud Fusion in Autonomous Driving: A Review

This document reviews deep learning methods for fusing image and point cloud data from cameras and LiDAR sensors in autonomous vehicles. It begins with an introduction describing the complementary characteristics and challenges of image and point cloud data. It then provides a brief overview of deep learning techniques for each data type individually. The majority of the document involves an in-depth review of camera-LiDAR fusion methods for tasks like depth completion, object detection, semantic segmentation, tracking, and calibration. It compares methods on public datasets and identifies gaps between research and applications to suggest promising future directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Deep Learning For Image and Point Cloud Fusion in Autonomous Driving: A Review

This document reviews deep learning methods for fusing image and point cloud data from cameras and LiDAR sensors in autonomous vehicles. It begins with an introduction describing the complementary characteristics and challenges of image and point cloud data. It then provides a brief overview of deep learning techniques for each data type individually. The majority of the document involves an in-depth review of camera-LiDAR fusion methods for tasks like depth completion, object detection, semantic segmentation, tracking, and calibration. It compares methods on public datasets and identifies gaps between research and applications to suggest promising future directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

JOURNAL OF LATEX CLASS FILES 1

Deep Learning for Image and Point Cloud Fusion


in Autonomous Driving: A Review
Yaodong Cui, Student Member, IEEE, Ren Chen, Wenbo Chu, Long Chen, Senior Member, IEEE, Daxin Tian,
Senior Member, IEEE, Ying Li, Dongpu Cao, Member, IEEE

Abstract—Autonomous vehicles were experiencing rapid devel-


opment in the past few years. However, achieving full autonomy
is not a trivial task, due to the nature of the complex and
arXiv:2004.05224v2 [cs.CV] 9 Sep 2020

dynamic driving environment. Therefore, autonomous vehicles


are equipped with a suite of different sensors to ensure robust,
accurate environmental perception. In particular, the camera-
LiDAR fusion is becoming an emerging research theme. However,
so far there has been no critical review that focuses on deep-
learning-based camera-LiDAR fusion methods. To bridge this
gap and motivate future research, this paper devotes to review
recent deep-learning-based data fusion approaches that leverage
both image and point cloud. This review gives a brief overview of
deep learning on image and point cloud data processing. Followed
by in-depth reviews of camera-LiDAR fusion methods in depth Fig. 1. A comparison between image data and point cloud data.
completion, object detection, semantic segmentation, tracking
and online cross-sensor calibration, which are organized based
on their respective fusion levels. Furthermore, we compare these
real-time. To this end, sensor fusion, which leverages mul-
methods on publicly available datasets. Finally, we identified gaps
and over-looked challenges between current academic researches tiple types of sensors with complementary characteristics to
and real-world applications. Based on these observations, we enhance perception and reduce cost, has become an emerging
provide our insights and point out promising research directions. research theme.
In particular, recent deep learning advances have signif-
Index Terms—camera-LiDAR fusion, sensor fusion, depth icantly improved the performance of camera-LiDAR fusion
completion, object detection, semantic segmentation, tracking, algorithms. Cameras and LiDARs have complementary char-
deep learning, acteristics, which make camera-LiDAR fusion models more
effective and popular compared with other sensor fusion
configurations (radar-camera, LiDAR-radar, etc.,). To be more
I. I NTRODUCTION
specific, vision-based perception systems achieve satisfactory
Recent breakthroughs in deep learning and sensor technolo- performance at low-cost, often outperforming human experts
gies have motivated rapid development of autonomous driving [4] [5]. However, a mono-camera perception system cannot
technology, which could potentially improve road safety, traffic provide reliable 3D geometry, which is essential for au-
efficiency and personal mobility [1] [2] [3]. However, technical tonomous driving [6] [7]. On the other hand, stereo cameras
challenges and the cost of exteroceptive sensors have con- can provide 3D geometry, but do so at high computational cost
strained current applications of autonomous driving systems and struggle in high-occlusion and textureless environments
to confined and controlled environments in small quantities. [8] [9] [10]. Furthermore, camera base perception systems
One critical challenge is to obtain an adequately accurate struggle with complex or poor lighting conditions, which limit
understanding of the vehicles 3D surrounding environment in their all-weather capabilities [11]. Contrarily, LiDAR can pro-
vide high-precision 3D geometry and is invariant to ambient
Y. Cui, R. Chen, Y. Li, D. Cao are with the Waterloo CogDrive light. However, mobile LiDARs are limited by low-resolution
Lab, Department of Mechanical and Mechatronics Engineering, Univer-
sity of Waterloo, 200 University Ave West, Waterloo ON, N2L3G1
(ranging from 16 to 128 channels), low-refresh rates (10Hz),
Canada. e-mail: [email protected], [email protected], severe weather conditions (heavy rain, fog and snow) and
[email protected], [email protected] (Corresponding author: D. high cost. To mitigate these challenges, many works combined
Cao.)
W. Chu is with the China Intelligent and Connected Vehicles (Beijing) Re-
these two complementary sensors and demonstrated significant
search Institute Co., Ltd., Beijing, 100176, China. e-mail: chuwenbo@china- performance advantages than a-modal approaches. Therefore,
icv.cn. this paper focuses on reviewing current deep learning fusion
L. Chen is with School of Data and Computer Science, Sun Yat-sen
University, Zhuhai 519082, China, and also with Waytous Inc., Qingdao
strategies for camera-LiDAR fusion.
266109, China. e-mail: [email protected] Camera-LiDAR fusion is not a trivial task. First of all,
D. Tian is with Beijing Advanced Innovation Center for Big Data and cameras record the real-world by projecting it to the image
Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infras-
tructure Systems and Safety Control, School of Transportation Science and plane, whereas the point cloud preserves the 3D geometry.
Engineering, Beihang University, Beijing 100191, China. Furthermore, in terms of data structure, the point cloud is
JOURNAL OF LATEX CLASS FILES 2

Fig. 2. Tasks related to image and point cloud fusion based perception and their corresponding sections.

irregular, orderless and continuous, while the image is regular, II. A B RIEF R EVIEW OF D EEP L EARNING
ordered and discrete. These characteristics differences between
A. Deep Learning on Image
the point cloud and the image lead to different feature extrac-
tion methodologies. In Figure 1, a comparison between the Convolutional Neural Networks (CNNs) are one of the
characteristics of image and point are shown. most efficient and powerful deep learning models for image
processing and understanding. Compared to the Multi-Layer-
Previous reviews [12] [13] on deep learning methods for Perceptron (MLP), the CNN is shift-invariant, contains fewer
multi-modal data fusion covered a broad range of sensors, in- weights and exploits hierarchical patterns, making it highly
cluding radars, cameras, LiDARs, Ultrasonics, IMU, Odome- efficient for image semantic extraction. Hidden layers of a
ters, the GNSS and HD Maps. This paper focuses on camera- CNN consist of a hierarchy of convolutional layers, batch nor-
LiDAR fusion only and therefore is able to present more malization layers, activation layers and pooling layers, which
detailed reviews on individual methods. Furthermore, we cover are trained end-to-end. This hierarchical structure extracts
a broader range of perception related topics (depth completion, image features with increasing abstract levels and receptive
dynamic and stationary object detection, semantic segmen- elds, enabling the learning of high-level semantics.
tation, tracking and online cross-sensor calibration) that are
interconnected and are not fully included in the previous
reviews [13]. The contribution of this paper is summarized B. Deep Learning on Point Cloud
as the following: The point cloud is a set of data points, which are LiDAR’s
measurements of the detected object’s surface. In terms of data
structure, the point cloud is sparse, irregular, orderless and
• To the best of our knowledge, this paper is the first survey
continuous. Point cloud encodes information in 3D structures
focusing on deep learning based image and point cloud
and in per-point features (reflective intensities, color, normal,
fusion approaches in autonomous driving, including depth
etc.,), which is invariant to scale, rigid transformation and
completion, dynamic and stationary object detection, se-
permutation. These characteristics made feature extractions on
mantic segmentation, tracking and online cross-sensor
the point cloud challenging for existing deep learning models,
calibration.
which require the modifications of existing models or develop-
• This paper organizes and reviews methods based on their
ing new models. Therefore, this section focuses on introducing
fusion methodologies. Furthermore, this paper presented
common methodologies for point cloud processing.
the most up-to-date (2014-2020) overviews and perfor-
mance comparisons of the state-of-the-art camera-LiDAR 1) Volumetric representation based: The volumetric rep-
fusion methods. resentation partitions the point cloud into a fixed-resolution
• This paper raises overlooked open questions, such as 3D grid, where features of each grid/voxel are hand-crafted
open-set detection and sensor-agnostic framework, that or learned. This representation is compatible with standard
are critical for the real-world deployment of autonomous 3D convolution [14] [15] [16]. Several techniques have been
driving technology. Moreover, summaries of trends and proposed in [17] to reduce over-fittings, orientation sensitivity
possible research directions on open challenges are pre- and capture internal structures of objects. However, the volu-
sented. metric representation loses spatial resolution and fine-grained
3D geometry during voxelization which limits its performance.
Furthermore, attempts to increase its spatial resolution (denser
This paper first provides a brief overview of deep learning voxels) would cause computation and memory footprint to
methods on image and point cloud data in Section II. In grow cubically, making it unscalable.
Sections III to VIII, reviews on camera-LiDAR based depth 2) Index/Tree representation based: To alleviate constraints
completion, dynamic object detection, stationary object detec- between high spatial resolution and computational costs,
tion, semantic segmentation, object tracking and online sensor adapted-resolution partition methods that leverage tree-like
calibration are presented respectively. Trends, open challenges data structures, such as kd-tree [18] [19], octree [20] [21]
and promising directions are discussed in Section VII. Finally, [22] are proposed. By dividing the point cloud into a series
a summary is given in Section VIII. Figure 2 presents the of unbalanced trees, regions can be partitioned based on their
overall structures of this survey and the corresponding topics. point densities. This allows regions with lower point densities
JOURNAL OF LATEX CLASS FILES 3

to have lower resolutions, which reduce unnecessary computa-


tion and memory footprint. Point features are extracted along
with the pre-built tree structure.
3) 2D views representation based: 2D views/multi-views
are generated by projecting the point cloud to multiple 2D
view planes. These rendered multi-view images can be pro-
cessed by standard 2D convolutions and features from these
views are aggregated via view-pooling layers [23]. Thus, Fig. 3. Timeline of depth completion models and their corresponding fusion
levels.
the permutation-invariant problem is solved by transforming
point cloud to images and the translation-invariant is achieved
by aggregating features from different views. Qi et al. [17] Unlike PointNet-based methods, the spatial relationship be-
combined the volumetric representation with multi-views gen- tween points is explicitly modelled in point-wise convolutions.
erated via sphere rendering. Unfortunately, 2D views methods Point-wise convolutions aim to generalize the standard 2D
lose the 3D geometry information during the view rendering discrete convolution to the continuous 3D space. The main
and struggle with per-point label prediction [19]. challenge is to replace the discrete weight filter in standard
4) Graph representation based: Point clouds can be repre- convolution with a continuous weight function. This continu-
sented as graphs and convolution-like operation can be imple- ous weight function is approximated using MLPs in PointConv
mented on graphs in the spatial or spectral domain [24] [25] [40] and correlation functions in KPConv [38] and PCNN [33].
[26]. For graph-convolution in the spatial domain, operations More specifically, the PCNN [33] defines convolution kernels
are carried out by MLPs on spatially neighbouring points. as 3D points with weights. A Gaussian correlation function
Spectral-domain graph-convolutions extend convolutions as that takes the coordinates of the kernel point and input point
spectral filtering on graphs through the Laplacian Spectrum is used to calculate the weighting matrix at any given 3D
[27] [28] [29]. coordinates. The KPConv [38] follows this idea but instead
5) Point representation based: Point representation based uses a linear correlation function. Furthermore, KPConvs [38]
methods consume the point cloud without transforming it are applied on local point patches hierarchically, which are
into an intermediate data representation. Early works in this similar to the concepts of standard CNNs. This general point-
direction employ shared Multi-Layer Perceptrons (MLPs) to wise convolution F at an input point x ∈ R3 in 3D continuous
process point cloud [30] [31] , while recent works concentrated space is defined as:
on defining specialized convolution operations for points [32] X
(F ∗ h)(x) = h (xi − x) fi (2)
[33] [34] [35] [36] [37] [38].
xi ∈Nx
One of the pioneering works of direct learning on point
clouds is the PointNet [31] [30], which employs an indepen- where h is the per-point kernel function which calculates the
dent T-Net module to align point clouds and shared MLPs weighting matrix given the coordinates of input points and
to process individual points for per-point feature extraction. kernel points. xi and fi are the ith neighboring points of x
The computation complexity of the PointNet increases linearly and their corresponding features (intensity, color etc.,). Nx
with the number of inputs, making it more scalable com- are all the neighboring points of the input point x, which are
pared with volumetric based methods. To achieve permutation determined using KNN or radius neighborhoods.
invariant, point-wise features are extracted by shared MLPs
which are identical for all points. These features are aggregated III. D EPTH C OMPLETION
by symmetric operations (i.e. max-pooling), which are also Depth Completion aims to up-sample sparse irregular depth
permutation invariant. The feature extraction process of the to dense regular depth, which facilitates the down-stream
PointNet is defined as: perception module. Depth completion can reduce the drastic
uneven distributions of points in a LiDAR scan. For instance,
g ({x1 , . . . , xn }) ≈ fsym (h (x1 ) , . . . , h (xn )) (1) far-away objects represented by a hand full of points are
up-sampled to match their closer counterparts. To achieve
where x represents input points, h represents the per-point
this, high-resolution images are often employed to guide the
feature extraction function (i.e. shared MLPs), fsym represents
3D depth up-sampling. The depth completion task can be
a symmetric function (i.e. max-polling) and g is a general
represented as:
function that we want to approximate.
However, the PointNet fails to extract local inter-point w∗ = arg min L(f (x; w), G) (3)
geometry at different levels. To mitigate this challenge, Qi w

et al. [30] extended the PointNet to extract features from where the network f (.) parametrized by w, predicts the ground
different levels by grouping points into multiple sets and apply truth G, given the input x. The loss function is represented as
PointNets locally. To reduce the computational and memory L(·, ·).
cost of the PointNet++ [30], the RandLA-Net [39] stacked Figure 3 gives the timeline of depth completion models
the random point sampling modules and attention-based local and their corresponding fusion levels. The comparative results
feature aggregation modules hierarchically to progressively of depth completion models on the KITTI depth completion
increase receptive field while maintaining high efficiency. benchmark [41] are listed in Table I.
JOURNAL OF LATEX CLASS FILES 4

TABLE I
C OMPARATIVE RESULTS ON KITTI DEPTH COMPLETION BENCHMARK [41]. ’LL’,’S’ , ’SS’,’U’ STAND FOR LEARNING SCHEME , SUPERVISED ,
SELF - SUPERVISED AND UNSUPERVISED RESPECTIVELY. ’RGB’,’ S D’,’C O ’, ’S’, STAND FOR RGB IMAGE DATA , SPARSE DEPTH , CONFIDENCE AND
STEREO DISPARITY. ’ I RMSE’,’ I MAE’, ’RMSE’,’MAE’ STAND FOR ROOT MEAN SQUARED ERROR OF THE INVERSE DEPTH [1/ KM ], MEAN ABSOLUTE
ERROR OF THE INVERSE DEPTH [1/ KM ], ROOT MEAN SQUARED ERROR [ MM ], MEAN ABSOLUTE ERROR [ MM ].

Fusion Run Model


Methods LL Input Models Hardware iRMSE iMAE RMSE MAE
Level Time Size
S RGB+sD Sparse2Dense [42] Tesla P100 0.08s 12M 4.07 1.57 1299.85 350.32
Signal SS RGB+sD Sparse2Dense++ [43] Tesla V100 0.08s 22M 2.80 1.21 814.73 249.95
Level S RGB+sD CSPN [44] Titan X 1s 25M 2.93 1.15 1019.64 279.46
Mono-
S RGB+sD CSPN++ [45] Tesla P40 0.2s - 2.07 0.90 743.69 209.28
LiDAR
S RGB+sD Spade-RGBsD [46] N/A - - 2.17 0.95 917.64 234.81
Fusion Feature
S RGB+sD+Co NConv-CNN [47] Tesla V100 0.02s 0.5M 2.60 1.03 829.98 233.26
Level
S RGB+sD GuideNet [48] GTX 1080Ti 0.14s - 2.25 0.99 736.24 218.83
Multi U RGB+sD RGBguide [49] Tesla V100 0.02s 0.35M 2.19 0.93 772.87 215.02
S sD+S HDE-Net [9] Titan X 0.05s 4.2M - - - -
Stereo- Feature
U sD+S LidarStereoNet [50] Titan X - - - - - -
LiDAR Level
SS sD+S LiStereo [51] Titan X - - 2.19 1.10 832.16 283.91

A. Mono Camera and LiDAR fusion in parallel and combine them in a shared decoder. Furthermore,
normalized convolutions are used to process highly sparse
The idea behind image-guided depth completion is that
depth and to propagate confidence. Valada et al. [55] extended
dense RGB/color information contains relevant 3D geometry.
one-stage feature-level fusion to multiple-stages of varying
Therefore, images can be leveraged as a reference for depth
depth of the network. Similarly, GuideNet [48] fuse image
up-sampling.
features to sparse depth features at different stages of the
1) Signal-level fusion: In 2018, Ma et al. [42] presented a
encoder to guide the up-sampling of sparse depths, which
ResNet [52] based autoencoder network that leverages RGB-
achieves top performance in the KITTI depth completion
D images (i.e. images concatenated with sparse depth maps)
benchmark. The constrain of these approaches is the lack of
to predict dense depth maps. However, this method requires
large-scale datasets that have dense depth ground truth.
pixel-level depth ground truth, which is difficult to obtain. To
3) Multi-level fusion: Van Gansbeke et al. [49] further com-
solve this issue, Ma et al. [43] presented a model-based self-
bines signal-level fusion and feature-level fusion in an image-
supervised framework that only requires a sequence of images
guided depth completion network. The network consists of a
and sparse depth images for training. This self-supervision is
global and a local branch to process RGB-D data and depth
achieved by employing sparse depth constrain, photometric
data in parallel before fusing them based on the confidence
loss and smoothness loss. However, this approach assumes
maps.
objects to be stationary. Furthermore, the resulting depth
output is blurry and input depth may not be preserved.
To generate a sharp dense depth map in real-time, Cheng B. Stereo Cameras and LiDAR fusion
et al. [44] fed RGB-D images to a convolutional spatial Compared with the RGB image, dense depth disparity from
propagation network (CSPN). This CSPN aims to extract the stereo cameras contains richer ground truth 3D geometry.
image-dependent affinity matrix directly, producing signifi- On the other hand, LiDAR depth is sparse but of higher
cantly better results in key measurements with lesser run-time. accuracy. These complementary characteristics enable stereo-
In CSPN++, Cheng et al. [45] proposed to dynamically select LiDAR fusion based depth completion models to produce
convolutional kernel sizes and iterations to reduce computa- a more accurate dense depth. However, it is worth noting
tion. Furthermore, CSPN++ employs weighted assembling to that stereo cameras have limited range and struggles in high-
boost its performance. occlusion, texture-less environments, making them less ideal
2) Feature-level fusion: Jaritz et al. [46] presented an for autonomous driving.
autoencoder network that can either perform depth comple- 1) Feature-level fusion: One of the pioneering works is
tion or semantic segmentation from sparse depth maps and from Park et al. [9], in which high-precision dense disparity
images without applying validity masks. Images and sparse map is computed from dense stereo disparity and point cloud
depth maps are first processed by two parallel NASNet-based using a two-stage CNN. The first stage of the CNN takes
encoders [53] before fusing them into the shared decoder. This LiDAR and stereo disparity to produce a fused disparity. In
approach can achieve decent performance with very sparse the second stage, this fused disparity and left RGB image is
depth inputs (8-channel LiDAR). Wang et al. [54] designed fused in the feature space to predict the final high-precision
an integrable module (PnP) that leverages the sparse depth disparity. Finally, the 3D scene is reconstructed from this high-
map to improve the performance of existing image-based precision disparity. The bottleneck of this approach is the
depth prediction networks. This PnP module leverages gradient lack of large-scale annotated stereo-LiDAR datasets. The Li-
calculated from sparse depth to update the intermediate feature darStereoNet [50] averted this difficulty with an unsupervised
map produced by the existing depth prediction network. El- learning scheme, which employs image warping/photometric
desokey et al. [47] presented a framework for unguided depth loss, sparse depth loss, smoothness loss and plane fitting
completion that processes images and very sparse depth maps loss for end-to-end training. Furthermore, the introduction of
JOURNAL OF LATEX CLASS FILES 5

Fig. 4. Timeline of 3D object detection networks and their corresponding fusion levels.

’feedback loop’ makes the LidarStereoNet robust against noisy One of the early works of result-level fusion is the F-
point cloud and sensor misalignment. Similarly, Zhang et al. PointNets [57], where 2D bounding boxes are first generated
[51] presented a self-supervised scheme for depth completion. from images and projected to the 3D space. The resulting
The loss function consists of sparse depth, photometric and projected frustum proposals are fed into a PointNet[31] based
smoothness loss. detector for 3D object detection. Du et al. [58] extended the 2D
to 3D proposals generation stage with an additional proposal
IV. DYNAMIC O BJECT D ETECTION refinement stage, which further reduces unnecessary compu-
tation on the background point. During this refinement stage,
Object detection (3D) aims to locate, classify and estimate a model fitting based method is used to filter out background
oriented bounding boxes in the 3D space. This section is points inside the seed region. Finally, the filtered points are fed
devoted to dynamic object detection, which includes common into the bbox regression network. The RoarNet [59] followed
dynamic road objects (car, pedestrian, cyclist, etc.,). There are a similar idea, but instead employs a neural network for the
two main approaches for object detection: sequential and one- proposal refinement stage. Multiple 3D cylinder proposals are
step. Sequential based models consist of a proposal stage and a first generated based on each 2D bbox using the geometric
3D bounding box (bbox) regression stage in the chronological agreement search [60], which results in smaller but more
order. In the proposal stage, regions that may contain objects precise frustum proposals then the F-pointNet [57]. These
of interest are proposed. In the bbox regression stage, these initial cylinder proposals are then processed by a PointNet [30]
proposals are classified based on the region-wise features based header network for the final refinement. To summarize,
extracted from 3D geometry. However, the performance of these approaches assume each seed region only contains one
sequential fusion is limited by each stage. On the other hand, object of interest, which is however not true for crowded
one-step models consist of one stage, where 2D and 3D data scenes and small objects like pedestrians.
are processed in a parallel manner.
One possible solution towards the fore-mentioned issues
The timeline of 3D object detection networks and typical
is to replace the 2D object detector with 2D semantic seg-
model architectures are shown in Figures 4 & 5. Table II
mentation and region-wise seed proposal with point-wise seed
presents the comparative results of 3D object detection models
proposals. Intensive Point-based Object Detector (IPOD) [61]
on the KITTI 3D Object Detection benchmark [56]. Table III
by Yang et al. is a work in this direction. In the first step,
summarizes and compares dynamic object detection models.
2D semantic segmentation is used to filter out back-ground
points. This is achieved by projecting points to the image plane
A. 2D Proposal Based Sequential Models and associated point with 2D semantic labels. The resulting
A 2D proposal based sequential model attempts to utilize 2D foreground point cloud retains the context information and
image semantics in the proposal stage, which takes advantage fine-grained location, which is essential for the region-proposal
of off-the-shelf image processing models. Specifically, these and bbox regression. In the following point-wise proposal
methods leverage the image object detector to generate 2D generation and bbox regression stage, two PointNet++ [30]
region proposals, which are projected to the 3D space as detec- based networks are used for proposal feature extraction and
tion seeds. There are two projection approaches to translate 2D bbox prediction. In addition, a novel criterion called PointsIoU
proposals to 3D. The first one is projecting bounding boxes in is proposed to speed up training and inference. This approach
the image plane to the point cloud, which results in a frustum has yielded significant performance advantages over other
shaped 3D search space. The second method projects the point state-of-the-art approaches in scenes with high occlusion or
cloud to the image plane, which results in the point cloud with many objects.
point-wise 2D semantics. 2) Multi-level Fusion: Another possible direction of im-
1) Result-level Fusion: The intuition behind result-level provement is to combine result level fusion with feature
fusion is to use off-the-shelf 2D object detectors to limit the level fusion, where one such work is PointFusion [62]. The
3D search space for 3D object detection, which significantly PointFusion first utilizes an existing 2D object detector to
reduces computation and improve run-time. However, since generate 2D bboxes. These bboxes is used to select corre-
the whole pipeline depends on the results of the 2D object sponding points, via projecting points to the image plane and
detector, it suffers from the limitations of the image-based locate points that pass through the bboxs. Finally, a ResNet
detector. [52] and a PointNet [31] based network combines image and
JOURNAL OF LATEX CLASS FILES 6

point cloud features to estimate 3D objects. In this approach,


image features and point cloud features are fused per-proposal
for final object detection in 3D, which facilitates 3D bbox
regression. However, its proposal stage is still amodal. In
SIFRNet [63], frustum proposals are first generated from an
image. Point cloud features in these frustum proposals are then
combined with their corresponding image features for final
3D bbox regression. To achieve scale-invariant, the PointSIFT
[64] is incorporated into the network. Additionally, the SENet
module is used to suppress less informative features.
3) Feature-level Fusion: Early attempts [75] [76] of mul-
timodal fusion are done in pixel-wise, where 3D geometry is
converted to image format or appended as additional channels
of an image. The intuition is to project 3D geometry onto the
image plane and leverage mature image processing methods
for feature extraction. The resulting output is also on the
image plane, which is not ideal to locate objects in the 3D
space. In 2014, Gupta et al. proposed DepthRCNN [75],an R-
CNN [77] based architecture for 2D object detection, instance
and semantic segmentation. It encodes 3D geometry from
Microsoft Kinect camera in image’s RGB channels, which are
Horizontal disparity, Height above ground, and Angle with
gravity (HHA). Gupta et al. extended Depth-RCNN [78] in
2015 for 3D object detection by aligning 3D CAD models,
yielding significant performance improvement. In 2016, Gupta
Fig. 5. A comparison between three typical model architectures for dynamic
et al. developed a novel technique for supervised knowledge object detection.
transfer between networks trained on image data and unseen
paired image modality (depth image) [76]. In 2016, Schlosser
et al. [79] further exploited learning RGB-HHA representa- of 2D to 3D proposal transformation greatly limits the 3D
tions on 2D CNNs for pedestrian detection. However, the HHA search space for 3D object detection. Common methods for
data are generated from the LiDAR’s depth instead of a depth 3D proposal generation includes the multi-view approach and
camera. The authors also noticed that better results can be the point cloud voxelization approach.
achieved if the fusion of RGB and HHA happens at deeper Multi-view based approach exploits the point cloud’s bird’s
layers of the network. eye view (BEV) representation for 3D proposal generation.
The resolution mismatch between dense RGB and sparse The BEV is the preferred viewpoint because it avoids occlu-
depth means only a small portion of pixels have corresponding sions and retains the raw information of objects’ orientation
points. Therefore, to directly append RGB information to and x, y coordinates. These orientation and x, y coordinates
points leads to the loss of most texture information, rendering information is critical for 3D object detection while mak-
the fusion pointless. To mitigate this challenge, PointPainting ing coordinate transformation between BEV and other views
[66] extract high-level image semantic before the per-point straight-forward.
fusion. To be more specific, PointPainting [66] follows the idea Point cloud voxelization transforms the continuous irregular
of projecting points to 2D semantic maps in [61]. But instead data structure to a discrete regular data structure. This makes
of using 2D semantics to filter non-object points, 2D semantics it possible to apply standard 3D discrete convolution and
is simply appended to point clouds as additional channels. leverage existing network structures to process point cloud.
The authors argued that this technique made PointPainting The drawback is the loss of some spatial resolution, which
flexible as it enables any point cloud networks to be applied on might contain fine-grained 3D structure information.
this fused data. To demonstrate this flexibility, the fused point 1) Feature-level fusion: One of the pioneering and most
cloud is fed into multiple existing point cloud detectors, which important works in generating 3D proposals from BEV rep-
are based on the PointRCNN [80], the VoxelNet [14] and resentations is MV3D [67]. MV3D generate 3D proposals
the PointPillar [81]. However, this would lead to the coupling on pixelized top-down LiDAR feature map (height, density
between image and LiDAR models. This requires the LiDAR and intensity). These 3D candidates are then projected to the
model to be re-trained when the image model changes, which LiDAR front view and image plane to extract and fuse region-
reduces the overall reliability and increases the development wise features for bbox regression. The fusion happens at the
cost. region-of-interest (ROI) level via ROI pooling. The ROIviews
of views is defined as:
B. 3D Proposal Based Sequential Model
In a 3D proposal based sequential model, 3D proposals ROIviews = T3D→views (p3D ) , views ∈ {BV, FV, RGB}
are directly generated from 2D or 3D data. The elimination (4)
JOURNAL OF LATEX CLASS FILES 7

TABLE II
C OMPARATIVE RESULTS ON KITTI 3D O BJECT D ETECTION BENCHMARK ( MODERATE DIFFICULTIES )[56].T HE I O U REQUIREMENT FOR CAR ,
PEDESTRIANS AND CYCLISTS IS 70%, 50% AND 50% RESPECTIVELY.’FL’,’FT’,’RT’ ’MS’,’PCR’ STAND FOR FUSION LEVEL , FUSION TYPE , RUN - TIME ,
MODEL SIZE ( NUMBER OF PARAMETERS ) AND POINT CLOUD REPRESENTATION RESPECTIVELY.

Cars Pedestrians Cyclists


Methods FL FT PCR Models Hardware RT MS
AP3D % AP3D % AP3D %
Points F-PointNet [57] GTX1080 0.17s 7M 69.79 42.15 56.12
Points F-ConvNet [65] GTX1080Ti 0.47s 30M 76.39 43.38 65.07
2D Result ROI Voxels PC-CNN [58] - - 276M 73.79 - -
Proposal Points RoarNet [59] GTX1080Ti 0.1s - 73.04 - -
based Points IPOD [61] GTX1080Ti 0.2s 24M 72.57 44.68 53.46
Feature Point Multiple PointPainting [66] GTX1080 0.4s - 71.70 40.97 63.78
Sequential
ROI Points PointFusion [62] GTX1080 1.3s 6M 63.00 28.04 29.42
Models Multi
Point Points SIFRNet [63] - - - 72.05 60.85 60.34
ROI 2D views MV3D [67] Titan X 0.36s 240M 63.63 - -
ROI 2D views AVOD-FPN [68] Titan Xp 0.08s 38M 71.76 42.27 50.55
3D
Feature ROI 2D views SCANet [69] GTX1080Ti 0.17s - 68.12 37.93 53.38
Proposal
Point 2D views ContFuse [70] GTX1080 0.06s - 68.78 - -
based
Point Voxels MVX-Net [71] - - - 77.43 - -
Pixel 2D views BEVF [72] - - - - - 45.00
Multi Point 2D views MMF [73] GTX1080 0.08s - 77.43 - -
One-Step Models Feature Pixel 2D views LaserNet++ [74] TITAN Xp 0.04s - - - -

where T3D→views represents the transformation function that SCA can capture multi-scale contextual information, whereas
project point cloud p3D from 3D space to bird’s eye view ESU recovers the spatial information.
(BEV), front view (FV) and the image plane (RGB). The ROI- One of the problems in object-centric fusion methods
pooling R to obtain feature vector fviews is defined as: [68][67] is the loss of fine-grained geometric information
during ROI-pooling. The ContFuse [70] by Liang et al. tackles
fviews = R (x, ROIviews ) , views ∈ {BV, FV, RGB} (5) this information lost with point-wise fusion. This point-wise
fusion is achieved with continuous convolutions [83] fusion
There are a few drawbacks of the MV3D. First, generating layers that bridge image and point cloud features of different
3D proposals on BEV assumes that all objects of interest scales at multiple stages in the network. This is achieved by
are captured without occlusions from this view-point. This first extracting K-nearest neighbour points for each pixel in
assumption does not hold well for small object instances, such the BEV representation of point cloud. These points are then
as pedestrians and bicyclists, which can be fully occluded projected to the image plane to retrieve related image features.
by other large objects in the point cloud. Secondly, spatial Finally, the fused feature vector is weighted according to their
information of small object instances is lost during the down- geometry offset to the target ’pixel’ before feeding into MLPs.
sample of feature maps caused by consecutive convolution However, point-wise fusion might fail to take full advantage
operations. Thirdly, object-centric fusion combines feature of high-resolution images when the LiDAR points are sparse.
maps of image and point clouds through ROI-pooling, which In [73] Liang et al. further extended point-wise fusion by
spoils fine-grained geometric information in the process. It is combining multiple fusion methodologies, such as signal-
also worth noting that redundant proposals lead to repetitive level fusion (RGB-D), feature-level fusion, multi-view and
computation in the bbox regression stage. To mitigate these depth completion. In particular, depth completion upsamples
challenges, multiple methods have been put forward to im- sparse depth map using image information to generate a
prove MV3D. dense pseudo point cloud. This up-sampling process alleviates
To improve the detection of small objects, the Aggregate the sparse point-wise fusion problem, which facilitates the
View Object Detection network (AVOD) [68] first improved learning of cross-modality representations. Furthermore, the
the proposal stage in MV3D [67] with feature maps from both authors argued that multiple complementary tasks (ground
BEV point cloud and image. Furthermore, an auto-encoder estimation, depth completion and 2D/3D object detection)
architecture is employed to up-sample the final feature maps could assists the network achieve better overall performance.
to its original size. This alleviates the problem that small However, point-wise/pixel-wise fusion leads to the ’feature
objects might get down-sampled to one ’pixel’ with con- blurring’ problem. This ’feature blurring’ happens when one
secutive convolution operations. The proposed feature fusion point in the point cloud is associated with multiple pixels in
Region Proposal Network (RPN) first extracts equal-length the image or the other way around, which confound the data
feature vectors from multiple modalities (BEV point cloud fusion. Similarly, wang et al. [72] replace the ROI-pooling
and image) with crop and resize operations. Followed by a in MV3D [67] with sparse non-homogeneous pooling, which
1 × 1 convolution operation for feature space dimensionality enables effective fusion between feature maps from multiple
reduction, which can reduce computational cost and boost modalities.
up speed. Lu et al.[69] also utilized an encoder-decoder MVX-Net [71] presented by Sindagi et al. introduced two
based proposal network with Spatial-Channel Attention (SCA) methods that fuse image and point cloud data point-wise or
module and Extension Spatial Upsample (ESU) module. The voxel-wise. Both methods employ a pre-trained 2D CNN for
JOURNAL OF LATEX CLASS FILES 8

TABLE III
A SUMMARIZATION AND COMPARISON BETWEEN DYNAMIC OBJECT DETECTION MODELS

Methods: Key features: Advantages: Disadvantages:


Frustum based 1.Leverage image object detector to generate 2D 1. Due to sequential result-level fusion, the over-
proposals and project them to form frustum 3D 3D search space is limited all performance is limited by the image detector.
[57] [65] [58]
search spaces for 3D object detection. using 2D results to reduce 2. Redundant information from multimodal sen-
[59] [61] [63]
2.Result-level fusion / Multi-level fusion computational cost. sors is not leveraged.
[62]
Solve the resolution mis- 1.Image and LiDAR models are highly coupled,
1.Fuse high-level image semantics point-wise
match problem between which reduce overall reliability and increase
Point-fusion and perform 3D object detection in the fused
dense RGB and sparse development cost.
based [66] point cloud.
depth via fusing high-level 2.3D search space is not limited which leads to
2.Feature-level fusion.
image semantics to points. high computational cost.
1.Assume all objects are visible in LiDAR’s
Enable the use of standard BEV, which is often not the case.
Multi-view 1.Generate 3D proposals from BEV and perform
2D convolution and off- 2.Spatial information of small object instances is
based [67] [68] 3D bboxes regression on these proposals.
the-shelf models, making lost during consecutive convolution operations
[69] [70] [73] 2.Feature-level fusion.
it much more scalable. 3.ROI fusion spoils fine-grained geometric in-
formation.
1. Spatial resolution and fine-grained 3D geom-
1.Fuse image semantics voxel-wise and perform
etry information are lost during voxelization.
Voxel-based 3D bboxes regression using voxel-based net- Compatible with standard 2. Computation and memory footprint to grow
[71] [82] work. 3D convolution. cubically with the resolution, making it unscal-
2.Feature-level fusion.
able.

image feature extraction and a VoxelNet [14] based network Meyer et al. [74] extended the LaserNet [85] to multi-task
to estimate objects from the fused point cloud. In the point- and multimodal network, performing 3D object detection and
wise fusion method, the point cloud is first projected to image 3D semantic segmentation on fused image and LiDAR data.
feature space to extract image features before voxelization Two CNN process depth-image (generated from point cloud)
and processed by VoxelNet. The voxel-wise fusion method and front-view image in a parallel manner and fuse them via
first voxelized the point cloud before projecting non-empty projecting points to the image plane to associate corresponding
voxels to the image feature space for voxel/region-wise feature image features. This feature map is fed into the LaserNet
extraction. These voxel-wise features are only appended to to predict per-point distributions of the bounding box and
their corresponding voxels at a later stage of the VoxelNet. combine them for final 3D proposals. This method is highly
MVX-Net achieved state-of-the-art results and out-performed efficient while achieving state-of-the-art performance.
other LiDAR-based methods on the KITTI benchmark while
lowering false positives and false negatives rate compared to V. S TATIONARY ROAD O BJECT D ETECTION
[14]. This section focuses on reviewing recent advances in
The simplest means to combine the voxelized point cloud camera-LiDAR fusion based stationary road object detection
and image is to append RGB information as additional chan- methods. Stationary road objects can be categorized into
nels of a voxel. In a 2014 paper by Song et al. [82] 3D object on-road objects (e.g. road surfaces and road markings) and
detection is achieved by sliding a 3D detection window on the off-road objects (e.g. traffic signs). On-road and off-road
voxelized point cloud. The classification is performed by an objects provide regulations, warning bans and guidance for
ensemble of Exemplar-SVMs. In this work, color information autonomous vehicles.
is appended to voxels by projection. Song et al. further In Figure 6 & 7, typical model architecture in lane/road
extended this idea with 3D discrete convolutional neural detection and traffic sign recognition (TSR) are compared.
networks [84]. In the first stage, the voxelized point cloud Table IV presents the comparative results of different models
(generated from RGB-D data) is first processed by Multi-scale on the KITTI road benchmark [56] and gives summarization
3D RPN for 3D proposal generation. These candidates are and comparison between these models.
then classified by joint Object Recognition Network (ORN),
which takes both image and voxelized point cloud as inputs.
However, the volumetric representation introduces boundary A. Lane/Road detection
artifacts and spoils fine-grained local geometry. Secondly, the Existing surveys [93] [94] [95] have presented detailed
resolution mismatch between image and voxelized point cloud reviews on traditional multimodal road detection methods.
makes fusion inefficient. These methods [96] [97] [98] [99] mainly rely on vision for
road/lane detection while utilizing LiDAR for the curb fitting
C. One-step Models and obstacle masking. Therefore, this section focuses on recent
One-step models perform proposal generation and bbox advances in deep learning based fusion strategies for road
regression in a single stage. By fusing the proposal and bbox extraction.
regression stage into one-step, these models are often more Deep leaning based road detection methods can be grouped
computationally efficient. This makes them more well-suited into BEV-based or front-camera-view-based. BEV-based meth-
for real-time applications on mobile computational platforms. ods [86] [88] [89] [87] project LiDAR depth and images to
JOURNAL OF LATEX CLASS FILES 9

TABLE IV
C OMPARATIVE RESULTS ON KITTI ROAD BENCHMARK [56] WITH SUMMARIZATION AND COMPARISON BETWEEN LANE / ROAD DETECTION
MODELS .’M AX F (%)’ STANDS FOR M AXIMUM F1- MEASURE ON KITTI U RBAN M ARKED ROAD TEST SET. ’FL’,’MS’ STAND FOR F USION L EVEL AND
MODEL PARAMETER SIZE .

Method: Inputs: FL: MaxF MS Highlights: Disadvantages:


BEV-based Lane/Road Detection
BEV Image + Predict dense BEV height map from sparse Cannot distinguish between differ-
DMLD [86] Feature - 50M
BEV Point Cloud depth which improves performance ent lane types
Fused BEV Oc- Signal-level fusion reduces model complex- Loss of dense RGB/texture infor-
EDUNet [87] Signal 93.81 143M
cupation Grid ity mation in early fusion
BEV Image +
TSF-FCN [88] Feature 94.86 2M Relatively moderate computational cost Relatively moderate performance
BEV Point Cloud
BEV Image + Multi-stage fusion at different network
MSRF [89] Feature 96.15 - Relatively high computational cost
BEV Point Cloud depth improves performance
Camera-view based Lane/Road Detection
LidCamNet Image + Dense Explore and compared multiple fusion Rely on image for depth up-
Multi 96.03 3M
[90] Depth Image strategies for road detection sampling
Image + Dense Progressive fusion strategy improves perfor-
PLARD [91] Feature 97.05 - Relatively high computational cost
Depth Image mance
Image + Depth Spherical coordinate transformation reduces Road detection in 2D space rather
SpheSeg [92] Feature 96.94 1.4M
Image input size and improves speed than 3D space

Fig. 7. A typical fusion-based traffic-sign recognition pipeline.

by a U-net based road segmentation network. However, the


signal-level fusion between dense RGB and sparse depth leads
to the loss of dense texture information due to the low grid
resolution.
Front-camera-view-based methods [90] [91] [92] project
LiDAR depth to the image plane to extract road surface,
which suffers from accuracy loss in the translation of 2D
to 3D boundaries. The LCNet [90] compared signal-level
fusion (early fusion) and feature level fusion (late and cross
Fig. 6. Some typical model architectures and fusion methods for road/lane
fusion) for road detection, which finds the cross fusion is
detection the best performing fusion strategy. Similar to [88], PLARD
[91] fuses image and point cloud features progressively in
multiple stages. Lee et al. [92] focused on improving speed
BEV for road detection, which retains the objects’ original x, y via a spherical coordinate transformation scheme that reduces
coordinates and orientation. In [86], the dense BEV height the input size. This transformed camera and LiDAR data are
estimation is predicted from the point cloud using a CNN, further processed by a SegNet based semantic segmentation
which is then fused with the BEV image for accurate lane network.
detection. However, this method cannot distinguish different
lane types. Similarly, Lv et al. [88] also utilized the BEV
LiDAR grid map and the BEV image but instead processed B. Traffic sign recognition
them in a parallel manner. Yu et al. [89] proposed a multi-stage In LiDAR scans, traffic-signs are highly distinguishable due
fusion strategy (MSRF) that combines image-depth features at to its retro-reflective property, but the lack of dense texture
different network levels, which significantly improves its per- makes it difficult to classify. On the contrary, traffic sign
formance. However, this strategy also relatively increases its image patches can be easily classified. However, it is difficult
computational cost. Wulff et al. [87] used signal-level fusion for vision-based TSR system to locate these traffic-signs in
to generate a fused BEV occupation grid, which is processed the 3D space. Therefore, various studies have proposed to
JOURNAL OF LATEX CLASS FILES 10

A. 2D Semantic Segmentation

1) Feature-level fusion: Sparse & Dense [46] presented a


NASNet [53] based auto-encoder network that can be used
for 2D semantic segmentation or depth completion leveraging
image and sparse depths. The image and corresponding sparse
depth map are processed by two parallel encoders before fused
into the shared decoder. Valada et al. [55] employed a multi-
stage feature-level fusion of varying depth to facilitate seman-
Fig. 8. Timeline of 3D semantic segmentation networks and their corre-
sponding fusion levels. tic segmentation. Caltagirone et al.[90] utilized up-sampled
depth-image and image for 2D semantic segmentation. This
dense depth-image is up-sampled using sparse depth-images
utilize both camera and LiDAR for TSR. Existing reviews [93] (from point cloud) and images [108]. The best performing
[100] have comprehensively covered traditional traffic sign cross-fusion model processes dense depth-image and image
recognition methods and part of the deep learning methods. data in two parallel CNN branches with skip-connections
Hence, this section presents a brief overview of traditional in between and fuses the two feature maps in the final
traffic sign recognition methods and mostly focuses on recent convolution layer.
advances. In a typical TSR fusion pipeline [101] [102] [103]
[104] [105], traffic-signs are first located in the LiDAR scan
based on its retro-reflective property. These 3D positions of
detected traffic-signs are then projected to the image plane B. 3D Semantic Segmentation
to generate traffic-signs patches, which are fed into an image
classifier for classification. This TSR fusion pipeline is shown 1) Feature-level fusion: Dai et al.[109] presented 3DMV,
in Fig. 7. a multi-view network for 3D semantic segmentation which
For methods that employ the typical TSR fusion pipeline, fuse image semantic and point features in voxelized point
the main difference is on the classifier. These classifiers cloud. Image features are extracted by 2D CNNs from multiple
include deep Boltzmann machine (DBMs) based hierarchical aligned images and projected back to the 3D space. These
classifier[102], SVMs [101] and DNN [104]. To summarize, multi-view image features are max-pooled voxel-wise and
these methods all employ result-level fusion and a hierarchical fused with 3D geometry before feeding into the 3D CNNs
object detection model. They assume traffic-signs are visible for per-voxel semantic prediction. 3DMV out-performed other
in the LiDAR scan, which sometimes is not the case due voxel-based approaches on ScanNet [110] benchmark. How-
to the occlusion. Furthermore, this pipeline is limited by the ever, the performance of voxel-based approaches is determined
detection range of mobile LiDARs. by the voxel-resolution and hindered by voxel boundary arti-
To mitigate these challenges, Deng et al. [106] combined facts.
image and point cloud to generate a colorized point cloud for To alleviate problems caused by point cloud voxelization,
both traffic-sign detection and classification. In addition, the Chiang et al. [111] proposed a point-based semantic segmenta-
3D geometrical properties of detected traffic-signs are utilized tion framework (UPF) that also enables efficient representation
to reduce false-positives. In [107] traffic-signs are detected learning of image features, geometrical structures and global
based on prior knowledge, which includes road geometry in- context priors. Features of rendered multi-view images are
formation and traffic-sign geometry information. The detected extracted using a semantic segmentation network and projected
traffic-signs patches are classified by a GaussianBernoulli to 3D space for point-wise feature fusion. This fused point
DBMs model. Following this ideal, Guan et al. [105] further cloud is processed by two PointNet++ [30] based encoders to
improved the traffic sign recognition part using a convolutional extract local and global features before feeding into a decoder
capsule network. To summarize, these methods improve the for per-point semantic label prediction. Similarly, Multi-View
traffic sign detection stage with multimodal data and prior PointNet (MVPNet) [112] fused multi-view images semantics
knowledge. However, prior knowledge is often region-specific, and 3D geometry to predict per-point semantic labels.
which makes it difficult to generalize to other parts of the Permutohedral lattice representation is an alternative ap-
world. proach for multimodal data fusion and processing. Sparse Lat-
tice Networks (SPLATNet) by Su et al. [113] employed sparse
bilateral convolution to achieve spatial-aware representation
VI. S EMANTIC S EGMENTATION
learning and multimodal (image and point cloud) reasoning. In
This section reviews existing camera-LiDAR fusion meth- this approach, point cloud features are interpolated onto a dl -
ods for 2D semantic segmentation, 3D semantic segmentation dimensional permutohedral lattice, where bilateral convolution
and instance segmentation. 2D/3D semantic segmentation aims is applied. The results are interpolated back onto the point
to predict per-pixel and per-point class labels, while instance cloud. Image features are extracted from multi-view images
segmentation also cares about individual instances. Figure 8 using a CNN and projected to the 3D lattice space to be
& 9 present a timeline of 3D semantic segmentation networks combined with 3D features. This fused feature map is further
and typical model architectures. processed by CNN to predict the per-point label.
JOURNAL OF LATEX CLASS FILES 11

Fig. 10. A comparison between Detection-Based Tracking (DBT) and


Detection-Free Tracking (DFT) approaches.
Fig. 9. Some typical model architectures and fusion methods for semantic
segmentation
these semantic features to cluster points into instances. This
approach is mainly constraint by its dependence on semantic
C. Instance Segmentation features from BEV, which could introduce occlusions from
In essence, instance segmentation aims to perform semantic sensor displacements.
segmentation and object detection jointly. It extends the se-
mantic segmentation task by discriminating against individual VII. O BJECTS T RACKING
instances within a class, which makes it more challenging. Multiple object tracking (MOT) aims to maintain objects
1) Proposal based: Hou et al. presented 3D-SIS [114], identities and track their location across data frames (over
a two-stage 3D CNN that performs voxel-wise 3D instance time), which is indispensable for the decision making of
segmentation on multi-view images and RGB-D scan data. In autonomous vehicles. To this end, this section reviews camera-
the 3D detection stage, multi-view image features are extracted LiDAR fusion based object tracking methods. Based on object
and down-sampled using ENet [115] based network. This initialization methods, MOT algorithms can be catalogued into
down-sample process tackles the mismatch problem between Detection-Based Tracking (DBT) and Detection-Free Tracking
a high-resolution image feature map and a low-resolution (DFT) frameworks. DBT or Tracking-by-Detection framework
voxelized point cloud feature map. These down-sampled image leverages a sequence of object hypotheses and higher-level
feature maps are projected back to 3D voxel space and append cues produced by an object detector to track objects. In DBT,
to the corresponding 3D geometry features, which are then fed objects are tracked via data (detections sequence) associa-
into a 3D CNN to predict object classes and 3D bbox poses. tion or Multiple Hypothesis Tracking. On the contrary, the
In the 3D mask stage, a 3D CNN takes images, point cloud DFT framework is based on finite set statistics (FISST) for
features and 3D object detection results to predict per-voxel state estimation. Common methods include Multi-target Multi-
instance labels. Bernoulli (MeMBer) filter and Probability Hypothesis Density
Narita et al. [116] extended 2D panoptic segmentation
(PHD) filter.
to perform scene reconstruction, 3D semantic segmentation
Table V presents the performance of different models on
and 3D instance segmentation jointly on RGB images and
the KITTI multi-object tracking benchmark (car) [56]. Figure
depth images. This approach takes RGB and depth frames as
10 provides a comparison between DBT and DFT approaches.
inputs for instance and 2D semantic segmentation networks.
To track labels between frames, these frame-wise predicted
panoptic annotations and corresponding depth are referenced A. Detection-Based Tracking (DBT)
by associating and integrating to the volumetric map. In the The tracking-by-Detection framework consists of two
final step, a fully connected conditional random field (CRF) stages. In the first stage, objects of interest are detected. The
is employed to fine-tune the outputs. However, this approach second stage associates these objects over time and formulates
does not support dynamic scenes and are vulnerable to long- them into trajectories, which are formulated as linear pro-
term post drift. grams. Frossard et al. [119] presented an end-to-end trainable
2) Proposal-free based: Elich et al. [117] presented 3D- tracking-by-detection framework comprise of multiple inde-
BEVIS, a framework that performs 3D semantic and instance pendent networks that leverage both image and point cloud.
segmentation tasks jointly using the clustering method on This framework performs object detection, proposal matching
points aggregated with 2D semantics. 3D-BEVIS first extract and scoring, linear optimization consecutively. To achieve end-
global semantic scores map and instance feature map from 2D to-end learning, detection and matching are formulated via a
BEV representation (RGB and height-above-ground). These deep structured model (DSM). Zhang et al. [120] presented
two semantic maps are propagated to points using a graph a sensor-agnostic framework, which employs a loss-coupling
neural network. Finally, the mean shift algorithm [118] use scheme for image and point cloud fusion. Similar to [119], the
JOURNAL OF LATEX CLASS FILES 12

TABLE V
C OMPARATIVE RESULTS ON KITTI MULTI - OBJECT TRACKING BENCHMARK ( CAR ) [56]. MOTA STANDS FOR MULTIPLE OBJECT TRACKING ACCURACY.
MOPT STANDS FOR MULTIPLE OBJECT TRACKING PRECISION . MT STANDS FOR M OSTLY TRACKED . ML STANDS FOR M OSTLY L OST. IDS STANDS FOR
NUMBER OF ID SWITCHES . FRAG STANDS FOR NUMBER OF F RAGMENTS .

Methods Data-association Models Hardware Run-time MOTA(%) MOTP(%) MT(%) ML(%) IDS FRAG
min-cost ow DSM [119] GTX1080Ti 0.1s 76.15 83.42 60.00 8.31 296 868
DBT min-cost ow mmMOT [120] GTX1080Ti 0.02s 84.77 85.21 73.23 2.77 284 753
Hungarian MOTSFusion [121] GTX1080Ti 0.44s 84.83 85.21 73.08 2.77 275 759
DFT RFS Complexer-YOLO [122] GTX1080Ti 0.01s 75.70 78.46 58.00 5.08 1186 2092

framework consists of three stages, object detection, adjacency A. Classical Online Calibration
estimation and linear optimization. In the object detection Online calibration methods estimate extrinsic in natural
stage, image and point cloud features are extracted via a VGG- settings without a calibration target. Many studies [124] [125]
16 [123] and a PointNet [30] in parallel and fused via a robust [126] [127] find extrinsic by maximizing mutual information
fusion module. The robust fusion module is designed to work (MI) (raw intensity value or edge intensity) between differ-
with both a-modal and multi-modal inputs. The adjacency ent modalities. However, MI-based methods are not robust
estimation stage extends min-cost flow to multi-modality via against texture-rich environments, large de-calibrations and
adjacent matrix learning. Finally, an optimal path is computed occlusions caused by sensor displacements. Alternatively, the
from the min-cost flow graph. LiDAR-enabled visual-odometry based method [128] uses the
Tracking and 3D reconstruct tasks can be performed jointly. camera’s ego-motion to estimate and evaluate camera-LiDAR
Extending this idea, Luiten et al. [121] leveraged 3D recon- extrinsic parameters. Nevertheless, [128] still struggles with
struction to improve tracking, making tracking robust against large de-calibrations and cannot run in real-time.
complete occlusion. The proposed MOTSFusion consists of
two stages. In the first stage, detected objects are associate B. DL-based Online Calibration
with spatial-temporal tracklets. These tracklets are matched To mitigate the aforementioned challenges, Schneider et
and merged into trajectories using the Hungarian algorithm. al. [129] designed a real-time capable CNN (RegNet) to
Furthermore, MOTSFusion can work with LiDAR mono and estimate extrinsic, which is trained on randomly decalibrated
stereo depth. data. The proposed RegNet extracts image and depth feature
in two parallel branches and concatenates them to produce
a fused feature map. This fused feature map is fed into a
B. Detection-Free Tracking (DFT) stack of Network in Network (NiN) modules and two fully
In DFT objects are manually initialized and tracked via connected layers for feature matching and global regression.
filtering based methods. The complexer-YOLO [122] is a However, the RegNet is agnostic towards the sensor’s intrinsic
real-time framework for decoupled 3D object detection and parameters and needs to be retrained once these intrinsic
tracking on image and point cloud data. In the 3D object changes. To solve this problem, the CalibNet [130] learns to
detection phase, 2D semantics are extracted and fused point- minimize geometric and photometric inconsistencies between
wise to the point cloud. This semantic point cloud is voxelized the miscalibrated and target depth in a self-supervised man-
and fed into a 3D complex-YOLO for 3D object detection. ner. Because intrinsics are only used during the 3D spatial
To speed up the training process, IoU is replaced by a novel transformer, the CalibNet can be applied to any intrinsically
metric called Scale-Rotation-Translation score (SRTs), which calibrated cameras. Nevertheless, deep learning based cross-
evaluates 3 DoFs of the bounding box position. Multi-object sensor calibration methods are computationally expensive.
tracking is decoupled from the detection, and the inference
IX. T RENDS , O PEN C HALLENGES AND P ROMISING
is achieved via Labeled Multi-Bernoulli Random Finite Sets
D IRECTIONS
Filter (LMB RFS).
The perception module in a driverless car is responsible for
obtaining and understanding its surrounding scenes. Its down-
VIII. O NLINE C ROSS -S ENSOR C ALIBRATION stream modules, such as planning, decision making and self-
localization, depend on its outputs. Therefore, its performance
One of the preconditions of the camera-LiDAR fusion and reliability are the prerequisites for the competence of the
pipeline is a flawless registration/calibration between sensors, entire driverless system. To this end, LiDAR and camera fusion
which can be difficult to satisfy. The calibration parameters be- is applied to improve the performance and reliability of the
tween sensors change constantly due to mechanical vibration perception system, making driverless vehicles more capable
and heat fluctuation. As most fusion methods are extremely in understanding complex scenes (e.g. urban traffic, extreme
sensitive towards calibration errors, this could significantly weather condition and so on). Consequently, in this section,
cripple their performance and reliability. Furthermore, offline we summarize overall trends and discuss open challenges
calibration is a troublesome and time-consuming procedure. and potential influencing factors in this regard. As shown in
Therefore, studies on online automatic cross-sensor calibration Table VI, we focus on improving the performance of fusion
have significant practical benefits. methodology and the robustness of the fusion pipeline.
JOURNAL OF LATEX CLASS FILES 13

TABLE VI
O PEN CHALLENGES RELATED TO PERFORMANCE IMPROVEMENT, RELIABILITY ENHANCEMENT.

Open Challenges: Possible solutions / directions:


Performance-related Open Research Questions
What should be the data representation of fused data? Point Representation + Point Convolution
How to encode Temporal Context? RNN/LSTM + Generative Models
What should be the learning scheme? Unsupervised + Weakly-supervised learning
When to use deep learning methods? On applications with explicit objectives that can be verified objectively
Reliability-related Open Research Questions
How to mitigate camera-LiDAR coupling? Sensor-agnostic framework
How to improve all-weather / Lighting conditions? Datasets with complex weather / lighting conditions
How to handle adversarial attacks and corner cases? Cross modality verification
How to solve open-set object detection? Testing protocol, metrics + new framework
How to balance speed-accuracy trade-offs? Models developed with scalability in mind

From the methods reviewed above, we observed some convolutions has great potentials in camera-LiDAR fusion
general trends among the image and point cloud fusion ap- studies.
proaches, which are summarized as the following: 2) How to encode temporal context?: Most current deep
• 2D to 3D: Under the progressing of 3D feature extraction learning based perception systems tend to overlook temporal
methods, to locate, track and segment objects in 3D space context. This leads to numerous problems, such as point
has become a heated area of research. cloud deformation from low refresh rate and incorrect time-
• Single-task to multi-tasks: Some recent works [73] [122] synchronization between sensors. These problems cause mis-
combined multiple complementary tasks, such as object matches between images, point cloud and actual environment.
detection, semantic segmentation and depth completion Therefore, it is vital to incorporate temporal context into the
to achieve better overall performance and reduce compu- perception systems.
tational costs. In the context of autonomous driving, temporal context
• Signal-level to multi-level fusion: Early works often can be incorporated using RNN or LSTM models. In [131],
leverage signal-level fusion where 3D geometry is trans- an LSTM auto-encoder is employed to estimate future states
lated to the image plane to leverage off-the-shelf image of surrounding vehicles and adjust the planned trajectory
processing models, while recent models try to fuse image accordingly, which helps autonomous vehicles run smoother
and LiDAR in multi-level (e.g. early fusion, late fusion) and more stable. In [121] temporal context is leveraged to
and temporal context encoding. estimate ego-motion, which benefits later task-related header
networks. Furthermore, the temporal context could benefit
online self-calibration via a visual-odometry based method
A. Performance-related Open Research Questions
[128]. Following this trend, the mismatches caused by LiDAR
1) What should be the data representation of fused data?: low refresh rate could be solved by encoding temporal context
The choosing of data representations of the fused data plays and generative models.
a fundamental role in designing any data fusion algorithms. 3) What should be the learning scheme?: Most current
Current data representation for image and point cloud fusion camera-LiDAR fusion methods rely on supervised learning,
includes: which requires large annotated datasets. However, annotating
• Image Representation: Append 3D geometry as addi- images and point clouds is expensive and time-consuming.
tional channels of the image. The image-based repre- This limits the size of the current multi-modal dataset and the
sentation enables off-the-shelf image processing models. performance of supervised learning methods.
However, the results are also limited in the 2D image The answer to this problem is unsupervised and weakly-
plane, which is less ideal for autonomous driving. supervised learning frameworks. Some recent studies have
• Point Representation: Append RGB signal/features as shown great potentials in this regards [43] [50] [132] [24]
additional channels of the point cloud. However, the [101]. Following this trend, future researches in unsupervised
mismatch of resolution between high-resolution images and weakly-supervised learning fusion frameworks could al-
and low-resolution point cloud leads to inefficiency. low the networks to be trained on large unlabeled/coarse-
• Intermediate data representations: Translate image and labelled dataset and leads to better performance.
point cloud features/signal into an intermediate data rep- 4) When to use deep learning methods?: Recent advances
resentation, such as voxelized point cloud [82]. However, in deep learning techniques have accelerated the develop-
voxel-based methods suffer from bad scalability. ment of autonomous driving technology. In many aspects,
Many recent works for point cloud processing have con- however, traditional methods are still indispensable in current
centrated on defining explicit point convolution operations autonomous driving systems. Compared with deep learning
[32] [62] [33] [35] [36] [37] [38], which have shown great methods, traditional methods have better interpretability and
potentials. These point convolutions are better suited to extract consume significantly less computational resources. The abil-
fine-grained per-point and local geometry. Therefore, we be- ity to track back a decision is crucial for the decision making
lieve the point representation of fused data coupled with point and planning system of an autonomous vehicle. Nevertheless,
JOURNAL OF LATEX CLASS FILES 14

current deep learning algorithms are not back-traceable, mak- 4) How to solve open-set object detection?: Open-set ob-
ing them unfit for these applications. Apart from this black- ject detection is a scenario where an object detector is tested on
box dilemma, traditional algorithms are also preferred for their instances from unknown/unseen classes. The open-set problem
real-time capability. is critical for an autonomous vehicle, as it operates in un-
To summarize, we believe deep learning methods should be constrained environments with infinite categories of objects.
applied to applications that have explicit objectives that can Current datasets often use a background class for any objects
be verified objectively. that are not interested. However, no dataset can include all the
unwanted object categories in a background class. Therefore,
the behaviour of an object detector in an open-set setting is
B. Reliability-related Open Research Questions highly uncertain, which is less ideal for autonomous driving.
The lack of open-set object detection awareness, testing
1) How to mitigate camera-LiDAR coupling?: From an
protocol and metrics leads to little explicit evaluations of the
engineering perspective, redundancy design in an autonomous
open-set performance in current object detection studies. These
vehicle is crucial for its safety. Although fusing LiDAR
challenges have been discussed and investigated in a recent
and camera improves perception performance, it also comes
study by Dhamija et al. [133], where a novel open-set protocol
with the problem of signal coupling. If one of the signal
and metrics are proposed. The authors proposed an additional
paths suddenly failed, the whole pipeline could break down
mixed unknown category, which contains known ’background’
and cripple down-stream modules. This is unacceptable for
objects and unknown/unseen objects. Based on this protocol,
autonomous driving systems, which require robust perception
current methods are tested on a test set with a mixed unknown
pipelines.
category generated from a combination of existing datasets.
To solve this problem, we should develop a sensor-agnostic In another recent study on the point cloud, Wong et al.
framework. For instance, we can adopt multiple fusion mod- [134] proposed a technique that maps unwanted objects from
ules with different sensor inputs. Furthermore, we can employ different categories into a category-agnostic embedding space
a multi-path fusion module that take asynchronous multi- for clustering.
modal data. However, the best solution is still open for study. The open-set challenges are essential for deploying deep
2) How to improve all-weather/Lighting conditions?: Au- learning-based perception systems in the real-world. And it
tonomous vehicles need to work in all weather and lighting needs more effort and attention from the whole research
conditions. However, current datasets and methods are mostly community (dataset and methods with emphasis on unknown
focused on scenes with good lighting and weather conditions. objects, test protocols and metrics, etc.,).
This leads to bad performances in the real-world, where 5) How to balance speed-accuracy trade-offs?: The pro-
illumination and weather conditions are more complex. cessing of multiple high-resolution images and large-scale
The first step towards this problem is developing more point clouds put substantial stress on existing mobile comput-
datasets that contain a wide range of lighting and weather ing platforms. This sometimes causes frame-drop, which could
conditions. In addition, methods that employ multi-modal data seriously degrade the performance of the perception system.
to tackle complex lighting and weather conditions require More generally, it leads to high-power consumption and low
further investigation. reliability. Therefore, it is important to balance the speed and
3) How to handle adversarial attacks and corner cases?: accuracy of a model in real-world deployments.
Adversarial attacks targeted at camera-based perception sys- There are studies that attempt to detect the frame-drop.
tems have proven to be effective. This poses a grave danger In [135], Imre et al. proposed a multi-camera frame-drop
for the autonomous vehicle, as it operates in safety-critical detection algorithm that leverages multiple segments (broken
environments. It may be difficult to identify attacks explicitly line) fitting on camera pairs. However, frame-drop detection
designed for certain sensory modality. However, perception only solves half of the problem. The hard-part is to prevent
results can be verified across different modalities. In this the performance degradation caused by the frame-drop. Re-
context, research on utilizing the 3D geometry and images cent advances in generative models have demonstrated great
to jointly identify these attacks can be further explored. potentials to predict missing frames in video sequences [136],
As self-driving cars operate in an unpredictable open en- which could be leveraged in autonomous driving for filling the
vironment with infinite possibilities, it is critical to consider missing frames in the image and the point cloud pipelines.
corner and edge cases in the design of the perception pipeline. However, we believe the most effective way to solve the
The perception system should anticipate unseen and unusual frame-drop problem is to prevent it by reducing the hard-ware
obstacles, strange behaviours and extreme weather. For in- workload. This can be achieved by carefully balancing the
stance, the image of cyclists printed on a large vehicle and speed and accuracy of a model [137].
people wearing costumes. These corner cases are often very To achieve this, deep learning models should be able to scale
difficult to handle using only the camera or LiDAR pipeline. down its computational cost, while maintaining acceptable
However, leveraging data from multi-modalities to identify performance. This scalability is often achieved by reducing
these corner cases could prove to be more effective and reliable the number of inputs (points, pixels, voxels) or the depth of
than from a-modal sensors. Further researches in this direction the network. From previous studies [30] [138] [38], point-
could greatly benefit the safety and commercialization of based and multi-view based fusion methods are more scalable
autonomous driving technology. compared to voxel-based methods.
JOURNAL OF LATEX CLASS FILES 15

X. C ONCLUSION [14] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
based 3d object detection,” 2018 IEEE/CVF Conference on Computer
This paper presented an in-depth review of the most recent Vision and Pattern Recognition, Jun 2018. [Online]. Available:
progress on deep learning models for point cloud and image http://dx.doi.org/10.1109/cvpr.2018.00472
[15] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural net-
fusion in the context of autonomous driving. Specifically, this work for real-time object recognition,” in 2015 IEEE/RSJ International
review organizes methods based on their fusion methodolo- Conference on Intelligent Robots and Systems (IROS), Sep. 2015, pp.
gies and covers topics in depth completion, dynamic and 922–928.
[16] Y. Li, L. Ma, Z. Zhong, D. Cao, and J. Li, “Tgnet: Geometric graph cnn
stationary object detection, semantic segmentation, tracking on 3-d point cloud segmentation,” IEEE Transactions on Geoscience
and online cross-sensor calibration. Furthermore, performance and Remote Sensing, vol. 58, no. 5, pp. 3588–3600, 2020.
comparisons on the publicly available dataset, highlights and [17] C. R. Qi, H. Su, M. NieBner, A. Dai, M. Yan, and L. J.
Guibas, “Volumetric and multi-view cnns for object classification
advantages/disadvantages of models are presented in tables. on 3d data,” 2016 IEEE Conference on Computer Vision and
Typical model architectures are shown in figures. Finally, we Pattern Recognition (CVPR), Jun 2016. [Online]. Available: http:
summarized general trends and discussed open challenges and //dx.doi.org/10.1109/CVPR.2016.609
[18] W. Zeng and T. Gevers, “3dcontextnet: Kd tree guided hierarchical
possible future directions. This survey also raised awareness learning of point clouds using local and global contextual cues,” in
and provided insights on questions that are overlooked by the Proceedings of the European Conference on Computer Vision (ECCV),
research community but troubles real-world deployment of the 2018, pp. 0–0.
[19] R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks
autonomous driving technology. for the recognition of 3d point cloud models,” 2017 IEEE International
Conference on Computer Vision (ICCV), Oct 2017. [Online]. Available:
http://dx.doi.org/10.1109/ICCV.2017.99
R EFERENCES [20] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep
3d representations at high resolutions,” in Proceedings of the IEEE
[1] F. Duarte, “Self-driving cars: A city perspective,” Science Robotics, Conference on Computer Vision and Pattern Recognition, 2017, pp.
vol. 4, no. 28, 2019. [Online]. Available: https://robotics.sciencemag. 3577–3586.
org/content/4/28/eaav9843 [21] D. C. Garcia, T. A. Fonseca, R. U. Ferreira, and R. L. de Queiroz, “Ge-
[2] J. Guo, U. Kurup, and M. Shah, “Is it safe to drive? an overview of ometry coding for dynamic voxelized point clouds using octrees and
factors, metrics, and datasets for driveability assessment in autonomous multiple contexts,” IEEE Transactions on Image Processing, vol. 29,
driving,” IEEE Transactions on Intelligent Transportation Systems, pp. pp. 313–322, 2020.
1–17, 2019. [22] H. Lei, N. Akhtar, and A. Mian, “Octree guided cnn with spherical
[3] Y. E. Bigman and K. Gray, “Life and death decisions of autonomous kernels for 3d point clouds,” in Proceedings of the IEEE Conference
vehicles,” Nature, vol. 579, no. 7797, pp. E1–E2, 2020. on Computer Vision and Pattern Recognition, 2019, pp. 9631–9640.
[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den [23] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, convolutional neural networks for 3d shape recognition,” in Proceed-
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, ings of the IEEE international conference on computer vision, 2015,
I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, pp. 945–953.
and D. Hassabis, “Mastering the game of go with deep neural networks [24] S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, and D. Tian, “Deep
and tree search,” Nature, vol. 529, pp. 484–489, 2016. unsupervised learning of 3d point clouds via graph topology inference
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. and filtering,” IEEE Transactions on Image Processing, vol. 29, pp.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, 3183–3198, 2020.
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, [25] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through graph-structured data,” 2015.
deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, [26] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters
Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236 in convolutional neural networks on graphs,” in Proceedings of the
[6] P. Huang, M. Cheng, Y. Chen, H. Luo, C. Wang, and J. Li, “Traffic sign IEEE conference on computer vision and pattern recognition, 2017,
occlusion detection using mobile laser scanning point clouds,” IEEE pp. 3693–3702.
Transactions on Intelligent Transportation Systems, vol. 18, no. 9, pp. [27] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and
2364–2376, 2017. P. Vandergheynst, “Learning class-specific descriptors for deformable
[7] L. Chen, Q. Zou, Z. Pan, D. Lai, L. Zhu, Z. Hou, J. Wang, and D. Cao, shapes using localized spectral convolutional networks,” Computer
“Surrounding vehicle detection using an fpga panoramic camera and Graphics Forum, vol. 34, no. 5, pp. 13–23, 2015. [Online]. Available:
deep cnns,” IEEE Transactions on Intelligent Transportation Systems, https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12693
pp. 1–13, 2019. [28] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and
[8] S. Kim, H. Kim, W. Yoo, and K. Huh, “Sensor fusion algorithm design locally connected networks on graphs,” in International Conference on
in detecting vehicles using laser scanner and stereo vision,” IEEE Learning Representations (ICLR2014), CBLS, April 2014, 2014.
Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp. [29] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
1072–1084, 2016. neural networks on graphs with fast localized spectral filtering,” in
[9] K. Park, S. Kim, and K. Sohn, “High-precision depth estimation using Advances in neural information processing systems, 2016, pp. 3844–
uncalibrated lidar and stereo fusion,” IEEE Transactions on Intelligent 3852.
Transportation Systems, vol. 21, no. 1, pp. 321–335, Jan 2020. [30] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical
[10] J. Zhu, L. Fan, W. Tian, L. Chen, D. Cao, and F. Wang, “Toward the feature learning on point sets in a metric space,” 2017.
ghosting phenomenon in a stereo-based map with a collaborative rgb-d [31] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet:
repair,” IEEE Transactions on Intelligent Transportation Systems, pp. Deep learning on point sets for 3d classification and segmentation,”
1–11, 2019. 2017 IEEE Conference on Computer Vision and Pattern Recognition
[11] J. Wang and L. Zhou, “Traffic light recognition with high dynamic (CVPR), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/
range imaging and deep learning,” IEEE Transactions on Intelligent CVPR.2017.16
Transportation Systems, vol. 20, no. 4, pp. 1341–1352, 2019. [32] B.-S. Hua, M.-K. Tran, and S.-K. Yeung, “Pointwise convolutional
[12] Z. Wang, Y. Wu, and Q. Niu, “Multi-sensor fusion in automated neural networks,” in Proceedings of the IEEE Conference on Computer
driving: A survey,” IEEE Access, 2019. Vision and Pattern Recognition, 2018, pp. 984–993.
[13] D. Feng, C. Haase-Schutz, L. Rosenbaum, H. Hertlein, C. Glaser, [33] M. Atzmon, H. Maron, and Y. Lipman, “Point convolutional
F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal neural networks by extension operators,” ACM Transactions on
object detection and semantic segmentation for autonomous driving: Graphics, vol. 37, no. 4, p. 112, Aug 2018. [Online]. Available:
Datasets, methods, and challenges,” IEEE Transactions on Intelligent http://dx.doi.org/10.1145/3197517.3201301
Transportation Systems, p. 120, 2020. [Online]. Available: http: [34] L. Ma, Y. Li, J. Li, W. Tan, Y. Yu, and M. A. Chapman, “Multi-scale
//dx.doi.org/10.1109/tits.2020.2972974 point-wise convolutional neural networks for 3d object segmentation
JOURNAL OF LATEX CLASS FILES 16

from lidar point clouds in large-scale environments,” IEEE Transac- Conference on Computer Vision and Pattern Recognition, Jun 2018.
tions on Intelligent Transportation Systems, pp. 1–16, 2019. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2018.00907
[35] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, [54] T. Wang, F. Wang, J. Lin, Y. Tsai, W. Chiu, and M. Sun, “Plug-and-
“Pointcnn: Convolution on x-transformed points,” in Advances play: Improve depth prediction via sparse data propagation,” in 2019
in Neural Information Processing Systems 31, S. Bengio, International Conference on Robotics and Automation (ICRA), May
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, 2019, pp. 5880–5886.
and R. Garnett, Eds. Curran Associates, Inc., 2018, [55] A. Valada, R. Mohan, and W. Burgard, “Self-supervised model
pp. 820–830. [Online]. Available: http://papers.nips.cc/paper/ adaptation for multimodal semantic segmentation,” International
7362-pointcnn-convolution-on-x-transformed-points.pdf Journal of Computer Vision, Jul 2019. [Online]. Available: http:
[36] S. Lan, R. Yu, G. Yu, and L. S. Davis, “Modeling local geometric //dx.doi.org/10.1007/s11263-019-01188-y
structure of 3d point clouds using geo-cnn,” 2019 IEEE/CVF [56] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
Conference on Computer Vision and Pattern Recognition (CVPR), Jun driving? the kitti vision benchmark suite,” in Conference on Computer
2019. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2019.00109 Vision and Pattern Recognition (CVPR), 2012.
[37] F. Groh, P. Wieschollek, and H. P. A. Lensch, “Flex-convolution,” [57] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets
Lecture Notes in Computer Science, p. 105122, 2019. [Online]. for 3d object detection from rgb-d data,” 2018 IEEE/CVF Conference
Available: http://dx.doi.org/10.1007/978-3-030-20887-5 7 on Computer Vision and Pattern Recognition, Jun 2018. [Online].
[38] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, Available: http://dx.doi.org/10.1109/CVPR.2018.00102
and L. Guibas, “Kpconv: Flexible and deformable convolution [58] X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline
for point clouds,” 2019 IEEE/CVF International Conference on for 3d detection of vehicles,” 2018 IEEE International Conference
Computer Vision (ICCV), Oct 2019. [Online]. Available: http: on Robotics and Automation (ICRA), May 2018. [Online]. Available:
//dx.doi.org/10.1109/ICCV.2019.00651 http://dx.doi.org/10.1109/ICRA.2018.8461232
[39] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and [59] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roarnet: A robust 3d object
A. Markham, “Randla-net: Efficient semantic segmentation of large- detection based on region approximation refinement,” 2019 IEEE
scale point clouds,” in Proceedings of the IEEE/CVF Conference on Intelligent Vehicles Symposium (IV), Jun 2019. [Online]. Available:
Computer Vision and Pattern Recognition, 2020, pp. 11 108–11 117. http://dx.doi.org/10.1109/IVS.2019.8813895
[40] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks [60] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding
on 3d point clouds,” 2019 IEEE/CVF Conference on Computer Vision box estimation using deep learning and geometry,” 2017 IEEE
and Pattern Recognition (CVPR), Jun 2019. [Online]. Available: Conference on Computer Vision and Pattern Recognition (CVPR), Jul
http://dx.doi.org/10.1109/CVPR.2019.00985 2017. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2017.597
[41] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, [61] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Ipod: Intensive point-
“Sparsity invariant cnns,” in International Conference on 3D Vision based object detector for point cloud,” 2018.
(3DV), 2017. [62] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor
[42] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from fusion for 3d bounding box estimation,” 2018 IEEE/CVF Conference
sparse depth samples and a single image,” 2018 IEEE International on Computer Vision and Pattern Recognition, Jun 2018. [Online].
Conference on Robotics and Automation (ICRA), May 2018. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2018.00033
Available: http://dx.doi.org/10.1109/ICRA.2018.8460184 [63] X. Zhao, Z. Liu, R. Hu, and K. Huang, “3d object detection using
[43] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised scale invariant and feature reweighting networks,” in Proceedings of the
sparse-to-dense: Self-supervised depth completion from lidar and AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9267–
monocular camera,” 2019 International Conference on Robotics 9274.
and Automation (ICRA), May 2019. [Online]. Available: http: [64] M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu, “Pointsift: A sift-like
//dx.doi.org/10.1109/ICRA.2019.8793637 network module for 3d point cloud semantic segmentation,” 2018.
[44] X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity [65] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to
learned with convolutional spatial propagation network,” Lecture aggregate local point-wise features for amodal,” 2019 IEEE/RSJ
Notes in Computer Science, p. 108125, 2018. [Online]. Available: International Conference on Intelligent Robots and Systems (IROS),
http://dx.doi.org/10.1007/978-3-030-01270-0 7 Nov 2019. [Online]. Available: http://dx.doi.org/10.1109/IROS40897.
[45] X. Cheng, P. Wang, G. Chenye, and R. Yang, “Cspn++: Learning 2019.8968513
context and resource aware convolutional spatial propagation networks [66] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting:
for depth completion,” Thirty-Fourth AAAI Conference on Artificial Sequential fusion for 3d object detection,” 2019.
Intelligence (AAAI-20), 2020. [67] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
[46] M. Jaritz, R. D. Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, detection network for autonomous driving,” 2017 IEEE Conference on
“Sparse and dense data with cnns: Depth completion and semantic Computer Vision and Pattern Recognition (CVPR), Jul 2017. [Online].
segmentation,” 2018 International Conference on 3D Vision (3DV), Sep Available: http://dx.doi.org/10.1109/CVPR.2017.691
2018. [Online]. Available: http://dx.doi.org/10.1109/3DV.2018.00017 [68] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander,
[47] A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence propagation “Joint 3d proposal generation and object detection from view
through cnns for guided sparse depth regression,” IEEE Transactions aggregation,” 2018 IEEE/RSJ International Conference on Intelligent
on Pattern Analysis and Machine Intelligence, p. 11, 2019. [Online]. Robots and Systems (IROS), Oct 2018. [Online]. Available: http:
Available: http://dx.doi.org/10.1109/TPAMI.2019.2929170 //dx.doi.org/10.1109/IROS.2018.8594049
[48] J. Tang, F.-P. Tian, W. Feng, J. Li, and P. Tan, “Learning guided [69] H. Lu, X. Chen, G. Zhang, Q. Zhou, Y. Ma, and Y. Zhao, “Scanet:
convolutional network for depth completion,” 2019. Spatial-channel attention network for 3d object detection,” in ICASSP
[49] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool, 2019 - 2019 IEEE International Conference on Acoustics, Speech and
“Sparse and noisy lidar completion with rgb guidance and Signal Processing (ICASSP), May 2019, pp. 1992–1996.
uncertainty,” 2019 16th International Conference on Machine [70] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
Vision Applications (MVA), May 2019. [Online]. Available: http: for multi-sensor 3d object detection,” in The European Conference on
//dx.doi.org/10.23919/MVA.2019.8757939 Computer Vision (ECCV), September 2018.
[50] X. Cheng, Y. Zhong, Y. Dai, P. Ji, and H. Li, “Noise-aware [71] V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal
unsupervised deep lidar-stereo fusion,” 2019 IEEE/CVF Conference voxelnet for 3d object detection,” 2019 International Conference on
on Computer Vision and Pattern Recognition (CVPR), Jun 2019. Robotics and Automation (ICRA), May 2019. [Online]. Available:
[Online]. Available: http://dx.doi.org/10.1109/CVPR.2019.00650 http://dx.doi.org/10.1109/ICRA.2019.8794195
[51] J. Zhang, M. S. Ramanagopalg, R. Vasudevan, and M. Johnson- [72] Z. Wang, W. Zhan, and M. Tomizuka, “Fusing birds eye view lidar
Roberson, “Listereo: Generate dense depth maps from lidar and stereo point cloud and front view camera image for 3d object detection,” in
imagery,” 2019. 2018 IEEE Intelligent Vehicles Symposium (IV), June 2018, pp. 1–6.
[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning [73] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-
for image recognition,” 2016 IEEE Conference on Computer Vision sensor fusion for 3d object detection,” in The IEEE Conference on
and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: Computer Vision and Pattern Recognition (CVPR), June 2019.
http://dx.doi.org/10.1109/CVPR.2016.90 [74] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-
[53] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable Gonzalez, “Sensor fusion for joint 3d object detection and semantic
architectures for scalable image recognition,” 2018 IEEE/CVF segmentation,” 2019.
JOURNAL OF LATEX CLASS FILES 17

[75] S. Gupta, R. Girshick, P. Arbelez, and J. Malik, “Learning rich integration, assessment, and perspectives on acp-based parallel vision,”
features from rgb-d images for object detection and segmentation,” IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 645–661,
Lecture Notes in Computer Science, p. 345360, 2014. [Online]. 2018.
Available: http://dx.doi.org/10.1007/978-3-319-10584-0 23 [96] A. S. Huang, D. Moore, M. Antone, E. Olson, and S. Teller, “Finding
[76] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for multiple lanes in urban road networks with vision and lidar,” Au-
supervision transfer,” 2016 IEEE Conference on Computer Vision tonomous Robots, vol. 26, no. 2-3, pp. 103–122, 2009.
and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: [97] P. Y. Shinzato, D. F. Wolf, and C. Stiller, “Road terrain detection:
http://dx.doi.org/10.1109/CVPR.2016.309 Avoiding common obstacle detection assumptions using sensor fusion,”
[77] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature in 2014 IEEE Intelligent Vehicles Symposium Proceedings, 2014, pp.
hierarchies for accurate object detection and semantic segmentation,” 687–692.
2014 IEEE Conference on Computer Vision and Pattern Recognition, [98] L. Xiao, R. Wang, B. Dai, Y. Fang, D. Liu, and T. Wu, “Hybrid
Jun 2014. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2014.81 conditional random field based camera-lidar fusion for road detection,”
[78] S. Gupta, P. Arbelez, R. Girshick, and J. Malik, “Aligning 3d models Information Sciences, vol. 432, pp. 543–558, 2018.
to rgb-d images of cluttered scenes,” in 2015 IEEE Conference on [99] L. Xiao, B. Dai, D. Liu, T. Hu, and T. Wu, “Crf based road detection
Computer Vision and Pattern Recognition (CVPR), June 2015, pp. with multi-sensor fusion,” in 2015 IEEE Intelligent Vehicles Symposium
4731–4740. (IV), 2015, pp. 192–198.
[79] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images [100] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based
for pedestrian detection using convolutional neural networks,” in 2016 traffic sign detection and analysis for intelligent driver assistance
IEEE International Conference on Robotics and Automation (ICRA), systems: Perspectives and survey,” IEEE Transactions on Intelligent
May 2016, pp. 2198–2205. Transportation Systems, vol. 13, no. 4, pp. 1484–1497, 2012.
[80] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation [101] M. Tan, B. Wang, Z. Wu, J. Wang, and G. Pan, “Weakly supervised
and detection from point cloud,” 2019 IEEE/CVF Conference on metric learning for traffic sign recognition in a lidar-equipped vehicle,”
Computer Vision and Pattern Recognition (CVPR), Jun 2019. [Online]. IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 5,
Available: http://dx.doi.org/10.1109/CVPR.2019.00086 pp. 1415–1427, 2016.
[81] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and [102] Y. Yu, J. Li, C. Wen, H. Guan, H. Luo, and C. Wang, “Bag-of-
O. Beijbom, “Pointpillars: Fast encoders for object detection from visual-phrases and hierarchical deep models for traffic sign detection
point clouds,” 2019 IEEE/CVF Conference on Computer Vision and recognition in mobile laser scanning data,” ISPRS journal of
and Pattern Recognition (CVPR), Jun 2019. [Online]. Available: photogrammetry and remote sensing, vol. 113, pp. 106–123, 2016.
http://dx.doi.org/10.1109/CVPR.2019.01298 [103] M. Soiln, B. Riveiro, J. Martnez-Snchez, and P. Arias, “Traffic
[82] S. Song and J. Xiao, “Sliding shapes for 3d object detection in depth sign detection in mls acquired point clouds for geometric and
images,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, image-based semantic inventory,” ISPRS Journal of Photogrammetry
B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International and Remote Sensing, vol. 114, pp. 92 – 101, 2016. [Online]. Available:
Publishing, 2014, pp. 634–651. http://www.sciencedirect.com/science/article/pii/S0924271616000368
[83] S. Wang, S. Suo, W.-C. Ma, A. Pokrovsky, and R. Urtasun, “Deep
[104] A. Arcos-Garcia, M. Soilán, J. A. Alvarez-Garcia, and B. Riveiro,
parametric continuous convolutional neural networks,” in Proceedings
“Exploiting synergies of mobile mapping sensors and deep learning
of the IEEE Conference on Computer Vision and Pattern Recognition,
for traffic sign recognition systems,” Expert Systems with Applications,
2018, pp. 2589–2597.
vol. 89, pp. 286–295, 2017.
[84] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object
[105] H. Guan, Y. Yu, D. Peng, Y. Zang, J. Lu, A. Li, and J. Li, “A
detection in rgb-d images,” in Proceedings of the IEEE Conference
convolutional capsule network for traffic-sign recognition using mobile
on Computer Vision and Pattern Recognition, 2016, pp. 808–816.
lidar data with digital images,” IEEE Geoscience and Remote Sensing
[85] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K.
Letters, vol. 17, no. 6, pp. 1067–1071, 2020.
Wellington, “Lasernet: An efficient probabilistic 3d object detector
for autonomous driving,” 2019 IEEE/CVF Conference on Computer [106] Z. Deng and L. Zhou, “Detection and recognition of traffic planar ob-
Vision and Pattern Recognition (CVPR), Jun 2019. [Online]. Available: jects using colorized laser scan and perspective distortion rectification,”
http://dx.doi.org/10.1109/CVPR.2019.01296 IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 5,
[86] M. Bai, G. Mattyus, N. Homayounfar, S. Wang, S. K. Lakshmikanth, pp. 1485–1495, 2018.
and R. Urtasun, “Deep multi-sensor lane detection,” in 2018 IEEE/RSJ [107] H. Guan, W. Yan, Y. Yu, L. Zhong, and D. Li, “Robust traffic-sign
International Conference on Intelligent Robots and Systems (IROS), detection and classification using mobile lidar data with digital images,”
2018, pp. 3102–3109. IEEE Journal of Selected Topics in Applied Earth Observations and
[87] F. Wulff, B. Schufele, O. Sawade, D. Becker, B. Henke, and I. Radusch, Remote Sensing, vol. 11, no. 5, pp. 1715–1724, 2018.
“Early fusion of camera and lidar for robust road detection based on [108] C. Premebida, J. Carreira, J. Batista, and U. Nunes, “Pedestrian
u-net fcn,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018, detection combining rgb and dense lidar data,” 09 2014, pp. 4112–
pp. 1426–1431. 4117.
[88] X. Lv, Z. Liu, J. Xin, and N. Zheng, “A novel approach for detecting [109] A. Dai and M. Niener, “3dmv: Joint 3d-multi-view prediction
road based on two-stream fusion fully convolutional network,” in 2018 for 3d semantic scene segmentation,” Lecture Notes in Computer
IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 1464–1469. Science, p. 458474, 2018. [Online]. Available: http://dx.doi.org/10.
[89] D. Yu, H. Xiong, Q. Xu, J. Wang, and K. Li, “Multi-stage residual fu- 1007/978-3-030-01249-6 28
sion network for lidar-camera road detection,” in 2019 IEEE Intelligent [110] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,
Vehicles Symposium (IV), 2019, pp. 2323–2328. and M. NieBner, “Scannet: Richly-annotated 3d reconstructions
[90] L. Caltagirone, M. Bellone, L. Svensson, and M. Wahde, “Lidarcamera of indoor scenes,” 2017 IEEE Conference on Computer Vision
fusion for road detection using fully convolutional neural networks,” and Pattern Recognition (CVPR), Jul 2017. [Online]. Available:
Robotics and Autonomous Systems, vol. 111, p. 125131, Jan 2019. http://dx.doi.org/10.1109/CVPR.2017.261
[Online]. Available: http://dx.doi.org/10.1016/j.robot.2018.11.002 [111] H.-Y. Chiang, Y.-L. Lin, Y.-C. Liu, and W. H. Hsu, “A unified
[91] Z. Chen, J. Zhang, and D. Tao, “Progressive lidar adaptation for road point-based framework for 3d segmentation,” 2019 International
detection,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. Conference on 3D Vision (3DV), Sep 2019. [Online]. Available:
693–702, 2019. http://dx.doi.org/10.1109/3DV.2019.00026
[92] J. Lee and T. Park, “Fast road detection by cnn-based camera-lidar [112] M. Jaritz, J. Gu, and H. Su, “Multi-view pointnet for 3d scene
fusion and spherical coordinate transformation,” IEEE Transactions on understanding,” 2019.
Intelligent Transportation Systems, pp. 1–9, 2020. [113] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-
[93] L. Ma, Y. Li, J. Li, C. Wang, R. Wang, and M. A. Chapman, “Mobile H. Yang, and J. Kautz, “Splatnet: Sparse lattice networks for
laser scanned point-clouds for road object detection and extraction: A point cloud processing,” 2018 IEEE/CVF Conference on Computer
review,” Remote Sensing, vol. 10, no. 10, p. 1531, 2018. Vision and Pattern Recognition, Jun 2018. [Online]. Available:
[94] S. P. Narote, P. N. Bhujbal, A. S. Narote, and D. M. Dhane, “A review http://dx.doi.org/10.1109/CVPR.2018.00268
of recent advances in lane detection and departure warning system,” [114] J. Hou, A. Dai, and M. NieBner, “3d-sis: 3d semantic instance
Pattern Recognition, vol. 73, pp. 216–234, 2018. segmentation of rgb-d scans,” 2019 IEEE/CVF Conference on
[95] Y. Xing, C. Lv, L. Chen, H. Wang, H. Wang, D. Cao, E. Velenis, Computer Vision and Pattern Recognition (CVPR), Jun 2019.
and F.-Y. Wang, “Advances in vision-based lane detection: algorithms, [Online]. Available: http://dx.doi.org/10.1109/CVPR.2019.00455
JOURNAL OF LATEX CLASS FILES 18

[115] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep [136] Y. Li, D. Roblek, and M. Tagliasacchi, “From here to there:
neural network architecture for real-time semantic segmentation,” 2016. Video inbetweening using direct 3d convolutions,” arXiv preprint
[116] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji, “Panopticfusion: arXiv:1905.10240, 2019.
Online volumetric semantic mapping at the level of stuff and [137] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
things,” 2019 IEEE/RSJ International Conference on Intelligent I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and et al.,
Robots and Systems (IROS), Nov 2019. [Online]. Available: “Speed/accuracy trade-offs for modern convolutional object detectors,”
http://dx.doi.org/10.1109/IROS40897.2019.8967890 2017 IEEE Conference on Computer Vision and Pattern Recognition
[117] C. Elich, F. Engelmann, T. Kontogianni, and B. Leibe, “3d birds- (CVPR), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/
eye-view instance segmentation,” Pattern Recognition, p. 4861, 2019. CVPR.2017.351
[Online]. Available: http://dx.doi.org/10.1007/978-3-030-33676-9 4 [138] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d
[118] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward shapenets: A deep representation for volumetric shapes,” 2015 IEEE
feature space analysis,” IEEE Transactions on pattern analysis and Conference on Computer Vision and Pattern Recognition (CVPR),
machine intelligence, vol. 24, no. 5, pp. 603–619, 2002. Jun 2015. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2015.
[119] D. Frossard and R. Urtasun, “End-to-end learning of multi-sensor 7298801
3d tracking by detection,” 2018 IEEE International Conference on
Robotics and Automation (ICRA), May 2018. [Online]. Available:
http://dx.doi.org/10.1109/ICRA.2018.8462884
[120] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust
multi-modality multi-object tracking,” 2019 IEEE/CVF International Yaodong Cui received the B.S. degree in automation
Conference on Computer Vision (ICCV), Oct 2019. [Online]. Available: from Chang’an University, Xi’an, China, in 2017,
http://dx.doi.org/10.1109/ICCV.2019.00245 received the M.Sc. degree in Systems, Control and
[121] J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and Signal Processing from the University of Southamp-
reconstruct to track,” IEEE Robotics and Automation Letters, ton, Southampton, UK, in 2019. He is currently
vol. 5, no. 2, p. 18031810, Apr 2020. [Online]. Available: working toward a Ph.D. degree with the Waterloo
http://dx.doi.org/10.1109/LRA.2020.2969183 Cognitive Autonomous Driving (CogDrive) Lab, De-
[122] M. Simon, K. Amende, A. Kraus, J. Honer, T. Smann, H. Kaulbersch, partment of Mechanical Engineering, University of
S. Milz, and H. M. Gross, “Complexer-yolo: Real-time 3d object Waterloo, Waterloo, Canada. His research interests
detection and tracking on semantic point clouds,” 2019. include sensor fusion, perception for the intelligent
[123] K. Simonyan and A. Zisserman, “Very deep convolutional networks vehicle, driver emotion detection.
for large-scale image recognition,” in International Conference on
Learning Representations, 2015.
[124] A. Napier, P. Corke, and P. Newman, “Cross-calibration of push-broom
2d lidars and cameras in natural scenes,” in 2013 IEEE International
Conference on Robotics and Automation, 2013, pp. 3679–3684. Ren Chen received the B.S. degree in electronic
[125] Z. Taylor and J. Nieto, “Automatic calibration of lidar and camera science and technology from Huazhong University
images using normalized mutual information,” in Robotics and Au- of Science and Technology, Wuhan, China. received
tomation (ICRA), 2013 IEEE International Conference on, 2013. the M.Sc. degree in Computer Science from China
[126] G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Automatic Academy of Aerospace Science and Engineering,
extrinsic calibration of vision and lidar by maximizing mutual infor- Beijing, China. He worked at Institude of Deep
mation,” Journal of Field Robotics, vol. 32, no. 5, pp. 696–722, 2015. Learning, Baidu and Autonomous Driving Lab, Ten-
[127] M. Miled, B. Soheilian, E. Habets, and B. Vallet, “Hybrid online cent. His research interests include computer vision,
mobile laser scanner calibration through image alignment by mutual perception for robotics, realtime dense 3D recon-
information,” ISPRS Annals of the Photogrammetry, Remote Sensing struction and robotic motion planning.
and Spatial Information Sciences, vol. 3, p. 25, 2016.
[128] H.-J. Chien, R. Klette, N. Schneider, and U. Franke, “Visual odometry
driven online calibration for monocular lidar-camera systems,” in 2016
23rd International conference on pattern recognition (ICPR). IEEE,
2016, pp. 2848–2853.
[129] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: Wenbo Chu received the B.S. degree in automo-
Multimodal sensor registration using deep neural networks,” 2017 tive engineering from Tsinghua University, Beijing,
IEEE Intelligent Vehicles Symposium (IV), Jun 2017. [Online]. China, in 2008, the M.S. degree in automotive
Available: http://dx.doi.org/10.1109/IVS.2017.7995968 engineering from RWTH Aachen, Germany, and
[130] G. Iyer, R. KarnikRam, K. M. Jatavallabhula, and K. M. Krishna, “Cal- the Ph.D. degree in mechanical engineering from
ibnet: Geometrically supervised extrinsic calibration using 3d spatial Tsinghua University, in 2014. He is currently a
transformer networks,” 2018 IEEE/RSJ International Conference on Research Fellow of China Intelligent and Connected
Intelligent Robots and Systems (IROS), pp. 1110–1117, 2018. Vehicles Research Institute Company, Ltd., Beijing,
[131] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. which is also known as the National Innovation
Choi, “Sequence-to-sequence prediction of vehicle trajectory via lstm Center of Intelligent and Connected Vehicles. His
encoder-decoder architecture,” in 2018 IEEE Intelligent Vehicles Sym- research interests include intelligent connected ve-
posium (IV), 2018, pp. 1672–1678. hicles, new energy vehicles, and vehicle dynamics and control
[132] J. Weng, N. Ahuja, and T. S. Huang, “Cresceptron: a self-organizing
neural network which grows adaptively,” in [Proceedings 1992] IJCNN
International Joint Conference on Neural Networks, vol. 1, June 1992,
pp. 576–581 vol.1.
[133] A. R. Dhamija, M. Gnther, J. Ventura, and T. E. Boult, “The over- Long Chen received the B.Sc. degree in commu-
looked elephant of object detection: Open set,” in 2020 IEEE Winter nication engineering and the Ph.D. degree in signal
Conference on Applications of Computer Vision (WACV), 2020, pp. and information processing from Wuhan University,
1010–1019. Wuhan, China.He is currently an Associate Professor
[134] K. Wong, S. Wang, M. Ren, M. Liang, and R. Urtasun, “Identifying with School of Data and Computer Science, Sun
unknown instances for autonomous driving,” in Conference on Robot Yat-sen University, Guangzhou, China. His areas of
Learning, 2020, pp. 384–393. interest include autonomous driving, robotics, artifi-
cial intelligence where he has contributed more than
[135] E. Imre, J. Guillemaut, and A. Hilton, “Through-the-lens multi-camera
70 publications. He serves as an Associate Editor
synchronisation and frame-drop detection for 3d reconstruction,” in
for IEEE Transactions on Intelligent Transportation
2012 Second International Conference on 3D Imaging, Modeling,
Systems.
Processing, Visualization Transmission, 2012, pp. 395–402.
JOURNAL OF LATEX CLASS FILES 19

Daxin Tian is currently a professor in the School


of Transportation Science and Engineering, Bei-
hang University, Beijing, China. He is IEEE Senior
Member, IEEE Intelligent Transportation Systems
Society Member, and IEEE Vehicular Technology
Society Member, etc. His current research interests
include mobile computing, intelligent transportation
systems, vehicular ad hoc networks, and swarm
intelligent.

Ying Li received the M.Sc. degree in remote sens-


ing from Wuhan University, China, in 2017. She
is currently pursuing the Ph.D. degree with the
Mobile Sensing and Geodata Science Laboratory,
Department of Geography and Environmental Man-
agement, University of Waterloo, Canada. Her re-
search interests include autonomous driving, mobile
laser scanning, intelligent processing of point clouds,
geometric and semantic modeling, and augmented
reality.

Dongpu Cao received the Ph.D. degree from Con-


cordia University, Canada, in 2008. He is the Canada
Research Chair in Driver Cognition and Automated
Driving, and currently an Associate Professor and
Director of Waterloo Cognitive Autonomous Driving
(CogDrive) Lab at University of Waterloo, Canada.
His current research focuses on driver cognition,
automated driving and cognitive autonomous driv-
ing. He has contributed more than 200 papers and 3
books. He received the SAE Arch T. Colwell Merit
Award in 2012, IEEE VTS 2020 Best Vehicular
Electronics Paper Award and three Best Paper Awards from the ASME
and IEEE conferences. Prof. Cao serves as Deputy Editor-in-Chief for IET
INTELLIGENT TRANSPORT SYSTEMS JOURNAL, and an Associate
Editor for IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE
TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
IEEE/ASME TRANSACTIONS ON MECHATRONICS, IEEE TRANSAC-
TIONS ON INDUSTRIAL ELECTRONICS, IEEE/CAA JOURNAL OF
AUTOMATICA SINICA, IEEE TRANSACTIONS ON COMPUTATIONAL
SOCIAL SYSTEMS, and ASME JOURNAL OF DYNAMIC SYSTEMS,
MEASUREMENT AND CONTROL. He was a Guest Editor for VEHICLE
SYSTEM DYNAMICS, IEEE TRANSACTIONS ON SMC: SYSTEMS and
IEEE INTERNET OF THINGS JOURNAL. He serves on the SAE Vehicle
Dynamics Standards Committee and acts as the Co-Chair of IEEE ITSS
Technical Committee on Cooperative Driving. Prof. Cao is an IEEE VTS
Distinguished Lecturer

You might also like