Deep Learning For Image and Point Cloud Fusion in Autonomous Driving: A Review
Deep Learning For Image and Point Cloud Fusion in Autonomous Driving: A Review
Fig. 2. Tasks related to image and point cloud fusion based perception and their corresponding sections.
irregular, orderless and continuous, while the image is regular, II. A B RIEF R EVIEW OF D EEP L EARNING
ordered and discrete. These characteristics differences between
A. Deep Learning on Image
the point cloud and the image lead to different feature extrac-
tion methodologies. In Figure 1, a comparison between the Convolutional Neural Networks (CNNs) are one of the
characteristics of image and point are shown. most efficient and powerful deep learning models for image
processing and understanding. Compared to the Multi-Layer-
Previous reviews [12] [13] on deep learning methods for Perceptron (MLP), the CNN is shift-invariant, contains fewer
multi-modal data fusion covered a broad range of sensors, in- weights and exploits hierarchical patterns, making it highly
cluding radars, cameras, LiDARs, Ultrasonics, IMU, Odome- efficient for image semantic extraction. Hidden layers of a
ters, the GNSS and HD Maps. This paper focuses on camera- CNN consist of a hierarchy of convolutional layers, batch nor-
LiDAR fusion only and therefore is able to present more malization layers, activation layers and pooling layers, which
detailed reviews on individual methods. Furthermore, we cover are trained end-to-end. This hierarchical structure extracts
a broader range of perception related topics (depth completion, image features with increasing abstract levels and receptive
dynamic and stationary object detection, semantic segmen- elds, enabling the learning of high-level semantics.
tation, tracking and online cross-sensor calibration) that are
interconnected and are not fully included in the previous
reviews [13]. The contribution of this paper is summarized B. Deep Learning on Point Cloud
as the following: The point cloud is a set of data points, which are LiDAR’s
measurements of the detected object’s surface. In terms of data
structure, the point cloud is sparse, irregular, orderless and
• To the best of our knowledge, this paper is the first survey
continuous. Point cloud encodes information in 3D structures
focusing on deep learning based image and point cloud
and in per-point features (reflective intensities, color, normal,
fusion approaches in autonomous driving, including depth
etc.,), which is invariant to scale, rigid transformation and
completion, dynamic and stationary object detection, se-
permutation. These characteristics made feature extractions on
mantic segmentation, tracking and online cross-sensor
the point cloud challenging for existing deep learning models,
calibration.
which require the modifications of existing models or develop-
• This paper organizes and reviews methods based on their
ing new models. Therefore, this section focuses on introducing
fusion methodologies. Furthermore, this paper presented
common methodologies for point cloud processing.
the most up-to-date (2014-2020) overviews and perfor-
mance comparisons of the state-of-the-art camera-LiDAR 1) Volumetric representation based: The volumetric rep-
fusion methods. resentation partitions the point cloud into a fixed-resolution
• This paper raises overlooked open questions, such as 3D grid, where features of each grid/voxel are hand-crafted
open-set detection and sensor-agnostic framework, that or learned. This representation is compatible with standard
are critical for the real-world deployment of autonomous 3D convolution [14] [15] [16]. Several techniques have been
driving technology. Moreover, summaries of trends and proposed in [17] to reduce over-fittings, orientation sensitivity
possible research directions on open challenges are pre- and capture internal structures of objects. However, the volu-
sented. metric representation loses spatial resolution and fine-grained
3D geometry during voxelization which limits its performance.
Furthermore, attempts to increase its spatial resolution (denser
This paper first provides a brief overview of deep learning voxels) would cause computation and memory footprint to
methods on image and point cloud data in Section II. In grow cubically, making it unscalable.
Sections III to VIII, reviews on camera-LiDAR based depth 2) Index/Tree representation based: To alleviate constraints
completion, dynamic object detection, stationary object detec- between high spatial resolution and computational costs,
tion, semantic segmentation, object tracking and online sensor adapted-resolution partition methods that leverage tree-like
calibration are presented respectively. Trends, open challenges data structures, such as kd-tree [18] [19], octree [20] [21]
and promising directions are discussed in Section VII. Finally, [22] are proposed. By dividing the point cloud into a series
a summary is given in Section VIII. Figure 2 presents the of unbalanced trees, regions can be partitioned based on their
overall structures of this survey and the corresponding topics. point densities. This allows regions with lower point densities
JOURNAL OF LATEX CLASS FILES 3
et al. [30] extended the PointNet to extract features from where the network f (.) parametrized by w, predicts the ground
different levels by grouping points into multiple sets and apply truth G, given the input x. The loss function is represented as
PointNets locally. To reduce the computational and memory L(·, ·).
cost of the PointNet++ [30], the RandLA-Net [39] stacked Figure 3 gives the timeline of depth completion models
the random point sampling modules and attention-based local and their corresponding fusion levels. The comparative results
feature aggregation modules hierarchically to progressively of depth completion models on the KITTI depth completion
increase receptive field while maintaining high efficiency. benchmark [41] are listed in Table I.
JOURNAL OF LATEX CLASS FILES 4
TABLE I
C OMPARATIVE RESULTS ON KITTI DEPTH COMPLETION BENCHMARK [41]. ’LL’,’S’ , ’SS’,’U’ STAND FOR LEARNING SCHEME , SUPERVISED ,
SELF - SUPERVISED AND UNSUPERVISED RESPECTIVELY. ’RGB’,’ S D’,’C O ’, ’S’, STAND FOR RGB IMAGE DATA , SPARSE DEPTH , CONFIDENCE AND
STEREO DISPARITY. ’ I RMSE’,’ I MAE’, ’RMSE’,’MAE’ STAND FOR ROOT MEAN SQUARED ERROR OF THE INVERSE DEPTH [1/ KM ], MEAN ABSOLUTE
ERROR OF THE INVERSE DEPTH [1/ KM ], ROOT MEAN SQUARED ERROR [ MM ], MEAN ABSOLUTE ERROR [ MM ].
A. Mono Camera and LiDAR fusion in parallel and combine them in a shared decoder. Furthermore,
normalized convolutions are used to process highly sparse
The idea behind image-guided depth completion is that
depth and to propagate confidence. Valada et al. [55] extended
dense RGB/color information contains relevant 3D geometry.
one-stage feature-level fusion to multiple-stages of varying
Therefore, images can be leveraged as a reference for depth
depth of the network. Similarly, GuideNet [48] fuse image
up-sampling.
features to sparse depth features at different stages of the
1) Signal-level fusion: In 2018, Ma et al. [42] presented a
encoder to guide the up-sampling of sparse depths, which
ResNet [52] based autoencoder network that leverages RGB-
achieves top performance in the KITTI depth completion
D images (i.e. images concatenated with sparse depth maps)
benchmark. The constrain of these approaches is the lack of
to predict dense depth maps. However, this method requires
large-scale datasets that have dense depth ground truth.
pixel-level depth ground truth, which is difficult to obtain. To
3) Multi-level fusion: Van Gansbeke et al. [49] further com-
solve this issue, Ma et al. [43] presented a model-based self-
bines signal-level fusion and feature-level fusion in an image-
supervised framework that only requires a sequence of images
guided depth completion network. The network consists of a
and sparse depth images for training. This self-supervision is
global and a local branch to process RGB-D data and depth
achieved by employing sparse depth constrain, photometric
data in parallel before fusing them based on the confidence
loss and smoothness loss. However, this approach assumes
maps.
objects to be stationary. Furthermore, the resulting depth
output is blurry and input depth may not be preserved.
To generate a sharp dense depth map in real-time, Cheng B. Stereo Cameras and LiDAR fusion
et al. [44] fed RGB-D images to a convolutional spatial Compared with the RGB image, dense depth disparity from
propagation network (CSPN). This CSPN aims to extract the stereo cameras contains richer ground truth 3D geometry.
image-dependent affinity matrix directly, producing signifi- On the other hand, LiDAR depth is sparse but of higher
cantly better results in key measurements with lesser run-time. accuracy. These complementary characteristics enable stereo-
In CSPN++, Cheng et al. [45] proposed to dynamically select LiDAR fusion based depth completion models to produce
convolutional kernel sizes and iterations to reduce computa- a more accurate dense depth. However, it is worth noting
tion. Furthermore, CSPN++ employs weighted assembling to that stereo cameras have limited range and struggles in high-
boost its performance. occlusion, texture-less environments, making them less ideal
2) Feature-level fusion: Jaritz et al. [46] presented an for autonomous driving.
autoencoder network that can either perform depth comple- 1) Feature-level fusion: One of the pioneering works is
tion or semantic segmentation from sparse depth maps and from Park et al. [9], in which high-precision dense disparity
images without applying validity masks. Images and sparse map is computed from dense stereo disparity and point cloud
depth maps are first processed by two parallel NASNet-based using a two-stage CNN. The first stage of the CNN takes
encoders [53] before fusing them into the shared decoder. This LiDAR and stereo disparity to produce a fused disparity. In
approach can achieve decent performance with very sparse the second stage, this fused disparity and left RGB image is
depth inputs (8-channel LiDAR). Wang et al. [54] designed fused in the feature space to predict the final high-precision
an integrable module (PnP) that leverages the sparse depth disparity. Finally, the 3D scene is reconstructed from this high-
map to improve the performance of existing image-based precision disparity. The bottleneck of this approach is the
depth prediction networks. This PnP module leverages gradient lack of large-scale annotated stereo-LiDAR datasets. The Li-
calculated from sparse depth to update the intermediate feature darStereoNet [50] averted this difficulty with an unsupervised
map produced by the existing depth prediction network. El- learning scheme, which employs image warping/photometric
desokey et al. [47] presented a framework for unguided depth loss, sparse depth loss, smoothness loss and plane fitting
completion that processes images and very sparse depth maps loss for end-to-end training. Furthermore, the introduction of
JOURNAL OF LATEX CLASS FILES 5
Fig. 4. Timeline of 3D object detection networks and their corresponding fusion levels.
’feedback loop’ makes the LidarStereoNet robust against noisy One of the early works of result-level fusion is the F-
point cloud and sensor misalignment. Similarly, Zhang et al. PointNets [57], where 2D bounding boxes are first generated
[51] presented a self-supervised scheme for depth completion. from images and projected to the 3D space. The resulting
The loss function consists of sparse depth, photometric and projected frustum proposals are fed into a PointNet[31] based
smoothness loss. detector for 3D object detection. Du et al. [58] extended the 2D
to 3D proposals generation stage with an additional proposal
IV. DYNAMIC O BJECT D ETECTION refinement stage, which further reduces unnecessary compu-
tation on the background point. During this refinement stage,
Object detection (3D) aims to locate, classify and estimate a model fitting based method is used to filter out background
oriented bounding boxes in the 3D space. This section is points inside the seed region. Finally, the filtered points are fed
devoted to dynamic object detection, which includes common into the bbox regression network. The RoarNet [59] followed
dynamic road objects (car, pedestrian, cyclist, etc.,). There are a similar idea, but instead employs a neural network for the
two main approaches for object detection: sequential and one- proposal refinement stage. Multiple 3D cylinder proposals are
step. Sequential based models consist of a proposal stage and a first generated based on each 2D bbox using the geometric
3D bounding box (bbox) regression stage in the chronological agreement search [60], which results in smaller but more
order. In the proposal stage, regions that may contain objects precise frustum proposals then the F-pointNet [57]. These
of interest are proposed. In the bbox regression stage, these initial cylinder proposals are then processed by a PointNet [30]
proposals are classified based on the region-wise features based header network for the final refinement. To summarize,
extracted from 3D geometry. However, the performance of these approaches assume each seed region only contains one
sequential fusion is limited by each stage. On the other hand, object of interest, which is however not true for crowded
one-step models consist of one stage, where 2D and 3D data scenes and small objects like pedestrians.
are processed in a parallel manner.
One possible solution towards the fore-mentioned issues
The timeline of 3D object detection networks and typical
is to replace the 2D object detector with 2D semantic seg-
model architectures are shown in Figures 4 & 5. Table II
mentation and region-wise seed proposal with point-wise seed
presents the comparative results of 3D object detection models
proposals. Intensive Point-based Object Detector (IPOD) [61]
on the KITTI 3D Object Detection benchmark [56]. Table III
by Yang et al. is a work in this direction. In the first step,
summarizes and compares dynamic object detection models.
2D semantic segmentation is used to filter out back-ground
points. This is achieved by projecting points to the image plane
A. 2D Proposal Based Sequential Models and associated point with 2D semantic labels. The resulting
A 2D proposal based sequential model attempts to utilize 2D foreground point cloud retains the context information and
image semantics in the proposal stage, which takes advantage fine-grained location, which is essential for the region-proposal
of off-the-shelf image processing models. Specifically, these and bbox regression. In the following point-wise proposal
methods leverage the image object detector to generate 2D generation and bbox regression stage, two PointNet++ [30]
region proposals, which are projected to the 3D space as detec- based networks are used for proposal feature extraction and
tion seeds. There are two projection approaches to translate 2D bbox prediction. In addition, a novel criterion called PointsIoU
proposals to 3D. The first one is projecting bounding boxes in is proposed to speed up training and inference. This approach
the image plane to the point cloud, which results in a frustum has yielded significant performance advantages over other
shaped 3D search space. The second method projects the point state-of-the-art approaches in scenes with high occlusion or
cloud to the image plane, which results in the point cloud with many objects.
point-wise 2D semantics. 2) Multi-level Fusion: Another possible direction of im-
1) Result-level Fusion: The intuition behind result-level provement is to combine result level fusion with feature
fusion is to use off-the-shelf 2D object detectors to limit the level fusion, where one such work is PointFusion [62]. The
3D search space for 3D object detection, which significantly PointFusion first utilizes an existing 2D object detector to
reduces computation and improve run-time. However, since generate 2D bboxes. These bboxes is used to select corre-
the whole pipeline depends on the results of the 2D object sponding points, via projecting points to the image plane and
detector, it suffers from the limitations of the image-based locate points that pass through the bboxs. Finally, a ResNet
detector. [52] and a PointNet [31] based network combines image and
JOURNAL OF LATEX CLASS FILES 6
TABLE II
C OMPARATIVE RESULTS ON KITTI 3D O BJECT D ETECTION BENCHMARK ( MODERATE DIFFICULTIES )[56].T HE I O U REQUIREMENT FOR CAR ,
PEDESTRIANS AND CYCLISTS IS 70%, 50% AND 50% RESPECTIVELY.’FL’,’FT’,’RT’ ’MS’,’PCR’ STAND FOR FUSION LEVEL , FUSION TYPE , RUN - TIME ,
MODEL SIZE ( NUMBER OF PARAMETERS ) AND POINT CLOUD REPRESENTATION RESPECTIVELY.
where T3D→views represents the transformation function that SCA can capture multi-scale contextual information, whereas
project point cloud p3D from 3D space to bird’s eye view ESU recovers the spatial information.
(BEV), front view (FV) and the image plane (RGB). The ROI- One of the problems in object-centric fusion methods
pooling R to obtain feature vector fviews is defined as: [68][67] is the loss of fine-grained geometric information
during ROI-pooling. The ContFuse [70] by Liang et al. tackles
fviews = R (x, ROIviews ) , views ∈ {BV, FV, RGB} (5) this information lost with point-wise fusion. This point-wise
fusion is achieved with continuous convolutions [83] fusion
There are a few drawbacks of the MV3D. First, generating layers that bridge image and point cloud features of different
3D proposals on BEV assumes that all objects of interest scales at multiple stages in the network. This is achieved by
are captured without occlusions from this view-point. This first extracting K-nearest neighbour points for each pixel in
assumption does not hold well for small object instances, such the BEV representation of point cloud. These points are then
as pedestrians and bicyclists, which can be fully occluded projected to the image plane to retrieve related image features.
by other large objects in the point cloud. Secondly, spatial Finally, the fused feature vector is weighted according to their
information of small object instances is lost during the down- geometry offset to the target ’pixel’ before feeding into MLPs.
sample of feature maps caused by consecutive convolution However, point-wise fusion might fail to take full advantage
operations. Thirdly, object-centric fusion combines feature of high-resolution images when the LiDAR points are sparse.
maps of image and point clouds through ROI-pooling, which In [73] Liang et al. further extended point-wise fusion by
spoils fine-grained geometric information in the process. It is combining multiple fusion methodologies, such as signal-
also worth noting that redundant proposals lead to repetitive level fusion (RGB-D), feature-level fusion, multi-view and
computation in the bbox regression stage. To mitigate these depth completion. In particular, depth completion upsamples
challenges, multiple methods have been put forward to im- sparse depth map using image information to generate a
prove MV3D. dense pseudo point cloud. This up-sampling process alleviates
To improve the detection of small objects, the Aggregate the sparse point-wise fusion problem, which facilitates the
View Object Detection network (AVOD) [68] first improved learning of cross-modality representations. Furthermore, the
the proposal stage in MV3D [67] with feature maps from both authors argued that multiple complementary tasks (ground
BEV point cloud and image. Furthermore, an auto-encoder estimation, depth completion and 2D/3D object detection)
architecture is employed to up-sample the final feature maps could assists the network achieve better overall performance.
to its original size. This alleviates the problem that small However, point-wise/pixel-wise fusion leads to the ’feature
objects might get down-sampled to one ’pixel’ with con- blurring’ problem. This ’feature blurring’ happens when one
secutive convolution operations. The proposed feature fusion point in the point cloud is associated with multiple pixels in
Region Proposal Network (RPN) first extracts equal-length the image or the other way around, which confound the data
feature vectors from multiple modalities (BEV point cloud fusion. Similarly, wang et al. [72] replace the ROI-pooling
and image) with crop and resize operations. Followed by a in MV3D [67] with sparse non-homogeneous pooling, which
1 × 1 convolution operation for feature space dimensionality enables effective fusion between feature maps from multiple
reduction, which can reduce computational cost and boost modalities.
up speed. Lu et al.[69] also utilized an encoder-decoder MVX-Net [71] presented by Sindagi et al. introduced two
based proposal network with Spatial-Channel Attention (SCA) methods that fuse image and point cloud data point-wise or
module and Extension Spatial Upsample (ESU) module. The voxel-wise. Both methods employ a pre-trained 2D CNN for
JOURNAL OF LATEX CLASS FILES 8
TABLE III
A SUMMARIZATION AND COMPARISON BETWEEN DYNAMIC OBJECT DETECTION MODELS
image feature extraction and a VoxelNet [14] based network Meyer et al. [74] extended the LaserNet [85] to multi-task
to estimate objects from the fused point cloud. In the point- and multimodal network, performing 3D object detection and
wise fusion method, the point cloud is first projected to image 3D semantic segmentation on fused image and LiDAR data.
feature space to extract image features before voxelization Two CNN process depth-image (generated from point cloud)
and processed by VoxelNet. The voxel-wise fusion method and front-view image in a parallel manner and fuse them via
first voxelized the point cloud before projecting non-empty projecting points to the image plane to associate corresponding
voxels to the image feature space for voxel/region-wise feature image features. This feature map is fed into the LaserNet
extraction. These voxel-wise features are only appended to to predict per-point distributions of the bounding box and
their corresponding voxels at a later stage of the VoxelNet. combine them for final 3D proposals. This method is highly
MVX-Net achieved state-of-the-art results and out-performed efficient while achieving state-of-the-art performance.
other LiDAR-based methods on the KITTI benchmark while
lowering false positives and false negatives rate compared to V. S TATIONARY ROAD O BJECT D ETECTION
[14]. This section focuses on reviewing recent advances in
The simplest means to combine the voxelized point cloud camera-LiDAR fusion based stationary road object detection
and image is to append RGB information as additional chan- methods. Stationary road objects can be categorized into
nels of a voxel. In a 2014 paper by Song et al. [82] 3D object on-road objects (e.g. road surfaces and road markings) and
detection is achieved by sliding a 3D detection window on the off-road objects (e.g. traffic signs). On-road and off-road
voxelized point cloud. The classification is performed by an objects provide regulations, warning bans and guidance for
ensemble of Exemplar-SVMs. In this work, color information autonomous vehicles.
is appended to voxels by projection. Song et al. further In Figure 6 & 7, typical model architecture in lane/road
extended this idea with 3D discrete convolutional neural detection and traffic sign recognition (TSR) are compared.
networks [84]. In the first stage, the voxelized point cloud Table IV presents the comparative results of different models
(generated from RGB-D data) is first processed by Multi-scale on the KITTI road benchmark [56] and gives summarization
3D RPN for 3D proposal generation. These candidates are and comparison between these models.
then classified by joint Object Recognition Network (ORN),
which takes both image and voxelized point cloud as inputs.
However, the volumetric representation introduces boundary A. Lane/Road detection
artifacts and spoils fine-grained local geometry. Secondly, the Existing surveys [93] [94] [95] have presented detailed
resolution mismatch between image and voxelized point cloud reviews on traditional multimodal road detection methods.
makes fusion inefficient. These methods [96] [97] [98] [99] mainly rely on vision for
road/lane detection while utilizing LiDAR for the curb fitting
C. One-step Models and obstacle masking. Therefore, this section focuses on recent
One-step models perform proposal generation and bbox advances in deep learning based fusion strategies for road
regression in a single stage. By fusing the proposal and bbox extraction.
regression stage into one-step, these models are often more Deep leaning based road detection methods can be grouped
computationally efficient. This makes them more well-suited into BEV-based or front-camera-view-based. BEV-based meth-
for real-time applications on mobile computational platforms. ods [86] [88] [89] [87] project LiDAR depth and images to
JOURNAL OF LATEX CLASS FILES 9
TABLE IV
C OMPARATIVE RESULTS ON KITTI ROAD BENCHMARK [56] WITH SUMMARIZATION AND COMPARISON BETWEEN LANE / ROAD DETECTION
MODELS .’M AX F (%)’ STANDS FOR M AXIMUM F1- MEASURE ON KITTI U RBAN M ARKED ROAD TEST SET. ’FL’,’MS’ STAND FOR F USION L EVEL AND
MODEL PARAMETER SIZE .
A. 2D Semantic Segmentation
TABLE V
C OMPARATIVE RESULTS ON KITTI MULTI - OBJECT TRACKING BENCHMARK ( CAR ) [56]. MOTA STANDS FOR MULTIPLE OBJECT TRACKING ACCURACY.
MOPT STANDS FOR MULTIPLE OBJECT TRACKING PRECISION . MT STANDS FOR M OSTLY TRACKED . ML STANDS FOR M OSTLY L OST. IDS STANDS FOR
NUMBER OF ID SWITCHES . FRAG STANDS FOR NUMBER OF F RAGMENTS .
Methods Data-association Models Hardware Run-time MOTA(%) MOTP(%) MT(%) ML(%) IDS FRAG
min-cost ow DSM [119] GTX1080Ti 0.1s 76.15 83.42 60.00 8.31 296 868
DBT min-cost ow mmMOT [120] GTX1080Ti 0.02s 84.77 85.21 73.23 2.77 284 753
Hungarian MOTSFusion [121] GTX1080Ti 0.44s 84.83 85.21 73.08 2.77 275 759
DFT RFS Complexer-YOLO [122] GTX1080Ti 0.01s 75.70 78.46 58.00 5.08 1186 2092
framework consists of three stages, object detection, adjacency A. Classical Online Calibration
estimation and linear optimization. In the object detection Online calibration methods estimate extrinsic in natural
stage, image and point cloud features are extracted via a VGG- settings without a calibration target. Many studies [124] [125]
16 [123] and a PointNet [30] in parallel and fused via a robust [126] [127] find extrinsic by maximizing mutual information
fusion module. The robust fusion module is designed to work (MI) (raw intensity value or edge intensity) between differ-
with both a-modal and multi-modal inputs. The adjacency ent modalities. However, MI-based methods are not robust
estimation stage extends min-cost flow to multi-modality via against texture-rich environments, large de-calibrations and
adjacent matrix learning. Finally, an optimal path is computed occlusions caused by sensor displacements. Alternatively, the
from the min-cost flow graph. LiDAR-enabled visual-odometry based method [128] uses the
Tracking and 3D reconstruct tasks can be performed jointly. camera’s ego-motion to estimate and evaluate camera-LiDAR
Extending this idea, Luiten et al. [121] leveraged 3D recon- extrinsic parameters. Nevertheless, [128] still struggles with
struction to improve tracking, making tracking robust against large de-calibrations and cannot run in real-time.
complete occlusion. The proposed MOTSFusion consists of
two stages. In the first stage, detected objects are associate B. DL-based Online Calibration
with spatial-temporal tracklets. These tracklets are matched To mitigate the aforementioned challenges, Schneider et
and merged into trajectories using the Hungarian algorithm. al. [129] designed a real-time capable CNN (RegNet) to
Furthermore, MOTSFusion can work with LiDAR mono and estimate extrinsic, which is trained on randomly decalibrated
stereo depth. data. The proposed RegNet extracts image and depth feature
in two parallel branches and concatenates them to produce
a fused feature map. This fused feature map is fed into a
B. Detection-Free Tracking (DFT) stack of Network in Network (NiN) modules and two fully
In DFT objects are manually initialized and tracked via connected layers for feature matching and global regression.
filtering based methods. The complexer-YOLO [122] is a However, the RegNet is agnostic towards the sensor’s intrinsic
real-time framework for decoupled 3D object detection and parameters and needs to be retrained once these intrinsic
tracking on image and point cloud data. In the 3D object changes. To solve this problem, the CalibNet [130] learns to
detection phase, 2D semantics are extracted and fused point- minimize geometric and photometric inconsistencies between
wise to the point cloud. This semantic point cloud is voxelized the miscalibrated and target depth in a self-supervised man-
and fed into a 3D complex-YOLO for 3D object detection. ner. Because intrinsics are only used during the 3D spatial
To speed up the training process, IoU is replaced by a novel transformer, the CalibNet can be applied to any intrinsically
metric called Scale-Rotation-Translation score (SRTs), which calibrated cameras. Nevertheless, deep learning based cross-
evaluates 3 DoFs of the bounding box position. Multi-object sensor calibration methods are computationally expensive.
tracking is decoupled from the detection, and the inference
IX. T RENDS , O PEN C HALLENGES AND P ROMISING
is achieved via Labeled Multi-Bernoulli Random Finite Sets
D IRECTIONS
Filter (LMB RFS).
The perception module in a driverless car is responsible for
obtaining and understanding its surrounding scenes. Its down-
VIII. O NLINE C ROSS -S ENSOR C ALIBRATION stream modules, such as planning, decision making and self-
localization, depend on its outputs. Therefore, its performance
One of the preconditions of the camera-LiDAR fusion and reliability are the prerequisites for the competence of the
pipeline is a flawless registration/calibration between sensors, entire driverless system. To this end, LiDAR and camera fusion
which can be difficult to satisfy. The calibration parameters be- is applied to improve the performance and reliability of the
tween sensors change constantly due to mechanical vibration perception system, making driverless vehicles more capable
and heat fluctuation. As most fusion methods are extremely in understanding complex scenes (e.g. urban traffic, extreme
sensitive towards calibration errors, this could significantly weather condition and so on). Consequently, in this section,
cripple their performance and reliability. Furthermore, offline we summarize overall trends and discuss open challenges
calibration is a troublesome and time-consuming procedure. and potential influencing factors in this regard. As shown in
Therefore, studies on online automatic cross-sensor calibration Table VI, we focus on improving the performance of fusion
have significant practical benefits. methodology and the robustness of the fusion pipeline.
JOURNAL OF LATEX CLASS FILES 13
TABLE VI
O PEN CHALLENGES RELATED TO PERFORMANCE IMPROVEMENT, RELIABILITY ENHANCEMENT.
From the methods reviewed above, we observed some convolutions has great potentials in camera-LiDAR fusion
general trends among the image and point cloud fusion ap- studies.
proaches, which are summarized as the following: 2) How to encode temporal context?: Most current deep
• 2D to 3D: Under the progressing of 3D feature extraction learning based perception systems tend to overlook temporal
methods, to locate, track and segment objects in 3D space context. This leads to numerous problems, such as point
has become a heated area of research. cloud deformation from low refresh rate and incorrect time-
• Single-task to multi-tasks: Some recent works [73] [122] synchronization between sensors. These problems cause mis-
combined multiple complementary tasks, such as object matches between images, point cloud and actual environment.
detection, semantic segmentation and depth completion Therefore, it is vital to incorporate temporal context into the
to achieve better overall performance and reduce compu- perception systems.
tational costs. In the context of autonomous driving, temporal context
• Signal-level to multi-level fusion: Early works often can be incorporated using RNN or LSTM models. In [131],
leverage signal-level fusion where 3D geometry is trans- an LSTM auto-encoder is employed to estimate future states
lated to the image plane to leverage off-the-shelf image of surrounding vehicles and adjust the planned trajectory
processing models, while recent models try to fuse image accordingly, which helps autonomous vehicles run smoother
and LiDAR in multi-level (e.g. early fusion, late fusion) and more stable. In [121] temporal context is leveraged to
and temporal context encoding. estimate ego-motion, which benefits later task-related header
networks. Furthermore, the temporal context could benefit
online self-calibration via a visual-odometry based method
A. Performance-related Open Research Questions
[128]. Following this trend, the mismatches caused by LiDAR
1) What should be the data representation of fused data?: low refresh rate could be solved by encoding temporal context
The choosing of data representations of the fused data plays and generative models.
a fundamental role in designing any data fusion algorithms. 3) What should be the learning scheme?: Most current
Current data representation for image and point cloud fusion camera-LiDAR fusion methods rely on supervised learning,
includes: which requires large annotated datasets. However, annotating
• Image Representation: Append 3D geometry as addi- images and point clouds is expensive and time-consuming.
tional channels of the image. The image-based repre- This limits the size of the current multi-modal dataset and the
sentation enables off-the-shelf image processing models. performance of supervised learning methods.
However, the results are also limited in the 2D image The answer to this problem is unsupervised and weakly-
plane, which is less ideal for autonomous driving. supervised learning frameworks. Some recent studies have
• Point Representation: Append RGB signal/features as shown great potentials in this regards [43] [50] [132] [24]
additional channels of the point cloud. However, the [101]. Following this trend, future researches in unsupervised
mismatch of resolution between high-resolution images and weakly-supervised learning fusion frameworks could al-
and low-resolution point cloud leads to inefficiency. low the networks to be trained on large unlabeled/coarse-
• Intermediate data representations: Translate image and labelled dataset and leads to better performance.
point cloud features/signal into an intermediate data rep- 4) When to use deep learning methods?: Recent advances
resentation, such as voxelized point cloud [82]. However, in deep learning techniques have accelerated the develop-
voxel-based methods suffer from bad scalability. ment of autonomous driving technology. In many aspects,
Many recent works for point cloud processing have con- however, traditional methods are still indispensable in current
centrated on defining explicit point convolution operations autonomous driving systems. Compared with deep learning
[32] [62] [33] [35] [36] [37] [38], which have shown great methods, traditional methods have better interpretability and
potentials. These point convolutions are better suited to extract consume significantly less computational resources. The abil-
fine-grained per-point and local geometry. Therefore, we be- ity to track back a decision is crucial for the decision making
lieve the point representation of fused data coupled with point and planning system of an autonomous vehicle. Nevertheless,
JOURNAL OF LATEX CLASS FILES 14
current deep learning algorithms are not back-traceable, mak- 4) How to solve open-set object detection?: Open-set ob-
ing them unfit for these applications. Apart from this black- ject detection is a scenario where an object detector is tested on
box dilemma, traditional algorithms are also preferred for their instances from unknown/unseen classes. The open-set problem
real-time capability. is critical for an autonomous vehicle, as it operates in un-
To summarize, we believe deep learning methods should be constrained environments with infinite categories of objects.
applied to applications that have explicit objectives that can Current datasets often use a background class for any objects
be verified objectively. that are not interested. However, no dataset can include all the
unwanted object categories in a background class. Therefore,
the behaviour of an object detector in an open-set setting is
B. Reliability-related Open Research Questions highly uncertain, which is less ideal for autonomous driving.
The lack of open-set object detection awareness, testing
1) How to mitigate camera-LiDAR coupling?: From an
protocol and metrics leads to little explicit evaluations of the
engineering perspective, redundancy design in an autonomous
open-set performance in current object detection studies. These
vehicle is crucial for its safety. Although fusing LiDAR
challenges have been discussed and investigated in a recent
and camera improves perception performance, it also comes
study by Dhamija et al. [133], where a novel open-set protocol
with the problem of signal coupling. If one of the signal
and metrics are proposed. The authors proposed an additional
paths suddenly failed, the whole pipeline could break down
mixed unknown category, which contains known ’background’
and cripple down-stream modules. This is unacceptable for
objects and unknown/unseen objects. Based on this protocol,
autonomous driving systems, which require robust perception
current methods are tested on a test set with a mixed unknown
pipelines.
category generated from a combination of existing datasets.
To solve this problem, we should develop a sensor-agnostic In another recent study on the point cloud, Wong et al.
framework. For instance, we can adopt multiple fusion mod- [134] proposed a technique that maps unwanted objects from
ules with different sensor inputs. Furthermore, we can employ different categories into a category-agnostic embedding space
a multi-path fusion module that take asynchronous multi- for clustering.
modal data. However, the best solution is still open for study. The open-set challenges are essential for deploying deep
2) How to improve all-weather/Lighting conditions?: Au- learning-based perception systems in the real-world. And it
tonomous vehicles need to work in all weather and lighting needs more effort and attention from the whole research
conditions. However, current datasets and methods are mostly community (dataset and methods with emphasis on unknown
focused on scenes with good lighting and weather conditions. objects, test protocols and metrics, etc.,).
This leads to bad performances in the real-world, where 5) How to balance speed-accuracy trade-offs?: The pro-
illumination and weather conditions are more complex. cessing of multiple high-resolution images and large-scale
The first step towards this problem is developing more point clouds put substantial stress on existing mobile comput-
datasets that contain a wide range of lighting and weather ing platforms. This sometimes causes frame-drop, which could
conditions. In addition, methods that employ multi-modal data seriously degrade the performance of the perception system.
to tackle complex lighting and weather conditions require More generally, it leads to high-power consumption and low
further investigation. reliability. Therefore, it is important to balance the speed and
3) How to handle adversarial attacks and corner cases?: accuracy of a model in real-world deployments.
Adversarial attacks targeted at camera-based perception sys- There are studies that attempt to detect the frame-drop.
tems have proven to be effective. This poses a grave danger In [135], Imre et al. proposed a multi-camera frame-drop
for the autonomous vehicle, as it operates in safety-critical detection algorithm that leverages multiple segments (broken
environments. It may be difficult to identify attacks explicitly line) fitting on camera pairs. However, frame-drop detection
designed for certain sensory modality. However, perception only solves half of the problem. The hard-part is to prevent
results can be verified across different modalities. In this the performance degradation caused by the frame-drop. Re-
context, research on utilizing the 3D geometry and images cent advances in generative models have demonstrated great
to jointly identify these attacks can be further explored. potentials to predict missing frames in video sequences [136],
As self-driving cars operate in an unpredictable open en- which could be leveraged in autonomous driving for filling the
vironment with infinite possibilities, it is critical to consider missing frames in the image and the point cloud pipelines.
corner and edge cases in the design of the perception pipeline. However, we believe the most effective way to solve the
The perception system should anticipate unseen and unusual frame-drop problem is to prevent it by reducing the hard-ware
obstacles, strange behaviours and extreme weather. For in- workload. This can be achieved by carefully balancing the
stance, the image of cyclists printed on a large vehicle and speed and accuracy of a model [137].
people wearing costumes. These corner cases are often very To achieve this, deep learning models should be able to scale
difficult to handle using only the camera or LiDAR pipeline. down its computational cost, while maintaining acceptable
However, leveraging data from multi-modalities to identify performance. This scalability is often achieved by reducing
these corner cases could prove to be more effective and reliable the number of inputs (points, pixels, voxels) or the depth of
than from a-modal sensors. Further researches in this direction the network. From previous studies [30] [138] [38], point-
could greatly benefit the safety and commercialization of based and multi-view based fusion methods are more scalable
autonomous driving technology. compared to voxel-based methods.
JOURNAL OF LATEX CLASS FILES 15
X. C ONCLUSION [14] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
based 3d object detection,” 2018 IEEE/CVF Conference on Computer
This paper presented an in-depth review of the most recent Vision and Pattern Recognition, Jun 2018. [Online]. Available:
progress on deep learning models for point cloud and image http://dx.doi.org/10.1109/cvpr.2018.00472
[15] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural net-
fusion in the context of autonomous driving. Specifically, this work for real-time object recognition,” in 2015 IEEE/RSJ International
review organizes methods based on their fusion methodolo- Conference on Intelligent Robots and Systems (IROS), Sep. 2015, pp.
gies and covers topics in depth completion, dynamic and 922–928.
[16] Y. Li, L. Ma, Z. Zhong, D. Cao, and J. Li, “Tgnet: Geometric graph cnn
stationary object detection, semantic segmentation, tracking on 3-d point cloud segmentation,” IEEE Transactions on Geoscience
and online cross-sensor calibration. Furthermore, performance and Remote Sensing, vol. 58, no. 5, pp. 3588–3600, 2020.
comparisons on the publicly available dataset, highlights and [17] C. R. Qi, H. Su, M. NieBner, A. Dai, M. Yan, and L. J.
Guibas, “Volumetric and multi-view cnns for object classification
advantages/disadvantages of models are presented in tables. on 3d data,” 2016 IEEE Conference on Computer Vision and
Typical model architectures are shown in figures. Finally, we Pattern Recognition (CVPR), Jun 2016. [Online]. Available: http:
summarized general trends and discussed open challenges and //dx.doi.org/10.1109/CVPR.2016.609
[18] W. Zeng and T. Gevers, “3dcontextnet: Kd tree guided hierarchical
possible future directions. This survey also raised awareness learning of point clouds using local and global contextual cues,” in
and provided insights on questions that are overlooked by the Proceedings of the European Conference on Computer Vision (ECCV),
research community but troubles real-world deployment of the 2018, pp. 0–0.
[19] R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks
autonomous driving technology. for the recognition of 3d point cloud models,” 2017 IEEE International
Conference on Computer Vision (ICCV), Oct 2017. [Online]. Available:
http://dx.doi.org/10.1109/ICCV.2017.99
R EFERENCES [20] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep
3d representations at high resolutions,” in Proceedings of the IEEE
[1] F. Duarte, “Self-driving cars: A city perspective,” Science Robotics, Conference on Computer Vision and Pattern Recognition, 2017, pp.
vol. 4, no. 28, 2019. [Online]. Available: https://robotics.sciencemag. 3577–3586.
org/content/4/28/eaav9843 [21] D. C. Garcia, T. A. Fonseca, R. U. Ferreira, and R. L. de Queiroz, “Ge-
[2] J. Guo, U. Kurup, and M. Shah, “Is it safe to drive? an overview of ometry coding for dynamic voxelized point clouds using octrees and
factors, metrics, and datasets for driveability assessment in autonomous multiple contexts,” IEEE Transactions on Image Processing, vol. 29,
driving,” IEEE Transactions on Intelligent Transportation Systems, pp. pp. 313–322, 2020.
1–17, 2019. [22] H. Lei, N. Akhtar, and A. Mian, “Octree guided cnn with spherical
[3] Y. E. Bigman and K. Gray, “Life and death decisions of autonomous kernels for 3d point clouds,” in Proceedings of the IEEE Conference
vehicles,” Nature, vol. 579, no. 7797, pp. E1–E2, 2020. on Computer Vision and Pattern Recognition, 2019, pp. 9631–9640.
[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den [23] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, convolutional neural networks for 3d shape recognition,” in Proceed-
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, ings of the IEEE international conference on computer vision, 2015,
I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, pp. 945–953.
and D. Hassabis, “Mastering the game of go with deep neural networks [24] S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, and D. Tian, “Deep
and tree search,” Nature, vol. 529, pp. 484–489, 2016. unsupervised learning of 3d point clouds via graph topology inference
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. and filtering,” IEEE Transactions on Image Processing, vol. 29, pp.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, 3183–3198, 2020.
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, [25] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through graph-structured data,” 2015.
deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, [26] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters
Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236 in convolutional neural networks on graphs,” in Proceedings of the
[6] P. Huang, M. Cheng, Y. Chen, H. Luo, C. Wang, and J. Li, “Traffic sign IEEE conference on computer vision and pattern recognition, 2017,
occlusion detection using mobile laser scanning point clouds,” IEEE pp. 3693–3702.
Transactions on Intelligent Transportation Systems, vol. 18, no. 9, pp. [27] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and
2364–2376, 2017. P. Vandergheynst, “Learning class-specific descriptors for deformable
[7] L. Chen, Q. Zou, Z. Pan, D. Lai, L. Zhu, Z. Hou, J. Wang, and D. Cao, shapes using localized spectral convolutional networks,” Computer
“Surrounding vehicle detection using an fpga panoramic camera and Graphics Forum, vol. 34, no. 5, pp. 13–23, 2015. [Online]. Available:
deep cnns,” IEEE Transactions on Intelligent Transportation Systems, https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12693
pp. 1–13, 2019. [28] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and
[8] S. Kim, H. Kim, W. Yoo, and K. Huh, “Sensor fusion algorithm design locally connected networks on graphs,” in International Conference on
in detecting vehicles using laser scanner and stereo vision,” IEEE Learning Representations (ICLR2014), CBLS, April 2014, 2014.
Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp. [29] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
1072–1084, 2016. neural networks on graphs with fast localized spectral filtering,” in
[9] K. Park, S. Kim, and K. Sohn, “High-precision depth estimation using Advances in neural information processing systems, 2016, pp. 3844–
uncalibrated lidar and stereo fusion,” IEEE Transactions on Intelligent 3852.
Transportation Systems, vol. 21, no. 1, pp. 321–335, Jan 2020. [30] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical
[10] J. Zhu, L. Fan, W. Tian, L. Chen, D. Cao, and F. Wang, “Toward the feature learning on point sets in a metric space,” 2017.
ghosting phenomenon in a stereo-based map with a collaborative rgb-d [31] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet:
repair,” IEEE Transactions on Intelligent Transportation Systems, pp. Deep learning on point sets for 3d classification and segmentation,”
1–11, 2019. 2017 IEEE Conference on Computer Vision and Pattern Recognition
[11] J. Wang and L. Zhou, “Traffic light recognition with high dynamic (CVPR), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/
range imaging and deep learning,” IEEE Transactions on Intelligent CVPR.2017.16
Transportation Systems, vol. 20, no. 4, pp. 1341–1352, 2019. [32] B.-S. Hua, M.-K. Tran, and S.-K. Yeung, “Pointwise convolutional
[12] Z. Wang, Y. Wu, and Q. Niu, “Multi-sensor fusion in automated neural networks,” in Proceedings of the IEEE Conference on Computer
driving: A survey,” IEEE Access, 2019. Vision and Pattern Recognition, 2018, pp. 984–993.
[13] D. Feng, C. Haase-Schutz, L. Rosenbaum, H. Hertlein, C. Glaser, [33] M. Atzmon, H. Maron, and Y. Lipman, “Point convolutional
F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal neural networks by extension operators,” ACM Transactions on
object detection and semantic segmentation for autonomous driving: Graphics, vol. 37, no. 4, p. 112, Aug 2018. [Online]. Available:
Datasets, methods, and challenges,” IEEE Transactions on Intelligent http://dx.doi.org/10.1145/3197517.3201301
Transportation Systems, p. 120, 2020. [Online]. Available: http: [34] L. Ma, Y. Li, J. Li, W. Tan, Y. Yu, and M. A. Chapman, “Multi-scale
//dx.doi.org/10.1109/tits.2020.2972974 point-wise convolutional neural networks for 3d object segmentation
JOURNAL OF LATEX CLASS FILES 16
from lidar point clouds in large-scale environments,” IEEE Transac- Conference on Computer Vision and Pattern Recognition, Jun 2018.
tions on Intelligent Transportation Systems, pp. 1–16, 2019. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2018.00907
[35] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, [54] T. Wang, F. Wang, J. Lin, Y. Tsai, W. Chiu, and M. Sun, “Plug-and-
“Pointcnn: Convolution on x-transformed points,” in Advances play: Improve depth prediction via sparse data propagation,” in 2019
in Neural Information Processing Systems 31, S. Bengio, International Conference on Robotics and Automation (ICRA), May
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, 2019, pp. 5880–5886.
and R. Garnett, Eds. Curran Associates, Inc., 2018, [55] A. Valada, R. Mohan, and W. Burgard, “Self-supervised model
pp. 820–830. [Online]. Available: http://papers.nips.cc/paper/ adaptation for multimodal semantic segmentation,” International
7362-pointcnn-convolution-on-x-transformed-points.pdf Journal of Computer Vision, Jul 2019. [Online]. Available: http:
[36] S. Lan, R. Yu, G. Yu, and L. S. Davis, “Modeling local geometric //dx.doi.org/10.1007/s11263-019-01188-y
structure of 3d point clouds using geo-cnn,” 2019 IEEE/CVF [56] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
Conference on Computer Vision and Pattern Recognition (CVPR), Jun driving? the kitti vision benchmark suite,” in Conference on Computer
2019. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2019.00109 Vision and Pattern Recognition (CVPR), 2012.
[37] F. Groh, P. Wieschollek, and H. P. A. Lensch, “Flex-convolution,” [57] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets
Lecture Notes in Computer Science, p. 105122, 2019. [Online]. for 3d object detection from rgb-d data,” 2018 IEEE/CVF Conference
Available: http://dx.doi.org/10.1007/978-3-030-20887-5 7 on Computer Vision and Pattern Recognition, Jun 2018. [Online].
[38] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, Available: http://dx.doi.org/10.1109/CVPR.2018.00102
and L. Guibas, “Kpconv: Flexible and deformable convolution [58] X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline
for point clouds,” 2019 IEEE/CVF International Conference on for 3d detection of vehicles,” 2018 IEEE International Conference
Computer Vision (ICCV), Oct 2019. [Online]. Available: http: on Robotics and Automation (ICRA), May 2018. [Online]. Available:
//dx.doi.org/10.1109/ICCV.2019.00651 http://dx.doi.org/10.1109/ICRA.2018.8461232
[39] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and [59] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roarnet: A robust 3d object
A. Markham, “Randla-net: Efficient semantic segmentation of large- detection based on region approximation refinement,” 2019 IEEE
scale point clouds,” in Proceedings of the IEEE/CVF Conference on Intelligent Vehicles Symposium (IV), Jun 2019. [Online]. Available:
Computer Vision and Pattern Recognition, 2020, pp. 11 108–11 117. http://dx.doi.org/10.1109/IVS.2019.8813895
[40] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks [60] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding
on 3d point clouds,” 2019 IEEE/CVF Conference on Computer Vision box estimation using deep learning and geometry,” 2017 IEEE
and Pattern Recognition (CVPR), Jun 2019. [Online]. Available: Conference on Computer Vision and Pattern Recognition (CVPR), Jul
http://dx.doi.org/10.1109/CVPR.2019.00985 2017. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2017.597
[41] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, [61] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Ipod: Intensive point-
“Sparsity invariant cnns,” in International Conference on 3D Vision based object detector for point cloud,” 2018.
(3DV), 2017. [62] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor
[42] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from fusion for 3d bounding box estimation,” 2018 IEEE/CVF Conference
sparse depth samples and a single image,” 2018 IEEE International on Computer Vision and Pattern Recognition, Jun 2018. [Online].
Conference on Robotics and Automation (ICRA), May 2018. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2018.00033
Available: http://dx.doi.org/10.1109/ICRA.2018.8460184 [63] X. Zhao, Z. Liu, R. Hu, and K. Huang, “3d object detection using
[43] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised scale invariant and feature reweighting networks,” in Proceedings of the
sparse-to-dense: Self-supervised depth completion from lidar and AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9267–
monocular camera,” 2019 International Conference on Robotics 9274.
and Automation (ICRA), May 2019. [Online]. Available: http: [64] M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu, “Pointsift: A sift-like
//dx.doi.org/10.1109/ICRA.2019.8793637 network module for 3d point cloud semantic segmentation,” 2018.
[44] X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity [65] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to
learned with convolutional spatial propagation network,” Lecture aggregate local point-wise features for amodal,” 2019 IEEE/RSJ
Notes in Computer Science, p. 108125, 2018. [Online]. Available: International Conference on Intelligent Robots and Systems (IROS),
http://dx.doi.org/10.1007/978-3-030-01270-0 7 Nov 2019. [Online]. Available: http://dx.doi.org/10.1109/IROS40897.
[45] X. Cheng, P. Wang, G. Chenye, and R. Yang, “Cspn++: Learning 2019.8968513
context and resource aware convolutional spatial propagation networks [66] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting:
for depth completion,” Thirty-Fourth AAAI Conference on Artificial Sequential fusion for 3d object detection,” 2019.
Intelligence (AAAI-20), 2020. [67] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
[46] M. Jaritz, R. D. Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, detection network for autonomous driving,” 2017 IEEE Conference on
“Sparse and dense data with cnns: Depth completion and semantic Computer Vision and Pattern Recognition (CVPR), Jul 2017. [Online].
segmentation,” 2018 International Conference on 3D Vision (3DV), Sep Available: http://dx.doi.org/10.1109/CVPR.2017.691
2018. [Online]. Available: http://dx.doi.org/10.1109/3DV.2018.00017 [68] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander,
[47] A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence propagation “Joint 3d proposal generation and object detection from view
through cnns for guided sparse depth regression,” IEEE Transactions aggregation,” 2018 IEEE/RSJ International Conference on Intelligent
on Pattern Analysis and Machine Intelligence, p. 11, 2019. [Online]. Robots and Systems (IROS), Oct 2018. [Online]. Available: http:
Available: http://dx.doi.org/10.1109/TPAMI.2019.2929170 //dx.doi.org/10.1109/IROS.2018.8594049
[48] J. Tang, F.-P. Tian, W. Feng, J. Li, and P. Tan, “Learning guided [69] H. Lu, X. Chen, G. Zhang, Q. Zhou, Y. Ma, and Y. Zhao, “Scanet:
convolutional network for depth completion,” 2019. Spatial-channel attention network for 3d object detection,” in ICASSP
[49] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool, 2019 - 2019 IEEE International Conference on Acoustics, Speech and
“Sparse and noisy lidar completion with rgb guidance and Signal Processing (ICASSP), May 2019, pp. 1992–1996.
uncertainty,” 2019 16th International Conference on Machine [70] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
Vision Applications (MVA), May 2019. [Online]. Available: http: for multi-sensor 3d object detection,” in The European Conference on
//dx.doi.org/10.23919/MVA.2019.8757939 Computer Vision (ECCV), September 2018.
[50] X. Cheng, Y. Zhong, Y. Dai, P. Ji, and H. Li, “Noise-aware [71] V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal
unsupervised deep lidar-stereo fusion,” 2019 IEEE/CVF Conference voxelnet for 3d object detection,” 2019 International Conference on
on Computer Vision and Pattern Recognition (CVPR), Jun 2019. Robotics and Automation (ICRA), May 2019. [Online]. Available:
[Online]. Available: http://dx.doi.org/10.1109/CVPR.2019.00650 http://dx.doi.org/10.1109/ICRA.2019.8794195
[51] J. Zhang, M. S. Ramanagopalg, R. Vasudevan, and M. Johnson- [72] Z. Wang, W. Zhan, and M. Tomizuka, “Fusing birds eye view lidar
Roberson, “Listereo: Generate dense depth maps from lidar and stereo point cloud and front view camera image for 3d object detection,” in
imagery,” 2019. 2018 IEEE Intelligent Vehicles Symposium (IV), June 2018, pp. 1–6.
[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning [73] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-
for image recognition,” 2016 IEEE Conference on Computer Vision sensor fusion for 3d object detection,” in The IEEE Conference on
and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: Computer Vision and Pattern Recognition (CVPR), June 2019.
http://dx.doi.org/10.1109/CVPR.2016.90 [74] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-
[53] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable Gonzalez, “Sensor fusion for joint 3d object detection and semantic
architectures for scalable image recognition,” 2018 IEEE/CVF segmentation,” 2019.
JOURNAL OF LATEX CLASS FILES 17
[75] S. Gupta, R. Girshick, P. Arbelez, and J. Malik, “Learning rich integration, assessment, and perspectives on acp-based parallel vision,”
features from rgb-d images for object detection and segmentation,” IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 645–661,
Lecture Notes in Computer Science, p. 345360, 2014. [Online]. 2018.
Available: http://dx.doi.org/10.1007/978-3-319-10584-0 23 [96] A. S. Huang, D. Moore, M. Antone, E. Olson, and S. Teller, “Finding
[76] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for multiple lanes in urban road networks with vision and lidar,” Au-
supervision transfer,” 2016 IEEE Conference on Computer Vision tonomous Robots, vol. 26, no. 2-3, pp. 103–122, 2009.
and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: [97] P. Y. Shinzato, D. F. Wolf, and C. Stiller, “Road terrain detection:
http://dx.doi.org/10.1109/CVPR.2016.309 Avoiding common obstacle detection assumptions using sensor fusion,”
[77] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature in 2014 IEEE Intelligent Vehicles Symposium Proceedings, 2014, pp.
hierarchies for accurate object detection and semantic segmentation,” 687–692.
2014 IEEE Conference on Computer Vision and Pattern Recognition, [98] L. Xiao, R. Wang, B. Dai, Y. Fang, D. Liu, and T. Wu, “Hybrid
Jun 2014. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2014.81 conditional random field based camera-lidar fusion for road detection,”
[78] S. Gupta, P. Arbelez, R. Girshick, and J. Malik, “Aligning 3d models Information Sciences, vol. 432, pp. 543–558, 2018.
to rgb-d images of cluttered scenes,” in 2015 IEEE Conference on [99] L. Xiao, B. Dai, D. Liu, T. Hu, and T. Wu, “Crf based road detection
Computer Vision and Pattern Recognition (CVPR), June 2015, pp. with multi-sensor fusion,” in 2015 IEEE Intelligent Vehicles Symposium
4731–4740. (IV), 2015, pp. 192–198.
[79] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images [100] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based
for pedestrian detection using convolutional neural networks,” in 2016 traffic sign detection and analysis for intelligent driver assistance
IEEE International Conference on Robotics and Automation (ICRA), systems: Perspectives and survey,” IEEE Transactions on Intelligent
May 2016, pp. 2198–2205. Transportation Systems, vol. 13, no. 4, pp. 1484–1497, 2012.
[80] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation [101] M. Tan, B. Wang, Z. Wu, J. Wang, and G. Pan, “Weakly supervised
and detection from point cloud,” 2019 IEEE/CVF Conference on metric learning for traffic sign recognition in a lidar-equipped vehicle,”
Computer Vision and Pattern Recognition (CVPR), Jun 2019. [Online]. IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 5,
Available: http://dx.doi.org/10.1109/CVPR.2019.00086 pp. 1415–1427, 2016.
[81] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and [102] Y. Yu, J. Li, C. Wen, H. Guan, H. Luo, and C. Wang, “Bag-of-
O. Beijbom, “Pointpillars: Fast encoders for object detection from visual-phrases and hierarchical deep models for traffic sign detection
point clouds,” 2019 IEEE/CVF Conference on Computer Vision and recognition in mobile laser scanning data,” ISPRS journal of
and Pattern Recognition (CVPR), Jun 2019. [Online]. Available: photogrammetry and remote sensing, vol. 113, pp. 106–123, 2016.
http://dx.doi.org/10.1109/CVPR.2019.01298 [103] M. Soiln, B. Riveiro, J. Martnez-Snchez, and P. Arias, “Traffic
[82] S. Song and J. Xiao, “Sliding shapes for 3d object detection in depth sign detection in mls acquired point clouds for geometric and
images,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, image-based semantic inventory,” ISPRS Journal of Photogrammetry
B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International and Remote Sensing, vol. 114, pp. 92 – 101, 2016. [Online]. Available:
Publishing, 2014, pp. 634–651. http://www.sciencedirect.com/science/article/pii/S0924271616000368
[83] S. Wang, S. Suo, W.-C. Ma, A. Pokrovsky, and R. Urtasun, “Deep
[104] A. Arcos-Garcia, M. Soilán, J. A. Alvarez-Garcia, and B. Riveiro,
parametric continuous convolutional neural networks,” in Proceedings
“Exploiting synergies of mobile mapping sensors and deep learning
of the IEEE Conference on Computer Vision and Pattern Recognition,
for traffic sign recognition systems,” Expert Systems with Applications,
2018, pp. 2589–2597.
vol. 89, pp. 286–295, 2017.
[84] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object
[105] H. Guan, Y. Yu, D. Peng, Y. Zang, J. Lu, A. Li, and J. Li, “A
detection in rgb-d images,” in Proceedings of the IEEE Conference
convolutional capsule network for traffic-sign recognition using mobile
on Computer Vision and Pattern Recognition, 2016, pp. 808–816.
lidar data with digital images,” IEEE Geoscience and Remote Sensing
[85] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K.
Letters, vol. 17, no. 6, pp. 1067–1071, 2020.
Wellington, “Lasernet: An efficient probabilistic 3d object detector
for autonomous driving,” 2019 IEEE/CVF Conference on Computer [106] Z. Deng and L. Zhou, “Detection and recognition of traffic planar ob-
Vision and Pattern Recognition (CVPR), Jun 2019. [Online]. Available: jects using colorized laser scan and perspective distortion rectification,”
http://dx.doi.org/10.1109/CVPR.2019.01296 IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 5,
[86] M. Bai, G. Mattyus, N. Homayounfar, S. Wang, S. K. Lakshmikanth, pp. 1485–1495, 2018.
and R. Urtasun, “Deep multi-sensor lane detection,” in 2018 IEEE/RSJ [107] H. Guan, W. Yan, Y. Yu, L. Zhong, and D. Li, “Robust traffic-sign
International Conference on Intelligent Robots and Systems (IROS), detection and classification using mobile lidar data with digital images,”
2018, pp. 3102–3109. IEEE Journal of Selected Topics in Applied Earth Observations and
[87] F. Wulff, B. Schufele, O. Sawade, D. Becker, B. Henke, and I. Radusch, Remote Sensing, vol. 11, no. 5, pp. 1715–1724, 2018.
“Early fusion of camera and lidar for robust road detection based on [108] C. Premebida, J. Carreira, J. Batista, and U. Nunes, “Pedestrian
u-net fcn,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018, detection combining rgb and dense lidar data,” 09 2014, pp. 4112–
pp. 1426–1431. 4117.
[88] X. Lv, Z. Liu, J. Xin, and N. Zheng, “A novel approach for detecting [109] A. Dai and M. Niener, “3dmv: Joint 3d-multi-view prediction
road based on two-stream fusion fully convolutional network,” in 2018 for 3d semantic scene segmentation,” Lecture Notes in Computer
IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 1464–1469. Science, p. 458474, 2018. [Online]. Available: http://dx.doi.org/10.
[89] D. Yu, H. Xiong, Q. Xu, J. Wang, and K. Li, “Multi-stage residual fu- 1007/978-3-030-01249-6 28
sion network for lidar-camera road detection,” in 2019 IEEE Intelligent [110] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,
Vehicles Symposium (IV), 2019, pp. 2323–2328. and M. NieBner, “Scannet: Richly-annotated 3d reconstructions
[90] L. Caltagirone, M. Bellone, L. Svensson, and M. Wahde, “Lidarcamera of indoor scenes,” 2017 IEEE Conference on Computer Vision
fusion for road detection using fully convolutional neural networks,” and Pattern Recognition (CVPR), Jul 2017. [Online]. Available:
Robotics and Autonomous Systems, vol. 111, p. 125131, Jan 2019. http://dx.doi.org/10.1109/CVPR.2017.261
[Online]. Available: http://dx.doi.org/10.1016/j.robot.2018.11.002 [111] H.-Y. Chiang, Y.-L. Lin, Y.-C. Liu, and W. H. Hsu, “A unified
[91] Z. Chen, J. Zhang, and D. Tao, “Progressive lidar adaptation for road point-based framework for 3d segmentation,” 2019 International
detection,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. Conference on 3D Vision (3DV), Sep 2019. [Online]. Available:
693–702, 2019. http://dx.doi.org/10.1109/3DV.2019.00026
[92] J. Lee and T. Park, “Fast road detection by cnn-based camera-lidar [112] M. Jaritz, J. Gu, and H. Su, “Multi-view pointnet for 3d scene
fusion and spherical coordinate transformation,” IEEE Transactions on understanding,” 2019.
Intelligent Transportation Systems, pp. 1–9, 2020. [113] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-
[93] L. Ma, Y. Li, J. Li, C. Wang, R. Wang, and M. A. Chapman, “Mobile H. Yang, and J. Kautz, “Splatnet: Sparse lattice networks for
laser scanned point-clouds for road object detection and extraction: A point cloud processing,” 2018 IEEE/CVF Conference on Computer
review,” Remote Sensing, vol. 10, no. 10, p. 1531, 2018. Vision and Pattern Recognition, Jun 2018. [Online]. Available:
[94] S. P. Narote, P. N. Bhujbal, A. S. Narote, and D. M. Dhane, “A review http://dx.doi.org/10.1109/CVPR.2018.00268
of recent advances in lane detection and departure warning system,” [114] J. Hou, A. Dai, and M. NieBner, “3d-sis: 3d semantic instance
Pattern Recognition, vol. 73, pp. 216–234, 2018. segmentation of rgb-d scans,” 2019 IEEE/CVF Conference on
[95] Y. Xing, C. Lv, L. Chen, H. Wang, H. Wang, D. Cao, E. Velenis, Computer Vision and Pattern Recognition (CVPR), Jun 2019.
and F.-Y. Wang, “Advances in vision-based lane detection: algorithms, [Online]. Available: http://dx.doi.org/10.1109/CVPR.2019.00455
JOURNAL OF LATEX CLASS FILES 18
[115] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep [136] Y. Li, D. Roblek, and M. Tagliasacchi, “From here to there:
neural network architecture for real-time semantic segmentation,” 2016. Video inbetweening using direct 3d convolutions,” arXiv preprint
[116] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji, “Panopticfusion: arXiv:1905.10240, 2019.
Online volumetric semantic mapping at the level of stuff and [137] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
things,” 2019 IEEE/RSJ International Conference on Intelligent I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and et al.,
Robots and Systems (IROS), Nov 2019. [Online]. Available: “Speed/accuracy trade-offs for modern convolutional object detectors,”
http://dx.doi.org/10.1109/IROS40897.2019.8967890 2017 IEEE Conference on Computer Vision and Pattern Recognition
[117] C. Elich, F. Engelmann, T. Kontogianni, and B. Leibe, “3d birds- (CVPR), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/
eye-view instance segmentation,” Pattern Recognition, p. 4861, 2019. CVPR.2017.351
[Online]. Available: http://dx.doi.org/10.1007/978-3-030-33676-9 4 [138] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d
[118] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward shapenets: A deep representation for volumetric shapes,” 2015 IEEE
feature space analysis,” IEEE Transactions on pattern analysis and Conference on Computer Vision and Pattern Recognition (CVPR),
machine intelligence, vol. 24, no. 5, pp. 603–619, 2002. Jun 2015. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2015.
[119] D. Frossard and R. Urtasun, “End-to-end learning of multi-sensor 7298801
3d tracking by detection,” 2018 IEEE International Conference on
Robotics and Automation (ICRA), May 2018. [Online]. Available:
http://dx.doi.org/10.1109/ICRA.2018.8462884
[120] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust
multi-modality multi-object tracking,” 2019 IEEE/CVF International Yaodong Cui received the B.S. degree in automation
Conference on Computer Vision (ICCV), Oct 2019. [Online]. Available: from Chang’an University, Xi’an, China, in 2017,
http://dx.doi.org/10.1109/ICCV.2019.00245 received the M.Sc. degree in Systems, Control and
[121] J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and Signal Processing from the University of Southamp-
reconstruct to track,” IEEE Robotics and Automation Letters, ton, Southampton, UK, in 2019. He is currently
vol. 5, no. 2, p. 18031810, Apr 2020. [Online]. Available: working toward a Ph.D. degree with the Waterloo
http://dx.doi.org/10.1109/LRA.2020.2969183 Cognitive Autonomous Driving (CogDrive) Lab, De-
[122] M. Simon, K. Amende, A. Kraus, J. Honer, T. Smann, H. Kaulbersch, partment of Mechanical Engineering, University of
S. Milz, and H. M. Gross, “Complexer-yolo: Real-time 3d object Waterloo, Waterloo, Canada. His research interests
detection and tracking on semantic point clouds,” 2019. include sensor fusion, perception for the intelligent
[123] K. Simonyan and A. Zisserman, “Very deep convolutional networks vehicle, driver emotion detection.
for large-scale image recognition,” in International Conference on
Learning Representations, 2015.
[124] A. Napier, P. Corke, and P. Newman, “Cross-calibration of push-broom
2d lidars and cameras in natural scenes,” in 2013 IEEE International
Conference on Robotics and Automation, 2013, pp. 3679–3684. Ren Chen received the B.S. degree in electronic
[125] Z. Taylor and J. Nieto, “Automatic calibration of lidar and camera science and technology from Huazhong University
images using normalized mutual information,” in Robotics and Au- of Science and Technology, Wuhan, China. received
tomation (ICRA), 2013 IEEE International Conference on, 2013. the M.Sc. degree in Computer Science from China
[126] G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Automatic Academy of Aerospace Science and Engineering,
extrinsic calibration of vision and lidar by maximizing mutual infor- Beijing, China. He worked at Institude of Deep
mation,” Journal of Field Robotics, vol. 32, no. 5, pp. 696–722, 2015. Learning, Baidu and Autonomous Driving Lab, Ten-
[127] M. Miled, B. Soheilian, E. Habets, and B. Vallet, “Hybrid online cent. His research interests include computer vision,
mobile laser scanner calibration through image alignment by mutual perception for robotics, realtime dense 3D recon-
information,” ISPRS Annals of the Photogrammetry, Remote Sensing struction and robotic motion planning.
and Spatial Information Sciences, vol. 3, p. 25, 2016.
[128] H.-J. Chien, R. Klette, N. Schneider, and U. Franke, “Visual odometry
driven online calibration for monocular lidar-camera systems,” in 2016
23rd International conference on pattern recognition (ICPR). IEEE,
2016, pp. 2848–2853.
[129] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: Wenbo Chu received the B.S. degree in automo-
Multimodal sensor registration using deep neural networks,” 2017 tive engineering from Tsinghua University, Beijing,
IEEE Intelligent Vehicles Symposium (IV), Jun 2017. [Online]. China, in 2008, the M.S. degree in automotive
Available: http://dx.doi.org/10.1109/IVS.2017.7995968 engineering from RWTH Aachen, Germany, and
[130] G. Iyer, R. KarnikRam, K. M. Jatavallabhula, and K. M. Krishna, “Cal- the Ph.D. degree in mechanical engineering from
ibnet: Geometrically supervised extrinsic calibration using 3d spatial Tsinghua University, in 2014. He is currently a
transformer networks,” 2018 IEEE/RSJ International Conference on Research Fellow of China Intelligent and Connected
Intelligent Robots and Systems (IROS), pp. 1110–1117, 2018. Vehicles Research Institute Company, Ltd., Beijing,
[131] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. which is also known as the National Innovation
Choi, “Sequence-to-sequence prediction of vehicle trajectory via lstm Center of Intelligent and Connected Vehicles. His
encoder-decoder architecture,” in 2018 IEEE Intelligent Vehicles Sym- research interests include intelligent connected ve-
posium (IV), 2018, pp. 1672–1678. hicles, new energy vehicles, and vehicle dynamics and control
[132] J. Weng, N. Ahuja, and T. S. Huang, “Cresceptron: a self-organizing
neural network which grows adaptively,” in [Proceedings 1992] IJCNN
International Joint Conference on Neural Networks, vol. 1, June 1992,
pp. 576–581 vol.1.
[133] A. R. Dhamija, M. Gnther, J. Ventura, and T. E. Boult, “The over- Long Chen received the B.Sc. degree in commu-
looked elephant of object detection: Open set,” in 2020 IEEE Winter nication engineering and the Ph.D. degree in signal
Conference on Applications of Computer Vision (WACV), 2020, pp. and information processing from Wuhan University,
1010–1019. Wuhan, China.He is currently an Associate Professor
[134] K. Wong, S. Wang, M. Ren, M. Liang, and R. Urtasun, “Identifying with School of Data and Computer Science, Sun
unknown instances for autonomous driving,” in Conference on Robot Yat-sen University, Guangzhou, China. His areas of
Learning, 2020, pp. 384–393. interest include autonomous driving, robotics, artifi-
cial intelligence where he has contributed more than
[135] E. Imre, J. Guillemaut, and A. Hilton, “Through-the-lens multi-camera
70 publications. He serves as an Associate Editor
synchronisation and frame-drop detection for 3d reconstruction,” in
for IEEE Transactions on Intelligent Transportation
2012 Second International Conference on 3D Imaging, Modeling,
Systems.
Processing, Visualization Transmission, 2012, pp. 395–402.
JOURNAL OF LATEX CLASS FILES 19