1902 07830
1902 07830
driving are driven by deep learning. In order to achieve robust RGB Image LiDAR Points Radar Points Map
“how to fuse” remain open. This review paper attempts to sys- 0.99 0.8
PointPillars
0.9
AVOD PointRCNN LiDAR B. Contributions
MV3D(LiDAR)
0.8 AVOD-FPN VoxelNet MV3D LiDAR+Camera
PIXOR
A3DODWTDA
F-PC_CNN
Camera To the best of our knowledge, there is no survey that
TopNet-HighRes
0.7 TopNet-DecayRate 3D-FCN
(inference time>5s)
focuses on deep multi-modal object detection (2D or 3D)
0.6 and semantic segmentation for autonomous driving, which
Average Precision
BirdNet
0.5 Pseudo-LiDAR
makes it difficult for beginners to enter this research field.
0.4
Our review paper attempts to narrow this gap by conducting a
summary of newly-published datasets (2013-2019), and fusion
0.3
methodologies for deep multi-modal perception in autonomous
0.2
MonoFusion
A3DODWTDA(image) driving, as well as by discussing the remaining challenges and
OFT-NET
0.1
3D-SSMFCNN
open questions.
0.0 We first provide background information on multi-modal
0 100 200 300 400 500 600 700 800
Reported Runtime (ms) sensors, test vehicles, and modern deep learning approaches
in object detection and semantic segmentation in Sec. II. We
Fig. 2: Average precision (AP) vs. runtime. Visualized are deep
then summarize multi-modal datasets and perception problems
learning approaches that use LiDAR, camera, or both as inputs
in Sec. III and Sec. IV, respectively. Sec. V summarizes
for car detection on the KITTI bird’s eye view test dataset.
the fusion methodologies regarding “what to fuse”, “when to
Moderate APs are summarized. The results are mainly based
fuse” and “how to fuse”. Sec. VI discusses challenges and
on the KITTI leader-board [6] (visited on Apr. 20, 2019). On
open questions when developing deep multi-modal perception
the leader-board only the published methods are considered.
systems in order to fulfill the requirements of “accuracy”,
“robustness” and “real-time”, with a focus on data preparation
unless mentioned otherwise. and fusion methodology. We highlight the importance of
When developing methods for deep multi-modal object data diversity, temporal and spatial alignment, and labeling
detection or semantic segmentation, it is important to consider efficiency for multi-modal data preparation. We also high-
the input data: Are there any multi-modal datasets available light the lack of research on fusing Radar signals, as well
and how is the data labeled (cf. Tab. II)? Do the datasets as the importance of developing fusion methodologies that
cover diverse driving scenarios (cf. Sec. VI-A1)? Is the data tackle open dataset problems or increase network robustness.
of high quality (cf. Sec. VI-A2)? Additionally, we need to Sec. VII concludes this work. In addition, we provide an
answer several important questions on designing the neural interactive online platform for navigating topics and methods
network architecture: Which modalities should be combined for each reference. The platform can be found here: https:
via fusion, and how to represent and process them properly //boschresearch.github.io/multimodalperception/.
(“What to fuse” cf. Sec. VI-B1)? Which fusion operations and
methods can be used (“How to fuse” cf. Sec. VI-B2)? Which II. BACKGROUND
stage of feature representation is optimal for fusion (“When
to fuse” cf. Sec. VI-B2)? This section provides the background information for deep
multi-modal perception in autonomous driving. First, we
briefly summarize typical automotive sensors, their sensing
modalities, and some vehicles for test and research pur-
A. Related Works
poses. Next, we introduce deep object detection and semantic
Despite the fact that many methods have been proposed segmentation. Since deep learning has most-commonly been
for deep multi-modal perception in autonomous driving, there applied to image-based signals, here we mainly discuss image-
is no published summary examining available multi-modal based methods. We will introduce other methods that process
datasets, and there is no guideline for network architecture LiDAR and Radar data in Sec. V-A. For a more comprehensive
design. Yin et al. [7] summarize 27 datasets for autonomous overview on object detection and semantic segmentation, we
driving that were published between 2006 and 2016, in- refer the interested reader to the review papers [11], [12]. For a
cluding the datasets recorded with a single camera alone or complete review of computer vision problems in autonomous
multiple sensors. However, many new multi-modal datasets driving (e.g. optical flow, scene reconstruction, motion estima-
have been released since 2016, and it is worth summarizing tion), cf. [9].
them. Ramachandram et al. [8] provide an overview on deep
multi-modal learning, and mention its applications in diverse
A. Sensing Modalities for Autonomous Driving
research fields, such as robotic grasping and human action
recognition. Janai et al. [9] conduct a comprehensive summary 1) Visual and Thermal Cameras: Images captured by visual
on computer vision problems for autonomous driving, such and thermal cameras can provide detailed texture information
as scene flow and scene construction. Recently, Arnold et of a vehicle’s surroundings. While visual cameras are sensitive
al. [10] survey the 3D object detection problem in autonomous to lighting and weather conditions, thermal cameras are more
driving. They summarize methods based on monocular images robust to daytime/nighttime changes as they detect infrared
or point clouds, and briefly mention some works that fuse radiation that relates to heat from objects. However, both types
vision camera and LiDAR information. of cameras however cannot directly provide depth information.
3
B. Recording Conditions 1
Car 1.4M
Even though the KITTI dataset [75] is widely used for Person
0.8
Cyclist
autonomous driving research, the diversity of its recording
Normalized proportion
# of image frames
conditions is relatively low: it is recorded in Karlsruhe - a 0.6
needs to be estimated. Therefore, accurate depth information features (Horizontal disparity, Height, Angle) [66], or any
provided by LiDAR sensors is highly beneficial. In this regard, other 3D coordinate system. The reflectance information is
some papers including [98], [102]–[105], [107], [113], [115] given by intensity.
combine RGB camera images and LiDAR point clouds for There are mainly three ways to process point clouds. One
3D object detection. In addition, Liang et al. [116] propose way is by discretizing the 3D space into 3D voxels and as-
a multi-task learning network to aid 3D object detection. The signing the points to the voxels (e.g. [29], [113], [135]–[137]).
auxiliary tasks include camera depth completion, ground plane In this way, the rich 3D shape information of the driving
estimation, and 2D object detection. How to represent the environment can be preserved. However, this method results in
modalities properly is discussed in section V-A. many empty voxels as the LiDAR points are usually sparse and
3) What to detect: Complex driving scenarios often contain irregular. Processing the sparse data via clustering (e.g. [100],
different types of road users. Among them, cars, cyclists, and [106]–[108]) or 3D CNN (e.g. [29], [136]) is usually very
pedestrians are highly relevant to autonomous driving. In this time-consuming and infeasible for online autonomous driving.
regard, [98], [99], [106], [108], [110] employ multi-modal Zhou et al. [135] propose a voxel feature encoding (VFE) layer
neural networks for car detection; [101], [108], [109], [117]– to process the LiDAR points efficiently for 3D object detec-
[120] focus on detecting non-motorized road users (pedestrians tion. They report an inference time of 225 ms on the KITTI
or cyclists); [61], [91], [100], [102]–[105], [111], [115], [116] dataset. Yan et al. [138] add several sparse convolutional layers
detect both. after the VFE to convert the sparse voxel data into 2D images,
and then perform 3D object detection on them. Unlike the
common convolution operation, the sparse convolution only
B. Deep Multi-modal Semantic Segmentation
computes on the locations associated with input points. In
Compared to the object detection problem summarized in this way, they save a lot of computational cost, achieving an
Sec. IV-A, there are fewer works on multi-modal semantic inference time of only 25 ms.
segmentation: [92], [119], [124] employ RGB and thermal The second way is to directly learn over 3D LiDAR points in
images, [61] fuses RGB images and depth images from a continuous vector space without voxelization. PointNet [139]
stereo camera, [125]–[127] combine RGB, thermal, and depth and its improved version PointNet++ [140] propose to predict
images for semantic segmentation in diverse environments individual features for each point and aggregate the features
such as forests, [123] fuses RGB images and LiDAR point from several points via max pooling. This method was firstly
clouds for off-road terrain segmentation and [128]–[132] for introduced in 3D object recognition and later extended by Qi et
road segmentation. Apart from the above-mentioned works for al. [105], Xu et al. [104] and Shin et al. [141] to 3D object de-
semantic segmentation on the 2D image plane, [125], [133] tection in combination with RGB images. Furthermore, Wang
deal with 3D segmentation on LiDAR points. et al. [142] propose a new learnable operator called Parametric
Continuous Convolution to aggregate points via a weighted
V. M ETHODOLOGY sum, and Li et al. [143] propose to learn a χ transformation
When designing a deep neural network for multi-modal before applying transformed point cloud features into standard
perception, three questions need to be addressed - What to CNN. They are tested in semantic segmentation or LiDAR
fuse: what sensing modalities should be fused, and how to motion estimation tasks.
represent and process them in an appropriate way; How to A third way to represent 3D point clouds is by projecting
fuse: what fusion operations should be utilized; When to fuse: them onto 2D grid-based feature maps so that they can be
at which stage of feature representation in a neural network processed via 2D convolutional layers. In the following, we
should the sensing modalities be combined. In this section, distinguish among spherical map, camera-plane map (CPM),
we summarize existing methodologies based on these three as well as bird’s eye view (BEV) map. Fig. 6 illustrates
aspects. different LiDAR representations in 2D.
A spherical map is obtained by projecting each 3D point
onto a sphere, characterized by azimuth and zenith angles.
A. What to Fuse It has the advantage of representing each 3D point in a
LiDARs and cameras (visual cameras, thermal cameras) are dense and compact way, making it a suitable representation
the most common sensors for multi-modal perception in the for point cloud segmentation (e.g. [51]). However, the size
literature. While the interest in processing Radar signals via of the representation can be different from camera images.
deep learning is growing, only a few papers discuss deep Therefore, it is difficult to fuse them at an early stage. A
multi-modal perception with Radar for autonomous driving CPM can be produced by projecting the 3D points into the
(e.g. [134]). Therefore, we focus on several ways to represent camera coordinate system, provided the calibration matrix. A
and process LiDAR point clouds and camera images sepa- CPM can be directly fused with camera images, as their sizes
rately, and discuss how to combine them together. In addition, are the same. However, this representation leaves many pixels
we briefly summarize Radar perception using deep learning. empty. Therefore, many methods have been proposed to up-
1) LiDAR Point Clouds: LiDAR point clouds provide both sample such a sparse feature map, e.g. mean average [111],
depth and reflectance information of the environment. The nearest neighbors [144], or bilateral filter [145]. Compared
depth information of a point p can be encoded by its Cartesian to the above-mentioned feature maps which encode LiDAR
coordinates [x, y, z], distance x2 + y2 + z2 , density, or HHA information in the front-view, a BEV map avoids occlusion
7
Fig. 8: An illustration of early fusion, late fusion, and several middle fusion methods.
the layer from which intermediate features begin to be fused. Chen et al. [98] use LiDAR BEV maps to generate region
The middle fusion can be executed at this layer only once: proposals. For each ROI, the regional features from the LiDAR
BEV maps are fused with those from the LiDAR front-view
Mj Mj Mj
fL = GL · · · Gl ? +1 GM l?
i
· · · G Mi Mi
1 ( f 0 ) ⊕G l? · · · G 1 ( f 0 ) . maps as well as camera images via deep fusion. Compared
(4) to object detections from LiDAR point clouds, camera images
Alternatively, they can be fused hierarchically, such as by deep have been well investigated with larger labeled dataset and bet-
fusion [98], [162]: ter 2D detection performance. Therefore, it is straightforward
to exploit the predictions from well-trained image detectors
Mi Mj
fl +1 = fl ? ⊕ fl ? ,
? when doing camera-LiDAR fusion. In this regard, [104],
M
(5) [105], [107] propose to utilize a pre-trained image detector to
fk+1 = GM j ?
k ( f k ) ⊕ Gk ( f k ), ∀k : k ∈ {l + 1, · · · , L} .
i
produce 2D bounding boxes, which build frustums in LiDAR
or “short-cut fusion” [92]: point clouds. Then, they use these point clouds within the
M frustums for 3D object detection. Fig. 9 shows some exemplary
fl+1 = flMi ⊕ fl j , fusion architectures for two-stage object detection networks.
Mi Mj (6) Tab. III summarizes the methodologies for multi-modal object
fk+1 = fk ⊕ fk? ⊕ fk? ,
detection.
∀k : k ∈ {l + 1, · · · , L} ; ∃k? : k? ∈ {1, · · · , l − 1} .
5) Fusion Operation and Fusion Scheme: Based on the
Although the middle fusion approach is highly flexible, it is papers that we have reviewed, feature concatenation is the
not easy to find the “optimal” way to fuse intermediate layers most common operation, especially at early and middle stages.
given a specific network architecture. We will discuss this Element-wise average mean and addition operations are ad-
challenge in detail in Sec. VI-B3. ditionally used for middle fusion. Ensemble and Mixture of
4) Fusion in Object Detection Networks: Modern multi- Experts are often used for middle to decision level fusion.
modal object detection networks usually follow either the
two-stage pipeline (RCNN [35], Fast-RCNN [37], Faster-
VI. C HALLENGES AND O PEN Q UESTIONS
RCNN [41]) or the one-stage pipeline (YOLO [44] and
SSD [45]), as explained in detail in Sec. II-C. This offers As discussed in the Introduction (cf. Sec. I), developing
a variety of alternatives for network fusion. For instance, the deep multi-modal perception systems is especially challenging
sensing modalities can be fused to generate regional proposals for autonomous driving because it has high requirements in ac-
for a two-stage object detector. The regional multi-modal curacy, robustness, and real-time performance. The predictions
features for each proposal can be fused as well. Ku et al. [103] from object detection or semantic segmentation are usually
propose AVOD, an object detection network that fuses RGB transferred to other modules such as maneuver prediction
images and LiDAR BEV images both in the region proposal and decision making. A reliable perception system is the
network and the header network. Kim et al. [109] ensemble the prerequisite for a driverless car to run safely in uncontrolled
region proposals that are produced by LiDAR depth images and complex driving environments. In Sec. III and Sec. V
and RGB images separately. The joint region proposals are we have summarized the multi-modal datasets and fusion
then fed to a convolutional network for final object detection. methodologies. Correspondingly, in this section we discuss the
10
2D ROI
generation
2D ROI
Point Cloud 3D object 2D ROI 2D object
Fusion Projection
Segmentation detection detection
(ensemble) and pooling
2D ROI
RGB camera image RGB camera image 2D ROI
Object class
2D object detection 2D ROI
generation
Fig. 9: Examplary fusion architectures for two-stage object detection networks. (a). MV3D [98]; (b). AVOD [103]; (c). Frustum
PointNet [105]; (d). Ensemble Proposals [109].
Data quality • Labeling errors. • Teaching network robustness with erroneous and noisy
• Spatial and temporal misalignment of different sensors. labels.
• Integrating prior knowledge in networks.
• Developing methods (e.g. using deep learning) to
automatically register sensors.
“What to fuse” • Too few sensing modalities are fused. • Fusing multiple sensors with the same modality.
• Lack of studies for different feature representations. • Fusing more sensing modalities, e.g. Radar, Ultrasonic,
V2X communication.
• Fusing with physical models and prior knowledge, also
possible in the multi-task learning scheme.
• Comparing different feature representation w.r.t informa-
tiveness and computational costs.
Fusion methodology “How to fuse” • Lack of uncertainty quantification for each sensor chan- • Uncertainty estimation via e.g. Bayesian neural networks
nel. (BNN).
• Too simple fusion operations. • Propagating uncertainties to other modules, such as track-
ing and motion planning.
• Anomaly detection by generative models.
• Developing fusion operations that are suitable for
network pruning and compression.
“When to fuse” • Fusion architecture is often designed by empirical results. • Optimal fusion architecture search.
No guideline for optimal fusion architecture design. • Incorporating requirements of computation time or mem-
• Lack of study for accuracy/speed or memory/robustness ory as regularization term.
trade-offs. • Using visual analytics tool to find optimal fusion
architecture.
Evaluation metrics • Current metrics focus on comparing networks’ accuracy. • Metrics to quantify the networks’ robustness should
Others
be developed and adapted to multi-modal perception
problems.
More network architectures • Current networks lack temporal cues and cannot guaran- • Using Recurrent Neural Network (RNN) for sequential
tee prediction consistency over time. perception.
• They are designed mainly for modular autonomous • Multi-modal end-to-end learning or multi-modal direct-
driving. perception.
11
remaining challenges and open questions for multi-modal data (a) “collaborative labeling”
0.8
B. Fusion Methodology
𝐦𝐀𝐏𝐛𝐚𝐬𝐞𝐥𝐢𝐧𝐞
0.7
0.6
0.5 Labeling with random noises 1) What to Fuse: Most reviewed methods combine RGB
0.4
images with thermal images or LiDAR 3D points. The net-
𝐦𝐀𝐏
0.3
0.2 Labeling with biases works are trained and evaluated on open datasets such as
0.1 Labeling with random noises
KITTI [6] and KAIST Pedestrian [93]. These methods do
0
0 50 100
Labeling with biases
not specifically focus on sensor redundancy, e.g. installing
Noises δ [px]
multiple cameras on a driverless car to increase the reliability
Fig. 11: (a) An illustration for the influence of label quality of perception systems even when some sensors are defective.
on the performance of an object detection network [196]. The How to fuse the sensing information from multiple sensors
network is trained on labels which are incrementally disturbed. (e.g. RGB images from multiple cameras) is an important open
The performance is measured by mAP normalized to the question.
performance trained on the undisturbed dataset. The network is Another challenge is how to represent and process different
much more robust against random labeling errors (drawn from sensing modalities appropriately before feeding them into
a Gaussian distribution with variance σ ) than biased labeling a fusion network. For instance, many approaches exist to
(all labels shifted by σ ) cf. [194], [195]. (b) An illustration of represent LiDAR point clouds, including 3D voxels, 2D BEV
the random labeling noises and labeling biases (all bounding maps, spherical maps, as well as sparse or dense depth maps
boxes are shifted in the upper-right direction). (more details cf. Sec. V-A). However, only Pfeuffer et al. [111]
have studied the pros and cons for several LiDAR front-view
research topic to leverage lifelong learning [187] to update the representations. We expect more works to compare different
multi-modal perception network with continual data collection. 3D point representation methods.
2) Data Quality and Alignment: Besides data diversity and In addition, there are very few studies for fusing LiDAR
the size of the training dataset, data quality significantly affects and camera outputs with signals from other sources such
the performance of a deep multi-modal perception system as as Radars, ultrasonics or V2X communication. Radar data
well. Training data is usually labeled by human annotators to differs from LiDAR data and it requires different network
ensure the high labeling quality. However, humans are also architecture and fusion schmes. So far, we are not aware of
prone to making errors. Fig. 11 shows two different errors in any work fusing Ultrasonic sensor signals in deep multi-modal
the labeling process when training an object detection network. perception, despite its relevance for low-speed scenarios. How
The network is much more robust against labeling errors when to fuse these sensing modalities and align them temporally and
they are randomly distributed, compared to biased labeling spatially are big challenges.
from the use of a deterministic pre-labeling. Training networks Finally, it is an interesting topic to combine physical con-
with erroneous labels is further studied in [188]–[191]. The straints and model-based approaches with data-driven neural
impact on weak or erroneous labels on the performance of networks. For example, Ramos et al. [199] propose to fuse
deep learning based semantic segmentation is investigated semantics and geometric cues in a Bayesian framework for
in [192], [193]. The influence of labelling errors on the unexpected objects detections. The semantics are predicted by
accuracy of object detection is discussed in [194], [195]. a FCN network, whereas the geometric cues are provided by
Well-calibrated sensors are the prerequisite for accurate and model-based stereo detections. The multi-task learning scheme
robust multi-modal perception systems. However, the sensor also helps to add physical constraints in neural networks. For
setup is usually not perfect. Temporal and spatial sensing mis- example, to aid 3D object detection task, Liang et al. [116]
alignments might occur while recording the training data or de- design a fusion network that additionally estimate LiDAR
ploying the perception modules. This could cause severe errors ground plane and camera image depth. The ground plane
in training datasets and degrade the performance of networks, estimation provides useful cues for object locations, while
especially for those which are designed to implicitly learn the image depth completion contributes to better cross-modal
the sensor alignment (e.g. networks that fuse LiDAR BEV representation; Panoptic segmentation [47] aims to achieve
feature maps and front view camera images cf. Sec. V-A3). complete scene understanding by jointly doing semantic seg-
Interestingly, several works propose to calibrate sensors by mentation and instance segmentation.
deep neural networks: Giering et al. [197] discretize the spatial 2) How to Fuse: Explicitly modeling uncertainty or in-
misalignments between LiDAR and visual camera into nine formativeness of each sensing modality is important safe
classes, and build a network to classify misalignment taking autonomous driving. As an example, a multi-modal percep-
LiDAR and RGB images as inputs; Schneider et al. [198] tion system should show higher uncertainty against adverse
propose to fully regress the extrinsic calibration parameters weather or detect unseen driving environments (open-world
between LiDAR and visual camera by deep learning. Sev- problem). It should also reflect sensor’s degradation or defects
eral multi-modal CNN networks are trained on different de- as well. The perception uncertainties need to be propagated
calibration ranges to iteratively refine the calibration output. In to other modules such as motion planning [200] so that
this way, the feature extraction, feature matching, and global the autonomous vehicles can behave accordingly. Reliable
optimization problems for sensor registration could be solved uncertainty estimation can show the networks’ robustness (cf.
13
Tab. VI summarize the inference speed of several object de- and “when to fuse”. We have also discussed challenges and
tection and semantic segmentation networks on the KITTI test open questions. Furthermore, our interactive online tool allows
set. Each method uses different hardware, and the inference readers to navigate topics and methods for each reference. We
time is reported only by the authors. It is an open question how plan to frequently update this tool.
these methods perform when they are deployed on automotive Despite the fact that an increasing number of multi-modal
hardware. datasets have been published, most of them record data from
RGB cameras, thermal cameras, and LiDARs. Correspond-
C. Others ingly, most of the papers we reviewed fuse RGB images
either with thermal images or with LiDAR point clouds. Only
1) Evaluation Metrics: The common way to evaluate object
recently has the fusion of Radar data been investigated. This
detection methods is mean average precision (mAP) [6], [240].
includes nuScene dataset [89], the Oxford Radar RobotCar
It is the mean value of average precision (AP) over object
Dataset [85], the Astyx HiRes2019 Dataset [94], and the
classes, given a certain intersection over union (IoU) threshold
seminal work from Chadwick et al. [134] that proposes to fuse
defined as the geometric overlap between predictions and
RGB camera images with Radar points for vehicle detection.
ground truths. As for the pixel-level semantic segmentation,
In the future, we expect more datasets and fusion methods
metrics such as average precision, false positive rate, false
concerning Radar signals.
negative rate, and IoU calculated at pixel level [57] are often
There are various ways to fuse sensing modalities in neural
used. However, these metrics only summarize the prediction
networks, encompassing different sensor representations, cf.
accuracy to a test dataset. They do not consider how sensor
Sec. V-A, fusion operations cf. Sec. V-B, and fusion stages,
behaves in different situations. As an example, to evaluate the
cf. Sec. V-C. However, we do not find conclusive evidence
performance of a multi-modal network, the IoU thresholds
that one fusion method is better than the others. Additionally,
should depend on object distance, occlusion, and types of
there is a lack of research on multi-modal perception in open-
sensors.
set conditions or with sensor failures. We expect more focus
Furthermore, common evaluation metrics are not designed
on these challenging research topics.
specifically to illustrate how the algorithm handles open-set
conditions or in situations where some sensors are degraded or
defective. There exist several metrics to evaluate the quality of ACKNOWLEDGMENT
predictive uncertainty, e.g. empirical calibration curves [241] We thank Fabian Duffhauss for collecting literature and
and log predictive probabilities. The detection error [242] mea- reviewing the paper. We also thank Bill Beluch, Rainer Stal,
sures the effectiveness of a neural network in distinguishing Peter Möller and Ulrich Michael for their suggestions and
in- and out-of-distribution data. The Probability-based Detec- inspiring discussions.
tion Quality (PDQ) [243] is designed to measure the object
detection performance for spatial and semantic uncertainties.
These metrics can be adapted to the multi-modal perception R EFERENCES
problems to compare the networks’ robustness. [1] E. D. Dickmanns and B. D. Mysliwetz, “Recursive 3-d road and relative
2) More Network Architectures: Most reviewed methods ego-state recognition,” IEEE Trans. Pattern Anal. Mach. Intell., no. 2,
pp. 199–213, 1992.
are based on CNN architectures for single frame perception. [2] C. Urmson et al., “Autonomous driving in urban environments: Boss
The predictions in a frame are not dependent on previous and the urban challenge,” J. Field Robotics, vol. 25, no. 8, pp. 425–466,
frames, resulting in inconsistency over time. Only a few works 2008.
[3] R. Berger, “Autonomous driving,” Think Act, 2014.
incorporate temporal cues (e.g. [122], [244]). Future work is [Online]. Available: http://www.rolandberger.ch/media/pdf/Roland
expected to develop multi-modal perception algorithms that Berger TABAutonomousDrivingfinal20141211
can handle time series, e.g. via Recurrent Neural Networks. [4] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The
Mapillary Vistas dataset for semantic understanding of street scenes,”
Furthermore, current methods are designed to propagate results in Proc. IEEE Conf. Computer Vision, Oct. 2017, pp. 5000–5009.
to other modules in autonomous driving, such as localization, [5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
planning, and reasoning. While the modular approach is the 521, no. 7553, p. 436, 2015.
[6] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
common pipeline for autonomous driving, some works also try driving? the KITTI vision benchmark suite,” in Proc. IEEE Conf.
to map the sensor data directly to the decision policy such as Computer Vision and Pattern Recognition, 2012.
steering angles or pedal positions (end-to-end learning) [245]– [7] H. Yin and C. Berger, “When to use what data set for your self-driving
car algorithm: An overview of publicly available driving datasets,” in
[247], or to some intermediate environment representations IEEE 20th Int. Conf. Intelligent Transportation Systems, 2017, pp. 1–8.
(direct-perception) [248], [249]. Multi-modal end-to-end learn- [8] D. Ramachandram and G. W. Taylor, “Deep multimodal learning: A
ing and direct perception can be potential research directions survey on recent advances and trends,” IEEE Signal Process. Mag.,
vol. 34, no. 6, pp. 96–108, 2017.
as well. [9] J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision
for autonomous vehicles: Problems, datasets and state-of-the-art,”
VII. C ONCLUSION AND D ISCUSSION arXiv:1704.05519 [cs.CV], 2017.
[10] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and
We have presented our survey for deep multi-modal object A. Mouzakitis, “A survey on 3d object detection methods for au-
detection and segmentation applied to autonomous driving. We tonomous driving applications,” IEEE Trans. Intell. Transp. Syst., pp.
1–14, 2019.
have provided a summary of both multi-modal datasets and fu- [11] L. Liu et al., “Deep learning for generic object detection: A survey,”
sion methodologies, considering “what to fuse”, “how to fuse”, arXiv:1809.02165 [cs.CV], 2018.
15
[12] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and [37] R. Girshick, “Fast R-CNN,” in Proc. IEEE Conf. Computer Vision,
J. Garcia-Rodriguez, “A review on deep learning techniques applied to 2015, pp. 1440–1448.
semantic segmentation,” Applied Soft Computing, 2017. [38] K. Simonyan and A. Zisserman, “Very deep convolutional networks
[13] K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, and for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
H. Winner, “Three decades of driver assistance systems: Review and [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
future perspectives,” IEEE Intell. Transp. Syst. Mag., vol. 6, no. 4, pp. image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern
6–22, 2014. Recognition, 2016, pp. 770–778.
[14] Waymo. (2017) Waymo safety report: On the road to fully self-driving. [40] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
[Online]. Available: https://waymo.com/safety Conf. Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[15] M. Aeberhard et al., “Experience, results and lessons learned from [41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
automated driving on Germany’s highways,” IEEE Intell. Transp. Syst. time object detection with region proposal networks,” in Advances in
Mag., vol. 7, no. 1, pp. 42–57, 2015. Neural Information Processing Systems, 2015, pp. 91–99.
[16] J. Ziegler et al., “Making Bertha drive – an autonomous journey on a [42] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-
historic route,” IEEE Intell. Transp. Syst. Mag., vol. 6, no. 2, pp. 8–20, based fully convolutional networks,” in Advances in Neural Information
2014. Processing Systems, 2016, pp. 379–387.
[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
[43] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object
and A. Zisserman, “The PASCAL Visual Object Classes
detection,” in Advances in Neural Information Processing Systems,
Challenge 2007 (VOC2007) Results,” http://www.pascal-
2013, pp. 2553–2561.
network.org/challenges/VOC/voc2007/workshop/index.html.
[18] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in [44] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
Proc. Eur. Conf. Computer Vision. Springer, 2014, pp. 740–755. look once: Unified, real-time object detection,” in Proc. IEEE Conf.
[19] M. Weber, P. Wolf, and J. M. Zöllner, “DeepTLR: A single deep Computer Vision and Pattern Recognition, 2016, pp. 779–788.
convolutional network for detection and classification of traffic lights,” [45] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
in IEEE Intelligent Vehicles Symp., 2016, pp. 342–348. Computer Vision. Springer, 2016, pp. 21–37.
[20] J. Müller and K. Dietmayer, “Detecting traffic lights by single shot [46] J. Huang et al., “Speed/accuracy trade-offs for modern convolutional
detection,” in 21st Int. Conf. Intelligent Transportation Systems. IEEE, object detectors,” in Proc. IEEE Conf. Computer Vision and Pattern
2016, pp. 342–348. Recognition, vol. 4, 2017.
[21] M. Bach, S. Reuter, and K. Dietmayer, “Multi-camera traffic light [47] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic
recognition using a classifying labeled multi-bernoulli filter,” in IEEE segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern
Intelligent Vehicles Symp., 2017, pp. 1045–1051. Recognition, 2018.
[22] K. Behrendt, L. Novak, and R. Botros, “A deep learning approach to [48] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature
traffic lights: Detection, tracking, and classification,” in IEEE Int. Conf. pyramid networks,” in Proc. IEEE Conf. Computer Vision and Pattern
Robotics and Automation, 2017, pp. 1370–1377. Recognition, 2019, pp. 6399–6408.
[23] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu, “Traffic-sign [49] Y. Xiong et al., “Upsnet: A unified panoptic segmentation network,”
detection and classification in the wild,” in Proc. IEEE Conf. Computer in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019,
Vision and Pattern Recognition, 2016, pp. 2110–2118. pp. 8818–8826.
[24] H. S. Lee and K. Kim, “Simultaneous traffic sign detection and [50] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder, “Seamless
boundary estimation using convolutional neural network,” IEEE Trans. scene segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern
Intell. Transp. Syst., 2018. Recognition, 2019, pp. 8277–8286.
[25] H. Luo, Y. Yang, B. Tong, F. Wu, and B. Fan, “Traffic sign recognition [51] B. Wu, A. Wan, X. Yue, and K. Keutzer, “SqueezeSeg: Convolutional
using a multi-task convolutional neural network,” IEEE Trans. Intell. neural nets with recurrent CRF for real-time road-object segmentation
Transp. Syst., vol. 19, no. 4, pp. 1100–1111, 2018. from 3d lidar point cloud,” in IEEE Int. Conf. Robotics and Automation,
[26] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele, May 2018, pp. 1887–1893.
“Towards reaching human performance in pedestrian detection,” IEEE [52] L. Caltagirone, S. Scheidegger, L. Svensson, and M. Wahde, “Fast
Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 973–986, 2018. lidar-based road detection using fully convolutional neural networks,”
[27] L. Zhang, L. Lin, X. Liang, and K. He, “Is Faster R-CNN doing well for in IEEE Intelligent Vehicles Symp., 2017, pp. 1019–1024.
pedestrian detection?” in Proc. Eur. Conf. Computer Vision. Springer, [53] Q. Huang, W. Wang, and U. Neumann, “Recurrent slice networks for
2016, pp. 443–457. 3d segmentation of point clouds,” in Proc. IEEE Conf. Computer Vision
[28] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, and Pattern Recognition, 2018, pp. 2626–2635.
“3d object proposals using stereo imagery for accurate object class [54] A. Dewan, G. L. Oliveira, and W. Burgard, “Deep semantic classifica-
detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5, pp. tion for 3d lidar data,” in IEEE/RSJ Int. Conf. Intelligent Robots and
1259–1272, 2018. Systems, 2017, pp. 3544–3549.
[29] B. Li, “3d fully convolutional network for vehicle detection in point [55] A. Dewan and W. Burgard, “DeepTemporalSeg: Temporally con-
cloud,” in IEEE/RSJ Int. Conf. Intelligent Robots and Systems, 2017, sistent semantic segmentation of 3d lidar scans,” arXiv preprint
pp. 1513–1518. arXiv:1906.06962, 2019.
[30] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using
[56] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “RangeNet++: Fast
fully convolutional network,” in Proc. Robotics: Science and Systems,
and Accurate LiDAR Semantic Segmentation,” in IEEE/RSJ Int. Conf.
Jun. 2016.
Intelligent Robots and Systems, 2019.
[31] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun,
“Monocular 3d object detection for autonomous driving,” in Proc. IEEE [57] M. Cordts et al., “The Cityscapes dataset for semantic urban scene
Conf. Computer Vision and Pattern Recognition, 2016, pp. 2147–2156. understanding,” in Proc. IEEE Conf. Computer Vision and Pattern
[32] J. Fang, Y. Zhou, Y. Yu, and S. Du, “Fine-grained vehicle model recog- Recognition, 2016, pp. 3213–3223.
nition using a coarse-to-fine convolutional neural network architecture,” [58] S. Wang et al., “TorontoCity: Seeing the world with a million eyes,”
IEEE Trans. Intell. Transp. Syst., vol. 18, no. 7, pp. 1782–1792, 2017. in Proc. IEEE Conf. Computer Vision, 2017, pp. 3028–3036.
[33] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká, “3d bounding [59] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder, “The
box estimation using deep learning and geometry,” in Proc. IEEE Conf. mapillary vistas dataset for semantic understanding of street scenes,”
Computer Vision and Pattern Recognition, 2017, pp. 5632–5640. in Proc. IEEE Conf. Computer Vision, 2017. [Online]. Available:
[34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, https://www.mapillary.com/dataset/vistas
“OverFeat: Integrated recognition, localization and detection using [60] X. Huang et al., “The ApolloScape dataset for autonomous driving,” in
convolutional networks,” in Int. Conf. Learning Representations, 2013. Workshop Proc. IEEE Conf. Computer Vision and Pattern Recognition,
[35] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature 2018, pp. 954–960.
hierarchies for accurate object detection and semantic segmentation,” [61] L. Schneider et al., “Multimodal neural networks: RGB-D for semantic
in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, segmentation and object detection,” in Scandinavian Conf. Image
pp. 580–587. Analysis. Springer, 2017, pp. 98–109.
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep [62] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
convolutional networks for visual recognition,” IEEE Trans. Pattern convolutional encoder-decoder architecture for image segmentation,”
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015. IEEE Trans. Pattern Anal. Mach. Intell., no. 12, pp. 2481–2495, 2017.
16
[63] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, [89] H. Caesar et al., “nuScenes: A multimodal dataset for autonomous
“MultiNet: Real-time joint semantic reasoning for autonomous driv- driving,” arXiv preprint arXiv:1903.11027, 2019.
ing,” in IEEE Intelligent Vehicles Symp., 2018. [90] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon, “Multispectral
[64] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in pedestrian detection: Benchmark dataset and baseline,” in Proc. IEEE
Proc. IEEE Conf. Computer Vision, 2017, pp. 2980–2988. Conf. Computer Vision and Pattern Recognition, 2015, pp. 1037–1045.
[65] J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox, “Box2Pix: [91] K. Takumi, K. Watanabe, Q. Ha, A. Tejero-De-Pablos, Y. Ushiku, and
Single-shot instance segmentation by assigning pixels to object boxes,” T. Harada, “Multispectral object detection for autonomous vehicles,”
in IEEE Intelligent Vehicles Symp., 2018. in Proc. Thematic Workshops of ACM Multimedia, 2017, pp. 35–43.
[66] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features [92] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet:
from RGB-D images for object detection and segmentation,” in Proc. Towards real-time semantic segmentation for autonomous vehicles with
Eur. Conf. Computer Vision. Springer, 2014, pp. 345–360. multi-spectral scenes,” in IEEE/RSJ Int. Conf. Intelligent Robots and
[67] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous Systems, 2017, pp. 5108–5115.
detection and segmentation,” in Proc. Eur. Conf. Computer Vision. [93] Y. Choi et al., “KAIST multi-spectral day/night data set for autonomous
Springer, 2014, pp. 297–312. and assisted driving,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 3,
[68] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks pp. 934–948, 2018.
for semantic segmentation,” in Proc. IEEE Conf. Computer Vision and [94] M. Meyer and G. Kuschk, “Automotive radar dataset for deep learning
Pattern Recognition, 2015, pp. 3431–3440. based 3d object detection,” in Proceedings of the 16th European Radar
[69] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Conference, 2019.
“DeepLab: Semantic image segmentation with deep convolutional nets, [95] D. Kondermann et al., “Stereo ground truth with error bars,” in 12th
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Asian Conf. on Computer Vision. Springer, 2014, pp. 595–610.
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018. [96] M. Larsson, E. Stenborg, L. Hammarstrand, T. Sattler, M. Pollefeys,
[70] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A and F. Kahl, “A cross-season correspondence dataset for robust seman-
deep neural network architecture for real-time semantic segmentation,” tic segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern
arXiv:1606.02147 [cs.CV], 2016. Recognition, 2019.
[71] A. Roy and S. Todorovic, “A multi-scale CNN for affordance segmen- [97] T. Sattler et al., “Benchmarking 6DOF outdoor visual localization in
tation in RGB images,” in Proc. Eur. Conf. Computer Vision. Springer, changing conditions,” in Proc. IEEE Conf. Computer Vision and Pattern
2016, pp. 186–201. Recognition, 2018, pp. 8601–8610.
[72] S. Zheng et al., “Conditional random fields as recurrent neural net- [98] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object
works,” in Proc. IEEE Conf. Computer Vision, 2015, pp. 1529–1537. detection network for autonomous driving,” in Proc. IEEE Conf.
[73] M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, M. Jagersand, and Computer Vision and Pattern Recognition, 2017, pp. 6526–6534.
H. Zhang, “A comparative study of real-time semantic segmentation for [99] A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, and U. J. Nunes,
autonomous driving,” in Workshop Proc. IEEE Conf. Computer Vision “Multimodal vehicle detection: fusing 3d-lidar and color camera data,”
and Pattern Recognition, 2018, pp. 587–597. Pattern Recognition Lett., 2017.
[74] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 [100] S.-I. Oh and H.-B. Kang, “Object detection and classification by
km: The Oxford RobotCar dataset,” Int. J. Robotics Research, vol. 36, decision-level fusion for intelligent vehicle systems,” Sensors, vol. 17,
no. 1, pp. 3–15, 2017. no. 1, p. 207, 2017.
[75] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: [101] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images for
The KITTI dataset,” Int. J. Robotics Research, 2013. pedestrian detection using convolutional neural networks,” in IEEE Int.
[76] J.-L. Blanco-Claraco, F.-Á. Moreno-Dueñas, and J. González-Jiménez, Conf. Robotics and Automation, 2016, pp. 2198–2205.
“The Málaga urban dataset: High-rate stereo and lidar in a realistic [102] Z. Wang, W. Zhan, and M. Tomizuka, “Fusing bird view lidar point
urban scenario,” Int. J. Robotics Research, vol. 33, no. 2, pp. 207–214, cloud and front view camera image for deep object detection,” in IEEE
2014. Intelligent Vehicles Symp., 2018.
[77] H. Jung, Y. Oto, O. M. Mozos, Y. Iwashita, and R. Kurazume, “Multi- [103] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander, “Joint 3d
modal panoramic 3d outdoor datasets for place categorization,” in proposal generation and object detection from view aggregation,” in
IEEE/RSJ Int. Conf. Intelligent Robots and Systems, 2016, pp. 4545– IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Oct. 2018, pp.
4550. 1–8.
[78] Y. Chen et al., “Lidar-video driving dataset: Learning driving policies [104] D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for
effectively,” in Proc. IEEE Conf. Computer Vision and Pattern Recog- 3D bounding box estimation,” in Proc. IEEE Conf. Computer Vision
nition, 2018, pp. 5870–5878. and Pattern Recognition, 2018.
[79] A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The H3D dataset for [105] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets
full-surround 3D multi-object detection and tracking in crowded urban for 3d object detection from RGB-D data,” in Proc. IEEE Conf.
scenes,” in IEEE Int. Conf. Robotics and Automation, 2019. Computer Vision and Pattern Recognition, 2018.
[80] X. Jianru et al., “BLVD: Building a large-scale 5D semantics bench- [106] X. Du, M. H. Ang, and D. Rus, “Car detection for autonomous vehicle:
mark for autonomous driving,” in IEEE Int. Conf. Robotics and Lidar and vision fusion approach through deep learning framework,”
Automation, 2019. in IEEE/RSJ Int. Conf. Intelligent Robots and Systems, 2017, pp. 749–
[81] R. Kesten et al. (2019) Lyft level 5 AV dataset 2019. [Online]. 754.
Available: https://level5.lyft.com/dataset/ [107] X. Du, M. H. Ang Jr., S. Karaman, and D. Rus, “A general pipeline for
[82] M.-F. Chang et al., “Argoverse: 3D tracking and forecasting with rich 3d detection of vehicles,” in IEEE Int. Conf. Robotics and Automation,
maps,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018.
June 2019. [108] D. Matti, H. K. Ekenel, and J.-P. Thiran, “Combining lidar space
[83] (2019) PandaSet: Public large-scale dataset for autonomous driving. clustering and convolutional neural networks for pedestrian detection,”
[Online]. Available: https://scale.com/open-datasets/pandaset in 14th IEEE Int. Conf. Advanced Video and Signal Based Surveillance,
[84] (2019) Waymo open dataset: An autonomous driving dataset. [Online]. 2017, pp. 1–6.
Available: https://www.waymo.com/open [109] T. Kim and J. Ghosh, “Robust detection of non-motorized road users
[85] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The using deep learning on optical and lidar data,” in IEEE 19th Int. Conf.
Oxford radar RobotCar dataset: A radar extension to the Oxford Intelligent Transportation Systems, 2016, pp. 271–276.
RobotCar dataset,” arXiv preprint arXiv: 1909.01300, 2019. [Online]. [110] J. Kim, J. Koh, Y. Kim, J. Choi, Y. Hwang, and J. W. Choi, “Robust
Available: https://arxiv.org/pdf/1909.01300 deep multi-modal learning based on gated information fusion network,”
[86] Q.-H. Pham et al., “A*3D Dataset: Towards autonomous driving in in Asian Conf. Computer Vision, 2018.
challenging environments,” arXiv preprint arXiv: 1909.07541, 2019. [111] A. Pfeuffer and K. Dietmayer, “Optimal sensor data fusion architecture
[87] J. Geyer et al. (2019) A2D2: AEV autonomous driving dataset. for object detection in adverse weather conditions,” in Proc. 21st Int.
[Online]. Available: https://www.audi-electronics-venture.de/aev/web/ Conf. Information Fusion. IEEE, 2018, pp. 2592–2599.
en/driving-dataset.html [112] M. Bijelic, F. Mannan, T. Gruber, W. Ritter, K. Dietmayer, and F. Heide,
[88] M. Braun, S. Krebs, F. B. Flohr, and D. M. Gavrila, “EuroCity Persons: “Seeing through fog without seeing fog: Deep sensor fusion in the
A novel benchmark for person detection in traffic scenes,” IEEE Trans. absence of labeled training data,” in Proc. IEEE Conf. Computer Vision,
Pattern Anal. Mach. Intell., pp. 1–1, 2019. 2019.
17
[113] V. A. Sindagi, Y. Zhou, and O. Tuzel, “MVX-Net: Multimodal voxelnet [137] S. Shi, Z. Wang, X. Wang, and H. Li, “Part − A2 Net: 3d part-aware
for 3D object detection,” in IEEE Int. Conf. Robotics and Automation, and aggregation neural network for object detection from point cloud,”
2019. arXiv preprint arXiv:1907.03670, 2019.
[114] J. Dou, J. Xue, and J. Fang, “SEG-VoxelNet for 3D vehicle detection [138] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
from rgb and lidar data,” in IEEE Int. Conf. Robotics and Automation. detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
IEEE, 2019, pp. 4362–4368. [139] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on
[115] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to aggregate point sets for 3d classification and segmentation,” in Proc. IEEE Conf.
local point-wise features for amodal 3D object detection,” in IEEE/RSJ Computer Vision and Pattern Recognition, Jul. 2017, pp. 77–85.
Int. Conf. Intelligent Robots and Systems. IEEE, 2019. [140] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical
[116] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi- feature learning on point sets in a metric space,” in Advances in Neural
sensor fusion for 3D object detection,” in Proc. IEEE Conf. Computer Information Processing Systems, 2017, pp. 5099–5108.
Vision and Pattern Recognition, 2019, pp. 7345–7353. [141] K. Shin, Y. P. Kwon, and M. Tomizuka, “RoarNet: A robust 3D
[117] J. Wagner, V. Fischer, M. Herman, and S. Behnke, “Multispectral object detection based on region approximation refinement,” in IEEE
pedestrian detection using deep fusion convolutional neural networks,” Intelligent Vehicles Symp., 2018.
in 24th Eur. Symp. Artificial Neural Networks, Computational Intelli- [142] S. Wang, S. Suo, M. Wei-Chiu, A. Pokrovsky, and R. Urtasun, “Deep
gence and Machine Learning, 2016, pp. 509–514. parametric continuous convolutional neural networks,” in Proc. IEEE
[118] S. W. Jingjing Liu, Shaoting Zhang and D. Metaxas, “Multispectral Conf. Computer Vision and Pattern Recognition, 2018, pp. 2589–2597.
deep neural networks for pedestrian detection,” in Proc. British Ma-
[143] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Con-
chine Vision Conf., Sep. 2016, pp. 73.1–73.13.
volution on χ -transformed points,” in Advances in Neural Information
[119] D. Guan, Y. Cao, J. Liang, Y. Cao, and M. Y. Yang, “Fusion of
Processing Systems, 2018, pp. 826–836.
multispectral data through illumination-aware deep neural networks for
pedestrian detection,” Information Fusion, vol. 50, pp. 148–157, 2019. [144] A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, and U. J. Nunes,
[120] O. Mees, A. Eitel, and W. Burgard, “Choosing smartly: Adaptive “DepthCN: Vehicle detection using 3d-lidar and ConvNet,” in IEEE
multimodal fusion for object detection in changing environments,” in 20th Int. Conf. Intelligent Transportation Systems, 2017.
IEEE/RSJ Int. Conf. Intelligent Robots and Systems, 2016, pp. 151– [145] C. Premebida, L. Garrote, A. Asvadi, A. P. Ribeiro, and U. Nunes,
156. “High-resolution lidar-based depth mapping using bilateral filter,” in
[121] B. Yang, M. Liang, and R. Urtasun, “HDNET: Exploiting HD maps IEEE 19th Int. Conf. Intelligent Transportation Systems, Nov. 2016,
for 3D object detection,” in Proc. 2nd Annu. Conf. Robot Learning, pp. 2469–2474.
2018, pp. 146–155. [146] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
[122] S. Casas, W. Luo, and R. Urtasun, “IntentNet: Learning to predict “PointPillars: Fast encoders for object detection from point clouds,” in
intention from raw sensor data,” in Proc. 2nd Annu. Conf. Robot Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018.
Learning, 2018, pp. 947–956. [147] T. Roddick, A. Kendall, and R. Cipolla, “Orthographic feature trans-
[123] D.-K. Kim, D. Maturana, M. Uenoyama, and S. Scherer, “Season- form for monocular 3d object detection,” in Proc. British Machine
invariant semantic segmentation with a deep multimodal network,” in Vision Conf., 2019.
Field and Service Robotics. Springer, 2018, pp. 255–270. [148] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and
[124] Y. Sun, W. Zuo, and M. Liu, “RTFNet: Rgb-thermal fusion network K. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging
for semantic segmentation of urban scenes,” IEEE Robotics and Au- the gap in 3d object detection for autonomous driving,” in Proc. IEEE
tomation Letters, 2019. Conf. Computer Vision and Pattern Recognition, 2019.
[125] A. Valada, G. L. Oliveira, T. Brox, and W. Burgard, “Deep multi- [149] Y. You et al., “Pseudo-lidar++: Accurate depth for 3d object detection
spectral semantic scene understanding of forested environments using in autonomous driving,” arXiv preprint arXiv:1906.06310, 2019.
multimodal fusion,” in Int. Symp. Experimental Robotics. Springer, [150] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion
2016, pp. 465–477. for multi-sensor 3d object detection,” in Proc. Eur. Conf. Computer
[126] A. Valada, J. Vertens, A. Dhall, and W. Burgard, “AdapNet: Adaptive Vision, 2018, pp. 641–656.
semantic segmentation in adverse environmental conditions,” in IEEE [151] K. Werber et al., “Automotive radar gridmap representations,” in IEEE
Int. Conf. Robotics and Automation, 2017, pp. 4644–4651. MTT-S Int. Conf. Microwaves for Intelligent Mobility, 2015, pp. 1–4.
[127] A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adap- [152] J. Lombacher, M. Hahn, J. Dickmann, and C. Wöhler, “Potential of
tation for multimodal semantic segmentation,” Int. J. Computer Vision, radar for static object classification using deep learning methods,” in
2018. IEEE MTT-S Int. Conf. Microwaves for Intelligent Mobility, 2016, pp.
[128] F. Yang, J. Yang, Z. Jin, and H. Wang, “A fusion model for road 1–4.
detection based on deep learning and fully connected CRF,” in 13th [153] J. Lombacher, K. Laudt, M. Hahn, J. Dickmann, and C. Wöhler,
Annu. Conf. System of Systems Engineering. IEEE, 2018, pp. 29–36. “Semantic radar grids,” in IEEE Intelligent Vehicles Symp., 2017, pp.
[129] L. Caltagirone, M. Bellone, L. Svensson, and M. Wahde, “Lidar-camera 1170–1175.
fusion for road detection using fully convolutional neural networks,”
[154] T. Visentin, A. Sagainov, J. Hasch, and T. Zwick, “Classification of
Robotics and Autonomous Systems, vol. 111, pp. 125–131, 2019.
objects in polarimetric radar images using cnns at 77 ghz,” in 2017
[130] X. Lv, Z. Liu, J. Xin, and N. Zheng, “A novel approach for detecting
IEEE Asia Pacific Microwave Conference (APMC). IEEE, 2017, pp.
road based on two-stream fusion fully convolutional network,” in IEEE
356–359.
Intelligent Vehicles Symp., 2018, pp. 1464–1469.
[131] F. Wulff, B. Schäufele, O. Sawade, D. Becker, B. Henke, and [155] S. Kim, S. Lee, S. Doo, and B. Shim, “Moving target classification
I. Radusch, “Early fusion of camera and lidar for robust road detection in automotive radar systems using convolutional recurrent neural net-
based on U-Net FCN,” in IEEE Intelligent Vehicles Symp., 2018, pp. works,” in 26th Eur. Signal Processing Conf. IEEE, 2018, pp. 1482–
1426–1431. 1486.
[132] Z. Chen, J. Zhang, and D. Tao, “Progressive lidar adaptation for road [156] M. G. Amin and B. Erol, “Understanding deep neural networks per-
detection,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. formance for radar-based human motion recognition,” in IEEE Radar
693–702, 2019. Conf., 2018, pp. 1461–1465.
[133] F. Piewak et al., “Boosting lidar-based semantic labeling by cross- [157] O. Schumann, M. Hahn, J. Dickmann, and C. Wöhler, “Semantic seg-
modal training data generation,” in Workshop Proc. Eur. Conf. Com- mentation on radar point clouds,” in Proc. 21st Int. Conf. Information
puter Vision, 2018. Fusion. IEEE, 2018, pp. 2179–2186.
[134] S. Chadwick, W. Maddern, and P. Newman, “Distant vehicle detection [158] C. Wöhler, O. Schumann, M. Hahn, and J. Dickmann, “Comparison
using radar and vision,” in IEEE Int. Conf. Robotics and Automation, of random forest and long short-term memory network performances
2019. in classification tasks using radar,” in Sensor Data Fusion: Trends,
[135] Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud Solutions, Applications. IEEE, 2017, pp. 1–6.
based 3d object detection,” in Proc. IEEE Conf. Computer Vision and [159] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive
Pattern Recognition, 2018. mixtures of local experts,” Neural Computation, vol. 3, pp. 79–87,
[136] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, 1991.
“Vote3Deep: Fast object detection in 3d point clouds using efficient [160] D. Eigen, M. Ranzato, and I. Sutskever, “Learning factored represen-
convolutional neural networks,” in IEEE Int. Conf. Robotics and tations in a deep mixture of experts,” in Workshop Proc. Int. Conf.
Automation, 2017, pp. 1355–1361. Learning Representations, 2014.
18
[161] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. [187] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter,
Davison, “Codeslam: learning a compact, optimisable representation “Continual lifelong learning with neural networks: A review,” Neural
for dense visual slam,” in Proc. IEEE Conf. Computer Vision and Networks, 2019.
Pattern Recognition, 2018, pp. 2560–2568. [188] Y. Wang et al., “Iterative learning with open-set noisy labels,” in Proc.
[162] J. Wang, Z. Wei, T. Zhang, and W. Zeng, “Deeply-fused nets,” IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 8688–
arXiv:1605.07716 [cs.CV], 2016. 8696.
[163] O. Russakovsky et al., “ImageNet large scale visual recognition chal- [189] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight
lenge,” Int. J. Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. examples for robust deep learning,” in Int. Conf. Machine Learning,
[164] J. Ngiam et al., “Starnet: Targeted computation for object detection in 2018.
point clouds,” arXiv preprint arXiv:1908.11069, 2019. [190] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “MentorNet:
[165] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for Learning data-driven curriculum for very deep neural networks on
multi-object tracking analysis,” in Proc. IEEE Conf. Computer Vision corrupted labels,” in Int. Conf. Machine Learning, 2018, pp. 2309–
and Pattern Recognition, 2016. 2318.
[166] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: [191] X. Ma et al., “Dimensionality-driven learning with noisy labels,” in
Ground truth from computer games,” in Proc. Eur. Conf. Computer Int. Conf. Machine Learning, 2018, pp. 3361–3370.
Vision, 2016, pp. 102–118. [192] A. Zlateski, R. Jaroensri, P. Sharma, and F. Durand, “On the importance
[167] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The of label quality for semantic segmentation,” in Proc. IEEE Conf.
SYNTHIA dataset: A large collection of synthetic images for semantic Computer Vision and Pattern Recognition, 2018, pp. 1479–1487.
segmentation of urban scenes,” in Proc. IEEE Conf. Computer Vision
[193] P. Meletis and G. Dubbelman, “On boosting semantic street scene
and Pattern Recognition, 2016, pp. 3234–3243.
segmentation with weak supervision,” in IEEE Intelligent Vehicles
[168] S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in
Symp., 2019.
Proc. IEEE Conf. Computer Vision, Oct. 2017, pp. 2232–2241.
[169] X. Yue, B. Wu, S. A. Seshia, K. Keutzer, and A. L. Sangiovanni- [194] C. Haase-Schütz, H. Hertlein, and W. Wiesbeck, “Estimating labeling
Vincentelli, “A lidar point cloud generator: from a virtual world to quality with deep object detectors,” in IEEE Intelligent Vehicles Symp.,
autonomous driving,” in Proc. ACM Int. Conf. Multimedia Retrieval. June 2019, pp. 33–38.
ACM, 2018, pp. 458–464. [195] S. Chadwick and P. Newman, “Training object detectors with noisy
[170] M. Wrenninge and J. Unger, “Synscapes: A photorealistic synthetic data,” in IEEE Intelligent Vehicles Symp., June 2019, pp. 1319–1325.
dataset for street scene parsing,” arXiv preprint arXiv:1810.08705, [196] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
2018. dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., 2018.
[171] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang, “Deep- [197] M. Giering, V. Venugopalan, and K. Reddy, “Multi-modal sensor
mvs: Learning multi-view stereopsis,” in Proc. IEEE Conf. Computer registration for vehicle perception via deep neural networks,” in IEEE
Vision and Pattern Recognition, 2018. High Performance Extreme Computing Conf., 2015, pp. 1–6.
[172] D. Griffiths and J. Boehm, “SynthCity: A large scale synthetic point [198] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “RegNet: Mul-
cloud,” in ArXiv preprint, 2019. timodal sensor registration using deep neural networks,” in IEEE
[173] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, Intelligent Vehicles Symp., 2017, pp. 1803–1810.
“CARLA: An open urban driving simulator,” in Proc. 1st Annu. Conf. [199] S. Ramos, S. Gehrig, P. Pinggera, U. Franke, and C. Rother, “Detecting
Robot Learning, 2017, pp. 1–16. unexpected obstacles for self-driving cars: Fusing deep learning and
[174] B. Hurl, K. Czarnecki, and S. Waslander, “Precise synthetic image and geometric modeling,” in IEEE Intelligent Vehicles Symp., 2017, pp.
lidar (presil) dataset for autonomous vehicle perception,” arXiv preprint 1025–1032.
arXiv:1905.00160, 2019. [200] H. Banzhaf, M. Dolgov, J. Stellet, and J. M. Zöllner, “From footprints
[175] J. Lee, S. Walsh, A. Harakeh, and S. L. Waslander, “Leveraging pre- to beliefprints: Motion planning under uncertainty for maneuvering
trained 3d object detection models for fast ground truth generation,” in automated vehicles in dense scenarios,” in 21st Int. Conf. Intelligent
21st Int. Conf. Intelligent Transportation Systems. IEEE, Nov. 2018, Transportation Systems. IEEE, 2018, pp. 1680–1687.
pp. 2504–2510. [201] D. J. C. MacKay, “A practical Bayesian framework for backpropagation
[176] J. Mei, B. Gao, D. Xu, W. Yao, X. Zhao, and H. Zhao, “Semantic networks,” Neural Computation, vol. 4, no. 3, pp. 448–472, 1992.
segmentation of 3d lidar data in dynamic scene using semi-supervised [202] G. E. Hinton and D. Van Camp, “Keeping the neural networks simple
learning,” IEEE Trans. Intell. Transp. Syst., 2018. by minimizing the description length of the weights,” in Proc. 6th
[177] R. Mackowiak, P. Lenz, O. Ghori, F. Diego, O. Lange, and C. Rother, Annu. Conf. Computational Learning Theory. ACM, 1993, pp. 5–13.
“CEREALS – cost-effective region-based active learning for semantic [203] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, University
segmentation,” in Proc. British Machine Vision Conf., 2018. of Cambridge, 2016.
[178] S. Roy, A. Unmesh, and V. P. Namboodiri, “Deep active learning for [204] A. Graves, “Practical variational inference for neural networks,” in
object detection,” in Proc. British Machine Vision Conf., 2018, p. 91. Advances in Neural Information Processing Systems, 2011, pp. 2348–
[179] D. Feng, X. Wei, L. Rosenbaum, A. Maki, and K. Dietmayer, “Deep 2356.
active learning for efficient training of a lidar 3d object detector,” in [205] S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent
IEEE Intelligent Vehicles Symp., 2019. as approximate Bayesian inference,” J. Machine Learning Research,
[180] S. J. Pan, Q. Yang et al., “A survey on transfer learning,” IEEE Trans. vol. 18, no. 1, pp. 4873–4907, 2017.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
[206] M. Teye, H. Azizpour, and K. Smith, “Bayesian uncertainty estimation
[181] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain
for batch normalized deep networks,” in Int. Conf. Machine Learning,
adaptation: A survey of recent advances,” IEEE Signal Process. Mag.,
2018.
vol. 32, no. 3, pp. 53–69, 2015.
[182] Y. Chen, W. Li, X. Chen, and L. V. Gool, “Learning semantic [207] J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari, “Sampling-
segmentation from synthetic data: A geometrically guided input-output free epistemic uncertainty estimation using approximated variance
adaptation approach,” in Proc. IEEE Conf. Computer Vision and Pattern propagation,” in Proc. IEEE Conf. Computer Vision, 2019.
Recognition, 2019, pp. 1841–1850. [208] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian SegNet:
[183] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain Model uncertainty in deep convolutional encoder-decoder architectures
adaptive faster r-cnn for object detection in the wild,” in Proc. IEEE for scene understanding,” in Proc. British Machine Vision Conf., 2017.
Conf. Computer Vision and Pattern Recognition, 2018, pp. 3339–3348. [209] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout
[184] K.-H. Lee, G. Ros, J. Li, and A. Gaidon, “Spigan: Privileged adversarial sampling for robust object detection in open-set conditions,” in IEEE
learning from simulation,” in Proc. Int. Conf. Learning Representations, Int. Conf. Robotics and Automation, 2018.
2019. [210] D. Miller, F. Dayoub, M. Milford, and N. Sünderhauf, “Evaluating
[185] J. Tremblay et al., “Training deep networks with synthetic data: merging strategies for sampling-based uncertainty techniques in object
Bridging the reality gap by domain randomization,” in Workshop Proc. detection,” in IEEE Int. Conf. Robotics and Automation, 2018.
IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 969– [211] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian
977. deep learning for computer vision?” in Advances in Neural Information
[186] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi- Processing Systems, 2017, pp. 5574–5584.
supervised learning with deep generative models,” in Advances in [212] E. Ilg et al., “Uncertainty estimates and multi-hypotheses networks for
Neural Information Processing Systems, 2014, pp. 3581–3589. optical flow,” in Proc. Eur. Conf. Computer Vision, 2018.
19
[213] D. Feng, L. Rosenbaum, and K. Dietmayer, “Towards safe autonomous [238] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model
driving: Capture uncertainty in the deep neural network for lidar 3d compression and acceleration for deep neural networks,” IEEE Signal
vehicle detection,” in 21st Int. Conf. Intelligent Transportation Systems, Process. Mag., 2017.
Nov. 2018, pp. 3266–3273. [239] L. Enderich, F. Timm, L. Rosenbaum, and W. Burgard, “Learning
[214] D. Feng, L. Rosenbaum, F. Timm, and K. Dietmayer, “Leveraging multimodal fixed-point weights using gradient descent,” in 27th Eur.
heteroscedastic aleatoric uncertainties for robust real-time lidar 3d Symp. Artificial Neural Networks, Computational Intelligence and
object detection,” in IEEE Intelligent Vehicles Symp., 2019. Machine Learning, 2019.
[215] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. [240] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis-
Wellington, “Lasernet: An efficient probabilistic 3d object detector for serman, “The PASCAL visual object classes (VOC) challenge,” Int. J.
autonomous driving,” in Proc. IEEE Conf. Computer Vision and Pattern Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
Recognition, 2019, pp. 12 677–12 686. [241] A. P. Dawid, “The well-calibrated Bayesian,” J. American Statistical
[216] S. Wirges, M. Reith-Braun, M. Lauer, and C. Stiller, “Capturing object Association, vol. 77, no. 379, pp. 605–610, 1982.
detection uncertainty in multi-layer grid maps,” in IEEE Intelligent [242] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-
Vehicles Symp., 2019. distribution image detection in neural networks,” in Int. Conf. Learning
[217] M. T. Le, F. Diehl, T. Brunner, and A. Knol, “Uncertainty estimation Representations, 2017.
for deep neural object detectors in safety-critical applications,” in IEEE [243] D. Hall, F. Dayoub, J. Skinner, P. Corke, G. Carneiro, and
21st Int. Conf. Intelligent Transportation Systems. IEEE, 2018, pp. N. Sünderhauf, “Probability-based detection quality (PDQ): A prob-
3873–3878. abilistic approach to detection evaluation,” arXiv:1811.10800 [cs.CV],
[218] D. Feng, L. Rosenbaum, C. Gläser, F. Timm, and K. Dietmayer, “Can 2018.
we trust you? on calibration of a probabilistic object detector for [244] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-
autonomous driving,” arXiv:1909.12358 [cs.RO], 2019. to-end 3d detection, tracking and motion forecasting with a single
convolutional net,” in Proc. IEEE Conf. Computer Vision and Pattern
[219] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
Recognition, 2018, pp. 3569–3577.
Int. Conf. Learning Representations, 2014.
[245] M. Bojarski et al., “End to end learning for self-driving cars,”
[220] I. Goodfellow et al., “Generative adversarial nets,” in Advances in arXiv:1604.07316 [cs.CV], 2016.
Neural Information Processing Systems, 2014, pp. 2672–2680. [246] G.-H. Liu, A. Siravuru, S. Prabhakar, M. Veloso, and G. Kantor,
[221] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Learning end-to-end multimodal sensor policies for autonomous nav-
“Autoencoding beyond pixels using a learned similarity metric,” in Int. igation,” in Proc. 1st Annu. Conf. Robot Learning, 2017, pp. 249–261.
Conf. Machine Learning, 2016, pp. 1558–1566. [247] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning
[222] A. Deshpande, J. Lu, M.-C. Yeh, M. J. Chong, and D. A. Forsyth, to drive by imitating the best and synthesizing the worst,” in Proc.
“Learning diverse image colorization,” in Proc. IEEE Conf. Computer Robotics: Science and Systems, 2019.
Vision and Pattern Recognition, 2016, pp. 2877–2885. [248] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning
[223] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image for driving in urban environments,” in Proc. 2st Annu. Conf. Robot
translation with conditional adversarial networks,” in Proc. IEEE Conf. Learning, 2018, pp. 237–252.
Computer Vision and Pattern Recognition, 2017, pp. 5967–5976. [249] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving: Learning
[224] T. A. Wheeler, M. Holder, H. Winner, and M. J. Kochenderfer, “Deep affordance for direct perception in autonomous driving,” in Proc. IEEE
stochastic radar models,” in IEEE Intelligent Vehicles Symp., 2017, pp. Conf. Computer Vision, 2015, pp. 2722–2730.
47–53. [250] B. Yang, W. Luo, and R. Urtasun, “PIXOR: Real-time 3d object
[225] X. Han, J. Lu, C. Zhao, S. You, and H. Li, “Semi-supervised and detection from point clouds,” in Proc. IEEE Conf. Computer Vision
weakly-supervised road detection based on generative adversarial net- and Pattern Recognition, 2018, pp. 7652–7660.
works,” IEEE Signal Process. Lett., 2018. [251] Ö. Erkent, C. Wolf, C. Laugier, D. S. González, and V. R. Cano,
[226] J. L. Elman, “Learning and development in neural networks: The “Semantic grid estimation with a hybrid bayesian and deep neural
importance of starting small,” Cognition, vol. 48, no. 1, pp. 71–99, network approach,” in 2018 IEEE/RSJ International Conference on
1993. Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 888–895.
[227] J. Feng and T. Darrell, “Learning the structure of deep convolutional [252] S. Gu, T. Lu, Y. Zhang, J. Alvarez, J. Yang, and H. Kong, “3D lidar+
networks,” in Proc. IEEE Conf. Computer Vision, 2015, pp. 2749–2757. monocular camera: an inverse-depth induced fusion framework for
[228] D. Ramachandram, M. Lisicki, T. J. Shields, M. R. Amer, and G. W. urban road detection,” IEEE Transactions on Intelligent Vehicles, 2018.
Taylor, “Structure optimization for deep multimodal fusion networks [253] Y. Cai, D. Li, X. Zhou, and X. Mou, “Robust drivable road region de-
using graph-induced kernels,” in 25th Eur. Symp. Artificial Neural tection for fixed-route autonomous vehicles using map-fusion images,”
Networks, Computational Intelligence and Machine Learning, 2017. Sensors, vol. 18, no. 12, p. 4158, 2018.
[229] D. Whitley, T. Starkweather, and C. Bogart, “Genetic algorithms and
neural networks: Optimizing connections and connectivity,” Parallel
computing, vol. 14, no. 3, pp. 347–361, 1990.
[230] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
learning,” arXiv:1611.01578 [cs.LG], 2016.
[231] P. Kulkarni, J. Zepeda, F. Jurie, P. Pérez, and L. Chevallier, “Learning
the structure of deep architectures using l1 regularization,” in Proc.
British Machine Vision Conf., 2015.
[232] C. Murdock, Z. Li, H. Zhou, and T. Duerig, “Blockout: Dynamic
Di Feng (Member, IEEE) is currently pursuing
model selection for hierarchical deep networks,” in Proc. IEEE Conf.
his doctoral degree in the Corporate Research of
Computer Vision and Pattern Recognition, 2016, pp. 2583–2591.
Robert Bosch GmbH, Renningen, in cooperation
[233] F. Li, N. Neverova, C. Wolf, and G. Taylor, “Modout: Learning multi- with the Ulm University. He finished his master’s
modal architectures by stochastic regularization,” in 12th IEEE Int. degree with distinction in electrical and computer
Conf. Automatic Face & Gesture Recognition, 2017, pp. 422–429. engineering at the Technical University of Munich.
[234] A. Bilal, A. Jourabloo, M. Ye, X. Liu, and L. Ren, “Do convolutional During his studies, he was granted the opportunity
neural networks learn class hierarchy?” IEEE Trans. Vis. Comput. to work in several teams with reputable companies
Graphics, vol. 24, no. 1, pp. 152–162, 2018. and research institutes such as BMW AG, German
[235] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu, “Towards better analysis Aerospace Center (DLR), and Institute for Cognitive
of deep convolutional neural networks,” IEEE Trans. Vis. Comput. Systems (ICS) at Technical University of Munich.
Graphics, vol. 23, no. 1, pp. 91–100, 2017. He received the bachelor’s degree in mechatronics with honor from Tongji
[236] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing University. His current research is centered on robust multi-modal object
deep neural networks with pruning, trained quantization and huffman detection using deep learning approach for autonomous driving. He is also
coding,” in Int. Conf. Learning Representations, 2015. interested in robotic active learning and exploration through tactile sensing
[237] A. G. Howard et al., “MobileNets: Efficient convolutional neural and cognitive systems.
networks for mobile vision applications,” arXiv:1704.04861 [cs.CV],
2017.
20
Christian Haase-Schütz (Member, IEEE) is cur- Fabian Timm studied Computer Science at the
rently pursuing his PhD degree at Chassis Systems University of Lübeck, Germany. In 2006 he did
Control, Robert Bosch GmbH, Abstatt, in coop- his diploma thesis at Philips Lighting Systems in
eration with the Karlsruhe Institute of Technol- Aachen, Germany. Afterwards he started his PhD
ogy. Before joining Bosch, he finished his mas- also at Philips Lighting Systems in the field of
ter’s degree in physics at the Friedrich-Alexander- Machine Vision and Machine Learning and finished
University Erlangen-Nuremberg. He did his thesis it in 2011 at the University of Lübeck, Institute
with the Center for Medical Physics. During his for Neuro- and Bioinformatics. In 2012 he joined
master studies he was granted a scholarship by corporate research at Robert Bosch GmbH and
the Bavarian state to visit Huazhong University of worked on industrial image processing and machine
Science and Technology, Wuhan, China, from March learning. Afterwards he worked in the business unit
2015 till July 2015. He received his bachelor’s degree in physics from the at Bosch and developed new perception algorithms, such as pedestrian and
University of Stuttgart in 2013, where he did his thesis with the Max Planck cyclist protection only with a single radar sensor. Since 2018 he leads the
Institute for Intelligent Systems. His professional experience includes work research group ”automated driving perception and sensors” at Bosch corporate
with ETAS GmbH, Stuttgart, and Andreas Stihl AG, Waiblingen. His current research. His main research interests are machine and deep learning, signal
research is centered on multi-modal object detection using deep learning processing, and sensors for automated driving.
approaches for autonomous driving. He is further interested in challenges
of AI systems in the wild. Christian Haase-Schütz is a member of the IEEE
and the German Physical Society DPG.
21
TABLE II: OVERVIEW OF MULTI-MODAL DATASETS
Name Sensing Modalities Year (pub- Labelled Recording area Size Categories / Remarks Link
lished) (benchmark)
nuScenes dataset [89] Visual cameras (6), 3D 2019 3D bounding box Boston, 1000 scenes, 1.4M frames 25 Object classes, such as Car / https://www.nuscenes.org/download
LiDAR, and Radars Singapore (camera, Radar), 390k Van / SUV, different Trucks,
(5) frames (3D LiDAR) Buses, Persons, Animal, Traffic
Cone, Temporary Traffic Barrier,
Debris, etc.
BLVD [80] Visual (Stereo) 2019 3D bounding box, Changshu 120k frames, Vehicle, Pedestrian, Rider during https://github.com/VCCIV/BLVD/
camera, 3D LiDAR Tracking, Interaction, 249, 129 objects day and night
Intention
H3D dataset [79] Visual cameras (3), 3D 2019 3D bounding box San Francisco, 27, 721 frames, Car, Pedestrian, Cyclist, Truck, https:
LiDAR Mountain View, 1, 071, 302 objects Misc, Animals, Motorcyclist, Bus //usa.honda-ri.com/hdd/introduction/h3d
Santa Cruz, San
Mateo
ApolloScape [60] Visual (Stereo) 2018, 2019 2D/3D pixel-level Multiple areas in 143, 906 image frames, Rover, Sky, Car, Motobicycle, http://apolloscape.auto/scene.html
camera, 3D LiDAR, segmentation, lane China 89, 430 objects Bicycle, Person, Rider, Truck,
GNSS, and inertial marking, instance Bus, Tricycle, Road, Sidewalk,
sensors segmentation, depth Traffic Cone, Road Pile, Fence,
Traffic Light, Pole, Traffic Sign,
Wall, Dustbin, Billboard,
Building, Bridge, Tunnel,
Overpass, Vegetation
DBNet Dataset [78] 3D LiDAR, Dashboard 2018 Driving behaviours Multiple areas in Over 10k frames In total seven datasets with http://www.dbehavior.net/
visual camera, GNSS (Vehicle speed and China different test scenarios, such as
wheel angles) seaside roads, school areas,
mountain roads.
KAIST multispectral dataset Visual (Stereo) and 2018 2D bounding box, Seoul 7, 512 frames, Person, Cyclist, Car during day http://multispectral.kaist.ac.kr
[93] thermal camera, 3D drivable region, image 308, 913 objects and night, fine time slots (sunrise,
LiDAR, GNSS, and enhancement, depth, afternoon,...)
inertial sensors and colorization
Multi-spectral Object Visual and thermal 2017 2D bounding box University 7, 512 frames, 5, 833 objects Bike, Car, Car Stop, Color Cone, https://www.mi.t.u-tokyo.ac.jp/static/
Detection dataset [91] cameras environment in Person during day and night projects/mil multispectral/
Japan
Multi-spectral Semantic Visual and thermal 2017 2D pixel-level n.a. 1569 frames Bike, Car, Person, Curve, https://www.mi.t.u-tokyo.ac.jp/static/
Segmentation dataset [92] camera segmentation Guardrail, Color Cone, Bump projects/mil multispectral/
during day and night
Multi-modal Panoramic 3D Visual camera, 2016 Place categorization Fukuoka 650 scans (dense), No dynamic objects http:
Outdoor (MPO) dataset [77] LiDAR, and GNSS 34200 scans (sparse) //robotics.ait.kyushu-u.ac.jp/∼kurazume/
research-e.php?content=db#d08
KAIST multispectral Visual and thermal 2015 2D bounding box Seoul 95, 328 frames, Person, People, Cyclist during day https://sites.google.com/site/
pedestrian [90] camera 103, 128 objects and night pedestrianbenchmark/home
KITTI [6], [75] Visual (Stereo) 2012, 2D/3D bounding box, Karlsruhe 7481 frames (training) Car, Van, Truck, Pedestrian, http://www.cvlibs.net/datasets/kitti/
camera, 3D LiDAR, 2013, 2015 visual odometry, road, 80.256 objects Person (sitting), Cyclist, Tram,
GNSS, and inertial optical flow, tracking, Misc
sensors depth, 2D instance and
pixel-level
segmentation
The Málaga Stereo and Visual (Stereo) 2014 no Málaga 113, 082 frames, 5, 654.6 s n.a. https:
Laser Urban dataset [76] camera, 5× 2D (camera); > 220, 000 frames, //www.mrpt.org/MalagaUrbanDataset
LiDAR (yielding 3D 5, 000 s (LiDARs)
information), GNSS
and inertial sensors
22
TABLE III: SUMMARY OF MULTI-MODAL OBJECT DECTECTION METHODS
Reference Sensors Obj Type Sensing Modality Representations Network How to generate Region When to Fusion Operation and Fusion Dataset(s) used
and Processing Pipeline Proposals (RP) a fuse Method Levelb
Liang et LiDAR, visual 3D Car, Pedestrian, LiDAR BEV maps, RGB image. Faster Predictions with fused Before RP Addition, continuous fusion Middle KITTI, self-recorded
al., 2019 camera Cyclist Each processed by a ResNet with R-CNN features layer
[116] auxiliary tasks: depth estimation
and ground segmentation
Wang et al., LiDAR, visual 3D Car, Pedestrian, LiDAR voxelized frustum (each R-CNN Pre-trained RGB image After RP Using RP from RGB image Late KITTI, SUN-RGBD
2019 [115] camera Cyclist, Indoor frustum processed by the PointNet), detector detector to build LiDAR
objects RGB image (using a pre-trained frustums
detector).
Dou et al., LiDAR, visual 3D Car LiDAR voxel (processed by Two Predictions with fused Before RP Feature concatenation Middle KITTI
2019 [114] camera VoxelNet), RGB image (processed stage features
by a FCN to get semantic features) detector
Sindagi et LiDAR, visual 3D Car LiDAR voxel (processed by One Predictions with fused Before RP Feature concatenation Early, KITTI
al., 2019 camera VoxelNet), RGB image (processed stage features Middle
[113] by a pre-trained 2D image detector
detector).
Bijelic et LiDAR, visual 2D Car in foggy Lidar front view images (depth, SSD Predictions with fused Before RP Feature concatenation From Self-recorded datasets
al., 2019 camera weather intensity, height), RGB image. Each features early to focused on foggy weather,
[112] processed by VGG16 middle simulated foggy images
layers from KITTI
Chadwick Radar, visual 2D Vehicle Radar range and velocity maps, One Predictions with fused Before RP Addition, feature Middle Self-recorded
et al., 2019 camera RGB image. Each processed by stage features concatenation
[134] ResNet detector
Liang et LiDAR, visual 3D Car, Pedestrian, LiDAR BEV maps, RGB image. One Predictions with fused Before RP Addition, continuous fusion Middle KITTI, self-recorded
al., 2018 camera Cyclist Each processed by ResNet stage features layer
[150] detector
Du et al., LiDAR, visual 3D Car LiDAR voxel (processed by R-CNN Pre-trained RGB image Before and Ensemble: use RGB image Late KITTI, self-recorded data
2018 [107] camera RANSAC and model fitting), RGB detector produces 2D at RP detector to regress car
image (processed by VGG16 and bounding boxes to crop dimensions for a model
GoogLeNet) LiDAR points, which are fitting algorithm
then clustered
Kim et al, LiDAR, visual 2D Car LiDAR front-view depth image, SSD SSD with fused features Before RP Feature concatenation, Middle KITTI
2018 [110] camera RGB image Each input processed Mixture of Experts
by VGG16
Yang et al., LiDAR, HD-map 3D Car LiDAR BEV maps, Road mask One Detector predictions Before RP Feature concatenation Early KITTI, TOR4D
2018 [121] image from HD map. Inputs stage Dataset [250]
processed by PIXOR++ [250] with detector
the backbone similar to FPN
Pfeuffer et LiDAR, visual Multiple 2D objects LiDAR spherical, and front-view Faster RPN from fused features Before RP Feature concatenation Early, KITTI
al., 2018 camera sparse depth, dense depth image, R-CNN Mid-
[111] RGB image. Each processed by dle,
VGG16 Late
Casas et LiDAR, HD-map 3D Car sequential LiDAR BEV maps, One Detector predictions Before RP Feature concatenation Middle self-recorded data
al., 2018 sequential several road topology stage
[122]c mask images from HD map. Each detector
input processed by a base network
with residual blocks
Guan et al., visual camera, 2D Pedestrian RGB image, thermal image. Each Faster RPN with fused features Before and Feature concatenation, Early, KAIST Pedestrian Dataset
2018 [119] thermal camera processed by a base network built R-CNN after RP Mixture of Experts Middle,
on VGG16 Late
23
TABLE III: SUMMARY OF MULTI-MODAL OBJECT DECTECTION METHODS
Reference Sensors Obj Type Sensing Modality Representations Network How to generate Region When to Fusion Operation and Fusion Dataset(s) used
and Processing Pipeline Proposals (RP) a fuse Method Levelb
Shin et al., LiDAR, visual 3D Car LiDAR point clouds, (processed by R-CNN A 3D object detector for After RP Using RP from RGB image Late KITTI
2018 [141] camera PointNet [139]); RGB image RGB image detector to search LiDAR
(processed by a 2D CNN) point clouds
Schneider Visual camera Multiple 2D objects RGB image (processed by SSD SSD predictions Before RP Feature concatenation Early, Cityscape
et al., 2017 GoogLeNet), depth image from Mid-
[61] stereo camera (processed by NiN dle,
net) Late
Takumi et Visual camera, Multiple 2D objects RGB image, NIR, FIR, FIR image. YOLO YOLO predictions for each After RP Ensemble: ensemble final Late self-recorded data
al., 2017 thermal camera Each processed by YOLO spectral image predictions for each YOLO
[91] detector
Chen et al., LiDAR, visual 3D Car LiDAR BEV and spherical maps, Faster A RPN from LiDAR BEV After RP average mean, deep fusion Early, KITTI
2017 [98] camera RGB image. Each processed by a R-CNN map Mid-
base network built on VGG16 dle,
Late
Asvadi et LiDAR, visual 2D Car LiDAR front-view dense-depth YOLO YOLO outputs for LiDAR After RP Ensemble: feed engineered Late KITTI
al., 2017 camera (DM) and reflectance maps (RM), DM and RM maps, and features from ensembled
[99] RGB image. Each processed RGB image bounding boxes to a
through a YOLO net network to predict scores
for NMS
Oh et al., LiDAR, visual 2D Car, Pedestrian, LiDAR front-view dense-depth map R-CNN LiDAR voxel and RGB After RP Association matrix using Late KITTI
2017 [100] camera Cyclist (for fusion: processed by VGG16), image separately basic belief assignment
LiDAR voxel (for ROIs:
segmentation and region growing),
RGB image (for fusion: processed
by VGG16; for ROIs: segmentation
and grouping)
Wang et al., LiDAR, visual 3D Car, Pedestrian LiDAR BEV map, RGB image. One Fused LiDAR and RGB Before RP Sparse mean manipulation Middle KITTI
2017 [102] camera Each processed by a stage image features extracted
RetinaNet [196] detector from CNN
Ku et al., LiDAR, visual 3D Car, Pedestrian, LiDAR BEV map, RGB image. Faster Fused LiDAR and RGB Before and Average mean Early, KITTI
2017 [103] camera Cyclist Each processed by VGG16 R-CNN image features extracted after RP Middle,
from CNN Late
Xu et al., LiDAR, visual 3D Car, Pedestrian, LiDAR points (processed by R-CNN Pre-trained RGB image After RP Feature concatenation for Middle KITTI, SUN-RGBD
2017 [104] camera Cyclist, Indoor PointNet), RGB image (processed detector local and global features
objects by ResNet)
Qi et al., LiDAR, visual 3D Car, Pedestrian, LiDAR points (processed by R-CNN Pre-trained RGB image After RP Feature concatenation Middle, KITTI, SUN-RGBD
2017 [105] camera Cyclist, Indoor PointNet), RGB image (using a detector Late
objects pre-trained detector)
Du et al., LiDAR, visual 2D Car LiDAR voxel (processed by Faster First clustered by LiDAR Before RP Ensemble: feed LiDAR RP Late KITTI
2017 [106] camera RANSAC and model fitting), RGB R-CNN point clouds, then to RGB image-based CNN
image (processed by VGG16 and fine-tuned by a RPN of for final prediction
GoogLeNet) RGB image
Matti et al., LiDAR, visual 2D Pedestrian LiDAR points (clustering with R-CNN Clustered by LiDAR point Before and Ensemble: feed LiDAR RP Late KITTI
2017 [108] camera DBSCAN) and RGB image clouds, then size and ratio at RP to RGB image-based CNN
(processed by ResNet) corrected on RGB image. for final prediction
Kim et al., LiDAR, visual 2D Pedestrian, LiDAR front-view depth image, Fast Selective search for LiDAR At RP Ensemble: joint RP are fed Late KITTI
2016 [109] camera Cyclist RGB image. Each processed by R-CNN and RGB image separately. to RGB image based CNN.
Fast R-CNN network [37]
Mees et al., RGB-D camera 2D Pedestrian RGB image, depth image from Fast Dense multi-scale sliding After RP Mixture of Experts Late RGB-D People Unihall
2016 [120] depth camera, optical flow. Each R-CNN window for RGB image Dataset, InOutDoor RGB-D
processed by GoogLeNet People Dataset.
24
TABLE III: SUMMARY OF MULTI-MODAL OBJECT DECTECTION METHODS
Reference Sensors Obj Type Sensing Modality Representations Network How to generate Region When to Fusion Operation and Fusion Dataset(s) used
and Processing Pipeline Proposals (RP) a fuse Method Levelb
Wagner et visual camera, 2D Pedestrian RGB image, thermal image. Each R-CNN ACF+T+THOG detector After RP Feature concatenation Early, KAIST Pedestrian Dataset
al., 2016 thermal camera processed by CaffeeNet Late
[117]
Liu et al., Visual camera, 2D Pedestrian RGB image, thermal image. Each Faster RPN with fused (or Before and Feature concatenation, Early, KAIST Pedestrian Dataset
2016 [118] thermal camera processed by NiN network R-CNN separate) features after RP average mean, Score fusion Mid-
(Cascaded CNN) dle,
Late
Schlosser et LiDAR, visual 2D Pedestrian LiDAR HHA image, RGB image. R-CNN Deformable Parts Model After RP Feature concatenation Early, KITTI
al., 2016 camera Each processed by a small ConvNet with RGB image Middle,
[101] Late
a
For one-stage detector, we refer region proposals to be the detection outputs of a network.
b
Some methods compare multiple fusion levels. We mark the fusion level with the best reported performance in bold.
25
c
Besides object detection, this paper also proposes intention prediction and trajectory prediction up to 3s in the unified network (multi-task prediction).
TABLE IV: SUMMARY OF MULTI-MODAL SEMANTIC SEGMENTATION METHODS
a
Reference Sensors Semantics Sensing Modality Representations Fusion Operation and Method Fusion Level Dataset(s) used
Chen et al., 2019 LiDAR, visual camera Road segmentation RGB image, altitude difference image. Each Feature adaptation module, modified Middle KITTI
[132] processed by a CNN concatenation.
Valada et al., 2019 Visual camera, depth Multiple 2D objects RGB image, thermal image, depth image. Each Extension of Mixture of Experts Middle Six datasets,
[127] camera, thermal camera processed by FCN with ResNet backbone including
(Adapnet++ architecture) Cityscape, Sun
RGB-D, etc.
Sun et al., 2019 Visual camera, thermal Multiple 2D objects in RGB image, thermal image. Each processed by a Element-wise summation in the encoder Middle Datasets published
[124] camera campus environments base network built on ResNet networks by [92]
Caltagirone et al., LiDAR, visual camera Road segmentation LiDAR front-view depth image, RGB image. Each Feature concatenation (For early and late Early, Middle, Late KITTI
2019 [129] input processed by a FCN fusion), weighted addition similar to gating
network (for middle-level cross fusion)
Erkent et al., 2018 LiDAR, visual camera Multiple 2D objects LiDAR BEV occupancy grids (processed based on Feature concatenation Middle KITTI,
[251] Bayesian filtering and tracking), RGB image self-recorded
(processed by a FCN with VGG16 backbone)
Lv et al., 2018 LiDAR, visual camera Road segmentation LiDAR BEV maps, RGB image. Each input Feature concatenation Middle KITTI
[130] processed by a FCN with dilated convolution
operator. RGB image features are also projected
onto LiDAR BEV plane before fusion
Wulff et al., 2018 LiDAR, visual camera Road segmentation. LiDAR BEV maps, RGB image projected onto Feature concatenation Early KITTI
[131] Alternatives: freespace, BEV plane. Inputs processed by a FCN with UNet
ego-lane detection
Kim et al., 2018 LiDAR, visual camera 2D Off-road terrains LiDAR voxel (processed by 3D convolution), RGB Addition Early, Middle, Late self-recorded data
[123] image (processed by ENet)
Guan et al., 2018 Visual camera, thermal 2D Pedestrian RGB image, thermal image. Each processed by a Feature concatenation, Mixture of Experts Early, Middle, Late KAIST Pedestrian
[119]b camera base network built on VGG16 Dataset
Yang et al., 2018 LiDAR, visual camera Road segmentation LiDAR points (processed by PointNet++), RGB Optimizing Conditional Random Field Late KITTI
[128] image (processed by FCN with VGG16 backbone) (CRF)
Gu et al., 2018 LiDAR, visual camera Road segmentation LiDAR front-view depth and height maps Optimizing Conditional Random Field Late KITTI
[252] (processed by a inverse-depth histogram based line
scanning strategy), RGB image (processed by a
FCN).
Cai et al., 2018 Satellite map with route Road segmentation Route map image, RGB image. Images are fused Overlaying the line and curve segments in Early self-recorded data
[253] information, visual and processed by a FCN the route map onto the RGB image to
camera generate the Map Fusion Image (MFI)
Ha et al., 2017 Visual camera, thermal Multiple 2D objects in RGB image, thermal image. Each processed by a Feature concatenation, addition (“short-cut Middle self-recorded data
[92] camera campus environments FCN and mini-inception block fusion”)
Valada et al., 2017 Visual camera, thermal Multiple 2D objects RGB image, thermal image, depth image. Each Mixture of Experts Late Cityscape, Freiburg
[126] camera processed by FCN with ResNet backbone Multispectral
Dataset, Synthia
Schneider et al., Visual camera Multiple 2D Objects RGB image, depth image Feature concatenation Early, Middle, Late Cityscape
2017 [61]
Schneider et al., Visual camera Multiple 2D Objects RGB image (processed by GoogLeNet), depth Feature concatenation Early, Middle, Late Cityscape
2017 [61]2 image from stereo camera (processed by NiN net)
Valada et al., 2016 Visual camera, thermal Multiple 2D objects in RGB image, thermal image, depth image. Each Feature concatenation, addition Early, Late self-recorded data
[125] camera forested environments processed by the UpNet (built on VGG16 and
up-convolution)
26
a
Some methods compare multiple fusion levels. We mark the fusion level with the best reported performance in bold.
b
They also test the methods for object detection problem with different network architectures (see Table III).
TABLE V: PERFORMANCE AND RUNTIME FOR 3D OBJECT DETECTION ON KITTI TEST SET
Reference Car Pedestrian Cyclist Runtime Environment
Moderate Easy Hard Moderate Easy Hard Moderate Easy Hard
Liang et al., 2019 [116] 76.75 % 86.81 % 68.41 % 45.61 % 52.37 % 41.49 % 64.68 % 79.58 % 57.03 % 0.08 s GPU @ 2.5 Ghz (Python)
Wang et al., 2019 [115] 76.51 % 85.88 % 68.08 % - - - - - - 0.47 s GPU @ 2.5 Ghz (Python + C/C++)
Sindagi et al., 2019 [113] 72.7 % 83.2 % 65.12 % - - - - - - - -
Shin et al., 2018 [141] 73.04 % 83.71 % 59.16 % - - - - - - - GPU Titan X (not Pascal)
Du et al., 2018 [107] 73.80 % 84.33 % 64.83 % - - - - - - 0.5 s GPU @ 2.5 Ghz (Matlab + C/C++)
Liang et al., 2018 [150] 66.22 % 82.54 % 64.04 % - - - - - - 0.06 s GPU @ 2.5 Ghz (Python)
Ku et al., 2017 [103] 71.88 % 81.94 % 66.38 % 42.81 % 50.80 % 40.88 % 52.18 % 64.00 % 46.61 % 0.1 s GPU Titan X (Pascal)
Qi et al., 2017 [105] 70.39 % 81.20 % 62.19 % 44.89 % 51.21 % 40.23 % 56.77 % 71.96 % 50.39 % 0.17 s GPU @ 3.0 Ghz (Python)
Chen et al., 2017 [98] 62.35 % 71.09 % 55.12 % - - - - - - 0.36 s GPU @ 2.5 Ghz (Python + C/C++)
TABLE VI: PERFORMANCE AND RUNTIME FOR ROAD SEGMENTATION (URBAN) ON KITTI TEST SET
Method MaxF AP PRE REC FPR FNR Runtime Environment
Chen et al., 2019 [132] 97.03 % 94.03 % 97.19 % 96.88 % 1.54 % 3.12 % 0.16 s GPU
Caltagirone et al., 2019 [129] 96.03 % 93.93 % 96.23 % 95.83 % 2.07 % 4.17 % 0.15 s GPU
Gu et al., 2018 [252] 95.22 % 89.31 % 94.69 % 95.76 % 2.96 % 4.24 % 0.07 s CPU
Lv et al., 2018 [130] 94.48 % 93.65 % 94.28 % 94.69 % 3.17 % 5.31 % - GPU Titan X
Yang et al., 2018 [128] 91.40 % 84.22 % 89.09 % 93.84 % 6.33 % 6.16 % - GPU
25