Deep Continuous Fusion
Deep Continuous Fusion
Object Detection
1 Introduction
One of the fundamental problems when building perception systems for au-
tonomous driving is to be able to detect objects in 3D space. An autonomous
vehicle needs to perceive the objects present in the 3D scene from its sensors
in order to plan its motion safely. Most self-driving vehicles are equipped with
both cameras and 3D sensors. This brings potential to exploit the advantage of
different sensor modalities to conduct accurate and reliable 3D object detection.
During the past years, 2D object detection from camera images has seen signif-
icant progress [13, 12, 30, 7, 21, 29, 23, 22]. However, there is still large space for
improvement when it comes to object localization in 3D space.
LIDAR based 3D object detection has drawn much attention recently when
combined with the power of deep learning. Representative works either project
the 3D LIDAR points onto camera perspective [20], overhead view [24, 37, 19,
38], 3D volumes [39, 10] or directly conduct 3D bounding box estimation over
unordered 3D points [26]. However, these approaches suffer at long range and
when dealing with occluded objects due to the sparsity of the LIDAR returns
over these regions.
Images, on the other hand, provide dense measurements, but precise 3D lo-
calization is hard due to the loss of depth information caused by perspective
2 M. Liang, B. Yang, S. Wang and R. Urtasun
projection, particularly when using monocular cameras [3, 5]. Recently, several
approaches have tried to exploit both cameras and LIDAR jointly. In [26, 4] cam-
era view is used to generate proposals while LIDAR is used to conduct the final
3D localization. However, these cascading approaches do not exploit the capa-
bility to perform joint reasoning over multi-sensor’s inputs. As a consequence,
the 3D detection performance is bounded by the 2D image-only detection step.
Other approaches [6, 18, 8] apply 2D convolutional networks on both camera im-
age and LIDAR bird’s eye view (BEV) representations, and fuse them at the
intermediate region-wise convolutional feature map via feature concatenation.
This fusion usually happens at a coarse level, with significant resolution loss.
Thus, it remains an open problem to design 3D detectors that can better exploit
multiple modalities. The challenge lies in the fact that the LIDAR points are
sparse and continuous, while cameras capture dense features at discrete state;
thus, fusing them is non-trivial.
In this paper, we propose a 3D object detector that reasons in bird’s eye view
(BEV) and fuses image features by learning to project them into BEV space.
Towards this goal, we design an end-to-end learnable architecture that exploits
continuous convolutions to fuse image and LIDAR feature maps at different levels
of resolution. The proposed continuous fusion layer is capable of encoding dense
accurate geometric relationships between positions under the two modalities.
This enables us to design a novel, reliable and efficient 3D object detector based
on multiple sensors. Our experimental evaluation on both KITTI [11] and a large
scale 3D object detection benchmark [37] shows significant improvements over
the state of the art.
2 Related Work
Joint Camera-3D Sensor Detection: Over the past few years, many techniques
have explored both cameras and 3D sensors jointly to perform 3D reasoning.
One common practice is to perform depth image based processing, which en-
codes the 3D geometry as an additional image channel [32, 13, 15]. For instance,
[15] proposes a novel geocentric embedding for the depth image and, through
concatenating with the RGB image features, significant improvement can be
achieved. However, the output space of these approaches is on the camera image
plane. In the context of autonomous driving, this is not desirable as we wish
to localize objects in 3D space. Additional efforts have to be made in order to
Deep Continuous Fusion 3
Recently, several works [19, 38, 39, 18, 6, 5] have shown very promising results by
performing 3D object detection in BEV. These detectors are effective as BEV
maintains the structure native to 3D sensors such as LIDAR. As a consequence,
convolutional networks can be easily trained and strong priors like object size
can be exploited. Since most self-driving cars are equipped with both LIDAR and
cameras, sensor fusion between these modalities is desirable in order to further
boost performance.
Fusing information between LIDAR and images is non-trivial as images rep-
resent a projection of the world onto the camera plane, while LIDAR captures
the world’s native 3D structure. One possibility is to project the LIDAR points
onto the image, append an extra channel with depth information and exploit
traditional 2D detection architectures. This has been shown to be very effective
when reasoning in image space (e.g., [32, 15, 9]). Unfortunately, a second step is
necessary in order to obtain 3D detections from the 2D outputs.
In contrast, in this paper we perform the opposite operation. We exploit im-
age features extracted by a convolutional network, and then project the image
features into BEV and fuse them with the convolution layers of a LIDAR based
detector. This fusing operation is non-trivial, as image features happen at dis-
crete locations; thus, one needs to “interpolate” to create a dense BEV feature
4 M. Liang, B. Yang, S. Wang and R. Urtasun
ResNet
Block
... ResNet
Block
ResNet
Block
Multi-scale fusion
Camera Stream
Detection Header
Continuous Fusion
... Continuous Fusion Continuous Fusion
ResNet
Block ... ResNet
Block
ResNet
Block
LIDAR Stream + +
Fig. 1: Architecture of our model. There are two streams, namely the camera
image stream and the BEV LIDAR stream. Continuous fusion layers are used
to fuse the image features onto the BEV feature maps.
3D Points
(1) KNN-Search
Multi-layer Perceptron
Fig. 2: Continuous fusion layer: given a target pixel on BEV image, we first
extract K nearest LIDAR points (Step 1); we then project the 3D points onto
the camera image plane (Step 2-Step 3); this helps retrieve corresponding image
features (Step 4); finally we feed the image feature + continuous geometry offset
into a MLP to generate feature for the target pixel (Step 5).
where j indexes over the neighbors of point i, fj is the input feature and xj is the
continuous coordinate associated with a point. The MLP computes the convo-
lutional weight at each neighbor point. The advantage of parametric continuous
convolution is that it utilizes the concept of standard convolution to capture
local information from neighboring observations, without a rasterization stage
that could lead to geometric information loss. In this paper we argue that contin-
uous convolution is a good fit for our task, due to the fact that both camera view
and BEV are connected through a 3D point set, and modeling such geometric
relationships between them in a lossless manner is key to fusing information.
Continuous Fusion Layer: Our proposed continuous fusion layer exploits con-
tinuous convolutions to overcome the two aforementioned problems, namely the
sparsity in the observations and the handling of the spatially-discrete features
in camera view image. Given the input camera image feature map and a set
of LIDAR points, the target of the continuous fusion layer is to create a dense
BEV feature map where each discrete pixel contains features generated from
the camera image. This dense feature map can then be readily fused with BEV
feature maps extracted from LIDAR. One difficulty of image-BEV fusion is that
not all the discrete pixels on BEV space are observable in the camera. To over-
come this, for each target pixel in the dense map, we find its nearest K LIDAR
points over the 2D BEV plane using Euclidean distance. We then exploit MLP
to fuse information from these K nearest points to “interpolate” the unobserved
feature at the target pixel. For each source LIDAR point, the input of our MLP
contains two parts: First, we extract the corresponding image features by pro-
jecting the source LIDAR point onto the image plane. Bilinear interpolation is
used to get the image feature at the continuous coordinates. Second, we encode
6 M. Liang, B. Yang, S. Wang and R. Urtasun
the 3D neighboring offset between the source LIDAR point and the target pixel
on the dense BEV feature map, in order to model the dependence of each LIDAR
point’s contribution on its relative position to the target. Overall, this gives us
a K × (Di + 3)-d input to the MLP for each target pixel, where Di is the input
feature dimension. For each target pixel, the MLP outputs a Do -dimensional
output feature by summing over the MLP output for all its neighbors.
That is to say: X
hi = MLP(concat [fj , xj − xi ])
j
Our multi-sensor detection network has two streams: the image feature network
and the BEV network. We use four continuous fusion layers to fuse multiple
scales of image features into BEV network from lower level to higher level. The
overall architecture is depicted in Fig. 1. In this section we will discuss each
individual component in more details.
Fusion Layers: Four continuous fusion layers are used to fuse multi-scale image
features into the four residual groups of the BEV network. The input of each
continuous fuse layer is an image feature map combined from the outputs of
Deep Continuous Fusion 7
all four image residual groups. We use the same combination approach as the
feature pyramid network (FPN) [21]. The output feature in BEV space has the
same shape as the corresponding BEV layer and is combined into BEV through
element-wise summation. Our final BEV feature output also combines the last
three residual groups’ output in a similar manner as FPN [21], in order to exploit
multi-scale information.
Training: We use a multi-task loss to train our network. Following common prac-
tice in object detection [13, 12, 30], we define the loss function as the summation
of classification and regression losses.
where Lcls and Lreg are the classification loss and regression loss, respectively.
Lcls is defined as the binary cross entropy between class confidence and the label
1
Lcls = (lc log(pc ) + (1 − lc ) log (1 − pc )) (2)
N
where pc is the predicted classification score, lc is the binary label, and N is the
number of samples. For 3D detection, Lreg is the sum of seven terms
1 X
Lreg = D(pk , lk ) (3)
Npos
k∈(x,y,z,w,h,d,t)
where (x, y, z) denotes the 3D box center, (w, h, d) denotes the box size, t denotes
the orientation, and Npos is the number of positive samples. D is a smoothed
L1-norm defined as:
(
0.5(pk − lk )2 if |pk − lk | < 1
D(pk , lk ) = (4)
|pk − lk | − 0.5 otherwise,
with pk and lk the predicted and ground truth offsets respectively. For k ∈
(x, y, z), pk is encoded as:
pk = (k − ak )/ak (5)
where ak is the coordinate of the anchor. For k ∈ (w, h, d), pk is encoded as:
pk = log(k/ak ) (6)
8 M. Liang, B. Yang, S. Wang and R. Urtasun
where ak is the size of anchor. The orientation offset is simply defined as the
difference between predicted and labeled orientations:
pt = k − a k (7)
When only BEV detections are required, the z and d terms are removed from
the regression loss. Positive and negative samples are determined based on dis-
tance to the ground-truth object center. Hard negative mining is used to sample
the negatives. In particular, we first randomly select 5% negative anchors and
then only use top-k among them for training, based on the classification score.
We initialize the image network with ImageNet pre-trained weights and initialize
the BEV network and continuous fusion layers using Xavier initialization [14].
The whole network is trained end-to-end through back-propagation. Note that
there is no direct supervision on the image stream; instead, error is propagated
along the bridge of continuous fusion layer from the BEV feature space.
4 Experiments
We evaluate our multi-sensor 3D object detector on two datasets: the public
KITTI benchmark [11] and a large-scale 3D object detection dataset (TOR4D)
[37]. On the public KITTI dataset we compare with other state-of-the-art meth-
ods in both 3D object detection and BEV object detection tasks. An ablation
study is also conducted that compares different model design choices. We also
evaluate our model on TOR4D, a large-scale 3D object detection dataset col-
lected in-house on roads of North-American cities. On this dataset we show that
the proposed approach works particularly well in long-range (> 60m) detec-
tion, which plays an important role in practical object detection systems for
autonomous driving. Finally we show qualitative results and discuss future di-
rections.
Deep Continuous Fusion 9
Implementation details All camera images are cropped to the size of 370 ×
1224. BEV input is generated by voxelizing the 3D space into a 512 × 448 × 32
volume, corresponding to 70 meters in front direction, ±40 meters on left and
right sides of the ego-car. 8-neighbor interpolation is used during voxelization. We
train a 3D multi-sensor fusion detection model, where all seven regression terms
in Equation 3 are used. Because the height of the 2D box is needed by KITTI
evaluation server, we add another regression term to predict the 2D height. As
a result, we have a final output tensor with the size 118 × 112 × 2 × 9, where
118 × 112 is the number of spatial anchors and 2 is the number of orientation
anchors.
Since the training data in KITTI is limited, we adopt several data augmen-
tation techniques to alleviate over-fitting. For each frame during training, we
apply random scaling (0.9 ∼ 1.1 for all 3 axes), translation (−5 ∼ 5 meters for
xy axes and −1 ∼ 1 for z axis) and rotation (−5 ∼ 5 degrees along z axis) on 3D
LIDAR point clouds, and random scaling (0.9 ∼ 1.1) and translation (−50 ∼ 50
pixels) on camera images. We modify the transformation matrix from LIDAR
to camera accordingly to ensure their correspondence. We do not apply data
augmentation during testing.
We train the model with a batch size of 16 on 4 GPUs. Adam [16] is used for
optimization with 0 weight decay. The learning rate is initialized as 0.001, and
decayed by 0.1 at 30 epochs and 45 epochs. The training ends after 50 epochs.
model ranks third among the models, but has the best AP on the easy subset.
While keeping a high detection accuracy, our model is able to run at real-time
efficiency. Our detector runs at > 15 frames per second, much faster than all
other LIDAR based and fusion based methods.
Continuous fusion has two components which enables the dense accurate fusion
between image and LIDAR. The first is KNN pooling, which gathers image fea-
ture input for dense BEV pixels through sparse neighboring points. The second is
the geometric feature input to MLP, which compensates for the continuous offsets
between the matched position pairs of the two modalities. We investigate these
components by comparing the continuous fusion model with a set of derived
models. We also investigate the model with different KNN hyper-parameters.
The experiments are conducted on the same training/validation split provided
by MV3D [6]. We modify KITTI’s AP metric from 11-point area-under-curve
to 100-point for smaller variance. In practice, these two versions have < 1%
discrepancy.
The first derived model is a LIDAR BEV only model, which uses the BEV
stream of the continuous fusion model as its backbone net and the same de-
tection header. All continuous fusion models significantly outperform the BEV
model in all six metrics, which demonstrates the great advantage of our model.
This advantage is even larger for 3D detection, suggesting that the fused image
features provide complementary z axis information to BEV features.
The second derived model is a discrete fusion model, which has neither KNN
pooling nor geometric feature. This model projects the LIDAR points onto im-
age and BEV to find the matched pixel pairs, whose features are then fused.
Continuous fusion models outperform the discrete fusion model in all metrics.
For BEV detection, the discrete fusion model even has similar scores as the BEV
Deep Continuous Fusion 11
model. This result confirms that fusing image and LIDAR features is not a trivial
task.
When geometric feature is removed from MLP input, the performance of
the continuous fusion model significantly drops. However, even when offsets are
absent, the continuous fusion model still outperforms the discrete one, which
justifies the importance of interpolation by KNN pooling.
Continuous fusion layer has two hyper-parameters, the maximum neighbor
distance d and number of nearest neighbors k. Setting a threshold on the dis-
tance to selected neighbors prevents propagation of wrong information from far
away neighbors. However, as shown in Table 2, the model is insensitive to such
threshold (k=1, d=+inf). One reason might be that the model learns to “ignore”
neighbors when their distance is too far. When the number of nearest neighbor
is increased from 1 to 3, the performance is even worse. A possible reason is
that larger k will lead to more distant neighbors, which have less prediction
power than close neighbors. Empirically, for any of distance threshold chosen,
the model with KNN pooling consistently outperforms the model without KNN
pooling.
100 100 80
BEV AP0.3 BEV AP0.3
90 90
Fuse AP0.3 70 Fuse AP0.3
80 80
60
AP (%)
AP (%)
AP (%)
70 70
50
60 Pixor AP0.5 60
50
BEV AP0.5 50 40
Fuse AP0.5
40 40 30
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
x (meter) x (meter) x (meter)
90 90
BEV AP0.5 BEV AP0.5
60
Fuse AP0.5 Fuse AP0.5
80 80
50
AP (%)
70 70
AP (%)
AP (%)
40
60 60
50 Pixor AP0.7 50
30
BEV AP0.7 20
40
Fuse AP0.7 40
30 30 10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
x (meter) x (meter) x (meter)
size is 1200 × 1920, which is also larger than KITTI images. To achieve real-time
efficiency, we only use a narrow image crop of size 224 × 1920. Overall, after
these changes our TOR4D model is even faster than the KITTI model, running
at 0.05 second per frame.
A multi-class BEV object detection model is trained on the dataset. The
model detects three classes, including vehicle, pedestrian and bicyclist. We changed
the detection header to have multiple classification and regression outputs, one
for each class. The two z axis related regression terms are removed from the loss
function. 2D rotated IoU on BEV is used as the evaluation metric. AP0.5 and
AP0.7 are used for vehicle class, and AP0.3 and AP0.5 are used for the other two
classes. On this large scale dataset we do not find significant benefit from regu-
larization techniques, such as data augmentation and dropout. We thus do not
use these techniques. The model is trained with Adam optimizer [16] at 0.001
learning rate for 1 epoch, and then 0.0001 learning rate for another 0.4 epoch.
Evaluation results We compare the continuous fusion model with two baseline
models. One is a BEV model which is basically the BEV stream of the fusion
model. The other one is a recent state-of-the-art BEV detection model, PIXOR
[37], which is based on LIDAR input only. We evaluate the models over the range
of 0 ≤ x ≤ 100 and −40 ≤ y ≤ 40, where x and y are axes in the LIDAR space.
Deep Continuous Fusion 13
Fig. 4: Qualitative results on KITTI Dataset. The BEV-image pairs and the
detected bounding boxes are shown. The 2D bounding boxes are obtained by
projecting the 3D detections onto the image. The bounding box of an object on
BEV and images are shown in the same color.
Our continuous fusion model significantly outperforms the other two LIDAR
based methods on all classes(Table 3). To better illustrate the performance of
our model on long range object detection, we compute range based piecewise
AP for the three classes (Figure 3). Each point is computed over 10-meter range
for vehicle and pedestrian and 20-meter range for bicyclist along the x axis. The
continuous fusion model outperforms BEV and PIXOR [37] at most ranges, and
achieves more gains for long range detection.
no extra image labels are needed because our model can be trained on 3D/BEV
detection only.
5 Conclusion
References
1. Boscaini, D., Masci, J., Rodolà, E., Bronstein, M.: Learning shape correspondence
with anisotropic convolutional neural networks. In: NIPS (2016)
2. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric
deep learning: going beyond euclidean data. IEEE Signal Processing Magazine
(2017)
3. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3d
object detection for autonomous driving. In: CVPR (2016)
4. Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.:
3d object proposals for accurate object class detection. In: NIPS (2015)
5. Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals
using stereo imagery for accurate object class detection. TPAMI (2017)
6. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network
for autonomous driving. In: CVPR (2017)
7. Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully
convolutional networks. In: NIPS (2016)
8. Du, X., Ang Jr, M.H., Karaman, S., Rus, D.: A general pipeline for 3d detection
of vehicles. arXiv preprint arXiv:1803.00387 (2018)
9. Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal
deep learning for robust rgb-d object recognition. In: IROS (2015)
10. Engelcke, M., Rao, D., Wang, D.Z., Tong, C.H., Posner, I.: Vote3deep: Fast object
detection in 3d point clouds using efficient convolutional neural networks. In: ICRA
(2017)
11. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti
vision benchmark suite. In: CVPR (2012)
12. Girshick, R.: Fast r-cnn. In: ICCV (2015)
13. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: CVPR (2014)
14. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: AISTATS (2010)
15. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from rgb-d
images for object detection and segmentation. In: ECCV (2014)
16. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR (2015)
17. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. ICLR (2017)
18. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.: Joint 3d proposal gener-
ation and object detection from view aggregation. arXiv preprint arXiv:1712.02294
(2017)
19. Li, B.: 3d fully convolutional network for vehicle detection in point cloud. In: IROS
(2017)
20. Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional
network. In: RSS (2016)
21. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: CVPR (2017)
22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: ICCV (2017)
23. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd:
Single shot multibox detector. In: ECCV (2016)
16 M. Liang, B. Yang, S. Wang and R. Urtasun
24. Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detec-
tion, tracking and motion forecasting with a single convolutional net. In: CVPR
(2018)
25. Monti, F., Boscaini, D., Masci, J., Rodolà, E., Svoboda, J., Bronstein, M.M.: Ge-
ometric deep learning on graphs and manifolds using mixture model cnns. CVPR
(2017)
26. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object
detection from rgb-d data. CVPR (2018)
27. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for
3d classification and segmentation. In: CVPR (2017)
28. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn-
ing on point sets in a metric space (2017)
29. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: CVPR (2016)
30. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. In: NIPS (2015)
31. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. TNN (2009)
32. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from rgbd images. In: ECCV (2012)
33. Simonovsky, M., Komodakis, N.: Dynamic edge-conditioned filters in convolutional
neural networks on graphs. CVPR (2017)
34. Song, S., Xiao, J.: Sliding shapes for 3d object detection in depth images. In: ECCV
(2014)
35. Song, S., Xiao, J.: Deep sliding shapes for amodal 3d object detection in rgb-d
images. In: CVPR (2016)
36. Wang, S., Suo, S., Ma, W.C., Urtasun, R.: Deep parameteric convolutional neural
networks. In: CVPR (2018)
37. Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3d object detection from point
clouds. In: CVPR (2018)
38. Yu, S.L., Westfechtel, T., Hamada, R., Ohno, K., Tadokoro, S.: Vehicle detection
and localization on bird’s eye view elevation images using convolutional neural
network. In: SSRR (2017)
39. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object
detection. CVPR (2018)