Wu, Gu Et Al 2022 - PV-RCNN++ Semantical Point-Voxel Feature Interaction For 3D Object Detection
Wu, Gu Et Al 2022 - PV-RCNN++ Semantical Point-Voxel Feature Interaction For 3D Object Detection
https://doi.org/10.1007/s00371-022-02672-2
ORIGINAL ARTICLE
Abstract
Large imbalance often exists between the foreground points (i.e., objects) and the background points in outdoor LiDAR point
clouds. It hinders cutting-edge detectors from focusing on informative areas to produce accurate 3D object detection results.
This paper proposes a novel object detection network by semantical point-voxel feature interaction, dubbed PV-RCNN++.
Unlike most of existing methods, PV-RCNN++ explores the semantic information to enhance the quality of object detection.
First, a semantic segmentation module is proposed to retain more discriminative foreground keypoints. Such a module will
guide our PV-RCNN++ to integrate more object-related point-wise and voxel-wise features in the pivotal areas. Then, to
make points and voxels interact efficiently, we utilize voxel query based on Manhattan distance to quickly sample voxel-wise
features around keypoints. Such the voxel query will reduce the time complexity from O(N) to O(K), compared to the ball
query. Further, to avoid being stuck in learning only local features, an attention-based residual PointNet module is designed to
expand the receptive field to adaptively aggregate the neighboring voxel-wise features into keypoints. Extensive experiments
on the KITTI dataset show that PV-RCNN++ achieves 81.60%, 40.18%, 68.21% 3D mAP on Car, Pedestrian, and Cyclist,
achieving comparable or even better performance to the state-of-the-arts.
Keywords PV-RCNN++ · 3D object detection · Point-voxel feature interaction · Semantic segmentation · Voxel query
1 Introduction
B Xuefeng Yan
[email protected] Object detection in both 2D and 3D fields [1–6] is increas-
Peng Wu ingly important with the development of autonomous driving
[email protected] [7], robot systems, and virtual reality. It also provides assis-
Lipeng Gu tance and support for other tasks (e.g., object tracking
[email protected] [8–10]). Much progress has been made in 3D object detec-
Haoran Xie tion via various data representation (e.g., monocular images
[email protected] [11–14], stereo cameras [15,16], and LiDAR point clouds).
Fu Lee Wang Compared to 3D object detection from 2D images, LiDAR
[email protected] point cloud casts a critical role in detecting 3D objects as
Gary Cheng it contains relatively precise depth and 3D spatial structure
[email protected] information.
Mingqiang Wei LiDAR-based 3D object detectors can be roughly grouped
[email protected] into two prevailing categories: voxel-based [17–21] and
point-based [22–25]. The former discretizes points into reg-
1 College of Computer Science and Technology, Nanjing
University of Aeronautics and Astronautics, Nanjing 211106,
China 4 Department of Mathematics and Information Technology,
2 Department of Computing and Decision Sciences, Lingnan The Education University of Hong Kong, Ting Kok 999077,
University, Lingnan 999077, Hong Kong, China Hong Kong, China
3 5 Shenzhen Research Institute, Nanjing University of
School of Science and Technology, Hong Kong Metropolitan
University, Ho Man Tin 999077, Hong Kong, China Aeronautics and Astronautics, Shenzhen 518063, China
123
P. Wu et al.
ular grids for the convenience of the 3D sparse convolutional gives more preference to positive points, making more fore-
neural network (CNN). Then, the voxelized feature map can ground points retained in the SA layers. Hence, the sampled
be compressed to Bird’s Eye View (BEV) which is fed to points in the SA layers can act as the pivotal point-wise rep-
Region Proposal Network (RPN) [1,18] to produce predic- resentation for the succeeding operation.
tions. On the contrary, the point-based ones mainly adopt After obtaining the discriminative keypoints, the chal-
PointNet++ [26] as the backbone, which take raw points as lenge would be how to efficiently integrate the voxel-wise
input, and abstract sets of point features through an itera- and point-wise features via keypoints. We seek to (1) to
tive sampling-and-grouping operation. Different from only speed up the interaction between points and voxels and (2) to
the single voxel-based and point-based methods, PV-RCNN effectively summarize the 3D information from voxel-wise
[27] explores the interaction between point-wise and voxel- feature. Specifically, 3D sparse convolution is first adopted
wise features. Specially, PV-RCNN deeply integrates both to encode the voxelized point cloud. Then, we propose a fast
3D Voxel CNN and PointNet-based Set Abstraction (SA) voxel-to-point interaction module to efficiently sample and
to enhance the ability of feature learning. To be concrete, a group neighboring voxel-wise feature around keypoints. The
Voxel Set Abstraction (VSA) is proposed to encode voxel- existing query strategy, ball query [27], consumes too much
wise features of different scales through sampled keypoints time to compute the Euclidean distance from every voxel
by furthest point sampling [26] (FPS). Through coordinate to keypoints to identify whether the voxel is within a given
transform and projection, VSA also concatenates the features radius or not. Therefore, motivated by [29], we regard key-
of BEV and raw point features into keypoints to have a more points as voxels, which are regularly arranged in 3D space,
comprehensive understanding of 3D scenes. and then a voxel query strategy based on Manhattan distance
Nevertheless, we observe that the large imbalance between is utilized to quickly identify the neighboring voxel-wise fea-
small informative areas containing 3D objects and large ture of each keypoint. Compared with ball query, our voxel
redundant background areas exists in outdoor LiDAR point query greatly reduces the time consumption from O(N ) to
clouds. It poses a challenge for accurate 3D object detection. O(K ), where N is the total number of voxels and K means
Generally, the point cloud obtained by LiDAR covers a long the number of neighboring voxels around keypoints.
range of hundreds of meters, where only several small cars An Attention-based [30] Residual PointNet Module is
are captured and the rest are numerous background points. proposed to abstract the neighboring voxel-wise features to
However, in PV-RCNN [27], the whole 3D scene is sum- summarize the multi-scale 3D information. We apply self-
marized through a small number of sampled keypoints by attention mechanism on voxel set of each keypoint to produce
FPS. When selecting keypoints, FPS tends to choose distant corresponding attention maps, allowing each voxel to have
points to evenly cover the whole point cloud, which causes a more comprehensive perception field containing more 3D
excessive unimportant background points retained and many structure and scene information of other nearby voxels. Last,
valuable foreground points discarded. Consequently, the per- we introduce a lightweight residual [31,32] PointNet mod-
formance of PV-RCNN is limited largely due to insufficient ule to further extract and aggregate the refined voxel-wise
features provided by the foreground objects. Therefore, we feature.
consider that if there is any prior knowledge that can lead The main contributions are summarized as follows:
the detector to focusing on the pivotal foreground objects to
extract more valuable features. The inspiration is to leverage
the result of point cloud semantic segmentation as the prior • We introduce a semantic-guided keypoint sampling mod-
knowledge to guide the detector. ule to retain more valuable foreground points from the
To this end, we present a novel 3D object detection net- point cloud, which helps the detector focus on small piv-
work via semantical point-voxel feature interaction (termed otal areas containing 3D objects.
PV-RCNN++). First, we introduce a lightweight and fast • We utilize voxel query based on Manhattan distance
foreground point sampling head meticulously modified from to quickly gather the neighboring voxel-wise features
PointNet++ [26] to select proper object-related keypoints. around keypoints, reducing the time consumption com-
We remove the feature propagation (FP) layer in PointNet++ pared to ball query and improving the efficiency of
to avoid the heavy memory usage and time consumption point-voxel interaction.
[24]. We only remain the SA layers to produce more valu- • We propose an attention-based residual PointNet module,
able keypoints. Concretely, in each SA layer, we adopt a which allows each voxel to have an adaptive and nonlocal
binary segmentation module to classify the foreground and summary of the neighborhood to achieve more accurate
background points. Then, inspired by [28], we adopt a novel predictions.
sampling strategy, semantic-guided furthest point sampling • Extensive experiments and results show that our proposed
(S-FPS), taking segmentation scores as guidance to sample method achieves comparable performance on common
and group representative points. Different from FPS, S-FPS 3D object detection benchmark, KITTI dataset [7].
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
123
P. Wu et al.
the time-consuming interaction between points and voxels; redesign the PointNet++ to act as our foreground point sam-
and (3) voxel-wise feature should have a more comprehen- pling module. As mentioned in 3D-SSD [24], although the
sive perception of neighboring structure feature instead of FP layer in PointNet++ [26] can broadcast the semantic fea-
local convolution feature. To this end, we present our PV- ture to all points to improve segmentation precision, it takes
RCNN++: semantical point-voxel feature interaction for 3D much time to upsample points. Therefore, capturing more
object detection, which consists of the following modules: foreground points in the SA layer would be a better choice.
(1) a binary segmentation module is introduced to guide FPS We remove the FP layer in PointNet++ and adopt a semantic-
to select more object-related keypoints; (2) voxel query based guided sampling strategy in SA layer. We first feed the raw
on Manhattan distance replaces the ball query to quickly points P with feature F into the segmentation module to com-
sample voxel-wise feature; (3) an attention-based residual pute the score S. Then with the guidance of S, we employ the
PointNet is designed to adaptively fuse the neighboring modified sampling strategy to sample K keypoints in the SA
voxel-wise feature to summarize the nonlocal 3D structure layer. The specific process is shown in Fig. 2 and described
information. Our backbone is illustrated as Fig. 1. Given as follows.
point cloud P = { pi | i = 1, 2, 3 . . . N } ⊆ R3+d as input,
where N = 16384 and d denotes the point feature (e.g., 3.3.1 Binary segmentation module
reflection intensity), our goal is to predict the center loca-
tion (x, y, z), box size (l, w, h) and rotation angle θ around To avoid bringing high computation, we adopt a 2-layer
Z −axis of each object. MLP as binary segmentation module to directly obtain the
score of point. Concretely, given point feature set Fk =
3.2 Voxel encoder and 3D region proposal network { f 1dk , f 2dk , f 3dk , . . . f Ndkk }, where dk denotes the d-dimension
of point feature f idk fed into the k-th SA layer, score si ∈
First, the unordered point cloud is transformed to uniformed [0, 1] of each point is defined as:
3D grids with voxel size v, and each grid contains k points.
Then, the mean voxel feature encoder (VFE) [17] is adopted si = (MLP k ( f idk )) (1)
to compute the mean feature of k points as the representative
feature of the grid. where means the sigmoid function; MLP k represents the
segmentation module in k-th SA layer. The real segmentation
3.2.1 3D voxel CNN labels can be obtained from ground-truth boxes. We define
the points inside the ground-truth boxes as foreground points
Through mean VFE, the point cloud shapes as a L × while the outside as background points. Therefore the loss
W × H feature volume. We utilize 3D sparse convolutional of the segmentation module can be calculated as:
neural networks [18] to encode the feature volumes with
m Nk
1×, 2×, 4×, 8× downsample sizes. All four downsampled λk
voxel-wise feature maps are preserved for the subsequent Lkseg = · BC E(sik , ŝik ) (2)
Nk
k=1 i=1
point-voxel interaction module.
where sik denotes the predicted score and ŝik is the ground-
3.2.2 3D region proposal network truth label (0 for the background and 1 for the foreground) of
the i-th point in the k-th SA layer. BC E is the binary cross-
After 3D voxel encoder, the forth 8× downsampled feature entropy loss function. m means the number of SA layers,
map is compressed to 2D BEV feature map which is of L8 × W8 which is set to 4. Nk ∈ {4096, 2048, 1024, 256} is the total
resolution. We utilize anchor-based methods [1] on the BEV number of the input points, and λk ∈ {0.1, 0.01, 0.001, 0.0001}
map to generate 3D anchor boxes with the average size of is the loss weight of k-th SA layer.
each class. Considering the rotation angle around Z -axis, 0
and π2 degree are set for each anchor pixel. Therefore, the 3.3.2 Semantic-guided further point sampling
whole BEV map produces 3 × 2 × L8 × W8 proposals in total
for three classes: Car, Pedestrian, and Cyclist. Since we have obtained the point scores from the binary
segmentation module, it means that the possible foreground
3.3 Foreground point sampling points have been masked. The easiest way to select fore-
ground points is to use Top-K scores as guidance. However,
Our motivation is to retain more foreground points to cap- as observed from Fig. 7 and Table 2, it will decrease the per-
ture more valuable spatial and position information while ceptual ability of the whole 3D scene if too many foreground
bringing no burden of time consumption, so we meticulously points are selected and few background points are involved.
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
FPS
S-FPS(4096) S-FPS(2048) S-FPS(1024)
…
Voxel Query Keypoint Re-Weighng Module
Proposals refined
…
Classificaon
Fig. 1 An overview of our PV-RCNN++. The point cloud is first fed points, the voxel-wise feature, point-wise feature, and the BEV feature
into 3D voxel encoder to produce 3D proposals. Then, the foreground are concatenated to be fed into RoI-grid pooling to refine the proposals
point sampling module selects more valuable object-related keypoints to produce more accurate 3D boxes
through the modified SA layers. Last, according to the sampled key-
CAR
Raw points or from last SA layer Points with scores Selected key points Output key points with features
Fig. 2 The structure of the modified set abstract (SA) layer in the fore- better understanding of the 3D object-related information. Last, neigh-
ground point sampling module. Points from raw data or the previous boring points around keypoints are grouped and aggregated to produce
SA layer are fed to the binary segmentation module to obtain the scores. the final keypoint feature, which is fed to the next SA layer
Then, S-FPS is adopted to select more foreground keypoints to have a
Motivated by [28], we modified the furthest point sampling vanilla FPS, the point with the longest distance is picked up
strategy by adding a score weight called S-FPS. Keeping the as the furthest point. However, in S-FPS, we recalculate each
basic flow of FPS unchanged, we leverage scores of unse- distance d̂i with score si as the following formula:
lected points to rectify the distances to the selected points.
Given point coordinate set P = { p1 , p2 , p3 , . . . p N } ⊆ R3 d̂i = (eγ si − 1) · di , (3)
and the corresponding score set S = {s1 , s2 , s3 , . . . s N }, the
distance set D = {d1 , d2 , d3 , . . . d N } is the shortest distances where γ is an adjustable parameter deciding the importance
of N unselected points to the already selected points. In the of the score, which is set to 1 by default. When γ is fixed,
123
P. Wu et al.
the closer score si is to 1, the greater d̂i is. Hence, S-FPS can Voxel grid Voxel center Query point
select more positive points (i.e., foreground points) compared
to FPS. Obviously, S-FPS will be similar to Top-K algorithm
if γ = +∞. The specific process of S-FPS is shown in
Algorithm 1.
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
V ∈ Rdv ·N :
N
V̂i = Attention(Qi , K, V) = Sim · Vm . (8)
Fig. 4 Demonstration of voxel attention module. The input is N sam- m=1
pled voxel-wise feature
Finally, we add the weighted value V̂ ∈ Rdv ·N and origi-
nal voxel-wise feature F ∈ Rd f ·N to represent the attention
on the crucial voxel-wise feature to obtain more comprehen-
feature Ṽ ∈ R(dv +d f )·N .
sive and important features.
Attention mechanism [30] has shown its great power
3.5.2 Residual PointNet aggregation
in various visual tasks. Benefiting from the self-attention
mechanism, the model can obtain a larger receptive field
Through the self-attention mechanism, each voxel point inte-
to summarize the nonlocal feature. We note that sampled
grates the weighted feature from surrounding voxels and can
voxel-wise feature by voxel query can provide different
adaptively focus on the hotspot area of local structure infor-
structure and spatial information from different areas of the
mation. Next, the weighted values Ṽ = {v1 , v2 , v3 , . . . v N }
point cloud. Hence, we propose an attention-based resid-
are fed into a feed-forward network to produce the final fea-
ual PointNet aggregation module to adaptively aggregate the
ture. Different from transformer [30], which adopts simple
voxel-wise feature from the hotspot area of the point cloud.
linear layers, we propose a lightweight residual PointNet to
forward the weighted values. As illustrated in Fig. 5, given
3.5.1 Voxel attention module weighted values Ṽ, a plug-and-play module composed of
Convolution1D, Batch Normalization and Relu is stacked in
As shown in Fig. 4, given feature set F = { f 1 , f 2 , f 3 , . . . f N } a skip-connection way to extract the feature of Ṽ. In the end,
∈ Rd f ·N of N sampled voxels by voxel query, queries Q, keys we adopt the MaxPooling function to produce the represen-
K and values V are generated from F as following formula: tative voxel aggregation feature.
123
P. Wu et al.
K = [ f i ; pi − g j ]T | ∀ pi ∈ P, ∀g j ∈ G, pi − g j < r , (9)
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
4.2.2 Sampling strategy setting the generalization and robustness of the model. At the RoI
refinement stage, we sample 128 proposals from RPN and set
In the foreground point sampling module, we place four SA the threshold θ f g = 0.55 to classify the foreground objects
layers for summarizing the semantic feature with different and background objects. Half of them are recognized as fore-
numbers of keypoints. Before SA layers, we sample 16384 ground objects if the 3D intersection over union (IoU) with
points from the raw points as input. For the reason that raw the ground-truth box is over θ f g , while the lower is deter-
points have no segmentation scores, we adopt the vanilla FPS mined to be the background.
in the first SA layer to sample 4096 keypoints. Then, in the At the inference stage, we employ nonmaximum-
next three SA layers, we utilize S-FPS to sample 2048, 1024, suppression (NMS) on the 3D proposals with the thresh-
and 256 keypoints according to the segmentation scores. old of 0.7 to filter out top-100 proposals as the input of the
refinement module. Then, in RoI Grid Pooling, proposals
4.2.3 Network architecture are further refined with the feature-rich keypoints. Subse-
quently, NMS is adopted again with a threshold of 0.1 to
As shown in Fig. 1, the point cloud is input in the form of remove redundant predictions.
quantitative voxels with the resolution of 1600×1408×40×8
and sampled 16384 points by FPS. First, the voxel-based 4.3 Results on KITTI
backbone leverages 3D voxel CNN with 1×, 2×, 4×, 8×
downsample sizes to produce four voxel-wise feature maps 4.3.1 Evaluation metrics
with 16, 32, 64, and 64 output dimensions. After the last
3D CNN layer, the output feature map is of the resolu- We follow the evaluation criteria that the KITTI benchmark
tion of 200 × 176 × 2 × 128. Once the feature map is provides to ensure accuracy and fairness. The IoU threshold
reshaped to 200 × 176 × 256, the RPN [18] network is is set to 0.7 for Car and 0.5 for both Pedestrian and Cyclist.
applied to generate 3D proposals. For the voxel query in The results reported to the official KITTI test server are calcu-
Sect. 3.4, we set the query range I to 4 and sample 16 lated by average precision (AP) setting of recall 40 positions
neighboring voxels around query point. Hence, the size of to compare with the state-of-the-art methods.
each attention map in voxel attention module is 16 × 16. We demonstrate the test results returned from the KITTI
In the residual PointNet, the output dimension of Convolu- test server in Table 1 with the comparison of performance
tion1D is set to 32. The 16384 raw points are input to the on 3D detection. For the most critical Car detection, we
foreground point sampling module which consists of 4 sub- surpass the PV-RCNN by 2.55%, 0.20%, 0.07% on easy,
sequent SA layers. Each SA layer has 2 multi-scale radii r ∈ moderate, and hard levels. It is worth noting that our method
{[0.1 m, 0.5 m], [0.5 m, 1.0 m], [1.0 m, 2.0 m], [2.0 m, 4.0 m]} improves greatly on the Cyclist class, surpassing PV-RCNN
to group neighboring points which are input to 2-layer MLP by 4.89%, 5.09%, 4.36% on three levels. However, our
to classify the foreground and background. Then, the sam- method achieves inferior results on the Pedestrian class com-
pled keypoints from SA layer integrate the point-wise feature, pared with the state-of-the-art methods, where we think the
voxel-wise feature, and 2D BEV feature to summarize the segmentation module has limited ability to classify the small
critical information of the 3D scene. In the end, RoI-grid foreground objects like Pedestrian, and S-FPS tends to select
pooling divides the proposal box into 6 × 6 × 6 grids to refine keypoints on the big foreground objects like the Car and
the proposals with the aggregation of feature-rich keypoints Cyclist.
within 0.6 m and 0.8 m radius.
4.4 Ablation studies
4.2.4 Training and inference
To comprehensively verify the effectiveness of our method,
Concretely, the model is trained 80 epochs with Adam opti- we conduct ablation studies on foreground point sampling,
mizer and the learning rate is initially set to 0.01 updated by voxel query, voxel attention module, and residual PointNet,
one-cycle policy. We conduct the experiments on one RTX respectively.
3090 GPU with batch size 4, which takes about 36 hours.
During training, to avoid over-fitting, we leverage data 4.4.1 Effectiveness of foreground point sampling
augmentation strategies like [18]. We use GT-sampling to
paste some foreground instances from other point cloud In this part, we test the influence of the different numbers of
scenes to current training frame. Besides, augmentation oper- foreground points on detection precision, as shown in Table 2.
ations like random flip along X -axis, random rotation with It seems feasible that the more foreground points get caught,
angle range in [− π4 , π4 ], and random scaling with a ran- the more accurate the result will be. Therefore, we set γ in
dom scaling factor in [0.95, 1.05] are adopted to enhance Eq. 3 with 1, 2, 3, and 100 to increase the number of sampled
123
P. Wu et al.
foreground points. However, it turns out that when γ becomes Table 2 Performance comparison on the KITTI val split with different
larger, the performance decreases instead. The reason is that γ controlling the number of foreground keypoints
sampling excessive foreground points with few background Sampling strategy Easy (%) Mod (%) Hard (%)
points (like the Top-K sampling strategy) make it hard for
FPS 89.02 83.59 78.49
the model to have a global perception of the whole 3D scene,
S-FPS(γ = 1) 89.65 84.34 79.11
which limits its ability to precisely locate the correct objects.
As demonstrated in Fig. 7, when γ = 1, the sampled key- S-FPS(γ = 2) 89.75 84.15 79.09
points can focus on foreground objects and preserve proper S-FPS(γ = 3) 89.69 84.11 79.02
background points at the same time. S-FPS(γ = 100) 89.06 79.06 76.74
The 3D average precision is calculated by 11 recall positions for the
Car class
4.4.2 Effectiveness of voxel query Bold indicates the best results
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
Fig. 7 Visualization comparison of different point sampling strategies. remains proper background information. d Almost only foreground
a Raw point cloud scene. b Vanilla FPS samples too many background points are sampled by S-FPS (γ = 100), which loses the perception of
points, while points on foreground objects are too sparse. c Demon- the whole 3D scene
strates that S-FPS (γ = 1) can sample more foreground points and
Table 3 Performance and FPS comparison between ball query and voxel-wise attention features encoded into keypoints, where
voxel query almost the whole object is highly focused on, not only local
Query method Easy (%) Mod (%) Hard (%) FPS (Hz) spatial features are learned, indicating the importance of the
attention-based residual PointNet module.
Ball query 89.42 83.60 82.34 12.5
Voxel query 89.60 83.58 82.45 15.15
4.5 Visualization and discussion
The 3D AP is computed on the KITTI val split by recall 11 positions
for the Car class
Bold indicates the best results
Figure 9 shows the visualization results on the KITTI dataset.
Our model performs stably and makes accurate detection
in complex situations, especially in Fig. 9B. Moreover, as
in Table 4, the performance is further improved by atten- shown in Table 1, higher average precision on car and cyclist
tion mechanism with residual PointNet. Figure 8 shows the proves the effectiveness of our proposed methods. However,
123
P. Wu et al.
Fig. 8 Visualization of attention feature induced by attention-based residual PointNet module. The aggregated features cover the whole object-
related regions (red points) rather than small local parts of the object
Table 4 Performance
S-FPS Voxel query VAM Residual PointNet Easy (%) Mod (%) Hard (%)
demonstration of
attention-based residual × × × × 89.02 83.59 78.49
PointNet aggregation
× × 89.60 84.08 78.98
× 89.64 84.27 79.07
89.65 84.34 79.11
The 3D AP is computed on the KITTI val split by recall 11 positions for the Car class
Bold indicates the best results
Fig. 9 Visualization results on KITTI(FOV). Detected cars, pedestrians, and cyclists are marked by green boxes, blue boxes, and yellow boxes
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
speed up the interaction between keypoints and voxels to kernel correlation filter. J. King Saud Univers. Comput. Inf. Sci.
efficiently group the neighboring voxel-wise feature. The 6008–6018 (2022)
9. Zhang, J., Feng, W., Yuan, T., Wang, J., Sangaiah, A.K.: Scstcf:
proposed attention-based residual PointNet abstracts more spatial-channel selection and temporal regularized correlation fil-
fine-grained 3D information from nearby voxels, providing ters for visual tracking. Appl. Soft Comput. 118, 108485 (2022)
more comprehensive features for the succeeding refinement. 10. Zhang, J., Sun, J., Wang, J., Li, Z., Chen, X.: An object tracking
Extensive experiments on KITTI dataset demonstrate that our framework with recapture based on correlation filters and Siamese
networks. Comput. Electr. Eng. 98, 107730 (2022)
proposed semantic-guided voxel-to-keypoint detector pre- 11. Yan, C., Salman, E.: Mono3d: open source cell library for mono-
cisely summarizes the valuable information from pivotal lithic 3-d integrated circuits. IEEE Trans. Circuits Syst. I Regul.
areas in the point cloud and improves performance compared Pap. 65(3), 1075–1085 (2017)
with previous state-of-the-art methods. 12. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical
depth distribution network for monocular 3d object detection. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and
Acknowledgements This work was supported by the 14th Five-Year
Pattern Recognition, pp. 8555–8564 (2021)
Planning Equipment Pre-Research Program (No. JZX7Y20220301001
13. Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: monocular 3d object
801), by the National Natural Science Foundation of China (No.
detection using pairwise spatial relationships. In: Proceedings of
62172218), by the Free Exploration of Basic Research Project, Local
the IEEE/CVF Conference on Computer Vision and Pattern Recog-
Science and Technology Development Fund Guided by the Central
nition, pp. 12093–12102 (2020)
Government of China (No. 2021Szvup060), and by the General Pro-
14. Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: Gs3d: an effi-
gram of Natural Science Foundation of Guangdong Province (No.
cient 3d object detection framework for autonomous driving. In:
2022A1515010170).
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 1019–1028 (2019)
Data Availability Statement The datasets generated during and/or ana-
15. Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.:
lyzed during the current study are available from the corresponding
3d object proposals using stereo imagery for accurate object class
author on reasonable request.
detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1259–
1272 (2017)
16. Li, P., Chen, X., Shen, S.: Stereo r-cnn based 3d object detection
Declarations for autonomous driving. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 7644–7652
(2019)
Conflict of interest The authors declare that they have no conflict of 17. Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud
interest. based 3d object detection. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 4490–4499
(2018)
18. Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional
detection. Sensors 18(10), 3337 (2018)
References 19. Ye, M., Xu, S., Cao, T.: Hvnet: hybrid voxel network for lidar based
3d object detection. In: Proceedings of the IEEE/CVF Confer-
1. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real- ence on Computer Vision and Pattern Recognition, pp. 1631–1640
time object detection with region proposal networks. IEEE Trans. (2020)
Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016) 20. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.:
2. Zheng, C., Shi, D., Yan, X., Liang, D., Wei, M., Yang, X., Guo, Pointpillars: fast encoders for object detection from point clouds.
Y., Xie, H.: Glassnet: label decoupling-based three-stream neural In: Proceedings of the IEEE/CVF Conference on Computer Vision
network for robust image glass detection. Comput. Graph. Forum and Pattern Recognition, pp. 12697–12705 (2019)
41(1), 377–388 (2022) 21. Shi, S., Wang, Z., Wang, X., Li, H.: Part-a2 net: 3d part-aware and
3. Wei, Z., Liang, D., Zhang, D., Zhang, L., Geng, Q., Wei, M., aggregation neural network for object detection from point cloud.
Zhou, H.: Learning calibrated-guidance for object detection in arXiv preprint arXiv:1907.03670 vol. 2, no. (3) (2019)
aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 22. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation
15, 2721–2733 (2022) and detection from point cloud. In: Proceedings of the IEEE/CVF
4. Luo, C., Yang, X., Yuille, A.: Exploring simple 3d multi-object Conference on Computer Vision and Pattern Recognition, pp. 770–
tracking for autonomous driving. In: Proceedings of the IEEE/CVF 779 (2019)
International Conference on Computer Vision, pp. 10488–10497 23. Shi, W., Rajkumar, R.: Point-gnn: graph neural network for 3d
(2021) object detection in a point cloud. In: Proceedings of the IEEE/CVF
5. Ji, C., Liu, G., Zhao, D.: Stereo 3d object detection via instance Conference on Computer Vision and Pattern Recognition, pp.
depth prior guidance and adaptive spatial feature aggregation. Vis. 1711–1719 (2020)
Comput. 1–12 (2022) 24. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single
6. Wang, Z., Xie, Q., Wei, M., Long, K., Wang, J.: Multi-feature stage object detector. In: Proceedings of the IEEE/CVF Conference
fusion votenet for 3d object detection. ACM Trans. Multim. Com- on Computer Vision and Pattern Recognition, pp. 11040–11048
put. Commun. Appl. 18(1), 6–1617 (2022) (2020)
7. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous 25. Xie, Q., Lai, Y.-K., Wu, J., Wang, Z., Lu, D., Wei, M., Wang, J.:
driving? The Kitti vision benchmark suite. In: 2012 IEEE Confer- Venet: voting enhancement network for 3d object detection. In:
ence on Computer Vision and Pattern Recognition, pp. 3354–3361. Proceedings of the IEEE/CVF International Conference on Com-
IEEE (2012) puter Vision, pp. 3712–3721 (2021)
8. Xia, R., Chen, Y., Ren, B.: Improved anti-occlusion object track- 26. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++ deep hierarchical
ing algorithm using unscented Rauch–Tung–Striebel smoother and feature learning on point sets in a metric space. In: Proceedings of
123
P. Wu et al.
the 31st International Conference on Neural Information Process- 45. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets
ing Systems, pp. 5105–5114 (2017) for 3d object detection from rgb-d data. In: Proceedings of the
27. Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv- IEEE Conference on Computer Vision and Pattern Recognition,
rcnn: point-voxel feature set abstraction for 3d object detection. pp. 918–927 (2018)
In: Proceedings of the IEEE/CVF Conference on Computer Vision 46. Huang, T., Liu, Z., Chen, X., Bai, X.: Epnet: Enhancing point fea-
and Pattern Recognition, pp. 10529–10538 (2020) tures with image semantics for 3d object detection. In: European
28. Chen, C., Chen, Z., Zhang, J., Tao, D.: Sasa: Semantics-augmented Conference on Computer Vision, pp. 35–52 (2020). Springer
set abstraction for point-based 3d object detection. arXiv e-prints, 47. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting:
2201 (2022) Sequential fusion for 3d object detection. In: Proceedings of the
29. Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r- IEEE/CVF Conference on Computer Vision and Pattern Recogni-
cnn: towards high performance voxel-based 3d object detection. tion, pp. 4604–4612 (2020)
In: Proceedings of the AAAI Conference on Artificial Intelligence, 48. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3d-cvf: Generating joint
vol. 35, pp. 1201–1209 (2021) camera and lidar features using cross-view spatial feature fusion for
30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., 3d object detection. In: European Conference on Computer Vision,
Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. pp. 720–736 (2020). Springer
arXiv e-prints, 1706 (2017) 49. Pang, S., Morris, D., Radha, H.: Fast-clocs: fast camera-lidar
31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image object candidates fusion for 3d object detection. In: Proceedings
recognition. In: Proceedings of the IEEE Conference on Computer of the IEEE/CVF Winter Conference on Applications of Computer
Vision and Pattern Recognition, pp. 770–778 (2016) Vision, pp. 187–196 (2022)
32. Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network 50. Zhang, Y., Chen, J., Huang, D.: Cat-det: contrastively augmented
design and local geometry in point cloud: a simple residual mlp transformer for multi-modal 3d object detection. In: Proceedings of
framework. arXiv preprint arXiv:2202.07123 (2022) the IEEE/CVF Conference on Computer Vision and Pattern Recog-
33. Zarzar, J., Giancola, S., Ghanem, B.: Pointrgcn: graph convolu- nition (CVPR), pp. 908–917 (2022)
tion networks for 3d vehicles detection refinement. arXiv preprint 51. Wang, Z., Xie, Q., Wei, M., Long, K., Wang, J.: Multi-feature fusion
arXiv:1911.12236 (2019) votenet for 3d object detection. ACM Trans. Multimed. Comput.
34. Zhang, Y., Huang, D., Wang, Y.: Pc-rgnn: Point cloud completion Commun. Appl. 18(1), 1–17 (2022)
and graph neural network for 3d object detection. In: Proceedings 52. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai,
of the AAAI Conference on Artificial Intelligence, vol. 35, pp. X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly,
3430–3437 (2021) S., et al.: An image is worth 16x16 words: transformers for image
35. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
G.: The graph neural network model. IEEE Trans. Neural Netw. 53. Shu, X., Yang, J., Yan, R., Song, Y.: Expansion-squeeze-excitation
20(1), 61–80 (2008) fusion network for elderly activity recognition. IEEE Trans. Cir-
36. Yuan, W., Khot, T., Held, D., Mertz, C., Hebert, M.: Pcn: point com- cuits Syst. Video Technol. 1–1 (2022)
pletion network. In: 2018 International Conference on 3D Vision 54. Shu, X., Zhang, L., Qi, G.-J., Liu, W., Tang, J.: Spatiotemporal
(3DV), pp. 728–737. IEEE (2018) co-attention recurrent neural networks for human-skeleton motion
37. Zheng, W., Tang, W., Chen, S., Jiang, L., Fu, C.-W.: Cia-ssd: con- prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3300–
fident iou-aware single-stage object detector from point cloud. In: 3315 (2021)
Proceedings of the AAAI Conference on Artificial Intelligence, 55. Tang, J., Shu, X., Yan, R., Zhang, L.: Coherence constrained graph
vol. 35, pp. 3555–3562 (2021) lstm for group activity recognition. IEEE Trans. Pattern Anal.
38. Zheng, W., Tang, W., Jiang, L., Fu, C.-W.: Se-ssd: self-ensembling Mach. Intell. 44(2), 636–647 (2022)
single-stage object detector from point cloud. In: Proceedings of 56. Li, P., Chen, Y.: Research into an image inpainting algorithm via
the IEEE/CVF Conference on Computer Vision and Pattern Recog- multilevel attention progression mechanism. Math. Probl. Eng.
nition, pp. 14494–14503 (2021) 2022, 1–12 (2022)
39. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge 57. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point trans-
in a neural network. arXiv preprint arXiv:1503.02531, vol. 2, no. former. In: Proceedings of the IEEE/CVF International Conference
(7) (2015) on Computer Vision, pp. 16259–16268 (2021)
40. Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: Std: sparse-to-dense 3d 58. Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu,
object detector for point cloud. In: Proceedings of the IEEE/CVF S.-M.: Pct: point cloud transformer. Comput. Vis. Media 7(2), 187–
International Conference on Computer Vision, pp. 1951–1960 199 (2021)
(2019) 59. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for
41. He, C., Zeng, H., Huang, J., Hua, X.-S., Zhang, L.: Structure aware dense object detection. In: Proceedings of the IEEE International
single-stage 3d object detection from point cloud. In: Proceed- Conference on Computer Vision, pp. 2980–2988 (2017)
ings of the IEEE/CVF Conference on Computer Vision and Pattern 60. Noh, J., Lee, S., Ham, B.: Hvpr: hybrid voxel-point representa-
Recognition, pp. 11873–11882 (2020) tion for single-stage 3d object detection. In: Proceedings of the
42. Jiang, T., Song, N., Liu, H., Yin, R., Gong, Y., Yao, J.: Vic-net: IEEE/CVF Conference on Computer Vision and Pattern Recogni-
voxelization information compensation network for point cloud tion (CVPR), pp. 14605–14614 (2021)
3d object detection. In: 2021 IEEE International Conference on 61. He, Q., Wang, Z., Zeng, H., Zeng, Y., Liu, Y.: Svga-net: sparse
Robotics and Automation (ICRA), pp. 13408–13414 (2021). IEEE voxel-graph attention network for 3d object detection from point
43. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object clouds. In: Proceedings of the AAAI Conference on Artificial Intel-
detection network for autonomous driving. In: Proceedings of the ligence, vol. 36, pp. 870–878 (2022)
IEEE Conference on Computer Vision and Pattern Recognition, 62. Zhang, Y., Hu, Q., Xu, G., Ma, Y., Wan, J., Guo, Y.: Not all points
pp. 1907–1915 (2017) are equal: Learning highly efficient point-based detectors for 3d
44. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d lidar point clouds. In: Proceedings of the IEEE/CVF Conference
proposal generation and object detection from view aggregation. on Computer Vision and Pattern Recognition, pp. 18953–18962
In: 2018 IEEE/RSJ International Conference on Intelligent Robots (2022)
and Systems (IROS), pp. 1–8 (2018). IEEE
123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection
63. Team, O.D.: OpenPCDet: An Open-source Toolbox for 3D Object Haoran Xie received his PhD in
Detection from Point Clouds. https://github.com/open-mmlab/ computer science from the City
OpenPCDet (2020) University of Hong Kong, Hong
Kong, and his EdD degree in dig-
ital learning from the University
Publisher’s Note Springer Nature remains neutral with regard to juris- of Bristol, UK. He is currently an
dictional claims in published maps and institutional affiliations. associate professor at the Depart-
ment of Computing and Decision
Springer Nature or its licensor holds exclusive rights to this article Sciences, Lingnan University,
under a publishing agreement with the author(s) or other rightsholder(s); Hong Kong. His research interests
author self-archiving of the accepted manuscript version of this article include artificial intelligence, big
is solely governed by the terms of such publishing agreement and appli- data, and educational technology.
cable law. He has published 290 research
publications, including 140 jour-
nal articles. He is the editor-in-
chief of Computers & Education: Artificial Intelligence. He is a senior
Peng Wu is currently pursu- member of IEEE and ACM.
ing the PhD degree with com-
puter science and technology at Fu Lee Wang received his BEng
Nanjing University of Aeronau- degree in computer engineering
tics and Astronautics (NUAA). He and his MPhil degree in com-
received the B.Sc. degree from puter science and information sys-
Nanjing University of Information tems from the University of Hong
Science and Technology in 2021. Kong, Hong Kong, and his PhD
His research interests include deep in systems engineering and engi-
learning, computer vision, and com- neering management from Chi-
puter graphics. nese University of Hong Kong,
Hong Kong. He is currently the
dean of the School of Science
and Technology, Hong Kong
Metropolitan University. He has
over 250 publications in interna-
Lipeng Gu is currently pursu- tional journals and conferences
ing the PhD degree with com- and led more than 20 competitive grants with a total value greater than
puter science and technology at 20 million. His current research interests include educational technol-
Nanjing University of Aeronau- ogy, information retrieval, computer graphics, and bioinformatics. He
tics and Astronautics (NUAA). He is a fellow of BCS, IET, and HKIE and a senior member of ACM
received the M.Sc. degree from and IEEE. He was the chair of the IEEE Hong Kong section computer
Donghua University in 2021. His chapter and ACM Hong Kong chapter.
research interests include deep
learning, computer vision, and com-
Gary Cheng is currently an asso-
puter graphics.
ciate professor at the Department
of Mathematics and Information
Technology at the Education Uni-
versity of Hong Kong. He received
his PhD in computing from Hong
Kong Polytechnic University. His
research interests include artifi-
Xuefeng Yan is a professor of cial intelligence in education,
the School of Computer Science technology-enhanced language
and Technology, Nanjing Univer- learning, and machine learning.
sity of Aeronautics and Astronau-
tics (NUAA), China. He obtained
his PhD degree from Beijing Insti-
tute of Technology in 2005. He
was the visiting scholar at Geor-
gia State University in 2008 and
2012. His research interests include
intelligent computing, MBSE/
complex system modeling, simu-
lation, and evaluation.
123
P. Wu et al.
123