0% found this document useful (0 votes)

28 views

Wu, Gu Et Al 2022 - PV-RCNN++ Semantical Point-Voxel Feature Interaction For 3D Object Detection

This document summarizes a research paper that proposes PV-RCNN++, a 3D object detection network that uses semantic point-voxel feature interaction. It introduces 1) a semantic segmentation module to sample more discriminative foreground keypoints, 2) a fast voxel query to efficiently interact point and voxel features around keypoints, and 3) an attention-based PointNet module to aggregate neighboring voxel features. Experiments on the KITTI dataset show PV-RCNN++ achieves state-of-the-art or comparable performance for car, pedestrian, and cyclist detection.

Uploaded by

bish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Wu, Gu Et Al 2022 - PV-RCNN++ Semantical Point-Voxel Feature Interaction For 3D Object Detection

Uploaded by

bish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

The Visual Computer

https://doi.org/10.1007/s00371-022-02672-2

ORIGINAL ARTICLE

PV-RCNN++: semantical point-voxel feature interaction for 3D object

detection
Peng Wu1 · Lipeng Gu1 · Xuefeng Yan1 · Haoran Xie2 · Fu Lee Wang3 · Gary Cheng4 · Mingqiang Wei5

Accepted: 6 September 2022

Abstract
Large imbalance often exists between the foreground points (i.e., objects) and the background points in outdoor LiDAR point
clouds. It hinders cutting-edge detectors from focusing on informative areas to produce accurate 3D object detection results.
This paper proposes a novel object detection network by semantical point-voxel feature interaction, dubbed PV-RCNN++.
Unlike most of existing methods, PV-RCNN++ explores the semantic information to enhance the quality of object detection.
First, a semantic segmentation module is proposed to retain more discriminative foreground keypoints. Such a module will
guide our PV-RCNN++ to integrate more object-related point-wise and voxel-wise features in the pivotal areas. Then, to
make points and voxels interact efﬁciently, we utilize voxel query based on Manhattan distance to quickly sample voxel-wise
features around keypoints. Such the voxel query will reduce the time complexity from O(N) to O(K), compared to the ball
query. Further, to avoid being stuck in learning only local features, an attention-based residual PointNet module is designed to
expand the receptive ﬁeld to adaptively aggregate the neighboring voxel-wise features into keypoints. Extensive experiments
on the KITTI dataset show that PV-RCNN++ achieves 81.60%, 40.18%, 68.21% 3D mAP on Car, Pedestrian, and Cyclist,
achieving comparable or even better performance to the state-of-the-arts.

Keywords PV-RCNN++ · 3D object detection · Point-voxel feature interaction · Semantic segmentation · Voxel query

1 Introduction
B Xuefeng Yan
[email protected] Object detection in both 2D and 3D ﬁelds [1–6] is increas-
Peng Wu ingly important with the development of autonomous driving
[email protected] [7], robot systems, and virtual reality. It also provides assis-
Lipeng Gu tance and support for other tasks (e.g., object tracking
[email protected] [8–10]). Much progress has been made in 3D object detec-
Haoran Xie tion via various data representation (e.g., monocular images
[email protected] [11–14], stereo cameras [15,16], and LiDAR point clouds).
Fu Lee Wang Compared to 3D object detection from 2D images, LiDAR
[email protected] point cloud casts a critical role in detecting 3D objects as
Gary Cheng it contains relatively precise depth and 3D spatial structure
[email protected] information.
Mingqiang Wei LiDAR-based 3D object detectors can be roughly grouped
[email protected] into two prevailing categories: voxel-based [17–21] and
point-based [22–25]. The former discretizes points into reg-
1 College of Computer Science and Technology, Nanjing
University of Aeronautics and Astronautics, Nanjing 211106,
China 4 Department of Mathematics and Information Technology,
2 Department of Computing and Decision Sciences, Lingnan The Education University of Hong Kong, Ting Kok 999077,
University, Lingnan 999077, Hong Kong, China Hong Kong, China
3 5 Shenzhen Research Institute, Nanjing University of
School of Science and Technology, Hong Kong Metropolitan
University, Ho Man Tin 999077, Hong Kong, China Aeronautics and Astronautics, Shenzhen 518063, China

123
P. Wu et al.

ular grids for the convenience of the 3D sparse convolutional gives more preference to positive points, making more fore-
neural network (CNN). Then, the voxelized feature map can ground points retained in the SA layers. Hence, the sampled
be compressed to Bird’s Eye View (BEV) which is fed to points in the SA layers can act as the pivotal point-wise rep-
Region Proposal Network (RPN) [1,18] to produce predic- resentation for the succeeding operation.
tions. On the contrary, the point-based ones mainly adopt After obtaining the discriminative keypoints, the chal-
PointNet++ [26] as the backbone, which take raw points as lenge would be how to efficiently integrate the voxel-wise
input, and abstract sets of point features through an itera- and point-wise features via keypoints. We seek to (1) to
tive sampling-and-grouping operation. Different from only speed up the interaction between points and voxels and (2) to
the single voxel-based and point-based methods, PV-RCNN effectively summarize the 3D information from voxel-wise
[27] explores the interaction between point-wise and voxel- feature. Specifically, 3D sparse convolution is first adopted
wise features. Specially, PV-RCNN deeply integrates both to encode the voxelized point cloud. Then, we propose a fast
3D Voxel CNN and PointNet-based Set Abstraction (SA) voxel-to-point interaction module to efficiently sample and
to enhance the ability of feature learning. To be concrete, a group neighboring voxel-wise feature around keypoints. The
Voxel Set Abstraction (VSA) is proposed to encode voxel- existing query strategy, ball query [27], consumes too much
wise features of different scales through sampled keypoints time to compute the Euclidean distance from every voxel
by furthest point sampling [26] (FPS). Through coordinate to keypoints to identify whether the voxel is within a given
transform and projection, VSA also concatenates the features radius or not. Therefore, motivated by [29], we regard key-
of BEV and raw point features into keypoints to have a more points as voxels, which are regularly arranged in 3D space,
comprehensive understanding of 3D scenes. and then a voxel query strategy based on Manhattan distance
Nevertheless, we observe that the large imbalance between is utilized to quickly identify the neighboring voxel-wise fea-
small informative areas containing 3D objects and large ture of each keypoint. Compared with ball query, our voxel
redundant background areas exists in outdoor LiDAR point query greatly reduces the time consumption from O(N ) to
clouds. It poses a challenge for accurate 3D object detection. O(K ), where N is the total number of voxels and K means
Generally, the point cloud obtained by LiDAR covers a long the number of neighboring voxels around keypoints.
range of hundreds of meters, where only several small cars An Attention-based [30] Residual PointNet Module is
are captured and the rest are numerous background points. proposed to abstract the neighboring voxel-wise features to
However, in PV-RCNN [27], the whole 3D scene is sum- summarize the multi-scale 3D information. We apply self-
marized through a small number of sampled keypoints by attention mechanism on voxel set of each keypoint to produce
FPS. When selecting keypoints, FPS tends to choose distant corresponding attention maps, allowing each voxel to have
points to evenly cover the whole point cloud, which causes a more comprehensive perception field containing more 3D
excessive unimportant background points retained and many structure and scene information of other nearby voxels. Last,
valuable foreground points discarded. Consequently, the per- we introduce a lightweight residual [31,32] PointNet mod-
formance of PV-RCNN is limited largely due to insufficient ule to further extract and aggregate the refined voxel-wise
features provided by the foreground objects. Therefore, we feature.
consider that if there is any prior knowledge that can lead The main contributions are summarized as follows:
the detector to focusing on the pivotal foreground objects to
extract more valuable features. The inspiration is to leverage
the result of point cloud semantic segmentation as the prior • We introduce a semantic-guided keypoint sampling mod-
knowledge to guide the detector. ule to retain more valuable foreground points from the
To this end, we present a novel 3D object detection net- point cloud, which helps the detector focus on small piv-
work via semantical point-voxel feature interaction (termed otal areas containing 3D objects.
PV-RCNN++). First, we introduce a lightweight and fast • We utilize voxel query based on Manhattan distance
foreground point sampling head meticulously modified from to quickly gather the neighboring voxel-wise features
PointNet++ [26] to select proper object-related keypoints. around keypoints, reducing the time consumption com-
We remove the feature propagation (FP) layer in PointNet++ pared to ball query and improving the efficiency of
to avoid the heavy memory usage and time consumption point-voxel interaction.
[24]. We only remain the SA layers to produce more valu- • We propose an attention-based residual PointNet module,
able keypoints. Concretely, in each SA layer, we adopt a which allows each voxel to have an adaptive and nonlocal
binary segmentation module to classify the foreground and summary of the neighborhood to achieve more accurate
background points. Then, inspired by [28], we adopt a novel predictions.
sampling strategy, semantic-guided furthest point sampling • Extensive experiments and results show that our proposed
(S-FPS), taking segmentation scores as guidance to sample method achieves comparable performance on common
and group representative points. Different from FPS, S-FPS 3D object detection benchmark, KITTI dataset [7].

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

2 Related work clouds with the help of PointNet-like networks. Then at

the second stage, a local VFE is utilized to extract the
2.1 Point-based method voxel-wise feature to help box prediction. At the basis of
SECOND [18], SA-SSD [41] adds a detachable auxiliary
Generally, point-based methods [22–24,33,34] mainly rely network to transform voxel-wise tensors into the point-wise
on PointNet++ [26] and Graph Neural Network [35]. To feature and then leverages point-wise structure information
preserve the position information, point-based methods oper- to help train the backbone network. PV-RCNN [27] designs
ate directly on the point cloud to extract point-wise feature. a voxel set abstraction module to integrate the multi-scale
Inspired by Faster R-CNN [1], PointRCNN [22] utilizes voxel-wise and point feature through sampled keypoints.
PointNet++ to design a bottom-up 3D Region Proposal Net- VIC-Net [42] presents a two-branch network, which con-
work to generate 3D proposal boxes which are then refined sists of a point branch for geometry detail extraction and
in canonical coordinate at the second stage. To reduce the a voxel branch for efficient proposals generation. In this
time-consuming operation of the feature propagation (FP) paper, we propose a semantic-guided point-voxel interaction
layer in PointNet++, 3DSSD [24] removes the FP layer and method, PV-RCNN++, to efficiently integrate the voxel-wise
proposes a feature-guided sampling method to lead the SA and point-wise feature, providing a more comprehensive and
layer to select more discriminative points. Different from discriminative feature for accurate prediction.
common point-based methods, PC-RGNN [34] proposes a
graph-based point cloud completion [36] module to capture 2.4 LiDAR-camera fusion method
the geometry clues of 3D proposals, which provide more
complete structure and shape information for refinement. Recently, many multi-modal works [43–51] have been pro-
Although point-based methods have the potential to achieve posed to explore the fusion of different sensor data for 3D
more accurate detection, the problem of time complexity and object detection. MV3D [43] is a pioneering work to directly
memory consumption cannot be settled properly, which lim- combine the feature from point cloud BEV map, front view
its their further development. map, and 2D images to locate objects. EPNet [46] adopts a
refined way in which each point in the point cloud is fused
2.2 Voxel-based method with the corresponding image pixel to obtain more accurate
detection. However, all these methods inevitably consume a
Voxel-based methods [17–21,37,38] adopt the regular vox- lot of computation time and memory. Moreover, the align-
elized point cloud as data representation. VoxelNet [17] first ment error between points and pixels will have a negative
proposes a generic single-stage network for 3D box regres- impact on the accuracy.
sion. In VoxelNet, voxel feature encoder (VFE) is designed
to divide unordered points into grids and utilizes simplified 2.5 Attention mechanism
PointNet to produce a representative feature of each voxel.
3D CNN is then applied to extract the feature of the whole In the past several years, attention mechanism [30] has
3D scene. CIA-SSD [37] proposes a spatial-semantic fea- shown its great power in many 2D visual tasks [52–56] as
ture aggregation module to extract low-level spatial features it can effectively capture nonlocal information for feature
and high-level semantic features of the BEV feature map extraction. Nowadays, some works [57,58] apply attention
for more accurate proposals. SE-SSD [38] utilizes knowl- mechanism on 3D point cloud processing (e.g., point cloud
edge distillation [39] to design a pair of teacher and student segmentation) to obtain outperforming results. However,
SSDs. An effective IoU-based matching strategy is proposed these methods only explore the attentive relation among
to filter soft targets from the teacher and formulate a con- points while ignoring the potential effects on voxels. There-
sistency loss to align student predictions with them. Despite fore, we introduce an attention-based voxel aggregation
that voxel-based methods improve the speed and efficiency module to adaptively extract the spatial information in the
of detection, they pay a price for degrading the localization 3D scenes for more accurate 3D object detection.
accuracy due to the loss of point-wise feature.

2.3 Point-voxel hybrid method 3 Methodology

To achieve both efﬁciency and accuracy, point-voxel hybrid 3.1 Overview

methods [27,40–42] try to take advantage of the efﬁcient
computation of voxel-based backbone and accurate posi- Beyond previous wisdom, we argue that (1) more foreground
tion information from raw point clouds. STD [40] designs points are beneﬁcial to capturing pivotal structure and posi-
a spherical anchor to generate proposals from raw point tion feature; (2) faster query strategy is needed to relieve

123
P. Wu et al.

the time-consuming interaction between points and voxels; redesign the PointNet++ to act as our foreground point sam-
and (3) voxel-wise feature should have a more comprehen- pling module. As mentioned in 3D-SSD [24], although the
sive perception of neighboring structure feature instead of FP layer in PointNet++ [26] can broadcast the semantic fea-
local convolution feature. To this end, we present our PV- ture to all points to improve segmentation precision, it takes
RCNN++: semantical point-voxel feature interaction for 3D much time to upsample points. Therefore, capturing more
object detection, which consists of the following modules: foreground points in the SA layer would be a better choice.
(1) a binary segmentation module is introduced to guide FPS We remove the FP layer in PointNet++ and adopt a semantic-
to select more object-related keypoints; (2) voxel query based guided sampling strategy in SA layer. We first feed the raw
on Manhattan distance replaces the ball query to quickly points P with feature F into the segmentation module to com-
sample voxel-wise feature; (3) an attention-based residual pute the score S. Then with the guidance of S, we employ the
PointNet is designed to adaptively fuse the neighboring modified sampling strategy to sample K keypoints in the SA
voxel-wise feature to summarize the nonlocal 3D structure layer. The specific process is shown in Fig. 2 and described
information. Our backbone is illustrated as Fig. 1. Given as follows.
point cloud P = { pi | i = 1, 2, 3 . . . N } ⊆ R3+d as input,
where N = 16384 and d denotes the point feature (e.g., 3.3.1 Binary segmentation module
reflection intensity), our goal is to predict the center loca-
tion (x, y, z), box size (l, w, h) and rotation angle θ around To avoid bringing high computation, we adopt a 2-layer
Z −axis of each object. MLP as binary segmentation module to directly obtain the
score of point. Concretely, given point feature set Fk =
3.2 Voxel encoder and 3D region proposal network { f 1dk , f 2dk , f 3dk , . . . f Ndkk }, where dk denotes the d-dimension
of point feature f idk fed into the k-th SA layer, score si ∈
First, the unordered point cloud is transformed to uniformed [0, 1] of each point is defined as:
3D grids with voxel size v, and each grid contains k points.
Then, the mean voxel feature encoder (VFE) [17] is adopted si = (MLP k ( f idk )) (1)
to compute the mean feature of k points as the representative
feature of the grid. where means the sigmoid function; MLP k represents the
segmentation module in k-th SA layer. The real segmentation
3.2.1 3D voxel CNN labels can be obtained from ground-truth boxes. We define
the points inside the ground-truth boxes as foreground points
Through mean VFE, the point cloud shapes as a L × while the outside as background points. Therefore the loss
W × H feature volume. We utilize 3D sparse convolutional of the segmentation module can be calculated as:
neural networks [18] to encode the feature volumes with
m Nk
1×, 2×, 4×, 8× downsample sizes. All four downsampled λk
voxel-wise feature maps are preserved for the subsequent Lkseg = · BC E(sik , ŝik ) (2)
Nk
k=1 i=1
point-voxel interaction module.
where sik denotes the predicted score and ŝik is the ground-
3.2.2 3D region proposal network truth label (0 for the background and 1 for the foreground) of
the i-th point in the k-th SA layer. BC E is the binary cross-
After 3D voxel encoder, the forth 8× downsampled feature entropy loss function. m means the number of SA layers,
map is compressed to 2D BEV feature map which is of L8 × W8 which is set to 4. Nk ∈ {4096, 2048, 1024, 256} is the total
resolution. We utilize anchor-based methods [1] on the BEV number of the input points, and λk ∈ {0.1, 0.01, 0.001, 0.0001}
map to generate 3D anchor boxes with the average size of is the loss weight of k-th SA layer.
each class. Considering the rotation angle around Z -axis, 0
and π2 degree are set for each anchor pixel. Therefore, the 3.3.2 Semantic-guided further point sampling
whole BEV map produces 3 × 2 × L8 × W8 proposals in total
for three classes: Car, Pedestrian, and Cyclist. Since we have obtained the point scores from the binary
segmentation module, it means that the possible foreground
3.3 Foreground point sampling points have been masked. The easiest way to select fore-
ground points is to use Top-K scores as guidance. However,
Our motivation is to retain more foreground points to cap- as observed from Fig. 7 and Table 2, it will decrease the per-
ture more valuable spatial and position information while ceptual ability of the whole 3D scene if too many foreground
bringing no burden of time consumption, so we meticulously points are selected and few background points are involved.

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

Foreground Point Sampling

raw point(N) Segmentaon Segmentaon Segmentaon sampled key point(K)

FPS
S-FPS(4096) S-FPS(2048) S-FPS(1024)

Group Group Group

Aggregaon Aggregaon Aggregaon

VFE feature concatenaon

3D Voxel CNN Aenon-based Residual PointNet

K keypoint features

…
Voxel Query Keypoint Re-Weighng Module

3D RPN RoI-grid pooling

Proposals reﬁned
…
Classiﬁcaon

Fig. 1 An overview of our PV-RCNN++. The point cloud is first fed points, the voxel-wise feature, point-wise feature, and the BEV feature
into 3D voxel encoder to produce 3D proposals. Then, the foreground are concatenated to be fed into RoI-grid pooling to refine the proposals
point sampling module selects more valuable object-related keypoints to produce more accurate 3D boxes
through the modified SA layers. Last, according to the sampled key-

Binary Segmentaon S-FPS (K points) Group & Aggregaon

CAR
Raw points or from last SA layer Points with scores Selected key points Output key points with features

Fig. 2 The structure of the modiﬁed set abstract (SA) layer in the fore- better understanding of the 3D object-related information. Last, neigh-
ground point sampling module. Points from raw data or the previous boring points around keypoints are grouped and aggregated to produce
SA layer are fed to the binary segmentation module to obtain the scores. the ﬁnal keypoint feature, which is fed to the next SA layer
Then, S-FPS is adopted to select more foreground keypoints to have a

Motivated by [28], we modified the furthest point sampling vanilla FPS, the point with the longest distance is picked up
strategy by adding a score weight called S-FPS. Keeping the as the furthest point. However, in S-FPS, we recalculate each
basic flow of FPS unchanged, we leverage scores of unse- distance d̂i with score si as the following formula:
lected points to rectify the distances to the selected points.
Given point coordinate set P = { p1 , p2 , p3 , . . . p N } ⊆ R3 d̂i = (eγ si − 1) · di , (3)
and the corresponding score set S = {s1 , s2 , s3 , . . . s N }, the
distance set D = {d1 , d2 , d3 , . . . d N } is the shortest distances where γ is an adjustable parameter deciding the importance
of N unselected points to the already selected points. In the of the score, which is set to 1 by default. When γ is fixed,

123
P. Wu et al.

the closer score si is to 1, the greater d̂i is. Hence, S-FPS can Voxel grid Voxel center Query point
select more positive points (i.e., foreground points) compared
to FPS. Obviously, S-FPS will be similar to Top-K algorithm
if γ = +∞. The speciﬁc process of S-FPS is shown in
Algorithm 1.

Algorithm 1 Semantic-guided furthest point sampling. N is

the number of input points. M is the number of keypoints to
be sampled.
Data: Coordinates of Points P = { p1 , p2 , . . . p N };
Scores of Points S = {s1 , s2 , . . . s N }
Result: Sampled keypoints K = {k1 , k2 , . . . k M } Ball Query Voxel Query
1: define an empty points set K
Fig. 3 Demonstration of ball query and voxel query in 2D view (per-
2: define a distance list D = {d1 , d2 , . . . d N } with all +∞
formed in 3D space)
3: define a flag list F = { f 1 , f 2 , . . . f N } with all 0
4: for i ← 1 to M do
5: if i = 1 then
6: ki ← argmax(S) [xmin , xmax , ymin , ymax , z min , z max ], and 3D CNN downsam-
7: else ple stride ck , the voxel coordinate [xv , yv , z v ] in the k-th 3D
8: for k ← 0 to N do
voxel feature map can be calculated as follows:
9: if f k = 0 then
10: dk ← (eγ sk − 1) · dk
11: end if x p − xmin y p − ymin z p − z min
12: end for [xv , yv , z v ] = , , , (4)
vx · ck v y · ck vz · ck
13: ki ← argmax(D)
14: end if
15: add ki to K, f ki ← 1 3.4.2 Voxel query
16: for j ← 1 to N do
17: d j ← min(d j , p j − pki )
18: end for Compared to ball query, voxel query utilizes a positive inte-
19: end for ger I (default I = 4) as query range to produce offsets
20: return K {x , y , z | ∈ [−I , +I ]} to access the neighbor-
ing voxels around center voxel [xv , yv , z v ]. To determine
whether a neighboring voxel is within the query radius, a
distance radius threshold is set to compute the Manhattan
3.4 Faster neighboring voxel group distance. To be concrete, given the center voxel coordinate
[xv , yv , z v ] and queried voxel coordinate [xq , yq , z q ], the
The original neighboring voxel sampling strategy in PV- Manhattan distance dman can be computed as:
RCNN [27] employs ball query [26] to sample 3D voxel
feature around keypoints. However, the ball query occupies
dman =| xv − xq | + | yv − yq | + | z v − z q | (5)
too much time to compute the Euclidean distance between
each voxel and keypoint, which is rather low efficiency. It is
worth noting that the voxels are regularly arranged in ordered To sample M nearby voxels in total N voxels, the ball query
3D space, which can be easily accessed by indices. Motivated needs to compute distance N times, while the voxel query
by Voxel R-CNN [29], we apply the Manhattan distance to only needs K times (M < K < N ), where K is the number
replace the Euclidean distance to compute the query distance. of neighboring voxels around the center voxel, reducing the
Differently, we directly operate on keypoints, while voxel R- time complexity from O(N ) to O(K ).
CNN operates on each 3D proposal grid, which makes our
approach take less computation but focus more on critical 3.5 Attention-based residual PointNet
information (Fig. 3).
In previous work [27], a simple PointNet-like MLP is directly
3.4.1 Point to voxel coordinate adopted to aggregate the coarse feature of sampled vox-
els. Nevertheless, each queried voxel feature contributes
The coordinates of selected keypoints must be transformed unequally to learning the local structure information. 3D
to voxel coordinates to query voxel indices in the cor- sparse convolution operation limits the voxel’s ability to bet-
responding 3D voxel feature map. Given point coordi- ter understand the neighboring 3D structure and semantic
nate [x p , y p , z p ], voxel size [vx , v y , vz ], point cloud range information. Therefore, we consider how to adaptively focus

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

Fig. 5 Demonstration of residual PointNet aggregation. It consists of

simple Convolution1D, Batch Normalization, Relu, and MaxPooling

V ∈ Rdv ·N :
N
V̂i = Attention(Qi , K, V) = Sim · Vm . (8)
Fig. 4 Demonstration of voxel attention module. The input is N sam- m=1
pled voxel-wise feature
Finally, we add the weighted value V̂ ∈ Rdv ·N and origi-
nal voxel-wise feature F ∈ Rd f ·N to represent the attention
on the crucial voxel-wise feature to obtain more comprehen-
feature Ṽ ∈ R(dv +d f )·N .
sive and important features.
Attention mechanism [30] has shown its great power
3.5.2 Residual PointNet aggregation
in various visual tasks. Benefiting from the self-attention
mechanism, the model can obtain a larger receptive field
Through the self-attention mechanism, each voxel point inte-
to summarize the nonlocal feature. We note that sampled
grates the weighted feature from surrounding voxels and can
voxel-wise feature by voxel query can provide different
adaptively focus on the hotspot area of local structure infor-
structure and spatial information from different areas of the
mation. Next, the weighted values Ṽ = {v1 , v2 , v3 , . . . v N }
point cloud. Hence, we propose an attention-based resid-
are fed into a feed-forward network to produce the final fea-
ual PointNet aggregation module to adaptively aggregate the
ture. Different from transformer [30], which adopts simple
voxel-wise feature from the hotspot area of the point cloud.
linear layers, we propose a lightweight residual PointNet to
forward the weighted values. As illustrated in Fig. 5, given
3.5.1 Voxel attention module weighted values Ṽ, a plug-and-play module composed of
Convolution1D, Batch Normalization and Relu is stacked in
As shown in Fig. 4, given feature set F = { f 1 , f 2 , f 3 , . . . f N } a skip-connection way to extract the feature of Ṽ. In the end,
∈ Rd f ·N of N sampled voxels by voxel query, queries Q, keys we adopt the MaxPooling function to produce the represen-
K and values V are generated from F as following formula: tative voxel aggregation feature.

Q = Wq F ∈ Rdk ·N , K = Wk F ∈ Rdk ·N , 3.6 RoI grid pooling

V = Wv F ∈ Rdv ·N , (6) As shown in Fig. 1, besides summarizing the voxel-wise fea-
ture, keypoints P = { p1 , p2 , p3 , . . . pk } also concatenate the
where Wq ∈ Rdk ·d f , Wk ∈ Rdk ·d f , Wv ∈ Rdv ·d f are lin- grouped feature from raw points and BEV features, which
ear projections consisting of learnable matrices. Then the make keypoints rich in 3D spatial feature and semantic fea-
attention weights Si = {s1i , s2i , . . . , s N
i | i ∈ [1, N ]} of the
ture. Next, all the keypoint feature F̂ = { f 1 , f 2 , f 3 , . . . f N }
i-th query are calculated by softmax function on dot-product are fed to the second stage, RoI-grid pooling, to refine the
similarity between keys Ki and queries Qi : 3D proposals generated by RPN to achieve more accurate
and robust results. RoI-grid pooling uniformly divides each
3D proposal into 6 × 6 × 6 grids which can be denoted
KiT Qi
Si = softmax √ , (7) as G = {g1 , g2 , g3 , . . . g216 }. Benefiting from our fore-
dk ground sampling module, the keypoint feature contains more
object-related feature, which makes each grid gi contain
where dk is a scaling factor, which is set to the length of the more informative foreground keypoints. Specifically, given
voxel-wise feature dimension. Since we have obtained the a radius r and grid point g j , the feature f i of keypoint pi
attention weights of each query, then we generate the fine- is grouped if the keypoint is within r . The grouped keypoint
grained values V̂i ∈ Rdv by computing the weighted sum of feature set K is defined as follows:

123
P. Wu et al.

K = [ f i ; pi − g j ]T | ∀ pi ∈ P, ∀g j ∈ G, pi − g j < r , (9)

where pi − g j is the relative location from pi to g j , which

is concatenated to the feature f i . Then the grouped keypoint
feature set K is fed to a PointNet-like [26] module to pro-
duce a refined feature. After obtaining the refined RoI-grid
feature of each proposal, a 2-layer MLP is adopted to vector-
ize the RoI-grid feature to 256 dimensions to represent the
final feature.

3.7 Loss function

Our method is an end-to-end trainable network, which is

optimized by a multi-task loss Ltotal as follows:

Ltotal = Lseg + Lrpn + Lrcnn + Lkey (10)

As we have mentioned in Sect. 3.3, the segmentation loss

Lseg is computed by binary cross-entropy loss on sampled
keypoints in SA layers. Similar to [18], Lrpn is composed of
three partial loss:

Lrpn = α1 Lcls + α2 Lloc + α3 Ldir (11)

Fig. 6 Demonstration of keypoint re-weighting module

where α1 , α2 , α3 are assigned {1.0, 2.0, 0.2} to represent the

weight coefficient of object classification loss Lcls , loca-
tion regression loss Lloc , and direction regression loss Ldir , 4 Experiments
respectively. To avoid the case in which our model is stuck
in determining the directions of objects, we give α3 a rel- 4.1 Datasets
atively small parameter. To be concrete, Lcls is computed
by focal loss introduced by RetinaNet [59] due to the large 4.1.1 KITTI
imbalance between the foreground and background classes.
Lloc is optimized by smooth-L1 loss function for box regres- KITTI Dataset [7] is one of the most popular benchmarks
sion, and Ldir is computed by sine-error loss [18] for angle for 3D object detection in autonomous driving. It collects
regression. 7481 LiDAR point cloud frames for training and 7518 for
Lrcnn [27] is the loss of between RoIs and ground-truth testing. Although LiDAR scans the 360◦ point cloud scene,
labels, which consists of the classification confidence loss only the objects in field of vision (FOV) are annotated with
Lrcnn_cls , location regression loss Lrcnn_loc , and box corner 3D boxes. Specifically, the training dataset is divided into
loss Lrcnn_corner in the refinement stage. The computation of train split (3712 samples) and val split (3769 samples) for our
Lrcnn is defined as follows: experiments. When submitting the test results to the KITTI
server, we train 90% training data to obtain a robust and
Lrcnn = Lrcnn_cls + Lrcnn_loc + Lrcnn_corner (12) highly generalized model.

where Lrcnn_cls is optimized with the binary cross-entropy

loss function, and Lrcnn_loc and Lrcnn_corner are optimized 4.2 Implementation details
smooth-L1 loss function as in [18,27].
Besides, after the whole 3D scene is summarized into a 4.2.1 Voxelization
small number of keypoints, it is reasonable to re-weight them
to ensure that the foreground point feature has greater weights For KITTI Dataset, we only use the FOV point cloud as raw
to contribute more to the reﬁnement stage. As demonstrated data. The range of FOV ﬁeld is [0, 70.4] meters along X -axis,
in Fig. 6, the re-weighting loss of keypoints Lkey is computed [−40, 40] meters along Y -axis and [−3, 1] meters along Z -
by the focal loss [59] between predicted keypoint scores and axis, which is regularly divided into regularly arranged voxels
true labels in ground-truth boxes. with voxel size 0.05 m × 0.05 m × 0.1m.

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

4.2.2 Sampling strategy setting the generalization and robustness of the model. At the RoI
refinement stage, we sample 128 proposals from RPN and set
In the foreground point sampling module, we place four SA the threshold θ f g = 0.55 to classify the foreground objects
layers for summarizing the semantic feature with different and background objects. Half of them are recognized as fore-
numbers of keypoints. Before SA layers, we sample 16384 ground objects if the 3D intersection over union (IoU) with
points from the raw points as input. For the reason that raw the ground-truth box is over θ f g , while the lower is deter-
points have no segmentation scores, we adopt the vanilla FPS mined to be the background.
in the first SA layer to sample 4096 keypoints. Then, in the At the inference stage, we employ nonmaximum-
next three SA layers, we utilize S-FPS to sample 2048, 1024, suppression (NMS) on the 3D proposals with the thresh-
and 256 keypoints according to the segmentation scores. old of 0.7 to filter out top-100 proposals as the input of the
refinement module. Then, in RoI Grid Pooling, proposals
4.2.3 Network architecture are further refined with the feature-rich keypoints. Subse-
quently, NMS is adopted again with a threshold of 0.1 to
As shown in Fig. 1, the point cloud is input in the form of remove redundant predictions.
quantitative voxels with the resolution of 1600×1408×40×8
and sampled 16384 points by FPS. First, the voxel-based 4.3 Results on KITTI
backbone leverages 3D voxel CNN with 1×, 2×, 4×, 8×
downsample sizes to produce four voxel-wise feature maps 4.3.1 Evaluation metrics
with 16, 32, 64, and 64 output dimensions. After the last
3D CNN layer, the output feature map is of the resolu- We follow the evaluation criteria that the KITTI benchmark
tion of 200 × 176 × 2 × 128. Once the feature map is provides to ensure accuracy and fairness. The IoU threshold
reshaped to 200 × 176 × 256, the RPN [18] network is is set to 0.7 for Car and 0.5 for both Pedestrian and Cyclist.
applied to generate 3D proposals. For the voxel query in The results reported to the official KITTI test server are calcu-
Sect. 3.4, we set the query range I to 4 and sample 16 lated by average precision (AP) setting of recall 40 positions
neighboring voxels around query point. Hence, the size of to compare with the state-of-the-art methods.
each attention map in voxel attention module is 16 × 16. We demonstrate the test results returned from the KITTI
In the residual PointNet, the output dimension of Convolu- test server in Table 1 with the comparison of performance
tion1D is set to 32. The 16384 raw points are input to the on 3D detection. For the most critical Car detection, we
foreground point sampling module which consists of 4 sub- surpass the PV-RCNN by 2.55%, 0.20%, 0.07% on easy,
sequent SA layers. Each SA layer has 2 multi-scale radii r ∈ moderate, and hard levels. It is worth noting that our method
{[0.1 m, 0.5 m], [0.5 m, 1.0 m], [1.0 m, 2.0 m], [2.0 m, 4.0 m]} improves greatly on the Cyclist class, surpassing PV-RCNN
to group neighboring points which are input to 2-layer MLP by 4.89%, 5.09%, 4.36% on three levels. However, our
to classify the foreground and background. Then, the sam- method achieves inferior results on the Pedestrian class com-
pled keypoints from SA layer integrate the point-wise feature, pared with the state-of-the-art methods, where we think the
voxel-wise feature, and 2D BEV feature to summarize the segmentation module has limited ability to classify the small
critical information of the 3D scene. In the end, RoI-grid foreground objects like Pedestrian, and S-FPS tends to select
pooling divides the proposal box into 6 × 6 × 6 grids to refine keypoints on the big foreground objects like the Car and
the proposals with the aggregation of feature-rich keypoints Cyclist.
within 0.6 m and 0.8 m radius.
4.4 Ablation studies
4.2.4 Training and inference
To comprehensively verify the effectiveness of our method,
Concretely, the model is trained 80 epochs with Adam opti- we conduct ablation studies on foreground point sampling,
mizer and the learning rate is initially set to 0.01 updated by voxel query, voxel attention module, and residual PointNet,
one-cycle policy. We conduct the experiments on one RTX respectively.
3090 GPU with batch size 4, which takes about 36 hours.
During training, to avoid over-fitting, we leverage data 4.4.1 Effectiveness of foreground point sampling
augmentation strategies like [18]. We use GT-sampling to
paste some foreground instances from other point cloud In this part, we test the influence of the different numbers of
scenes to current training frame. Besides, augmentation oper- foreground points on detection precision, as shown in Table 2.
ations like random flip along X -axis, random rotation with It seems feasible that the more foreground points get caught,
angle range in [− π4 , π4 ], and random scaling with a ran- the more accurate the result will be. Therefore, we set γ in
dom scaling factor in [0.95, 1.05] are adopted to enhance Eq. 3 with 1, 2, 3, and 100 to increase the number of sampled

123
P. Wu et al.

Table 1 Performance comparison on KITTI ofﬁcial test server

Method Reference Modality Car 3D detection(%) Ped 3D detection (%) Cyclist 3D detection (%)
Easy Mod Hard Easy Mod Hard Easy Mod Hard

MV3D [43] CVPR2017 L+C 74.97 63.63 54.00 – – – – – –

AVOD-FPN [44] CVPR2018 L+C 83.07 71.76 65.73 50.46 42.27 39.04 63.76 50.55 44.93
F-PointNet [45] CVPR2018 L+C 82.19 69.79 60.59 50.53 42.15 38.08 72.27 56.12 49.01
EPNet [46] CVPR2020 L+C 89.81 79.28 74.49 – – – – – –
PointPainting [47] CVPR2020 L+C 82.11 71.70 67.08 50.32 40.97 37.87 77.63 63.78 55.89
3D-CVF [48] CVPR2020 L+C 89.20 80.05 73.11 – – – – – –
FAST-CLOCs [49] CVPR2022 L+C 89.11 80.34 76.98 52.10 42.72 39.08 82.83 65.31 57.43
CAT-Det [50] CVPR2022 L+C 89.87 81.32 76.68 54.26 45.44 41.94 83.68 68.81 61.45
SECOND-V1.5 [18] Sensors2018 L 84.65 75.96 68.71 – – – – – –
PointPillars [20] CVPR2019 L 82.58 74.31 68.99 51.45 41.92 38.89 77.1 58.65 51.92
Part-A2 [21] CVPR2019 L 87.81 78.49 73.51 53.10 43.35 40.06 79.17 63.52 56.93
PointRCNN [22] CVPR2019 L 86.96 75.64 70.7 47.98 39.37 36.01 74.96 58.82 52.53
STD [40] ICCV2019 L 87.95 79.71 75.09 53.29 42.47 38.35 78.69 61.59 55.3
SA-SSD [41] CVPR2020 L 88.75 79.79 74.16 – – – – – –
3DSSD [24] CVPR2020 L 88.36 79.57 74.55 54.64 44.27 40.23 82.48 64.1 56.90
CIA-SSD [37] AAAI2021 L 89.59 80.28 72.87 – – – – – –
VIC-Net [42] ICRA2021 L 88.25 80.61 75.83 43.82 37.18 35.35 78.29 63.65 57.27
HVPR [60] CVPR2021 L 86.38 77.92 73.04 53.47 43.96 40.64 – – –
SVGA-Net [61] AAAI2022 L 87.33 80.47 75.91 48.48 40.39 37.92 78.58 62.28 54.88
IA-SSD [62] CVPR2022 L 88.34 80.13 75.04 46.51 39.03 35.60 78.35 61.94 55.70
PV-RCNN [27] CVPR2020 L 87.98 81.40 77.00 44.61 39.04 36.89 78.57 63.12 56.81
Ours – L 90.53 81.60 77.07 45.94 40.18 37.28 83.46 68.21 61.17
Improvements – – +2.55 +0.20 +0.07 +1.33 +1.14 +0.39 +4.89 +5.09 +4.36
The 3D average precision is calculated by 40 recall positions
L+C denotes LiDAR-Camera fusion methods, and L represents the LiDAR-only methods. The result of PV-RCNN is obtained from the KITTI
server tested with the local model trained on the ofﬁcial source code: OpenPCDet [63]. The top two results are in bold

foreground points. However, it turns out that when γ becomes Table 2 Performance comparison on the KITTI val split with different
larger, the performance decreases instead. The reason is that γ controlling the number of foreground keypoints
sampling excessive foreground points with few background Sampling strategy Easy (%) Mod (%) Hard (%)
points (like the Top-K sampling strategy) make it hard for
FPS 89.02 83.59 78.49
the model to have a global perception of the whole 3D scene,
S-FPS(γ = 1) 89.65 84.34 79.11
which limits its ability to precisely locate the correct objects.
As demonstrated in Fig. 7, when γ = 1, the sampled key- S-FPS(γ = 2) 89.75 84.15 79.09
points can focus on foreground objects and preserve proper S-FPS(γ = 3) 89.69 84.11 79.02
background points at the same time. S-FPS(γ = 100) 89.06 79.06 76.74
The 3D average precision is calculated by 11 recall positions for the
Car class
4.4.2 Effectiveness of voxel query Bold indicates the best results

The function of voxel query is to boost the speed of voxel

sampling while bringing no loss of performance. Table 3 4.4.3 Effectiveness of attention-based residual PointNet
shows the 3D detection precision for Car class and frame
per second (FPS) at the condition of ball query and voxel We propose the voxel attention module so that voxel-wise
query. We only replace the ball query with the voxel query, feature can obtain a larger receptive ﬁeld to integrate other
while keeping other modules unchanged. Results show that voxels’ feature instead of the only local feature of itself.
the voxel query improves the inference speed but causes no Besides, a lightweight residual PointNet module is added to
harm to precision. efﬁciently aggregate the voxel-wise feature. As demonstrated

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

Fig. 7 Visualization comparison of different point sampling strategies. remains proper background information. d Almost only foreground
a Raw point cloud scene. b Vanilla FPS samples too many background points are sampled by S-FPS (γ = 100), which loses the perception of
points, while points on foreground objects are too sparse. c Demon- the whole 3D scene
strates that S-FPS (γ = 1) can sample more foreground points and

Table 3 Performance and FPS comparison between ball query and voxel-wise attention features encoded into keypoints, where
voxel query almost the whole object is highly focused on, not only local
Query method Easy (%) Mod (%) Hard (%) FPS (Hz) spatial features are learned, indicating the importance of the
attention-based residual PointNet module.
Ball query 89.42 83.60 82.34 12.5
Voxel query 89.60 83.58 82.45 15.15
4.5 Visualization and discussion
The 3D AP is computed on the KITTI val split by recall 11 positions
for the Car class
Bold indicates the best results
Figure 9 shows the visualization results on the KITTI dataset.
Our model performs stably and makes accurate detection
in complex situations, especially in Fig. 9B. Moreover, as
in Table 4, the performance is further improved by atten- shown in Table 1, higher average precision on car and cyclist
tion mechanism with residual PointNet. Figure 8 shows the proves the effectiveness of our proposed methods. However,

123
P. Wu et al.

Fig. 8 Visualization of attention feature induced by attention-based residual PointNet module. The aggregated features cover the whole object-
related regions (red points) rather than small local parts of the object

Table 4 Performance
S-FPS Voxel query VAM Residual PointNet Easy (%) Mod (%) Hard (%)
demonstration of
attention-based residual × × × × 89.02 83.59 78.49
PointNet aggregation
× × 89.60 84.08 78.98
× 89.64 84.27 79.07
89.65 84.34 79.11
The 3D AP is computed on the KITTI val split by recall 11 positions for the Car class
Bold indicates the best results

Fig. 9 Visualization results on KITTI(FOV). Detected cars, pedestrians, and cyclists are marked by green boxes, blue boxes, and yellow boxes

our method still lags behind some state-of-the-art methods 5 Conclusion

for detecting small objects such as pedestrian. The reason
is that our segmentation module performs badly to classify In this paper, we present the PV-RCNN++ framework, a novel
small objects, resulting in fewer keypoints on pedestrians semantical point-voxel feature interaction for 3D object
retained to integrate valuable feature for reﬁnement. detection, which achieves 81.60%, 40.18%, 68.21% 3D mAP
on Car, Pedestrian, and Cyclist on the KITTI benchmark.
We introduce a carefully designed point cloud segmentation
module as guidance to sample more object-related keypoints.
Through fast voxel query based on Manhattan distance, we

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

speed up the interaction between keypoints and voxels to kernel correlation filter. J. King Saud Univers. Comput. Inf. Sci.
efficiently group the neighboring voxel-wise feature. The 6008–6018 (2022)
9. Zhang, J., Feng, W., Yuan, T., Wang, J., Sangaiah, A.K.: Scstcf:
proposed attention-based residual PointNet abstracts more spatial-channel selection and temporal regularized correlation fil-
fine-grained 3D information from nearby voxels, providing ters for visual tracking. Appl. Soft Comput. 118, 108485 (2022)
more comprehensive features for the succeeding refinement. 10. Zhang, J., Sun, J., Wang, J., Li, Z., Chen, X.: An object tracking
Extensive experiments on KITTI dataset demonstrate that our framework with recapture based on correlation filters and Siamese
networks. Comput. Electr. Eng. 98, 107730 (2022)
proposed semantic-guided voxel-to-keypoint detector pre- 11. Yan, C., Salman, E.: Mono3d: open source cell library for mono-
cisely summarizes the valuable information from pivotal lithic 3-d integrated circuits. IEEE Trans. Circuits Syst. I Regul.
areas in the point cloud and improves performance compared Pap. 65(3), 1075–1085 (2017)
with previous state-of-the-art methods. 12. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical
depth distribution network for monocular 3d object detection. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and
Acknowledgements This work was supported by the 14th Five-Year
Pattern Recognition, pp. 8555–8564 (2021)
Planning Equipment Pre-Research Program (No. JZX7Y20220301001
13. Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: monocular 3d object
801), by the National Natural Science Foundation of China (No.
detection using pairwise spatial relationships. In: Proceedings of
62172218), by the Free Exploration of Basic Research Project, Local
the IEEE/CVF Conference on Computer Vision and Pattern Recog-
Science and Technology Development Fund Guided by the Central
nition, pp. 12093–12102 (2020)
Government of China (No. 2021Szvup060), and by the General Pro-
14. Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: Gs3d: an effi-
gram of Natural Science Foundation of Guangdong Province (No.
cient 3d object detection framework for autonomous driving. In:
2022A1515010170).
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 1019–1028 (2019)
Data Availability Statement The datasets generated during and/or ana-
15. Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.:
lyzed during the current study are available from the corresponding
3d object proposals using stereo imagery for accurate object class
author on reasonable request.
detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1259–
1272 (2017)
16. Li, P., Chen, X., Shen, S.: Stereo r-cnn based 3d object detection
Declarations for autonomous driving. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 7644–7652
(2019)
Conflict of interest The authors declare that they have no conflict of 17. Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud
interest. based 3d object detection. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 4490–4499
(2018)
18. Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional
detection. Sensors 18(10), 3337 (2018)
References 19. Ye, M., Xu, S., Cao, T.: Hvnet: hybrid voxel network for lidar based
3d object detection. In: Proceedings of the IEEE/CVF Confer-
1. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real- ence on Computer Vision and Pattern Recognition, pp. 1631–1640
time object detection with region proposal networks. IEEE Trans. (2020)
Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016) 20. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.:
2. Zheng, C., Shi, D., Yan, X., Liang, D., Wei, M., Yang, X., Guo, Pointpillars: fast encoders for object detection from point clouds.
Y., Xie, H.: Glassnet: label decoupling-based three-stream neural In: Proceedings of the IEEE/CVF Conference on Computer Vision
network for robust image glass detection. Comput. Graph. Forum and Pattern Recognition, pp. 12697–12705 (2019)
41(1), 377–388 (2022) 21. Shi, S., Wang, Z., Wang, X., Li, H.: Part-a2 net: 3d part-aware and
3. Wei, Z., Liang, D., Zhang, D., Zhang, L., Geng, Q., Wei, M., aggregation neural network for object detection from point cloud.
Zhou, H.: Learning calibrated-guidance for object detection in arXiv preprint arXiv:1907.03670 vol. 2, no. (3) (2019)
aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 22. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation
15, 2721–2733 (2022) and detection from point cloud. In: Proceedings of the IEEE/CVF
4. Luo, C., Yang, X., Yuille, A.: Exploring simple 3d multi-object Conference on Computer Vision and Pattern Recognition, pp. 770–
tracking for autonomous driving. In: Proceedings of the IEEE/CVF 779 (2019)
International Conference on Computer Vision, pp. 10488–10497 23. Shi, W., Rajkumar, R.: Point-gnn: graph neural network for 3d
(2021) object detection in a point cloud. In: Proceedings of the IEEE/CVF
5. Ji, C., Liu, G., Zhao, D.: Stereo 3d object detection via instance Conference on Computer Vision and Pattern Recognition, pp.
depth prior guidance and adaptive spatial feature aggregation. Vis. 1711–1719 (2020)
Comput. 1–12 (2022) 24. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single
6. Wang, Z., Xie, Q., Wei, M., Long, K., Wang, J.: Multi-feature stage object detector. In: Proceedings of the IEEE/CVF Conference
fusion votenet for 3d object detection. ACM Trans. Multim. Com- on Computer Vision and Pattern Recognition, pp. 11040–11048
put. Commun. Appl. 18(1), 6–1617 (2022) (2020)
7. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous 25. Xie, Q., Lai, Y.-K., Wu, J., Wang, Z., Lu, D., Wei, M., Wang, J.:
driving? The Kitti vision benchmark suite. In: 2012 IEEE Confer- Venet: voting enhancement network for 3d object detection. In:
ence on Computer Vision and Pattern Recognition, pp. 3354–3361. Proceedings of the IEEE/CVF International Conference on Com-
IEEE (2012) puter Vision, pp. 3712–3721 (2021)
8. Xia, R., Chen, Y., Ren, B.: Improved anti-occlusion object track- 26. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++ deep hierarchical
ing algorithm using unscented Rauch–Tung–Striebel smoother and feature learning on point sets in a metric space. In: Proceedings of

123
P. Wu et al.

the 31st International Conference on Neural Information Process- 45. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets
ing Systems, pp. 5105–5114 (2017) for 3d object detection from rgb-d data. In: Proceedings of the
27. Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv- IEEE Conference on Computer Vision and Pattern Recognition,
rcnn: point-voxel feature set abstraction for 3d object detection. pp. 918–927 (2018)
In: Proceedings of the IEEE/CVF Conference on Computer Vision 46. Huang, T., Liu, Z., Chen, X., Bai, X.: Epnet: Enhancing point fea-
and Pattern Recognition, pp. 10529–10538 (2020) tures with image semantics for 3d object detection. In: European
28. Chen, C., Chen, Z., Zhang, J., Tao, D.: Sasa: Semantics-augmented Conference on Computer Vision, pp. 35–52 (2020). Springer
set abstraction for point-based 3d object detection. arXiv e-prints, 47. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting:
2201 (2022) Sequential fusion for 3d object detection. In: Proceedings of the
29. Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r- IEEE/CVF Conference on Computer Vision and Pattern Recogni-
cnn: towards high performance voxel-based 3d object detection. tion, pp. 4604–4612 (2020)
In: Proceedings of the AAAI Conference on Artificial Intelligence, 48. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3d-cvf: Generating joint
vol. 35, pp. 1201–1209 (2021) camera and lidar features using cross-view spatial feature fusion for
30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., 3d object detection. In: European Conference on Computer Vision,
Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. pp. 720–736 (2020). Springer
arXiv e-prints, 1706 (2017) 49. Pang, S., Morris, D., Radha, H.: Fast-clocs: fast camera-lidar
31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image object candidates fusion for 3d object detection. In: Proceedings
recognition. In: Proceedings of the IEEE Conference on Computer of the IEEE/CVF Winter Conference on Applications of Computer
Vision and Pattern Recognition, pp. 770–778 (2016) Vision, pp. 187–196 (2022)
32. Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network 50. Zhang, Y., Chen, J., Huang, D.: Cat-det: contrastively augmented
design and local geometry in point cloud: a simple residual mlp transformer for multi-modal 3d object detection. In: Proceedings of
framework. arXiv preprint arXiv:2202.07123 (2022) the IEEE/CVF Conference on Computer Vision and Pattern Recog-
33. Zarzar, J., Giancola, S., Ghanem, B.: Pointrgcn: graph convolu- nition (CVPR), pp. 908–917 (2022)
tion networks for 3d vehicles detection refinement. arXiv preprint 51. Wang, Z., Xie, Q., Wei, M., Long, K., Wang, J.: Multi-feature fusion
arXiv:1911.12236 (2019) votenet for 3d object detection. ACM Trans. Multimed. Comput.
34. Zhang, Y., Huang, D., Wang, Y.: Pc-rgnn: Point cloud completion Commun. Appl. 18(1), 1–17 (2022)
and graph neural network for 3d object detection. In: Proceedings 52. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai,
of the AAAI Conference on Artificial Intelligence, vol. 35, pp. X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly,
3430–3437 (2021) S., et al.: An image is worth 16x16 words: transformers for image
35. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
G.: The graph neural network model. IEEE Trans. Neural Netw. 53. Shu, X., Yang, J., Yan, R., Song, Y.: Expansion-squeeze-excitation
20(1), 61–80 (2008) fusion network for elderly activity recognition. IEEE Trans. Cir-
36. Yuan, W., Khot, T., Held, D., Mertz, C., Hebert, M.: Pcn: point com- cuits Syst. Video Technol. 1–1 (2022)
pletion network. In: 2018 International Conference on 3D Vision 54. Shu, X., Zhang, L., Qi, G.-J., Liu, W., Tang, J.: Spatiotemporal
(3DV), pp. 728–737. IEEE (2018) co-attention recurrent neural networks for human-skeleton motion
37. Zheng, W., Tang, W., Chen, S., Jiang, L., Fu, C.-W.: Cia-ssd: con- prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3300–
fident iou-aware single-stage object detector from point cloud. In: 3315 (2021)
Proceedings of the AAAI Conference on Artificial Intelligence, 55. Tang, J., Shu, X., Yan, R., Zhang, L.: Coherence constrained graph
vol. 35, pp. 3555–3562 (2021) lstm for group activity recognition. IEEE Trans. Pattern Anal.
38. Zheng, W., Tang, W., Jiang, L., Fu, C.-W.: Se-ssd: self-ensembling Mach. Intell. 44(2), 636–647 (2022)
single-stage object detector from point cloud. In: Proceedings of 56. Li, P., Chen, Y.: Research into an image inpainting algorithm via
the IEEE/CVF Conference on Computer Vision and Pattern Recog- multilevel attention progression mechanism. Math. Probl. Eng.
nition, pp. 14494–14503 (2021) 2022, 1–12 (2022)
39. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge 57. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point trans-
in a neural network. arXiv preprint arXiv:1503.02531, vol. 2, no. former. In: Proceedings of the IEEE/CVF International Conference
(7) (2015) on Computer Vision, pp. 16259–16268 (2021)
40. Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: Std: sparse-to-dense 3d 58. Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu,
object detector for point cloud. In: Proceedings of the IEEE/CVF S.-M.: Pct: point cloud transformer. Comput. Vis. Media 7(2), 187–
International Conference on Computer Vision, pp. 1951–1960 199 (2021)
(2019) 59. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for
41. He, C., Zeng, H., Huang, J., Hua, X.-S., Zhang, L.: Structure aware dense object detection. In: Proceedings of the IEEE International
single-stage 3d object detection from point cloud. In: Proceed- Conference on Computer Vision, pp. 2980–2988 (2017)
ings of the IEEE/CVF Conference on Computer Vision and Pattern 60. Noh, J., Lee, S., Ham, B.: Hvpr: hybrid voxel-point representa-
Recognition, pp. 11873–11882 (2020) tion for single-stage 3d object detection. In: Proceedings of the
42. Jiang, T., Song, N., Liu, H., Yin, R., Gong, Y., Yao, J.: Vic-net: IEEE/CVF Conference on Computer Vision and Pattern Recogni-
voxelization information compensation network for point cloud tion (CVPR), pp. 14605–14614 (2021)
3d object detection. In: 2021 IEEE International Conference on 61. He, Q., Wang, Z., Zeng, H., Zeng, Y., Liu, Y.: Svga-net: sparse
Robotics and Automation (ICRA), pp. 13408–13414 (2021). IEEE voxel-graph attention network for 3d object detection from point
43. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object clouds. In: Proceedings of the AAAI Conference on Artificial Intel-
detection network for autonomous driving. In: Proceedings of the ligence, vol. 36, pp. 870–878 (2022)
IEEE Conference on Computer Vision and Pattern Recognition, 62. Zhang, Y., Hu, Q., Xu, G., Ma, Y., Wan, J., Guo, Y.: Not all points
pp. 1907–1915 (2017) are equal: Learning highly efficient point-based detectors for 3d
44. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d lidar point clouds. In: Proceedings of the IEEE/CVF Conference
proposal generation and object detection from view aggregation. on Computer Vision and Pattern Recognition, pp. 18953–18962
In: 2018 IEEE/RSJ International Conference on Intelligent Robots (2022)
and Systems (IROS), pp. 1–8 (2018). IEEE

123
PV-RCNN++: semantical point-voxel feature interaction for 3D object detection

63. Team, O.D.: OpenPCDet: An Open-source Toolbox for 3D Object Haoran Xie received his PhD in
Detection from Point Clouds. https://github.com/open-mmlab/ computer science from the City
OpenPCDet (2020) University of Hong Kong, Hong
Kong, and his EdD degree in dig-
ital learning from the University
Publisher’s Note Springer Nature remains neutral with regard to juris- of Bristol, UK. He is currently an
dictional claims in published maps and institutional affiliations. associate professor at the Depart-
ment of Computing and Decision
Springer Nature or its licensor holds exclusive rights to this article Sciences, Lingnan University,
under a publishing agreement with the author(s) or other rightsholder(s); Hong Kong. His research interests
author self-archiving of the accepted manuscript version of this article include artificial intelligence, big
is solely governed by the terms of such publishing agreement and appli- data, and educational technology.
cable law. He has published 290 research
publications, including 140 jour-
nal articles. He is the editor-in-
chief of Computers & Education: Artificial Intelligence. He is a senior
Peng Wu is currently pursu- member of IEEE and ACM.
ing the PhD degree with com-
puter science and technology at Fu Lee Wang received his BEng
Nanjing University of Aeronau- degree in computer engineering
tics and Astronautics (NUAA). He and his MPhil degree in com-
received the B.Sc. degree from puter science and information sys-
Nanjing University of Information tems from the University of Hong
Science and Technology in 2021. Kong, Hong Kong, and his PhD
His research interests include deep in systems engineering and engi-
learning, computer vision, and com- neering management from Chi-
puter graphics. nese University of Hong Kong,
Hong Kong. He is currently the
dean of the School of Science
and Technology, Hong Kong
Metropolitan University. He has
over 250 publications in interna-
Lipeng Gu is currently pursu- tional journals and conferences
ing the PhD degree with com- and led more than 20 competitive grants with a total value greater than
puter science and technology at 20 million. His current research interests include educational technol-
Nanjing University of Aeronau- ogy, information retrieval, computer graphics, and bioinformatics. He
tics and Astronautics (NUAA). He is a fellow of BCS, IET, and HKIE and a senior member of ACM
received the M.Sc. degree from and IEEE. He was the chair of the IEEE Hong Kong section computer
Donghua University in 2021. His chapter and ACM Hong Kong chapter.
research interests include deep
learning, computer vision, and com-
Gary Cheng is currently an asso-
puter graphics.
ciate professor at the Department
of Mathematics and Information
Technology at the Education Uni-
versity of Hong Kong. He received
his PhD in computing from Hong
Kong Polytechnic University. His
research interests include artifi-
Xuefeng Yan is a professor of cial intelligence in education,
the School of Computer Science technology-enhanced language
and Technology, Nanjing Univer- learning, and machine learning.
sity of Aeronautics and Astronau-
tics (NUAA), China. He obtained
his PhD degree from Beijing Insti-
tute of Technology in 2005. He
was the visiting scholar at Geor-
gia State University in 2008 and
2012. His research interests include
intelligent computing, MBSE/
complex system modeling, simu-
lation, and evaluation.

123
P. Wu et al.

Mingqiang Wei (Senior Member,

IEEE) received his PhD degree
(2014) in computer science and
engineering from the Chinese Uni-
versity of Hong Kong (CUHK).
He is a professor at the School of
Computer Science and Technol-
ogy, Nanjing University of Aero-
nautics and Astronautics (NUAA).
He was a recipient of the CUHK
Young Scholar Thesis Awards in
2014. His research interests focus
on 3D vision, computer graphics,
and deep learning.

123

PV_RCNN_Plus_1912.13192
No ratings yet
PV_RCNN_Plus_1912.13192
19 pages
PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
No ratings yet
PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
11 pages
PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
No ratings yet
PV-RCNN: Point-Voxel Feature Set Abstraction For 3D Object Detection
11 pages
PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation For 3D Object Detection
No ratings yet
PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation For 3D Object Detection
17 pages
Voxelnet: End-To-End Learning For Point Cloud Based 3D Object Detection
No ratings yet
Voxelnet: End-To-End Learning For Point Cloud Based 3D Object Detection
10 pages
Yang, Liu Et Al 2022 - Graph R-CNN - Towards Accurate 3D Object Detection With Semantic-Decorated Local Graph
No ratings yet
Yang, Liu Et Al 2022 - Graph R-CNN - Towards Accurate 3D Object Detection With Semantic-Decorated Local Graph
18 pages
SA-Det3D
No ratings yet
SA-Det3D
16 pages
16456-Article Text-19950-1-2-20210518
No ratings yet
16456-Article Text-19950-1-2-20210518
8 pages
Voxel RCNN Code Walk Through-rev1
No ratings yet
Voxel RCNN Code Walk Through-rev1
4 pages
Lang PointPillars Fast Encoders For Object Detection From Point Clouds CVPR 2019 Paper
No ratings yet
Lang PointPillars Fast Encoders For Object Detection From Point Clouds CVPR 2019 Paper
9 pages
RangeRCNN Towards Fast and Accurate 3D Object Detection
No ratings yet
RangeRCNN Towards Fast and Accurate 3D Object Detection
9 pages
2018_Frustum_PointNets
No ratings yet
2018_Frustum_PointNets
10 pages
Patch Refinement - Localized 3D Object Detection
No ratings yet
Patch Refinement - Localized 3D Object Detection
10 pages
Nerf RPN
No ratings yet
Nerf RPN
13 pages
(My)Anti-Noise 3D Object Detection of Multimodal Feature Attention Fusion Based on PV-RCNN
No ratings yet
(My)Anti-Noise 3D Object Detection of Multimodal Feature Attention Fusion Based on PV-RCNN
19 pages
Fast and Accurate Deep Learning-Based Framework for 3D Multi-Object Detector for Autonomous Vehicles
No ratings yet
Fast and Accurate Deep Learning-Based Framework for 3D Multi-Object Detector for Autonomous Vehicles
3 pages
Journal.pone.0314086
No ratings yet
Journal.pone.0314086
14 pages
RangeDet in Defense of Range View For LiDAR-Based 3D Object ICCV 2021 Paper
No ratings yet
RangeDet in Defense of Range View For LiDAR-Based 3D Object ICCV 2021 Paper
10 pages
Frustum PointNets For 3D Object Detection From RGB-D Data
No ratings yet
Frustum PointNets For 3D Object Detection From RGB-D Data
10 pages
Randla-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
No ratings yet
Randla-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
16 pages
2403.06433
No ratings yet
2403.06433
7 pages
Dark Green Light Green White Corporate Geometric Company Internal Deck Business Presentation
No ratings yet
Dark Green Light Green White Corporate Geometric Company Internal Deck Business Presentation
17 pages
2207.12691v1
No ratings yet
2207.12691v1
6 pages
Lenet
No ratings yet
Lenet
7 pages
Detecting Objects in Scene Point Cloud: A Combinational Approach
No ratings yet
Detecting Objects in Scene Point Cloud: A Combinational Approach
9 pages
Lecture 16 Hao
No ratings yet
Lecture 16 Hao
56 pages
Point-Trajectory Transformer For Efficient Temporal 3D Object Detection
No ratings yet
Point-Trajectory Transformer For Efficient Temporal 3D Object Detection
10 pages
Deep Learning on Point Clouds and Its Application_ a Survey
No ratings yet
Deep Learning on Point Clouds and Its Application_ a Survey
22 pages
ZoomNet Part-Aware Adaptive Zooming Neural Network For 3D Object Detection
No ratings yet
ZoomNet Part-Aware Adaptive Zooming Neural Network For 3D Object Detection
9 pages
1608 07916 PDF
No ratings yet
1608 07916 PDF
8 pages
Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper
No ratings yet
Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper
10 pages
Pointwise Convolutional Neural Networks
No ratings yet
Pointwise Convolutional Neural Networks
10 pages
2111.03098
No ratings yet
2111.03098
14 pages
3 D Point Cloud Classification
No ratings yet
3 D Point Cloud Classification
7 pages
OD Trans Ishan-Mishra2021 A
No ratings yet
OD Trans Ishan-Mishra2021 A
12 pages
Lyu Learning To Segment 3D Point Clouds in 2D Image Space CVPR 2020 Paper
No ratings yet
Lyu Learning To Segment 3D Point Clouds in 2D Image Space CVPR 2020 Paper
10 pages
Lu-Net: An Efficient Network For 3D Lidar Point Cloud Semantic Segmentation Based On End-To-End-Learned 3D Features and U-Net
No ratings yet
Lu-Net: An Efficient Network For 3D Lidar Point Cloud Semantic Segmentation Based On End-To-End-Learned 3D Features and U-Net
9 pages
point transformers
No ratings yet
point transformers
11 pages
3D Vehicle Detection Using Multi-Level Fusion From Point Clouds and Images
No ratings yet
3D Vehicle Detection Using Multi-Level Fusion From Point Clouds and Images
9 pages
A LiDAR-Camera Fusion 3D Object Detection Algorith
No ratings yet
A LiDAR-Camera Fusion 3D Object Detection Algorith
11 pages
Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection
No ratings yet
Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection
11 pages
Point Transformer
No ratings yet
Point Transformer
11 pages
00005_3D_Semantic_Novelty_Detection
No ratings yet
00005_3D_Semantic_Novelty_Detection
21 pages
Pointnet: Deep Learning On Point Sets For 3D Classification and Segmentation
No ratings yet
Pointnet: Deep Learning On Point Sets For 3D Classification and Segmentation
19 pages
Li DeepFusion Lidar-Camera Deep Fusion For Multi-Modal 3D Object Detection CVPR 2022 Paper
No ratings yet
Li DeepFusion Lidar-Camera Deep Fusion For Multi-Modal 3D Object Detection CVPR 2022 Paper
10 pages
Hashimoto Normal Estimation For Accurate 3D Mesh Reconstruction With Point Cloud CVPRW 2019 Paper
No ratings yet
Hashimoto Normal Estimation For Accurate 3D Mesh Reconstruction With Point Cloud CVPRW 2019 Paper
10 pages
Learning Local Displacements For Point Cloud Completion
No ratings yet
Learning Local Displacements For Point Cloud Completion
10 pages
FastPillars A Deployment-Friendly Pillar-Based 3D Detector
No ratings yet
FastPillars A Deployment-Friendly Pillar-Based 3D Detector
13 pages
entropy-25-00635
No ratings yet
entropy-25-00635
35 pages
2403.00325
No ratings yet
2403.00325
13 pages
SGPN: Similarity Group Proposal Network For 3D Point Cloud Instance Segmentation
No ratings yet
SGPN: Similarity Group Proposal Network For 3D Point Cloud Instance Segmentation
13 pages
BirdNet A 3D Object Detection Framework
No ratings yet
BirdNet A 3D Object Detection Framework
8 pages
Snake:shape-Aware Neural 3d Keypoint Field
No ratings yet
Snake:shape-Aware Neural 3d Keypoint Field
13 pages
Li, Kong 2022 - SRIF-RCNN Sparsely Represented Inputs Fusion of Different Sensors For 3D Object Detection
No ratings yet
Li, Kong 2022 - SRIF-RCNN Sparsely Represented Inputs Fusion of Different Sensors For 3D Object Detection
22 pages
Point Transformer
No ratings yet
Point Transformer
10 pages
Advancing 3D point cloud understanding through deep transfer learning: A comprehensive survey
No ratings yet
Advancing 3D point cloud understanding through deep transfer learning: A comprehensive survey
38 pages
Large-Scale Point Cloud Semantic Segmentation With Superpoint Graphs
No ratings yet
Large-Scale Point Cloud Semantic Segmentation With Superpoint Graphs
13 pages
SO Net
No ratings yet
SO Net
17 pages
Deep SCNN-Based Real-Time Object Detection For Self-Driving Vehicles Using LiDAR Temporal Data
No ratings yet
Deep SCNN-Based Real-Time Object Detection For Self-Driving Vehicles Using LiDAR Temporal Data
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Effective Attention Modeling For Neural Relation Extraction
No ratings yet
Effective Attention Modeling For Neural Relation Extraction
10 pages
Cybernetics and The Dialectic Materialism of Marx and Lenin
No ratings yet
Cybernetics and The Dialectic Materialism of Marx and Lenin
19 pages
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
No ratings yet
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
19 pages
Ktu Students: For More Study Materials WWW - Ktustudents.in
No ratings yet
Ktu Students: For More Study Materials WWW - Ktustudents.in
3 pages
Multimodal Post Attentive Profiling For Influencer Marketing
No ratings yet
Multimodal Post Attentive Profiling For Influencer Marketing
7 pages
Updated Paper ID-118
No ratings yet
Updated Paper ID-118
5 pages
Candidates Are Required To Give Their Answers in Their Own Words As Far As Practicable. The Figures in The Margin Indicate Full Marks
No ratings yet
Candidates Are Required To Give Their Answers in Their Own Words As Far As Practicable. The Figures in The Margin Indicate Full Marks
1 page
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
DBMS Practical 6
No ratings yet
DBMS Practical 6
9 pages
Hosni 2017
No ratings yet
Hosni 2017
14 pages
Large-Scale Multi-Class and Hierarchical Product Categorization For An E-Commerce Giant
No ratings yet
Large-Scale Multi-Class and Hierarchical Product Categorization For An E-Commerce Giant
11 pages
DM - 05 - 04 - Rule-Based Classification PDF
No ratings yet
DM - 05 - 04 - Rule-Based Classification PDF
72 pages
Memory 2 Images and Propositions
No ratings yet
Memory 2 Images and Propositions
40 pages
EML 5311 Project 28 April Ramin Shamshiri
No ratings yet
EML 5311 Project 28 April Ramin Shamshiri
33 pages
c11 Solutions v12
No ratings yet
c11 Solutions v12
216 pages
Lecture 3-Access
No ratings yet
Lecture 3-Access
32 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
77 pages
Human Machine Interaction
No ratings yet
Human Machine Interaction
27 pages
Abida System Theory-1
No ratings yet
Abida System Theory-1
13 pages
Lecture 9
No ratings yet
Lecture 9
24 pages
Intelligent Systems Notes
No ratings yet
Intelligent Systems Notes
7 pages
Rr411301 Neural Networks Fuzzy Logic Control
No ratings yet
Rr411301 Neural Networks Fuzzy Logic Control
7 pages
Paper 1 Vol. 16, No - 4, 2023
No ratings yet
Paper 1 Vol. 16, No - 4, 2023
11 pages
Data Science ML Full Stack Roadmap
No ratings yet
Data Science ML Full Stack Roadmap
35 pages
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
No ratings yet
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
9 pages
Time Series Classification From Scratch With Deep Neural Networks
No ratings yet
Time Series Classification From Scratch With Deep Neural Networks
9 pages
Hackathon
No ratings yet
Hackathon
1 page
Project Work: Final-ISA (Review 4)
No ratings yet
Project Work: Final-ISA (Review 4)
29 pages
Data Science and Machine Learning Brochure Skillovilla
No ratings yet
Data Science and Machine Learning Brochure Skillovilla
40 pages

Wu, Gu Et Al 2022 - PV-RCNN++ Semantical Point-Voxel Feature Interaction For 3D Object Detection

Uploaded by

Wu, Gu Et Al 2022 - PV-RCNN++ Semantical Point-Voxel Feature Interaction For 3D Object Detection

Uploaded by

The Visual Computer

PV-RCNN++: semantical point-voxel feature interaction for 3D object

Accepted: 6 September 2022

2 Related work clouds with the help of PointNet-like networks. Then at

2.3 Point-voxel hybrid method 3 Methodology

To achieve both efﬁciency and accuracy, point-voxel hybrid 3.1 Overview

Foreground Point Sampling

Group Group Group

Aggregaon Aggregaon Aggregaon

VFE feature concatenaon

3D Voxel CNN Aenon-based Residual PointNet

3D RPN RoI-grid pooling

Binary Segmentaon S-FPS (K points) Group & Aggregaon

Algorithm 1 Semantic-guided furthest point sampling. N is

Fig. 5 Demonstration of residual PointNet aggregation. It consists of

Q = Wq F ∈ Rdk ·N , K = Wk F ∈ Rdk ·N , 3.6 RoI grid pooling

where pi − g j is the relative location from pi to g j , which

3.7 Loss function

Our method is an end-to-end trainable network, which is

Ltotal = Lseg + Lrpn + Lrcnn + Lkey (10)

As we have mentioned in Sect. 3.3, the segmentation loss

Lrpn = α1 Lcls + α2 Lloc + α3 Ldir (11)

where α1 , α2 , α3 are assigned {1.0, 2.0, 0.2} to represent the

where Lrcnn_cls is optimized with the binary cross-entropy

Table 1 Performance comparison on KITTI ofﬁcial test server

MV3D [43] CVPR2017 L+C 74.97 63.63 54.00 – – – – – –

The function of voxel query is to boost the speed of voxel

our method still lags behind some state-of-the-art methods 5 Conclusion

Mingqiang Wei (Senior Member,

You might also like