【点云目标检测】3DSSD:Point-based 3D Single Stage Object Detector

最新推荐文章于 2024-11-11 16:54:50 发布

梦醒时分1218

最新推荐文章于 2024-11-11 16:54:50 发布

阅读量4.6k

点赞数 5

CC 4.0 BY-SA版权

分类专栏：论文阅读

本文链接：https://blog.csdn.net/qq_35632833/article/details/105800007

论文阅读专栏收录该内容

5 篇文章

订阅专栏

提出3DSSD，一种轻量级单阶段3D点云目标检测器，结合FusionSampling策略，解决前景点稀少问题，实现高精度与实时性的平衡。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

论文链接：3DSSD
代码链接：Github
目前在Kitti上排名第8，同时检测速度也达到了25FPS

一. Abstruct

$\quad$ 提出了一个轻量高效的单阶段3d点云目标检测器—3DSSD，同时兼顾了检测的精度和效率。
$\quad$ 介绍中说目前主流的point-base的方法中，必不可少的点云上采样层(upsampling layers ) 和精修阶段(refinement stage) 模块，在3DSSD中都为了减少计算消耗被舍弃。
$\quad$ 提出了一个fusion sampling strategy，解决了前景点(representative points)较少的问题。
$\quad$ 该检测网络包括以下三个部分：

Candidate Generation Layer
Anchor-free Regression Head
3D Center-ness Assignment Strategy

二. Introduction

1. voxel-base 方法

$\quad$ 缺点：

Although these methods are straightforward and efficient, they suffer from information loss during voxelization and encounter performance bottleneck.

2. point-base 方法

$\quad$ 两阶段的方法：
$\quad$ 第一阶段：

In the first stage, they first utilize set abstraction (SA) layers for downsampling and extracting context features. Afterwards, feature propagation (FP) layers are applied for upsampling and broadcasting features to points which are discarded during downsampling. A 3D region proposal network (RPN) is then applied for generating proposals centered at each point.

$\quad$ 第二阶段：

Based on these proposals, a refinement module is developed as the second stage to give final predictions.

$\quad$ 缺点：performance很好但是速度很慢。

These methods achieve better performance, but their inference time is usually intolerable in many real-time systems.

三. 3DSSD

整体结构：在这里插入图片描述
由三个部分组成

SA layers
通过 Fusion Sampling 融合了基于空间距离和特征距离的两种采样方法，且分别采样一半的点，然后多个SA层组合得到 backbone
Candidate generation layer
得到候选点，并且通过候选点整合周围点的语义信息
Anchor-free prediction head
回归和分类

1. Fusion Sampling

Motivation

作者发现目前point-base的方法比较耗时。在第一阶段，通过SA层对点云进行降采样，扩大感受野，再通过FP层，利用插值，去恢复降采样阶段丢弃的那些点的特征，具体可以参见pointnet++里的这张图：
在这里插入图片描述

在第二阶段，精修通过RPN获得的proposal，去进行更精确的预测。
SA层提取特征是必不可少的，那么如何通过去掉 FP layers 和 refinement module 去获得更高效的检测呢？往下看。

Challenge

作者首先解释为什么去掉FP layers不容易。因为通过FPS采样得到的点很多都是背景点，(FPS采样只是考虑到了点云的几何信息，但是没有考虑到语义信息)，很多的前景点都丢失了，因此不加FP层把所有的特征都插值回去的话，很难进行检测。所以很多方法还是加入了FP层，

To avoid this circumstance, most of existing methods apply FP layers to recall those abandoned useful points during downsampling, but they have to pay the overhead of computation with longer inference time.

Feature-FPS

$\quad$ 为了保留有用的前景点 positive points (interior points within any instance) ，同时去除一些没有用的背景点 negative points (points locating on background)。所以作者提出了用特征距离而不是空间距离。
$\quad$ 点的语义信息可以通过深度网络获得，对于那些没有的背景点可以直接去掉，对于那些比较远的物体 positive points，都可以被保留，因为他们有不同的语义特征，所以特征距离不同。
$\quad$ 但是也有个问题，将特征距离做为唯一的标准的话，就会造成冗余。例如一个车上的轮子和窗户的点，它们的特征距离完全不同，都会被采样，但是这二者之一的任何点都可以用于回归，就造成了冗余。
$\quad$ 因此在使用FPS的时候，把空间距离和特征距离都考虑到：

$B)=\lambda L_{d}(A, B)+L_{f}(A, B)$

$\quad$ 作者也给出了对比结果: 1024个点时候，通过F-FPS有89.2%的物体可以被保留，远高于D-FPS方法。
在这里插入图片描述

Fusion Sampling

$\quad$ 通过F-FPS采样，许多不同物体的 positive points 点得以保存，但是许多 negative points 的点被丢弃了，这会有利于回归，但是不利于分类。

That is, during the grouping stage in a SA layer, which aggregates features from neighboring points, a negative point is unable to find
enough surrounding points, making it impossible to enlarge its receptive field. As a result, the model finds difficulty in distinguishing positive and negative points, leading to a poor performance in classification.

$\quad$ 因此不仅要获取尽可能多的 positive points，也需要获取足够数量的 negative points ，为了更好的分类。因此通过F-FPS和D-FPS各选择一半的点。

2. Box Prediction Network

Candidate Generation Layer

$\quad$ 在SA模块后，作者设计了CG层（candidate generation ）整合局部信息；在CG层中，首先将representative points生成候选点。
$\quad$ 由于通过D-FPS采样得到点，很多都是 negative points 并且对于 bounding box regression 没有什么用，因此选择 F-FPS采样得到的点，做为 initial center points.
$\quad$ 对这些初始中心点 initial center points 做回归，由 initial center points 和对应的 object 的中心坐标值进行监督训练，和VoteNet类似，回归得到候选点 candidate points。如下图所示：
在这里插入图片描述
$\quad$ 接着作者将这些候选点 candidate points 当做中心点 center points，再从F-FPS和D-FPS的集合点中找到他们的周围点，最后采用MLP提取它们的特征。这些特征最终会被送入到一个anchor free prediction head中来预测最后的3D bbox。

Then we treat these candidate points as the center points in our CG layer. We use candidate points rather than original points as the center points for the sake of performance, which will be discussed in detail later. Next, we find the surrounding points of each candidate point from the whole representative point set containing points from both D-FPS and F-FPS with a pre-defined range threshold, concatenate their normalized locations and semantic features as input, and apply MLP layers to extract features. These features will be sent to the prediction head for regression and classification.

Anchor-free Regression Head

$\quad$ 作者首先说明了为啥不用anchor-base的方法，因为要根据场景中的物体设置不同的anchor，并且每个anchor还有不同的朝向。例如，在nuScenes数据集中，有10种不同的物体，每个还有不同的朝向，设置anchor的话起码20个，就很复杂，因此作者采用了anchor free 的方法。这里有一张anchor的示意图，很形象了。
在这里插入图片描述
回归七个量：

$distance:(d_{x}, d_{y},d_{z})$
$size:(d_{l}, d_{w},d_{h})$
$o r i e n t a t i o n$

其中，由于通过点预测是没有预先设置朝向的，作者采用分类和回归的混合公式（待完善）

Since there is no prior orientation of each point, we apply a hybrid of classification and regression formulation following 【Frustum pointnets】 in orientation angle regression.

3D Center-ness Assignment Strategy

$\quad$ 在训练的过程中，我们需要给每个 candidate point 分配 label 。
$\quad$ 在2d目标检测中，通常使用iou阈值或者mask去给每个像素分配label。同时在FCOS中，提出了一个连续的 center-ness label ：

$\text { centerness }^{*}=\sqrt{\frac{\min \left(l^{*}, r^{*}\right)}{\max \left(l^{*}, r^{*}\right)} \times \frac{\min \left(t^{*}, b^{*}\right)}{\max \left(t^{*}, b^{*}\right)}}$

$\quad$ 越是靠近object中心的pixel，center-ness越接近于1，所得到的分数也就设置越大，如下图：
在这里插入图片描述

$\quad$ 但是在3d detection task 中，由于所有的3d点云数据都在物体的表面，因此它们的center-ness都非常小并且接近的，这会导致从这些点得到好的预测不太可能。这里也就是前面为什么不用原始的采样点作为候选点，而是从F-FPS采样后，再做中心回归后得到的点作为候选点，因为靠近中心的候选点可以有更加接近和更加准确的结果，同时根据center-ness label可以轻松和object的表面的点区分开。
$\quad$ 定义center-ness label通过两步：

$l_{mask}$ : 确定这个点是否在物体里面 (binary-value)
$l_{ctrness}$ : 通过画出object的六面体，然后计算该点到其前后左右上下表面的距离，再通过以下公式得到其对应的center-ness值
$l_{\text {ctrness}}=\sqrt[3]{\left.\frac{\min (f, b)}{\max (f, b)} \times \frac{\min (l, r)}{\max (l, r)} \times \frac{\min (t, d)}{\max (t, d)}\right)}$
最后的 final classification label $l_{mask} * l_{ctrness}$

三. 实验结果

在这里插入图片描述