0% found this document useful (0 votes)
14 views

Density-preserving Deep Point Cloud Compression

l

Uploaded by

nikhiljunk0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Density-preserving Deep Point Cloud Compression

l

Uploaded by

nikhiljunk0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Density-preserving Deep Point Cloud Compression

Yun He1∗ Xinlin Ren1∗ Danhang Tang2 Yinda Zhang2 Xiangyang Xue1 Yanwei Fu1
1
Fudan University 2 Google

Abstract

Local density of point clouds is crucial for represent- \ \


ing local details, but has been overlooked by existing point
cloud compression methods. To address this, we propose a
novel deep point cloud compression method that preserves
local density information. Our method works in an auto-
GT Ours (Bpp: 1.61)
encoder fashion: the encoder downsamples the points and
learns point-wise features, while the decoder upsamples
the points using these features. Specifically, we propose \ \

to encode local geometry and density with three embed-


dings: density embedding, local position embedding and
ancestor embedding. During the decoding, we explicitly
predict the upsampling factor for each point, and the di- G-PCC (Bpp: 1.66) Depoco (Bpp: 1.73)
rections and scales of the upsampled points. To mitigate
the clustered points issue in existing methods, we design a Figure 1. We argue that local density is an important character-
novel sub-point convolution layer, and an upsampling block istic of the point cloud and should be preserved during compres-
with adaptive scale. Furthermore, our method can also sion. Existing methods that ignore the local density exhibit ar-
compress point-wise attributes, such as normal. Extensive tifacts such as uniform distribution (G-PCC [12]) and clustered
qualitative and quantitative results on SemanticKITTI and points (Depoco [40]), resulting in worse reconstruction, especially
when the bitrate is low.
ShapeNet demonstrate that our method achieves the state-
of-the-art rate-distortion trade-off.

is an important characteristic and should be preserved as


much as possible. Firstly, preserving density usually leads
to less outliers, and thus smaller reconstruction error. Sec-
1. Introduction
ondly, point clouds captured in practice, e.g. from LiDAR,
Point cloud is one of the most important and widely are rarely with uniformly distributed points. Losing local
used 3D representation in many applications, such as au- density means losing important traits such as scanning res-
tonomous driving, robotics and physics simulation [13]. olution and occlusion. Thirdly, point clouds are often pro-
With the rapid development of 3D scanning technology, cessed or simplified to be denser on regions of interest or
complex geometry can now be effectively captured as large with complex geometry, such as human face, hand, etc. Pre-
point clouds with fine details. As a consequence, point serving density during compression means more budget is
cloud compression becomes crucial for storage and trans- spent on these regions. Last but not the least, if the decom-
mission. Particularly, to achieve favorable compression ra- pressed point cloud has significantly different density from
tio, the community has been focusing on lossy methods and the raw one, downstream applications such as semantic seg-
pondering the key question: what properties of point clouds mentation may be affected.
should be preserved, given limited bitrate budget? Mathematically, a point cloud can be considered as a set,
Besides the global geometry, we argue that local density often with different cardinality and permutation settings [8],
∗ indicates
which makes it difficult for image/video compression or
equal contribution.
Yun He, Xinlin Ren and Xiangyang Xue are with the School of Com-
conventional learning-based solutions that assume fixed di-
puter Science, Fudan University. mensional and ordered input. A typical strategy of exist-
Yanwei Fu is with the School of Data Science, Fudan University. ing lossy methods is to voxelize the points before compres-

2333
sion [12, 27, 28, 37, 38]. While this allows leveraging con- can also be categorized into voxel-based [27, 28, 37, 38] and
ventional methods [7, 22], it obviously loses the local den- point-based [17, 40, 41]. While sharing the discussed pros
sity, and has a precision capped by the voxel size. Recent and cons in point cloud analysis, point-based methods en-
methods [17, 41] utilize PointNet [24] or PointNet++ [25] able preserving local density for taking the raw 3D points as
to ignore the cardinality and permutation with max pool- inputs. Specifically, Yan et al. [41] integrates PointNet [24]
ing, and preserve density to some extent. However, the de- into an auto-encoder framework, while Huang et al. [17]
compressed point clouds always lose local details and suf- uses PointNet++ [25] instead. Architecture wise, Wies-
fer from clustered points issue, since most of the local ge- mann et al. [40] proposes to downsample the point cloud
ometry has been discarded by max pooling. Depoco [40] while encoding and upsample during decoding. Moreover,
adopts KPConv [35] to capture more local spatial informa- the research on deep entropy model [6,16,29] is also active,
tion than pooling, but clustered points artifact still exists while it is nearly lossless since its loss is only from quan-
due to feature replication, see Fig 1. Alternatively, Zhao et tization. In this paper we are focusing on the more lossy
al. [46] introduces attention mechanism to handle different compression in favor of higher compression ratio.
cardinalities and permutations, though it is not designed for
Point Cloud Upsampling. Point cloud upsampling aims
compression purpose.
to upsample a sparse point cloud to a dense and uniform
In this paper, we propose a novel density-preserving
one. And previous methods always design various fea-
deep point cloud compression method which yields superior
ture expansion modules to achieve it. In particular, Yu et
rate-distortion trade-off to prior arts, and more importantly
al. [44] replicates features and transforms them by multi-
preserves the local density. Our method has an auto-encoder
branch MLPs. And some other methods [19,20,43] employ
architecture, trained with an entropy encoder end-to-end.
folding-based [42] upsampling, which also duplicates fea-
The contributions of our paper are summarized as follows.
tures first. Specifically, Wang et al. [43] assigns each dupli-
On the encoder side: three types of feature embeddings are
cated feature a 1D code. Li et al. [19] and Li et al. [20]
designed to capture local geometry distribution and density.
concatenate each replicated feature with a point sampled
On the decoder side: to mitigate the clustered points issue,
from a 2D grid. However, the upsampled features generated
we propose 1) the sub-point convolution to promote feature
from these methods could be too similar to each other due
diversity during upsampling; 2) learnable number of upsam-
to replication, which inevitably results in clustered points.
pling points, and scale for their offsets in different regions.
We conduct extensive experiments and ablation studies
to justify these contributions. Additionally, we demonstrate 3. Methodology
that our method can be easily extended to jointly compress
attributes such as normal. The proposed density-preserving deep point cloud com-
pression framework is based on a symmetric auto-encoder
2. Related Work architecture, where the encoder has S downsampling stages
indexed by 0, 1...S − 1, and the decoder also has S upsam-
Point Cloud Analysis. Point clouds are typically unstruc- pling stages indexed reversely by S − 1, S − 2...0. For stage
tured, irregular and unordered, which cannot be immedi- s of the encoder, the input point cloud is notated as Ps and
ately processed by conventional convolution. To tackle this the output as Ps+1 . Reversely on the decoder side, the in-
issue, many works [21, 30] first voxelize points and then put and output of stage s are P̂s+1 and P̂s respectively, as
apply 3D convolution, which however could be computa- shown in Fig 2. Note that to distinguish from encoding, the
tionally expensive. Another type of approach directly op- hat symbol is used for reconstructed point clouds and asso-
erates on point clouds, hence termed point-based. For ex- ciated features.
ample, PointNet [24] and PointNet++ [25] use max pool- The input point cloud P0 is first partitioned into smaller
ing to ignore the order of points. DGCNN [39] proposes blocks which will be compressed individually. For simplic-
dynamic graph convolution for non-local feature aggrega- ity, we use the same notation P0 for a block. Specifically,
tion. And Point Transformer [46] introduces a purely self- on the encoder side, the input Ps is downsampled to Ps+1
attention [36] based network. by a factor fs at each stage s, while local geometry and den-
Point Cloud Compression. Traditional point cloud com- sity are also encoded into features Fs+1 . At the bottleneck,
pression algorithms [10–12, 23, 31, 32] usually rely on oc- features FS are then fed into an end-to-end trained entropy
tree [22] or KD-tree [5] structures for storage efficiency. encoder for further compression. When decompressing, we
Inspired by the great success of deep learning technology recover the downsampled point cloud P̂S , along with the
in point cloud analysis [24, 25, 39, 46] and image compres- features F̂S extracted by the entropy decoder. Our upsam-
sion [2, 3], the community begins to focus on the learning pling module then utilizes F̂S to upsample P̂S back to the
based point cloud compression. Similarly, lossy methods reconstructed point cloud P̂0 stage by stage.

2334
: Downsampled Point Cloud

: Downsampled Features

Downsampling Module Adaptive Upsampling Module




DS DS DS Entropy US US US
Block Block Block Encoder Block Block Block

ܲଷ ‫ܨ‬ଷ ‫ܨ‬෠ଷ ܲ෠ଷ

Stage 0 Stage 1 Stage 2 Stage 2 Stage 1 Stage 0

ܲ଴ ܲଵ ܲଶ ܲଷ ܲ෠ଷ ܲ෠ଶ ܲ෠ଵ ܲ෠଴

Figure 2. Our pipeline first partitions the point cloud into small blocks. Each block is then downsampled three times while the local density
and geometry patterns of collapsed points are encoded into features. At the bottleneck, downsampled features are further compressed by an
entropy encoder. The decoder can then use the features to adaptively upsample the downsampled point cloud back to the original geometry
and density. The details of downsampling (DS) block and upsampling (US) block are shown in Fig 3 and Fig 5 respectively.

3.1. Density-preserving Encoder where the direction (3D) and distance (scalar) are repre-
sented by this 4D vector. Consequently, the local point dis-
Downsampling. At each stage s of the encoder, an input tribution centered at p can be represented by a u × 4 feature,
point cloud block Ps will be downsampled to Ps+1 by a which is mapped to a higher dimensional (u × d) space with
factor of fs using farthest point sampling (FPS), which en- MLPs, before attention mechanism [36] is applied to aggre-
courages the sampled points to have a good coverage of the gate them into a d-dimensional embedding FP .
point cloud Ps . Please refer to the supplementary section While the density and position embedding capture the
for the ablation study of different sampling techniques. local density and geometry at stage s, it is necessary to
Feature embedding. As Ps+1 itself does not preserve the pass along these information from previous stages without
discarded points distribution of Ps . Simply upsampling adding much rate cost. To this end, we employ the point
Ps+1 by 1/fs will end up with a reconstruction with poor transformer layer [46] to aggregate the previous stage fea-
accuracy and uniform density. To address this, for each tures of the collapsed points set C(p) into the representative
point p ∈ Ps+1 , we calculate three different embeddings: sampled point p, due to its simplicity and effectiveness. We
density embedding, local position embedding and ancestor term this d-dimensional vector FA as ancestor embedding.
embedding, to capture the geometry and density of the dis- At last, an MLP fuses these three embeddings
carded points Ps − Ps+1 in a compact form with low en- (FP , FD , FA ) into a new d-dimensional feature Fs+1 for
tropy. the next stage. This process is illustrated in Fig 3.
First we define the concept of a collapsed points set C(p). Entropy encoding. At the bottleneck, we have a downsam-
After the downsampled points set is decided, each discarded pled point cloud PS and per-point features FS . For PS , we
point is deemed to collapse into its nearest downsampled use half float representation to reduce bitrate. And FS are
point exclusively. Thus all the points that collapse into a further compressed by an entropy encoder. Following re-
downsampled point p form a collapsed points set C(p), and cent success in deep image compression [2, 3], we integrate
we term u = |C(p)| as the downsampling factor of point p. an arithmetic encoder into the training process to jointly op-
The density embedding FD captures the cardinality timize the entropy of the features. This process is accompa-
of C(p) by mapping the downsampling factor u to a d- nied by a rate loss function that will be introduced later in
dimensional embedding via MLPs. Secondly, the local po- Sec 3.3.
sition embedding captures the distribution of C(p). Specifi-
cally, for each pk ∈ C(p), the direction and distance of the 3.2. Density-recovering Decoder
offset pk − p are calculated as below:
Overview. During decoding, symmetrically, we have S up-
sampling stages. At the bottleneck, we have the downsam-
pled point cloud P̂S and decoded features F̂S extracted by
(\frac {\point _k-\point }{||\point _k-\point ||_2}, ||\point _k-\point ||_2), \point \in \pointcloud _{\stage +1}, \point _k \in \collapsedpointsset (\point ) (1) the entropy decoder. Recall that during encoding, for each

2335
Input Upsampled
Point Features Features
Transformer

݀௜௡ Convolution ݀௢௨௧


ܷ Individual
Attention MLP
Periodic
Shuffle
݀௜௡ Convolution
MLP ݀௢௨௧
ܷ Individual
Raw Sample with Find Collapsed Downsample
Pointcloud FPS Points for Samples Factor — = 2
ܰ ൈ ݀௜௡ ܰ ൈ ܷ݀௢௨௧ ܷܰ ൈ ݀௢௨௧
Figure 3. The downsampling block: first a subset of points are
Figure 4. The illustration of sub-point convolution.
chosen as samples, and then three types of embeddings are com-
puted and fused into Fs+1 .
to each group with lower dimension, which significantly re-
downsampled point p, u discarded points collapse into it. duces the parameters and computations.
This information is not losslessly transmitted but fused into Upsampling block with adaptive scale. Based on the sub-
the features. During decoding, in order to properly upsam- point convolution, we build our upsampling block for points
ple each point, we apply MLPs to predict an upsampling and associated features, as depicted in Fig 5. Centering at
factor û ≈ u from the features. Similar to the collapsed each point p̂ ∈ P̂s+1 , offsets of upsampled points are pre-
set C(p) on the encoder, we define the upsampled set of a dicted. Since both downsampling and upsampling happen
ˆ
specific point p̂ ∈ P̂s+1 as C(p̂). in local regions, the scales of predicted offsets need to be
ˆ
In addition to C(p̂), the feature of each upsampled point constrained. To this end, folding-based methods [19,20,43]
is also predicted. Therefore the output of each point p̂ at use predefined small grid sizes. While Wiesmann et al.
upsampling stage s is: [40] constrains predicted offsets to [-1,1], and then scales
them with a predefined factor. However, this scaling factor
(\underset {\maxupsamplenumber \times 3}{\upsampledpointsset (\reconstructedpoint )}, \underset {\maxupsamplenumber \times \dimension }{\reconstructedfeature (\reconstructedpoint )}, \upsamplenumber ), \reconstructedpoint \in \reconstructedpointcloud _{\stage +1}, \upsamplenumber \leq \maxupsamplenumber (2) may vary significantly across different regions and different
point clouds. Hence we design a new upsampling module
ˆ and F̂(p̂) here have U items, but only the first û
where C(p̂) with learnable scales.
points and features will be chosen as the final outputs. The
Input
union of all chosen points is the upsampled point cloud P̂s Unit Uniform Candidate Duplicate Points
for the next stage, and same goes for F̂s . Sphere Sample Directions Sum ܰ ൈ͵
‫ ܯ‬ൈ͵

Duplicate
Directions
ܷܰ ൈ ͵
Sub-point convolution. At upsampling stage s, guided by
Sub-point
the features F̂s+1 , we aim to upsample each point p̂ ∈ P̂s+1 Convolution Weights Upsampled
†
by the predicted upsampling factor û. Additionally, F̂s+1 ܷܰ ൈ ‫ܯ‬ Offsets Points
also need to be expanded to û features F̂s for the next ܷܰ ൈ ͵
Sub-point
stage. To achieve so, prior upsampling methods either use Convolution Scales
multi-branch MLPs for feature expansion [40, 44] or apply ܷܰ ൈ ͳ
Input Sub-point Upsampled Upsampled
folding-based [42] upsampling modules [19, 20, 43]. De- Convolution †
Features Features Features
spite efforts of regularization and refinement, they still suf- ܰ ൈ݀ ܷܰ ൈ ݀ ܷܰ ൈ ݀
fer from the aforementioned clustered points artifact due to Duplicate

feature replication. To address this, we propose a novel and


Figure 5. The scale-adaptive upsampling block, includes both
efficient operator sub-point convolution (Fig 4), inspired by
point upsampling and feature upsampling.
the sub-pixel convolution [33].
Specifically, given the input N × din features F̂s+1 , we
first divide them into U groups along the channel dimen- In particular, a pool of M vectors is first sampled from
sion, such that each group has din /U channels. A convo- a unit sphere and kept fixed as candidate directions for
lution layer per group is applied to expand the features to a both training and inference. During upsampling, weights of
space with dimension N × U dout . these candidates are predicted such that the weighted sum
At last, we use periodic shuffle to reshape the upsam- result is the most probable direction. Some scaling fac-
pled features to U N × dout . Compared with prior meth- tors, or magnitudes are also predicted from the input fea-
ods [19, 20, 40, 43, 44], sub-point convolution has the fol- tures F̂s+1 to have the offsets and thus upsampled points.
lowing advantages: 1) the clustered points issue is mitigated The feature expansion is performed by sub-point convo-
by preventing feature replication; 2) convolution is applied lution within a residual block [14]. Once we obtain the final

2336
points and features, a refinement layer is added to finetune normal compression into our framework. To avoid extra
the upsampled points and features. It is essentially an up- cost of bitrate, we fix the same network architecture and
sampling block with upsampling factor û = 1. hyperparameters. The only difference is the input/output
dimension has changed from 3D to 6D (position+normal).
3.3. Loss Function To facilitate this, we employ a simple L2 loss to minimize
We employ the standard rate-distortion loss function dur- the normal reconstruction error.
ing training for better trade-off.
4. Evaluation
\loss = \distortionloss + \rateweight \rateloss , (3)
In this section, we evaluate our method by comparing to
where D penalizes distortion and R penalizes bitrate. state-of-the-art methods on compression rate, reconstruc-
Distortion loss. For distortion (reconstuction error), we tion accuracy and local density recovering. We then pro-
utilize the symmetric point-to-point Chamfer Distance [16] vide ablation studies to justify the design choices. Lastly,
to measure the difference between the reconstructed point we demonstrate that additional attributes like normal can be
cloud P̂s and ground truth Ps . Since the decoder has S also compressed. Please refer to the supplementary section
stages, to avoid error accumulation, we compute the distor- for implementation details and parameter settings.
tion loss at each stage and aggregate them as Dcha . 4.1. Experiment Setup
A density term is also designed to encourage recovering
local density. At stage s of the decoder, a point p̂ is upsam- Datasets. We conduct our main experiments on Se-
ˆ (see Sec 3.2). We then
pled to a new chosen points set C(p̂) manticKITTI [4] and ShapeNet [9]. For SemanticKITTI,
find its nearest counter point p on the encoder side, which we follow the official training/testing split [4]. For
is collapsed from a set C(p) (see Sec 3.1). Hence we can ShapeNet, we follow [17] to split training/testing sets and
define the density loss Dden as: sample points from meshes based on [15]. All point clouds
are first normalized to 100m3 cubes and divided into non-
overlapping blocks of 12m3 and 22m3 for SemanticKITTI
\densityloss = \sum _{s=0}^{\totalstage -1} \sum _{\reconstructedpoint \in \reconstructedpointcloud _{\stage +1}} \frac { \left ||\collapsedpointsset (\point )| - |\upsampledpointsset (\reconstructedpoint )| \right | + \gamma \left |\mean {\collapsedpointsset (\point )} - \mean {\upsampledpointsset (\reconstructedpoint )}\right |}{|\reconstructedpointcloud _{\stage +1}|} and ShapeNet respectively, while each block is further nor-
malized to [-1, 1]. For downstream surface reconstruction
(4) task, we use the RenderPeople [1] dataset.
where the first term of numerator calculates the cardinality Baselines. We compare to both state-of-the-art non-
difference between the two sets, the second calculates the learning based methods: G-PCC [12], Google Draco [11],
difference between the mean distances of all points in sets MPEG Anchor [23]; and learning-based methods: De-
to center points p or p̂, and γ is the weight. peco [40], PCGC [38]. Note that all learning-based methods
To further facilitate the density estimation, for each stage have been retrained on the same datasets as our method.
s, we utilize another loss to measure the cardinality differ-
Evaluation metrics. Following [6, 16], we adopt the sym-
ence of ground truth Ps and reconstructed point cloud P̂s :
metric point-to-point Chamfer Distance (CD) and point-to-
plane PSNR for geometry accuracy and Bits per Point (Bpp)
\numpointsloss = \sum _{s=0}^{\totalstage -1} \left ||\pointcloud _{\stage }| - |\reconstructedpointcloud _{\stage }|\right | (5) for compression rate. Moreover, we design a new metric to
measure the local density differences. And all these metrics
Finally, the overall distortion loss is as follows: are evaluated on each block. Specifically, for each point
p, we notate its neighbor points within radius r = 0.15 as
\distortionloss = \chamferloss +\weightdensityloss \densityloss + \weightnumpointsloss \numpointsloss (6) K(p). Since the cardinalities of ground truth P0 and recon-
structed point cloud P̂0 are not necessarily the same, we
where α and β are the weights of respective terms. define a symmetric density metric DM as:
Rate loss. Since entropy encoding is non-differentiable,
a differentiable proxy is applied during training. Follow- \begin {aligned} DM(\pointcloud _0, \reconstructedpointcloud _0) &= \frac {1}{|\pointcloud _0|} \sum _{p \in \pointcloud _0} \delta (p, \hat {p}) + \frac {1}{|\reconstructedpointcloud _0|} \sum _{\hat {p} \in \reconstructedpointcloud _0} \delta (\hat {p},p),\\ \textrm {where}~ \delta (a, b) &= \frac {\left ||\neighbors (a)| -|\neighbors (b)|\right |}{|\neighbors (a)|} + \mu \frac {\left |\mean {\neighbors (a)} - \mean {\neighbors (b)}\right |}{\mean {\neighbors (a)}} \end {aligned}
ing [2, 3], we replace the quantization step with an additive
uniform noise, and estimate the number of bits as the rate
loss R. During inference, features are properly quantized
and compressed by a range encoder.
(7)
3.4. Attribute Compression
where b is the nearest counter point of a, µ is the weight,
Our framework can also compress point cloud attributes |K(a)| denotes the cardinality of K(a) and K(a) denotes
such as color, normal, etc. As an example, we incorporate the mean distance of all points in K(a) to a.

2337
GT
Ours

Bpp: 1.94 PSNR: 44.73 Bpp: 4.23 PSNR: 47.98 Bpp: 1.67 PSNR: 39.65 Bpp: 4.06 PSNR: 44.00
G-PCC

Bpp: 1.95 PSNR: 39.77 Bpp: 4.52 PSNR: 45.29 Bpp: 1.71 PSNR: 36.78 Bpp: 4.21 PSNR: 42.87
Draco

Bpp: 2.89 PSNR: 26.50 Bpp: 4.83 PSNR: 38.32 Bpp: 1.85 PSNR: 32.32 Bpp: 4.25 PSNR: 41.31
MPEG Anchor

Bpp: 2.56 PSNR: 24.61 Bpp: 4.89 PSNR: 38.65 Bpp: 1.99 PSNR: 34.79 Bpp: 4.16 PSNR: 40.93
Depoco

Bpp: 2.39 PSNR: 34.34 Bpp: 4.98 PSNR: 40.01 Bpp: 1.76 PSNR: 35.52 Bpp: 4.13 PSNR: 39.12
PCGC

Bpp: 2.54 PSNR: 36.02 Bpp: 4.91 PSNR: 40.22 Bpp: 1.69 PSNR: 29.44 Bpp: 4.09 PSNR: 39.97

1.00e-07 0.167 0.333 0.500 0.667 0.833 1.00


Error Colormap

Figure 6. Qualitative results on SemanticKITTI (the first two columns) and ShapeNet (the last two columns). From top to bottom: Ground
Truth, Ours, G-PCC [12], Draco [11], MPEG Anchor [23], Depeco [40] and PCGC [38]. We utilize the distance between each point in
decompressed point clouds and its nearest neighbor in ground truth as the error. And the Bpp and PSNR metrics are averaged by each
block of the full point clouds. It is obvious that our method successfully achieves both the most accurate geometry and lowest bitrates.

2338
Figure 7. Quantitative results on SemanticKITTI (the first row) and ShapeNet (the second row). Our method consistently achieves more
accurate geometry and recovering density across the full range of bitrates.

4.2. Comparison with SOTA Methods Enc. time (ms) Dec. time (ms) Size (MB)
G-PCC [12] 180/165 163/152 3.49
We first compare our method with SOTA on the rate- Draco [11] 147/153 147/153 2.49
distortion trade-off. In Fig 7, we show the per-block Cham- MPEG Anchor [23] 151/142 136/130 27.8
fer Distance, PSNR and density metric of all methods Depoco [40] 32/126 2/2 0.54
PCGC [38] 130/96 24/19 7.73
against Bits per Point (Bpp). Our method yields more accu-
rate reconstruction consistently across the full spectrum of Ours 80/81 24/31 0.70
Bpp on both SemanticKITTI and ShapeNet datasets. Note Table 1. The average per-block encoding time, decoding time and
the differences are more evident under the density metric. model size of different methods on SemanticKITTI/ShapeNet, us-
Fig 6 shows qualitative results at various bitrates. Draco ing a TITAN X GPU.
[11] and MPEG Anchor [23] typically need a high Bpp (e.g.
>4) to achieve a satisfactory reconstruction. Plus they per- we use their checkpoint sizes. Our model is competitive in
form poorly at low bitrates due to quantization. Depoco [40] computational efficiency, only second to Depoco [40] but
often generates clustered points caused by feature replica- achieves a better rate-distortion trade-off.
tion. PCGC [38] tends to miss a continuous chunk of points,
because it regards decompression as a binary classification 4.3. Ablation Study
process (occupied or not), which has extremely imbalanced For fair comparison, we conduct all the ablation experi-
data due to the intrinsic sparsity of point clouds. Besides, it ments on SemanticKITTI while fixing the Bpp at 2.1.
also significantly alters the density. Although G-PCC [12]
Effectiveness of each component. We build a baseline
recovers the overall geometry successfully, due to voxeliza-
model consisting of a point transformer encoder [46], en-
tion, it loses local details. Our method achieves the highest
tropy encoder and multi-branch MLPs decoder [44]. The
compression performance in terms of both geometry and lo-
proposed components, including dynamic upsampling fac-
cal density while spending the lowest bitrates.
tor û, local position embedding FP , density embedding
Complexity analysis. Table 1 shows the per-block latency FD , scale-adaptive upsampling block, sub-point convolu-
and memory footprint of different methods. For G-PCC tion and upsampling refinement layer, are then added incre-
[12], Draco [11] and MPEG Anchor [23], we use the sizes mentally, as shown in Table 2. All the modules contribute
of their executable files. For Depoco [40] and PCGC [38], to the reconstruction quality under a fixed Bpp.

2339
Components CD (10−2 ) ↓ PSNR ↑ DM ↓
Baseline 2.61 38.82 4.17
+û 2.29 39.64 3.23
+FP 1.67 40.96 3.02
+FD 1.32 41.68 2.58
+Adaptive Scale 0.98 42.49 2.31
+Sub-point Conv 0.45 43.73 2.07
+Refinement 0.36 44.03 1.98

Table 2. The effectiveness of each component in our method. Figure 9. Quantitative results of downstream tasks. Left: surface
Each row a component is added on top of the previous row. reconstruction on RenderPeople; right: semantic segmentation on
SemanticKITTI.
Effectiveness of our decoder. To show that our decoder,
consisting of our upsampling block and sub-point convo- MPEG Anchor [23], as shown in Fig 8. Our method consis-
lution, is more effective in leveraging the information pro- tently outperforms others, especially by a large margin on
vided by the encoder for recovering density, we utilize var- the SemanticKITTI dataset.
ious point upampling modules from previous works as the
4.5. Impact on Downstream Tasks
decoders to jointly train with our encoder, as shown in Ta-
ble 3. Our decoder significantly outperforms others on all Point cloud compression, as an upstream task, should not
the reconstruction quality metrics, indicating that our de- affect the performance of downstream applications much.
coder preserves geometry and local density better. In this section, we compare the impact of different com-
pression algorithms on two downstream tasks: surface re-
Decoders CD (10−2 ) ↓ PSNR ↑ DM ↓ construction and semantic segmentation. Since some meth-
Yu et al. [44] 1.25 41.51 2.60 ods do not support attribute compression, all methods only
Wang et al. [43] 1.03 42.54 2.46 compress the positions for fair comparison.
Li et al. [19] 0.98 42.57 2.45 In the surface reconstruction experiments, Poisson re-
Li et al. [20] 0.90 42.83 2.32 construction [18] is run on the full decompressed point
Qian et al. [26] 0.81 43.06 2.25
clouds. Reconstructed meshes are then compared with the
Ours 0.36 44.03 1.98 ground truth with the symmetric point-to-plane Chamfer
Table 3. The effectiveness of our decoder. In each row, we replace Distance [34]. For semantic segmentation, we train Polar-
our decoder with the decoder from another work. Net [45] on raw point clouds from SemanticKITTI training
set, and test on the full decompressed point clouds. The
mean intersection-over-union (IOU) is used as metric, fol-
lowing [16]. As shown in Fig 9, our method consistently
yields the best rate-distortion trade-off, which reiterates the
importance of recovering local density. Please refer to the
supplementary section for qualitative comparisons.

5. Conclusion
We introduce a novel deep point cloud compression
Figure 8. Quantitative normal compression results. Left: Se- framework that can preserve local density. Not only does
manticKITTI; right: ShapeNet. Our method consistently performs it yield the best rate-distortion trade-off against prior arts, it
better than Draco [11], G-PCC [12] and MPEG Anchor [23] across also recovers local density more accurately under our den-
the bitrate spectrum. sity metric. Qualitative results show that our algorithm can
mitigate the two main density issues of other methods: uni-
4.4. Normal Compression formly distributed and clustered points. Complexity wise
Besides positions, we also evaluate the capability of our method is only second to Depoco while with much bet-
compressing attributes, using normals as an example. The ter accuracy.
normals are concatenated with the point locations and fed Acknowledgments. This work was supported in part by
into our model. The decompressed locations and normals NSFC under Grant (No. 62076067), SMSTM Project
are then compared with the inputs by per-block F1 score [6]. (2021SHZDZX0103), and Shanghai Research and Innova-
As modifying learning based approaches such as PCGC tion Functional Program (17DZ2260900). Danhang Tang,
[38] and Depoco [40] to have attribute compression is non- Yinda Zhang and Yanwei Fu are the corresponding au-
trivial, we only compare to Draco [11], G-PCC [12] and thours.

2340
References [15] Pedro Hermosilla, Tobias Ritschel, Pere-Pau Vázquez, Àlvar
Vinacua, and Timo Ropinski. Monte carlo convolution for
[1] Renderpeople. https://renderpeople.com/free- learning on non-uniformly sampled point clouds. ACM
3d-people, 2018. 5 Transactions on Graphics (TOG), 37(6):1–12, 2018. 5
[2] Johannes Ballé, Valero Laparra, and Eero P Simoncelli.
[16] Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu,
End-to-end optimized image compression. arXiv preprint
and Raquel Urtasun. Octsqueeze: Octree-structured en-
arXiv:1611.01704, 2016. 2, 3, 5
tropy model for lidar compression. In Proceedings of the
[3] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin IEEE/CVF Conference on Computer Vision and Pattern
Hwang, and Nick Johnston. Variational image compression Recognition, pages 1313–1323, 2020. 2, 5, 8
with a scale hyperprior. arXiv preprint arXiv:1802.01436,
[17] Tianxin Huang and Yong Liu. 3d point cloud geometry
2018. 2, 3, 5
compression on deep learning. In Proceedings of the 27th
[4] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- ACM International Conference on Multimedia, pages 890–
zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- 898, 2019. 2, 5
mantickitti: A dataset for semantic scene understanding of
[18] Michael Kazhdan and Hugues Hoppe. Screened poisson sur-
lidar sequences. In Proceedings of the IEEE/CVF Interna-
face reconstruction. ACM Transactions on Graphics (ToG),
tional Conference on Computer Vision, pages 9297–9307,
32(3):1–13, 2013. 8
2019. 5
[19] Ruihui Li, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and
[5] Jon Louis Bentley. Multidimensional binary search trees
Pheng-Ann Heng. Pu-gan: a point cloud upsampling ad-
used for associative searching. Communications of the ACM,
versarial network. In Proceedings of the IEEE/CVF Inter-
18(9):509–517, 1975. 2
national Conference on Computer Vision, pages 7203–7212,
[6] Sourav Biswas, Jerry Liu, Kelvin Wong, Shenlong Wang,
2019. 2, 4, 8
and Raquel Urtasun. Muscle: Multi sweep compres-
sion of lidar using deep entropy models. arXiv preprint [20] Ruihui Li, Xianzhi Li, Pheng-Ann Heng, and Chi-Wing Fu.
arXiv:2011.07590, 2020. 2, 5, 8 Point cloud upsampling via disentangled refinement. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
[7] Andrew Brock, Theodore Lim, James M Ritchie, and
and Pattern Recognition, pages 344–353, 2021. 2, 4, 8
Nick Weston. Generative and discriminative voxel mod-
eling with convolutional neural networks. arXiv preprint [21] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-
arXiv:1608.04236, 2016. 2 volutional neural network for real-time object recognition.
[8] Christian Bueno and Alan Hylton. On the representation In 2015 IEEE/RSJ International Conference on Intelligent
power of set pooling networks. In Thirty-Fifth Conference Robots and Systems (IROS), pages 922–928. IEEE, 2015. 2
on Neural Information Processing Systems, 2021. 1 [22] Donald Meagher. Geometric modeling using octree encod-
[9] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, ing. Computer graphics and image processing, 19(2):129–
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, 147, 1982. 2
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: [23] Rufael Mekuria, Kees Blom, and Pablo Cesar. Design, im-
An information-rich 3d model repository. arXiv preprint plementation, and evaluation of a point cloud codec for tele-
arXiv:1512.03012, 2015. 5 immersive video. IEEE Transactions on Circuits and Sys-
[10] Ricardo L De Queiroz and Philip A Chou. Compression of tems for Video Technology, 27(4):828–842, 2016. 2, 5, 6, 7,
3d point clouds using a region-adaptive hierarchical trans- 8
form. IEEE Transactions on Image Processing, 25(8):3947– [24] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
3956, 2016. 2 Pointnet: Deep learning on point sets for 3d classification
[11] Frank Galligan, Michael Hemmer, Ondrej Stava, Fan Zhang, and segmentation. In Proceedings of the IEEE conference
and Jamieson Brettle. Google/draco: a library for com- on computer vision and pattern recognition, pages 652–660,
pressing and decompressing 3d geometric meshes and point 2017. 2
clouds. https : / / github . com / google / draco, [25] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
2018. 2, 5, 6, 7, 8 net++: Deep hierarchical feature learning on point sets in a
[12] D Graziosi, O Nakagami, S Kuma, A Zaghetto, T Suzuki, metric space. arXiv preprint arXiv:1706.02413, 2017. 2
and A Tabatabai. An overview of ongoing point cloud com- [26] Guocheng Qian, Abdulellah Abualshour, Guohao Li, Ali
pression standardization activities: video-based (v-pcc) and Thabet, and Bernard Ghanem. Pu-gcn: Point cloud upsam-
geometry-based (g-pcc). APSIPA Transactions on Signal and pling using graph convolutional networks. In Proceedings of
Information Processing, 9, 2020. 1, 2, 5, 6, 7, 8 the IEEE/CVF Conference on Computer Vision and Pattern
[13] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, Recognition, pages 11683–11692, 2021. 8
and Mohammed Bennamoun. Deep learning for 3d point [27] Maurice Quach, Giuseppe Valenzise, and Frederic Dufaux.
clouds: A survey. IEEE transactions on pattern analysis and Learning convolutional transforms for lossy point cloud ge-
machine intelligence, 2020. 1 ometry compression. In 2019 IEEE International Confer-
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ence on Image Processing (ICIP), pages 4320–4324. IEEE,
Deep residual learning for image recognition. In Proceed- 2019. 2
ings of the IEEE conference on computer vision and pattern [28] Maurice Quach, Giuseppe Valenzise, and Frederic Dufaux.
recognition, pages 770–778, 2016. 4 Improved deep point cloud geometry compression. In 2020

2341
IEEE 22nd International Workshop on Multimedia Signal In Proceedings of the IEEE Conference on Computer Vision
Processing (MMSP), pages 1–6. IEEE, 2020. 2 and Pattern Recognition, pages 206–215, 2018. 2, 4
[29] Zizheng Que, Guo Lu, and Dong Xu. Voxelcontext-net: An [43] Wang Yifan, Shihao Wu, Hui Huang, Daniel Cohen-Or, and
octree based framework for point cloud compression. In Pro- Olga Sorkine-Hornung. Patch-based progressive 3d point set
ceedings of the IEEE/CVF Conference on Computer Vision upsampling. In Proceedings of the IEEE/CVF Conference
and Pattern Recognition, pages 6042–6051, 2021. 2 on Computer Vision and Pattern Recognition, pages 5958–
[30] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 5967, 2019. 2, 4, 8
Octnet: Learning deep 3d representations at high resolutions. [44] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and
In Proceedings of the IEEE conference on computer vision Pheng-Ann Heng. Pu-net: Point cloud upsampling network.
and pattern recognition, pages 3577–3586, 2017. 2 In Proceedings of the IEEE Conference on Computer Vision
[31] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point and Pattern Recognition, pages 2790–2799, 2018. 2, 4, 7, 8
cloud library (pcl). In 2011 IEEE international conference [45] Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze-
on robotics and automation, pages 1–4. IEEE, 2011. 2 rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An
[32] Ruwen Schnabel and Reinhard Klein. Octree-based point- improved grid representation for online lidar point clouds se-
cloud compression. In PBG@ SIGGRAPH, pages 111–120, mantic segmentation. In Proceedings of the IEEE/CVF Con-
2006. 2 ference on Computer Vision and Pattern Recognition, pages
[33] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, 9601–9610, 2020. 8
Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan [46] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and
Wang. Real-time single image and video super-resolution Vladlen Koltun. Point transformer. In Proceedings of the
using an efficient sub-pixel convolutional neural network. In IEEE/CVF International Conference on Computer Vision,
Proceedings of the IEEE conference on computer vision and pages 16259–16268, 2021. 2, 3, 7
pattern recognition, pages 1874–1883, 2016. 4
[34] Danhang Tang, Saurabh Singh, Philip A Chou, Christian
Hane, Mingsong Dou, Sean Fanello, Jonathan Taylor, Philip
Davidson, Onur G Guleryuz, Yinda Zhang, et al. Deep im-
plicit volume compression. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 1293–1303, 2020. 8
[35] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud,
Beatriz Marcotegui, François Goulette, and Leonidas J
Guibas. Kpconv: Flexible and deformable convolution for
point clouds. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 6411–6420, 2019. 2
[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 2,
3
[37] Jianqiang Wang, Dandan Ding, Zhu Li, and Zhan Ma. Multi-
scale point cloud geometry compression. In 2021 Data Com-
pression Conference (DCC), pages 73–82. IEEE, 2021. 2
[38] Jianqiang Wang, Hao Zhu, Haojie Liu, and Zhan Ma. Lossy
point cloud geometry compression via end-to-end learning.
IEEE Transactions on Circuits and Systems for Video Tech-
nology, 2021. 2, 5, 6, 7, 8
[39] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
Michael M Bronstein, and Justin M Solomon. Dynamic
graph cnn for learning on point clouds. Acm Transactions
On Graphics (tog), 38(5):1–12, 2019. 2
[40] Louis Wiesmann, Andres Milioto, Xieyuanli Chen, Cyrill
Stachniss, and Jens Behley. Deep compression for dense
point cloud maps. IEEE Robotics and Automation Letters,
6(2):2060–2067, 2021. 1, 2, 4, 5, 6, 7, 8
[41] Wei Yan, Shan Liu, Thomas H Li, Zhu Li, Ge Li, et al.
Deep autoencoder-based lossy geometry compression for
point clouds. arXiv preprint arXiv:1905.03691, 2019. 2
[42] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Fold-
ingnet: Point cloud auto-encoder via deep grid deformation.

2342

You might also like