0% found this document useful (0 votes)
5 views

SO Net

Uploaded by

bruce030153
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

SO Net

Uploaded by

bruce030153
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

SO-Net: Self-Organizing Network for Point Cloud Analysis

Jiaxin Li Ben M. Chen Gim Hee Lee


National University of Singapore

Abstract
arXiv:1803.04249v4 [cs.CV] 27 Mar 2018

This paper presents SO-Net, a permutation invariant ar-


chitecture for deep learning with orderless point clouds.
The SO-Net models the spatial distribution of point cloud
by building a Self-Organizing Map (SOM). Based on the
SOM, SO-Net performs hierarchical feature extraction on
Figure 1. Our SO-Net applies hierarchical feature aggregation us-
individual points and SOM nodes, and ultimately represents ing SOM. Point clouds are converted into SOM node features and
the input point cloud by a single feature vector. The re- a global feature vector that can be applied to classification, autoen-
ceptive field of the network can be systematically adjusted coder reconstruction, part segmentation and shape retrieval etc.
by conducting point-to-node k nearest neighbor search. In
recognition tasks such as point cloud reconstruction, clas-
sification, object part segmentation and shape retrieval, our To avoid the shortcomings of naive voxelization, one op-
proposed network demonstrates performance that is similar tion is to explicitly exploit the sparsity of the voxel grids
with or better than state-of-the-art approaches. In addition, [35, 21, 11]. Although the sparse design allows higher grid
the training speed is significantly faster than existing point resolution, the induced complexity and limitations make it
cloud recognition networks because of the parallelizability difficult to realize large scale or flexible deep networks [30].
and simplicity of the proposed architecture. Our code is Another option is to utilize scalable indexing structures in-
available at the project website.1 cluding kd-tree [4], octree [25]. Deep networks based on
these structures have shown encouraging results. Compared
to tree based structures, point cloud representation is math-
ematically more concise and straight-forward because each
1. Introduction point is simply represented by a 3-vector. Additionally,
After many years of intensive research, convolutional point clouds can be easily acquired with popular sensors
neural networks (ConvNets) is now the foundation for such as the RGB-D cameras, LiDAR, or conventional cam-
many state-of-the-art computer vision algorithms, e.g. im- eras with the help of the Structure-from-Motion (SfM) al-
age recognition, object classification and semantic segmen- gorithm. Despite the widespread usage and easy acquisition
tation etc. Despite the great success of ConvNets for 2D of point clouds, recognition tasks with point clouds still re-
images, the use of deep learning on 3D data remains a main challenging. Traditional deep learning methods such
challenging problem. Although 3D convolution networks as ConvNets are not applicable because point clouds are
(3D ConvNets) can be applied to 3D data that is rasterized spatially irregular, and can be permutated arbitrarily. Due
into voxel representations, most computations are redun- to these difficulties, few attempts has been made to apply
dant because of the sparsity of most 3D data. Additionally, deep learning techniques directly to point clouds until the
the performance of naive 3D ConvNets is largely limited very recent PointNet [26].
by the resolution loss and exponentially growing compu- Despite being a pioneer in applying deep learning to
tational cost. Meanwhile, the accelerating development of point clouds, PointNet is unable to handle local feature ex-
depth sensors, and the huge demand from applications such traction adequately. PointNet++ [28] is later proposed to ad-
as autonomous vehicles make it imperative to process 3D dress this problem by building a pyramid-like feature aggre-
data efficiently. Recent availability of 3D datasets includ- gation scheme, but the point sampling and grouping strategy
ing ModelNet [37], ShapeNet [8], 2D-3D-S [2] adds on to in [28] does not reveal the spatial distribution of the input
the popularity of research on 3D data. point cloud. Kd-Net [18] build a kd-tree for the input point
cloud, followed by hierarchical feature extractions from the
1 https://github.com/lijx10/SO-Net leaves to root. Kd-Net explicitly utilizes the spatial distri-

1
bution of point clouds, but there are limitations such as the Rendering 3D data into multi-view 2D images turns the
lack of overlapped receptive fields. 3D problem into a 2D problem that can be solved using
In this paper, we propose the SO-Net to address the standard 2D ConvNets. View-pooling layer [33] is designed
problems in existing point cloud based networks. Specifi- to aggregate features from multiple rendered images. Qi
cally, a SOM [19] is built to model the spatial distribution et al. [27] substitute traditional 3D to 2D rendering with
of the input point cloud, which enables hierarchical feature multi-resolution sphere rendering. Wang et al. [34] further
extraction on both individual points and SOM nodes. propose the dominant set pooling and utilize features like
Ultimately, the input point cloud can be compressed into color and surface normal. Despite the improved efficiency
a single feature vector. During the feature aggregation compared to 3D ConvNets, multi-view strategy still suffers
process, the receptive field overlap is controlled by per- from information loss [18] and it cannot be easily extended
forming point-to-node k-nearest neighbor (kNN) search on to tasks like per-point labeling.
the SOM. The SO-Net theoretically guarantees invariance Indexing techniques such as kd-tree and octree are scal-
to the order of input points, by the network design and able compared to uniform grids, and their regular structures
our permutation invariant SOM training. Applications are suitable for deep learning techniques. To enable convo-
of our SO-Net include point cloud based classification, lution and pooling operations over octree, Riegler et al. [30]
autoencoder reconstruction, part segmentation and shape build a hybrid grid-octree structure by placing several small
retrieval, as shown in Fig. 1. octrees into a regular grid. With bit string representation, a
single voxel in the hybrid structure is fully determined by
The key contributions of this paper are as follows: its bit index. As a result, simple arithmetic can be used to
visit the parent or child nodes. Similarly, Wang et al. [36]
• We design a permutation invariant network - the SO-
introduce a label buffer to find correspondence of octants at
Net that explicitly utilizes the spatial distribution of
various depths. Klokov et al. propose the Kd-Net [18] that
point clouds.
computes vectorial representations for each node of the pre-
• With point-to-node kNN search on SOM, hierarchical built balanced kd-tree. A parent feature vector is computed
feature extraction is performed with systematically ad- by applying non-linearity and affine transformation on its
justable receptive field overlap. two child feature vectors, following the bottom-up fashion.
PointNet [26] is the pioneer in the direct use of point
• We propose a point cloud autoencoder as pre-training
clouds. It uses the channel-wise max pooling to aggregate
to improve network performance in various tasks.
per-point features into a global descriptor vector. PointNet
• Compared with state-of-the-art approaches, similar or is invariant to order permutation of input points because the
better performance is achieved in various applications per-point feature extraction is identical for every point and
with significantly faster training speed. max pooling operation is permutation invariant. A similar
permutation equivariant layer [29] is also proposed at al-
2. Related Work most the same time as [26], with the major difference that
the permutation equivariant layer is max-normalized. Al-
It is intuitive to represent 3D shapes with voxel grids be- though the max-pooling idea is proven to be effective, it suf-
cause they are compatible with 3D ConvNets. [24, 37] use fers from the lack of ConvNet-like hierarchical feature ag-
binary variable to indicate whether a voxel is occupied or gregation. PointNet++ [28] is later designed to group points
free. Several enhancements are proposed in [27] - overfit- into several groups in different levels, so that features from
ting is mitigated by predicting labels from partial subvol- multiple scales could be extracted hierarchically.
umes, orientation pooling layer is designed to fuse shapes Unlike networks based on octree or kd-tree, the spatial
with various orientations, and anisotropic probing kernels distribution of points is not explicitly modeled in Point-
are used to project 3D shapes into 2D features. Brock et Net++. Instead, heuristic grouping and sampling schemes,
al. [6] propose to combine voxel based variational autoen- e.g. multi-scale and multi-resolution grouping, are designed
coders with object recognition networks. Despite its sim- to combine features from multiple scales. In this paper, we
plicity, voxelization is able to achieve state-of-the-art per- propose our SO-Net that explicitly models the spatial distri-
formance. Unfortunately, it suffers from loss of resolution bution of input point cloud during hierarchical feature ex-
and the exponentially growing computational cost. Sparse traction. In addition, adjustable receptive field overlap leads
methods [35, 21, 11] are proposed to improve the efficiency. to more effective local feature aggregation.
However, these methods still rely on uniform voxel grids
and experience various limitations such as the lack of par- 3. Self-Organizing Network
allelization capacity [21]. Spectral ConvNets [23, 5, 7] are
explored to work on non-Euclidean geometries, but they are The input to the network is a point set P = {pi ∈
mostly limited to manifold meshes. R3 , i = 0, · · · , N − 1}, which will be processed into M
for the k nearest neighbors (kNN) on the SOM nodes S for
each point pi , i.e., point-to-node kNN search:

sik = kNN(pi | sj , j = 0, · · · , M − 1). (1)

Each pi is then normalized into k points by subtraction with


its associated nodes:

pik = pi − sik . (2)


(a) (b)
Figure 2. (a) The initial nodes of an 8 × 8 SOM. For each SOM The resulting kN normalized points are forwarded into a
configuration, the initial nodes are fixed for every point cloud. (b) series of fully connected layers to extract individual point
Example of a SOM training result. features. There is a shared fully connected layer on each
level l, where φ is the non-linear activation function. The
output of level l is given by
SOM nodes S = {sj ∈ R3 , j = 0, · · · , M − 1} as shown
in Sec. 3.1. Similarly, in the encoder described in Sec. 3.2, pl+1 l l l
ik = φ(W pik + b ). (3)
individual point features are max-pooled into M node fea-
tures, which can be further aggregated into a global feature The input to the first layer p0ik can simply be the normalized
vector. Our SO-Net can be applied to various computer vi- point coordinates pik , or the combination of coordinates and
sion tasks including classification, per-point segmentation other features like surface normal vectors.
(Sec. 3.3), and point cloud reconstruction (Sec. 3.4). Node feature extraction begins with max-pooling the kN
point features into M node features following the above
3.1. Permutation Invariant SOM kNN association. We apply a channel-wise max pooling
SOM is used to produce low-dimensional, in this case operation to get the node feature s0j for those point features
two-dimensional, representation of the input point cloud. associated with the same node sj :
We construct a SOM with the size of m × m, where m ∈ s0j = max({plik , ∀sik = sj }). (4)
[5, 11], i.e. the total number of nodes M ranges from 25 to
121. SOM is trained with unsupervised competitive learn- Since each point is normalized into k coordinates according
ing instead of the commonly used backpropagation in deep to the point-to-node kNN search, it is guaranteed that the
networks. However, naive SOM training schemes are not receptive fields of the M max pooling operations are over-
permutation invariant for two reasons: the training result is lapped. Specifically, M nodes cover kN normalized points.
highly related to the initial nodes, and the per-sample update k is an adjustable parameter to control the overlap.
rule depends on the order of the input points. Each node feature produced by the above max pooling
The first problem is solved by assigning fixed initial operation is further concatenated with the associated SOM
nodes for any given SOM configuration. Because the input node. The M augmented node features are forwarded into
point cloud is normalized to be within [−1, 1] in all three a series of shared layers, and then aggregated into a feature
axes, we generate a proper initial guess by dispersing the vector that represents the input point cloud.
nodes uniformly inside a unit ball, as shown in Fig. 2(a).
Simple approaches such as the potential field can be used Feature aggregation as point cloud separation and as-
to construct such a uniform initial guess. To solve the sec- sembly There is an intuitive reason behind the SOM fea-
ond problem, instead of updating nodes once per point, we ture extraction and node concatenation. Since the input
perform one update after accumulating the effects of all the points to the first layer are normalized with M SOM nodes,
points. This batch update process is deterministic [19] for a they are actually separated into M mini point clouds as
given point cloud, making it permutation invariant. Another shown in Fig. 3. Each mini point cloud contains a small
advantage of batch update is the fact that it can be imple- number of points in a coordinate whose origin is the as-
mented as matrix operations, which are highly efficient on sociated SOM node. For a point cloud of size 2048, and
GPU. Details of the initialization and batch training algo- M = 64 and k = 3, a typical mini point cloud may con-
rithms can be found in our supplementary material. sist of around 90 points inside a small space of x, y, z ∈
[−0.3, 0.3]. The number and coverage of points in a mini
3.2. Encoder Architecture
point cloud are determined by the SOM training and kNN
As shown in Fig. 3, SOM is a guide for hierarchical fea- search, i.e. M and k.
ture extraction, and a tool to systematically adjust the recep- The first batch of fully connected layers can be regarded
tive field overlap. Given the output of the SOM, we search as a small PointNet that encodes these mini point clouds.
Figure 3. The architecture of the SO-Net and its application to classification and segmentation. In the encoder, input points are normalized
with the k-nearest SOM nodes. The normalized point features are later max-pooled into node features based on the point-to-node kNN
search on SOM. k determines the receptive field overlap. In the segmentation network, M node features are concatenated with the kN
normalized points following the same kNN association. Finally kN features are aggregated into N features by average pooling.

The concatenation with SOM nodes plays the role of assem- fect of invalid node features is further investigated in Sec. 4
bling these mini point clouds back into the original point by inserting noise into the SOM results.
cloud. Because the SOM explicitly reveals the spatial distri-
bution of the input point cloud, our separate-and-assemble Exploration with ConvNets It is interesting to note that
process is more efficient than the grouping strategy used in the node feature extraction has generated an image-like fea-
PointNet++ [28], as shown in Sec. 4. ture matrix, which is invariant to the order of input points.
It is possible to apply standard ConvNets to further fuse
Permutation Invariance There are two levels of feature the node features with increasing receptive field. However,
aggregation in SO-Net, from point features to node features, the classification accuracy decreased slightly in our exper-
and from node features to global feature vector. The first iments, where we replaced the second batch of fully con-
phase applies a shared PointNet to M mini point clouds. nected layers with 2D convolutions and pooling. It remains
The generation of these M mini point clouds is irrelevant as a promising direction to investigate the reason and solu-
to the order of input points, because the SOM training in tion to this phenomenon.
Sec. 3.1 and kNN search in Fig. 3 are deterministic. Point-
Net [26] is permutation invariant as well. Consequently, 3.3. Extension to Segmentation
both the node features and global feature vector are theoret- The extension to per-point annotations, e.g. segmenta-
ically guaranteed to be permutation invariant. tion, requires the integration of both local and global fea-
tures. The integration process is similar to the invert opera-
Effect of suboptimal SOM training It is possible that the tion of the encoder in Sec. 3.2. The global feature vector can
training of SOM converges into a local minima with isolated be directly expanded and concatenated with the kN normal-
nodes outside the coverage of the input point cloud. In some ized points. The M node features are attached to the points
situations no point will be associated with the isolated nodes that are associated with them during the encoding process.
during the point-to-node kNN search, and we set the corre- The integration results in kN features that combine point,
sponding node features to zero. This phenomenon is quite node and global features, which are then forwarded into a
common because the initial nodes are dispersed uniformly chain of shared fully connected layers.
in a unit ball, while the input point cloud may occupy only The kN features are actually redundant to generate N
a small corner. Despite the existence of suboptimal SOM, per-point classification scores because of the receptive field
the proposed SO-Net still out-performs state-of-the-art ap- overlap. Average or max pooling are methods to fuse the
proaches in applications like object classification. The ef- redundant information. Additionally, similar to many deep
Figure 4. The architecture of the decoder that takes 5000 points and reconstructs 4608 points. The up-convolution branch is designed to
recover the main body of the input, while the more flexible fully connected branch is to recover the details. The “upconv” module consists
of a nearest neighbor upsampling layer and a 3 × 3 convolution layer. The “conv2pc” module consists of two 1 × 1 convolution layers.

networks, early, middle or late fusion may exhibit different robust against outliers [13]. Here we use the Chamfer loss:
performance [9]. With a series of experiments, we found 1 X
that middle fusion with average pooling is most effective d(Ps , Pt ) = min kx − yk2
|Ps | y∈Pt
compared to other fusion methods. x∈Ps
(5)
1 X
3.4. Autoencoder + min kx − yk2 .
|Pt | x∈Ps
y∈Pt
In this section, we design a decoder network to recover 3
the input point cloud from the encoded global feature vec- where Ps and Pt ∈ R represents the input and recovered
tor. A straightforward design is to stack series of fully con- point cloud respectively. The numbers of points in Ps and
nected layers on top of the feature vector, and generate an Pt are not necessarily the same. Intuitively, for each point
output vector of length 3N̂ , which can be reshaped into in Ps , Eq. (5) computes its distance to the nearest neighbor
N̂ × 3. However, the memory and computation footprint in Pt , and vice versa for points in Pt .
will be too heavy if N̂ is sufficiently large.
4. Experiments
Instead of generating point clouds with fully connected
layers [1], we design a network with two parallel branches In this section, the performance of our SO-Net is evalu-
similar with [13], i.e, a fully connected branch and a con- ated in three different applications, namely point cloud au-
volution branch as shown in Fig. 4. The fully connected toencoder, object classification and object part segmenta-
branch predicts N̂1 points by reshaping an output of 3N̂1 tion. In particular, the encoder trained in the autoencoder
elements. This branch enjoys high flexibility because each can be used as pre-training for the other two tasks. The
coordinate is predicted independently. On the other hand, encoder structure and SOM configuration remain identical
the convolution branch predicts a feature matrix with the among all experiments without delicate finetuning, except
size of 3 × H × W , i.e. N̂2 = H × W points. Due to the for the 2D MNIST classification.
spatial continuity of convolution layers, the predicted N̂2
4.1. Implementation Detail
point may exhibit more geometric consistency. Another ad-
vantage of the convolution branch is that it requires much Our network is implemented with PyTorch on a NVIDIA
less parameters compared to the fully connected branch. GTX1080Ti. We choose a SOM of size 8 × 8 and k = 3 in
Similar to common practice in many depth estimation most experiments. We optimize the networks using Adam
networks [14, 15], the convolution branch is designed as [17] with an initial learning rate of 0.001 and batch size of
an up-convolution (upconv) chain in a pyramid style. In- 8. For experiments with 5000 or more points as input, the
stead of deconvolution layers, each upconv module consists learning rate is decreased by half every 20 epochs, other-
of a nearest neighbor upsampling layer and a 3 × 3 convo- wise the learning rate decay is executed every 40 epochs.
lution layer. According to our experiments, this design is Generally the networks converge after around 5 times of
much more effective than deconvolution layers in the case learning rate decay. Batch-normalization and ReLU activa-
of point cloud autoencoder. In addition, intermediate up- tion are applied to every layer.
conv products are converted to coarse reconstructed point
4.2. Datasets
clouds and compared with the input. The conversion from
upconv products to point clouds is a 2-layer 1 × 1 convolu- As a 2D toy example, we adopt the MNIST dataset [20]
tion stack in order to give more flexibility to each recovered in Sec. 4.4. For each digit, 512 two-dimensional points are
point. The coarse-to-fine strategy produces another boost in sampled from the non-zero pixels to serve as our input.
the reconstruction performance. Two variants of the ModelNet [37], i.e. ModelNet10 and
To supervise the reconstruction process, the loss function ModelNet40, are used as the benchmarks for the autoen-
should be differentiable, ready for parallel computation and coder task in Sec. 4.3 and the classification task in Sec. 4.4.
The ModelNet40 contains 13,834 objects from 40 cate- reconstruct different sizes of point clouds. The first configu-
gories, among which 9,843 objects belong to training set ration generates 64 × 64 points from the convolution branch
and the other 3,991 samples for testing. Similarly, the Mod- and 512 points from the fully connected branch. The other
elNet10 is split into 2,468 training samples and 909 testing one produces 32 × 32 and 256 points respectively, by re-
samples. The original ModelNet provides CAD models rep- moving the last upconv module of Fig. 4.
resented by vertices and faces. Point clouds are generated It is difficult to provide quantitative comparison for the
by sampling from the models uniformly. For fair compari- point cloud autoencoder task because little research has
son, we use the prepared ModelNet10/40 dataset from [28], been done on this topic. The most related work is the point
where each model is represented by 10,000 points. Various set generation network [13] and the point cloud generative
sizes of point clouds, e.g., 2,048 or 5,000, can be sampled models [1]. Examples of our reconstructed ShapeNetPart
from the 10k points in different experiments. point clouds are visualized in Fig. 5, where 1024 points re-
Object part segmentation is demonstrated with the covered from the convolution branch are denoted in red and
ShapeNetPart dataset [38]. It contains 16,881 objects from the other 256 points in green. The overall testing Chamfer
16 categories, represented as point clouds. Each object con- distance (Eq. 5) is 0.033. Similar to the results in [13], the
sists of 2 to 6 parts, and in total there are 50 parts in the convolution branch recovers the main body of the object,
dataset. We sample fixed size point clouds, e.g. 1,024, in while the more flexible fully connected branch focuses on
our experiments. details such as the legs of a table. Nevertheless, many finer
details are lost. For example, the reconstructed earphone
is blurry. This is probably because the encoder is still not
Data augmentation Input point clouds are normalized powerful enough to capture fine-grained structures.
to be zero-mean inside a unit cube. The following data Despite the imperfect reconstruction, the autoencoder
augmentations are applied at training phase: (a) Gaussian enhances SO-Net’s performance in other tasks by providing
noise N (0, 0.01) is added to the point coordinates and sur- a pre-trained encoder, illustrated in Sec. 4.4 and 4.5. More
face normal vectors (if applicable). (b) Gaussian noise results are visualized in the supplementary materials.
N (0, 0.04) is added to the SOM nodes. (c) Point clouds,
surface normal vectors (if applicable) and SOM nodes are 4.4. Classification Tasks
scaled by a factor sampled from an uniform distribution
U(0.8, 1.2). Further augmentation like random shift or ro- To classify the point clouds, we attach a 3-layer multi-
tation do not improve the results. layer perceptron (MLP) on top of the encoded global fea-
ture vector. Random dropout is applied to the last two lay-
4.3. Point Cloud Autoencoder ers with keep-ratio of 0.4. Table 1 illustrates the classifica-
tion accuracy for state-of-the-art methods using scalable 3D
representations, such as point cloud, kd-tree and octree. In
MNIST dataset, our network achieves a relative 13.7% error
rate reduction compared with PointNet++. In ModelNet10
and ModelNet40, our approach out-performs state-of-the-
art methods by 1.7% and 1.5% respectively in terms of in-
stance accuracy. Our SO-Net even out-performs single net-
works using multi-view images or uniform voxel grids as in-
put, like qi-MVCNN [27] (ModelNet40 at 92.0%) and VRN
[6] (ModelNet40 at 91.3%). Methods that integrate multi-
ple networks, i.e., qi-MVCNN-MultiRes [27] and VRN En-
semble [6], are still better than SO-Net in ModelNet clas-
sification, but their multi-view / voxel grid representations
are far less scalable and flexible than our point cloud repre-
sentation, as illustrated in Sec. 1 and 2.
Figure 5. Examples of point cloud autoencoder results. First row:
input point clouds of size 1024. Second row: reconstructed point
clouds of size 1280. From left to right: chair, table, earphone. Effect of pre-training The performance of the network
can be improved with pre-training using the autoencoder in
In this section, we demonstrate that a point cloud can be Sec. 3.4. The autoencoder is trained with ModelNet40, us-
reconstructed from the SO-Net encoded feature vector, e.g. ing 5000 points and surface normal vectors as input. The
a vector with length of 1024. The nearest neighbor search autoencoder brings a boost of 0.5% in ModelNet10 classi-
in Chamfer distance (Eq. 5) is conducted with Facebook’s fication, but only 0.2% in ModelNet40 classification. This
faiss [16]. There are two configurations for the decoder to is not surprising because pre-training with a much larger
ModelNet10 ModelNet40 MNIST
Method Representation Input
Class Instance Class Instance Training Input Error rate
PointNet, [26] points 1024 × 3 - - 86.2 89.2 3-6h 256 × 2 0.78
PointNet++, [28] points + normal 5000 × 6 - - - 91.9 20h 512 × 2 0.51
DeepSets, [29, 39] points 5000 × 3 - - - 90.0 - - -
Kd-Net, [18] points 215 × 3 93.5 94.0 88.5 91.8 120h 1024 × 2 0.90
ECC, [32] points 1000 × 3 90.0 90.8 83.2 87.4 - - 0.63
OctNet, [30] octree 1283 90.1 90.9 83.8 86.5 - - -
O-CNN, [36] octree 643 - - - 90.6 - - -
Ours (2-layer)* points + normal 5000 × 6 94.9 95.0 89.4 92.5 3h - -
Ours (2-layer) points + normal 5000 × 6 94.4 94.5 89.3 92.3 3h - -
Ours (2-layer) points 2048 × 3 93.9 94.1 87.3 90.9 3h 512 × 2 0.44
Ours (3-layer) points + normal 5000 × 6 95.5 95.7 90.8 93.4 3h - -
Table 1. Object classification results for methods using scalable 3D representations like point cloud, kd-tree and octree. Our network
produces the best accuracy with significantly faster training speed. * represents pre-training.

100 95 95
90
80
Accuracy

Accuracy

Accuracy
85 90
60 80
75 85
40 ModelNet40 ModelNet40 ModelNet40
ModelNet10
70 ModelNet10 ModelNet10
20 65 80
2000 1500 1000 500 6 8 10 0 0.2 0.4 0.6
Number of Points SOM Size SOM Noise Sigma
(a) (b) (c) (d)
Figure 6. Robustness test on point or SOM corruption. (a) The network is trained with point clouds of size 2048, while there is random
point dropout during testing. (b) The network is trained with SOM of size 8 × 8, but SOMs of various sizes are used at testing phase. (c)
Gaussian noise N (0, σ) is added to the SOM during testing. (d) Example of SOM with Gaussian noise N (0, 0.2).

94 96
SO-Net SO-Net
but test the network with SOM sizes varying from 5 × 5 to
93.5 PointNet++
11×11. It is interesting that the performance decay is much
Accuracy

Accuracy

95.5
93
92.5
95
slower if the SOM size is larger than training configuration,
92 which is consistent with the theory in Sec. 3.2. The SO-Net
91.5 94.5 separates the input point cloud into M mini point clouds,
1 2 3 4 5 6 1 2 3 4 5 6
Hierarchical layer number Hierarchical layer number encodes them into M node features with a mini PointNet,
Figure 7. Effect of layer number on classification accuracy with and assembles them during the global feature extraction. In
ModelNet40 (left) and ModelNet10 (right). the case that the SOM becomes smaller during testing, the
dataset may lead to convergence basins [12] that are more mini point clouds are too large for the mini PointNet to en-
resistant to over-fitting. code. Therefore the network performs worse when the test-
ing SOM is smaller than expected.
In Fig. 6(c), we add Gaussian noise N (0, σ) onto the
Robustness to point corruption We train our network
SOM during testing. Given the fact that input points have
with point clouds of size 2048 but test it with point dropout.
been normalized into a unit cube, a Gaussian noise with
As shown in Fig. 6(a), our accuracy drops by 1.7% with
σ = 0.2 is rather considerable, as shown in Fig. 6(d). Even
50% points missing (2048 to 1024), and 14.2% with 75%
in that difficult case, our network achieves the accuracy of
points missing (2048 to 512). As a comparison, the accu-
91.1% in ModelNet40 and 94.6% in ModelNet10.
racy of PN drops by 3.8% with 50% points (1024 to 512).

Robustness to SOM corruption One of our major con- Effect of hierarchical layer number Our framework
cern when designing the SO-Net is whether the SO-Net re- shown in Fig. 3 can be made to further out-perform state-of-
lies too much on the SOM. With results shown in Fig. 6, the-art methods by simply adding more layers. The vanilla
we demonstrate that our SO-Net is quite robust to the noise SO-Net is a 2-layer structure “grouping&PN(PointNet) -
or corruption of the SOM results. In Fig. 6(b), we train a PN”, where the grouping is based on SOM and point-to-
network with SOM of size 8 × 8 as the noise-free version, node kNN. We make it a 3-layer structure by simply repeat-
Intersection over Union (IoU)
mean air bag cap car chair ear. gui. knife lamp lap. motor mug pistol rocket skate table
PointNet [26] 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
PointNet++ [28] 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
Kd-Net [18] 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3
O-CNN + CRF [36] 85.9 85.5 87.1 84.7 77.0 91.1 85.1 91.9 87.4 83.3 95.4 56.9 96.2 81.6 53.5 74.1 84.4
Ours (pre-trained) 84.9 82.8 77.8 88.0 77.3 90.6 73.5 90.7 83.9 82.8 94.8 69.1 94.2 80.9 53.1 72.9 83.0
Ours 84.6 81.9 83.5 84.8 78.1 90.8 72.2 90.1 83.6 82.3 95.2 69.3 94.2 80.0 51.6 72.1 82.6
Table 2. Object part segmentation results on ShapeNetPart dataset.

ing the SOM/kNN based “grouping&PN” with this proto- works are evaluated using the mean Intersection over Union
col: for each SOM node, find k 0 = 9 nearest nodes and pro- (IoU) protocol proposed in [26]. For each instance, IoU is
cess the k 0 node features with a PointNet. The output is a computed for each part that belongs to that object category.
new SOM feature map of the same size but larger receptive The mean of the part IoUs is regarded as the IoU for that
field. Shown in Table 1, our 3-layer SO-Net increases the instance. Overall IoU is calculated as the mean of IoUs
accuracy to 1.5% higher (relatively 19% lower error rate) over all instances, and category-wise IoU is computed as an
than PN++ on ModelNet40, and 1.7% higher (relatively average over instances under that category. Similar with O-
28% lower error rate) than Kd-Net on ModelNet10. The CNN [36] and PointNet++ [28], surface normal vectors are
effect of hierarchical layer number is illustrated in Fig. 7, fed into the network together with point coordinates.
where too many layers may lead to over-fitting. By optimizing per-point softmax loss functions, we
achieve competitive results as reported in Table 2. Although
Training speed The batch training of SOM allows par- O-CNN reports the best IoU, it adopts an additional dense
allel implementation on GPU. Moreover, the training of conditional random field (CRF) to refine the output of their
SOM is completely deterministic in our approach, so it can network while others do not contain this post-processing
be isolated as data preprocessing before network optimiza- step. Some segmentation results are visualized in Fig. 8
tion. Compared to the randomized kd-tree construction in and we further visualize one instance per category in the
[18], our deterministic design provides great boosting dur- supplementary material. Although in some hard cases our
ing training. In addition to the decoupled SOM, the hier- network may fail to annotate the fine-grained details cor-
archical feature aggregation based on SOM can be imple- rectly, generally our segmentation results are visually satis-
mented efficiently on GPU. As shown in Table 1, it takes fying. The low computation cost remains as one of our ad-
about 3 hours to train our best network on ModelNet40 with vantages. Additionally, pre-training with our autoencoder
a GTX1080Ti, which is significantly faster than state-of- produces a performance boost, which is consistent with our
the-art networks that can provide comparable performance. classification results.

4.5. Part Segmentation on ShapeNetPart 5. Conclusion


In this paper, we propose the novel SO-Net that performs
hierarchical feature extraction for point clouds by explicitly
modeling the spatial distribution of input points and sys-
tematically adjusting the receptive field overlap. In a series
of experiments including point cloud reconstruction, ob-
ject classification and object part segmentation, our network
achieves competitive performance. In particular, we out-
perform state-of-the-art deep learning approaches in point
cloud classification and shape retrieval, with significantly
faster training speed. As the SOM preserves the topological
properties of the input space and our SO-Net converts point
clouds into feature matrice accordingly, one promising fu-
Figure 8. Visualization of object part segmentation results. First ture direction is to apply classical ConvNets or graph-based
row: ground truth. Second row: predicted segmentation. From ConvNets to realize deeper hierarchical feature aggregation.
left to right: chair, lamp, table.
Acknowledgment. This work is supported partially by
We formulate the object part segmentation problem as a a ODPRT start-up grant R-252-000-636-133 from the Na-
per-point classification task, as illustrated in Fig. 3. The net- tional University of Singapore.
References [18] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-
networks for the recognition of 3d point cloud models. arXiv
[1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. preprint arXiv:1704.01222, 2017. 1, 2, 7, 8, 12, 13
Learning representations and generative models for 3d point
[19] T. Kohonen. The self-organizing map. Neurocomputing,
clouds. arXiv preprint arXiv:1707.02392, 2017. 5, 6
21(1):1–6, 1998. 2, 3
[2] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D- [20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
3D-Semantic Data for Indoor Scene Understanding. ArXiv based learning applied to document recognition. Proceed-
e-prints, Feb. 2017. 1 ings of the IEEE, 86(11):2278–2324, 1998. 5, 12, 13
[3] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki. Gift: [21] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas. Fpnn: Field
A real-time and scalable 3d shape search engine. In Proceed- probing neural networks for 3d data. In Advances in Neural
ings of the IEEE Conference on Computer Vision and Pattern Information Processing Systems, pages 307–315, 2016. 1, 2
Recognition, pages 5023–5032, 2016. 12
[22] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
[4] J. L. Bentley. Multidimensional binary search trees used preprint arXiv:1312.4400, 2013. 12, 13
for associative searching. Communications of the ACM,
[23] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst.
18(9):509–517, 1975. 1
Geodesic convolutional neural networks on riemannian man-
[5] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castel- ifolds. In ICCV Workshops, 2015. 2
lani, and P. Vandergheynst. Learning class-specific descrip- [24] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
tors for deformable shapes using localized spectral convolu- neural network for real-time object recognition. In IEEE/RSJ
tional networks. In Computer Graphics Forum, volume 34, International Conference on Intelligent Robots and Systems
pages 13–23. Wiley Online Library, 2015. 2 (IROS), 2015. 2
[6] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Generative [25] D. J. Meagher. Octree encoding: A new technique for the
and discriminative voxel modeling with convolutional neural representation, manipulation and display of arbitrary 3-d
networks. arXiv preprint arXiv:1608.04236, 2016. 2, 6 objects by computer. Electrical and Systems Engineering
[7] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral Department Rensseiaer Polytechnic Institute Image Process-
networks and locally connected networks on graphs. arXiv ing Laboratory, 1980. 1
preprint arXiv:1312.6203, 2013. 2 [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep
[8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, learning on point sets for 3d classification and segmentation.
Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, arXiv preprint arXiv:1612.00593, 2016. 1, 2, 4, 7, 8, 12, 13
et al. Shapenet: An information-rich 3d model repository. [27] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J.
arXiv preprint arXiv:1512.03012, 2015. 1 Guibas. Volumetric and multi-view cnns for object classi-
[9] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang. Locality- fication on 3d data. In CVPR, 2016. 2, 6
sensitive deconvolution networks with gated fusion for rgb-d [28] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep
indoor semantic segmentation. In CVPR, 2017. 5 hierarchical feature learning on point sets in a metric space.
[10] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column arXiv preprint arXiv:1706.02413, 2017. 1, 2, 4, 6, 7, 8, 12,
deep neural networks for image classification. In Computer 13
Vision and Pattern Recognition (CVPR), 2012 IEEE Confer- [29] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning
ence on, pages 3642–3649. IEEE, 2012. 13 with sets and point clouds. arXiv preprint arXiv:1611.04500,
[11] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. 2016. 2, 7
Vote3deep: Fast object detection in 3d point clouds using ef- [30] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning
ficient convolutional neural networks. In IEEE International deep 3d representations at high resolutions. In CVPR, 2017.
Conference on Robotics and Automation (ICRA), 2017. 1, 2 1, 2, 7
[12] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vin- [31] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best prac-
cent, and S. Bengio. Why does unsupervised pre-training tices for convolutional neural networks applied to visual doc-
help deep learning? Journal of Machine Learning Research, ument analysis. In ICDAR, volume 3, pages 958–962, 2003.
11(Feb):625–660, 2010. 6 13
[13] H. Fan, H. Su, and L. Guibas. A point set generation network [32] M. Simonovsky and N. Komodakis. Dynamic edge-
for 3d object reconstruction from a single image. 2017. 5, 6 conditioned filters in convolutional neural networks on
[14] R. Garg, G. Carneiro, and I. Reid. Unsupervised cnn for graphs. arXiv preprint arXiv:1704.02901, 2017. 7, 13
single view depth estimation: Geometry to the rescue. In [33] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-
ECCV, 2016. 5 view convolutional neural networks for 3d shape recognition.
[15] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsuper- In ICCV, 2015. 2, 12
vised monocular depth estimation with left-right consistency. [34] C. Wang, M. Pelillo, and K. Siddiqi. Dominant set clustering
arXiv preprint arXiv:1609.03677, 2016. 5 and pooling for multi-view 3d object recognition. In BMVC,
[16] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity 2017. 2
search with gpus. arXiv preprint arXiv:1702.08734, 2017. 6 [35] D. Z. Wang and I. Posner. Voting for voting in online point
[17] D. Kingma and J. Ba. Adam: A method for stochastic opti- cloud object detection. In Robotics: Science and Systems,
mization. arXiv preprint arXiv:1412.6980, 2014. 5 2015. 1, 2
[36] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong. Supplementary
O-cnn: Octree-based convolutional neural networks for 3d
shape analysis. ACM Transactions on Graphics (TOG), A. Overview
36(4):72, 2017. 2, 7, 8, 12
[37] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and This supplementary document provides more technical
J. Xiao. 3d shapenets: A deep representation for volumetric details and experimental results to the main paper. Shape
shapes. In CVPR, 2015. 1, 2, 5 retrieval experiments are demonstrated with ShapeNet
[38] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Core55 dataset in Sec. B. The time and space complexity is
Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active analyzed in Sec. C, followed by detailed illustration of our
framework for region annotation in 3d shape collections. permutation invariant SOM training algorithms in Sec. D.
ACM Transactions on Graphics (TOG), 35(6):210, 2016. 6 More experiments and results are presented in Sec. E.
[39] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos,
R. Salakhutdinov, and A. Smola. Deep sets. arXiv preprint
B. Shape Retrieval
arXiv:1703.06114, 2017. 7
Our object classification network can be easily extended
to the task of 3D shape retrieval by regarding the classifica-
tion score as the feature vector. Given a query shape and a
shape library, the similarity between the query and the can-
didates can be computed as their feature vector distances.

B.1. Dataset
We perform 3D shape retrieval task using the ShapeNet
Core55 dataset, which contains 51,190 shapes from 55 cat-
egories and 204 subcategories. Specifically, we adopt the
dataset split provided by the 3D Shape Retrieval Contest
2016 (SHREC16), where 70% of the models are used for
training, 10% for validation and 20% for testing. Since the
3D shapes are represented by CAD models, i.e., vertices and
faces, we sample 5,000 points and surface normal vectors
from each CAD model. Data augmentation is identical with
the previous classification and segmentation experiments -
random jitter and scale.

B.2. Procedures
We train a classification network on the ShapeNet
Core55 dataset using identical configurations as our Model-
Net40 classification experiment, i.e. a SOM of size 8×8 and
k = 3. For simplicity, the softmax loss is minimized with
only the category labels (without any subcategory informa-
tion). The classification score vector of length 55 is used
as the feature vector. We calculate the L2 feature distance
between each shape in the test set and all shapes in the same
predicted category from the test set (including itself). The
corresponding retrieval list is constructed by sorting these
shapes according to the feature distances.

B.3. Performance
SHREC16 provides several evaluation metrics includ-
ing Precision-Recall curve, F-score, mean average precision
(mAP), normalized discounted cumulative gain (NDCG).
These metrics are computed under two contexts - macro and
micro. Macro metric is a simple average across all cate-
gories while micro metric is a weighted average according
to the number of shapes in each category. As shown in Ta- pulsion and attraction force, so that the resulting nodes are
ble 3, our SO-Net out-performs state-of-the-art approaches within the unit ball.
with most metrics. The precision-recall curves are illus-
trated in Fig. 9, where SO-Net demonstrates the largest area Algorithm 1 Potential field method for SOM initialization
under curve (AUC). Some shape retrieval results are visual- Set random seed.
ized in Fig. 11. Random initialization: S ← U(−1, 1)
repeat
C. Time and Space Complexity for all sj ∈ S do
fjwall ← −sj
We evaluate the model size, forward (inference) time and fjnode ← 0
training time of several point cloud based networks in the for all sk ∈ S, k 6= j do
task of ModelNet40 classification, as shown in Table 4. The s −s
fjnode ← fjnode + λ ksjj−skkk2
forward timings are acquired with a batch size of 8 and input 2
end for
point cloud size of 1024. In the comparison, we choose the
end for
networks with the best classification accuracy among vari-
for all sj ∈ S do
ous configurations of PointNet and PointNet++, i.e., Point-
sj ← sj + η(fjwall + fjnode )
Net with transformations and PointNet++ with multi-scale
end for
grouping (MSG). Because of the parallelizability and sim-
until converge
plicity of our network design, our model size is smaller and
the training speed is significantly faster compared to Point-
Net and its successor PointNet++. Meanwhile, our infer-
ence time is around 1/3 of that of PointNet++.
D.2. Batch Update Training
Instead of updating the SOM once per point, the batch
D. Permutation Invariant SOM update rule conducts one update after accumulating the ef-
fect of all points in the point cloud. As a result, each SOM
We apply two methods to ensure that the SOM is invari- update iteration is unrelated to the order of point, i.e., per-
ant to the permutation of the input points - fixed initializa- mutation invariant. During SOM training, each training
tion and deterministic training rule. sample affects the winner node and all its neighbors. We
D.1. Initialization for SOM Training define the neighborhood function as a Gaussian distribution
as follows:
In addition to permutation invariance, the initialization
should be reasonable so that the SOM training is less prone
exp − 12 (x − µ)T Σ−1 (x − µ)

to local minima. Suboptimal SOM may lead to many iso- wxy (x, y|p, q, σx , σy ) = p
lated nodes outside the coverage of the input points. For (2π)2 |Σ|
simplicity, we use fixed initialization for any point cloud  T
µ= p q
input although there are other initialization approaches that " #
are permutation invariant, e.g., principal component initial- σx2 0
Σ= .
ization. We generate a set of node coordinates that are 0 σy2
uniformly distributed in an unit ball to serve as a reason- (6)
able initialization because the input point clouds are in var-
ious shapes. Unfortunately, as shown in Fig. 10, isolated The pseudo code of the training scheme is shown in Al-
nodes are inevitable even with uniform initialization. Iso- gorithm 2. P = {pi ∈ R3 , i = 0, · · · , N − 1} and
lated nodes may not be associated during the kNN search, S = {sj ∈ R3 , j = 0, · · · , M − 1} represent the input
and their corresponding node features will be set to zero, points and SOM nodes respectively. The learning rate ηt
i.e. the node features are invalid. Nevertheless, our SO-Net and neighborhood parameter (σx , σy ) should be decreased
is robust to small amount of invalid nodes as demonstrated slowly during training. In addition, Algorithm 2 can be eas-
in the experiments. ily implemented as matrix operations which are highly effi-
We propose a simple algorithm based on potential field cient on GPU.
methods to generate the initialization as shown in Algo-
rithm 1. S = {sj ∈ R3 , j = 0, · · · , M − 1} represents E. More Experiments
the SOM nodes and η is the learning rate. The key idea is
E.1. MNIST Classification
to apply a repulsion force between any pair of nodes, and
external forces to attract nodes toward the origin. The pa- We evaluate our network using the 2D MNIST dataset,
rameter λ is used to control the weighting between the re- which contains 60,000 28 × 28 images for training and
Micro Macro
Method
P@N R@N F1@N mAP NDCG@N P@N R@N F1@N mAP NDCG@N
Tatsuma 0.427 0.689 0.472 0.728 0.875 0.154 0.730 0.203 0.596 0.806
Wang CCMLT 0.718 0.350 0.391 0.823 0.886 0.313 0.536 0.286 0.661 0.820
Li ViewAggregation 0.508 0.868 0.582 0.829 0.904 0.147 0.813 0.201 0.711 0.846
Bai GIFT [3] 0.706 0.695 0.689 0.825 0.896 0.444 0.531 0.454 0.740 0.850
Su MVCNN [33] 0.770 0.770 0.764 0.873 0.899 0.571 0.625 0.575 0.817 0.880
Kd-Net [18] 0.760 0.768 0.743 0.850 0.905 0.492 0.676 0.519 0.746 0.864
O-CNN [36] 0.778 0.782 0.776 0.875 0.905 - - - - -
Ours 0.799 0.800 0.795 0.869 0.907 0.615 0.673 0.622 0.805 0.888
Table 3. 3D shape retrieval results with SHREC16. Our SO-Net out-performs state-of-the-art deep networks with most metrics.

1 1
Bai_GIFT Bai_GIFT
0.95 Li_ViewAggregation Li_ViewAggregation
Su_MVCNN 0.9 Su_MVCNN
Tatsuma_DB-FMCD-FUL-LCDR Tatsuma_DB-FMCD-FUL-LCDR
0.9 Wang_CCMLT Wang_CCMLT
SO-Net 0.8 SO-Net
0.85
0.7
0.8
P

P 0.6
0.75
0.5
0.7

0.65 0.4

0.6 0.3
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
R R
(a) (b)
Figure 9. Precision-recall curve for micro (a) and macro (b) metrics in the 3D shape retrieval task. In both curves, the SO-Net demonstrates
the largest AUC.

Size / MB Forward / ms Train / h pixel coordinates with Gaussian noise N (0, 0.01). Other
PointNet [26] 40 25.3 3-6 than the acquisition of point clouds, the data augmentation
PointNet++ [28] 12 163.2 20 is exactly the same as other experiments using ModelNet or
Kd-Net [18] - - 16 ShapeNetPart. We reduce the SOM size to 4 × 4 and set
Ours 11.5 59.6 1.5 k = 4 because the point clouds are in 2D and the cloud
Table 4. Time and space complexity of point cloud based networks size is relatively small. The neurons in the shared fully
in ModelNet40 classification. connected layers are reduced as well: 2-64-64-128-128 dur-
ing point feature extraction and (128+2)-256-512-512-1024
during node feature extraction.
Similar to 3D classification tasks, our network out-
performs existing point cloud based deep networks although
the best performance is still from the well engineered 2D
ConvNets as shown in Table 6. Despite using point cloud
representation instead of images, our network demonstrates
better results compared with ConvNets such as Network in
Network [22], LeNet5 [20].
Figure 10. Results of SOM training with uniform initialization.
Isolated nodes are inevitable even with uniform initialization.
E.2. Classification with SOM Only
10,000 images for testing. 2D coordinates are extracted There are two sources of information utilized by the SO-
from the non-zero pixels of the images. In order to up- Net - the point cloud and trained SOM. The information
sample these 2D coordinates into point clouds of a certain from SOM is explicitly used when the nodes are concate-
size, e.g., 512 in our experiment, we augment the original nated with the node features at the beginning of node feature
Method Input MNIST ModelNet10 ModelNet40
Kd-Net split based MLP [18] splits 82.40 83.4 73.2
Kd-Net depth 10 [18] point 99.10 93.3 90.6
Ours - SOM based MLP SOM nodes 91.37 88.9 75.7
Ours point 99.56 94.5 92.3
Table 5. Classification results using structure information - SOM nodes and kd-tree split directions.

Algorithm 2 SOM batch update rule It is interesting that we can achieve reasonable perfor-
Initialize m × m SOM S with Algorithm 1 mance in the classification tasks by combining SOM and a
for t < MaxIter do simple MLP. But there is still a large gap between this vari-
. Set update vectors to zero ant and the full SO-Net, which suggests that the integration
for all sxy ∈ S do of SOM and point clouds is important. Another intriguing
Dxy ← 0 phenomenon is that the SOM based MLP achieves better
end for results than split-based MLP. It suggests that maybe SOM
. Accumulate the effect of all points is more expressive than kd-trees in the context of classifica-
for all pi ∈ P do tion.
Obtain nearest neighbor coordinate p, q
for all sxy ∈ S do
E.3. Result Visualization
wxy ← Eq. (6) To visualize the shape retrieval results, we present the
Dxy ← Dxy + wxy (pi − sxy ) top 5 retrieval results for a few shapes as shown in Fig. 11
end for For the point cloud autoencoder, we present results from
end for two networks. The first network consumes 1024 points
. Conduct one update and reconstructs 1280 points with the ShapeNetPart dataset
for all sxy ∈ S do (Fig. 12), while the second one consumes 5000 points
sxy ← sxy + ηt Dxy and reconstructs 4608 points using the ModelNet40 dataset
end for (Fig. 13). We present one instance for each category.
t←t+1 For results of object part segmentation using ShapeNet-
Adjust σx , σy and ηt Part dataset, we visualize one instance per category in
end for Fig. 14. The inputs to the network are point clouds of size
1024 and the corresponding surface normal vectors.
Method Error rate (%)
Multi-column DNN [10] 0.23
Network in Network [22] 0.47
LeNet5 [20] 0.80
Multi-layer perceptron [31] 1.60
PointNet [26] 0.78
PointNet++ [28] 0.51
Kd-Net [18] 0.90
ECC [32] 0.63
Ours 0.44
Table 6. MNIST classification results.

extraction. Additionally, the SOM is implicitly utilized be-


cause point normalization, kNN search and the max pooling
are based on the nodes. We perform classification using the
SOM nodes without the point coordinates of the point cloud
to analyze the contribution of the SOM. We feed the SOM
nodes into a 3-layer MLP with MNIST, ModelNet10 and
ModelNet40 dataset. Similarly in the Kd-Net [18], experi-
ments are conducted using the kd-tree split directions with-
out point information, i.e. feeding directions of the splits
into a MLP. The results are shown in Table 5.
Figure 11. Top 5 retrieval results. First column: query shapes. Column 2-6: retrieved shapes ordered by feature similarity.
Figure 12. Results of our ShapeNetPart autoencoder. Red points are recovered by the convolution branch and green ones are by the fully
connected branch. Odd rows: input point clouds. Even rows: reconstructed point clouds.
Figure 13. Results of our ModelNet40 autoencoder. Red points are recovered by the convolution branch and green ones are by the fully
connected branch. Odd rows: input point clouds. Even rows: reconstructed point clouds.
Figure 14. Results of object part segmentation. Odd rows: ground truth segmentation. Even rows: predicted segmentation.

You might also like