Hassani Unsupervised Multi-Task Feature Learning on Point Clouds ICCV 2019 Paper

The document presents an unsupervised multi-task model for learning point and shape features from point clouds, defining three unsupervised tasks: clustering, reconstruction, and self-supervised classification. The proposed model, evaluated on shape classification and segmentation benchmarks, outperforms previous state-of-the-art unsupervised models, achieving an accuracy of 89.1% on ModelNet40 and an mIoU of 68.2 on ShapeNet segmentation. Key contributions include a multi-scale graph-based encoder and the demonstration of effective feature learning through a combination of self-supervised and unsupervised tasks.

Uploaded by

Ruifang Wang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Hassani Unsupervised Multi-Task Feature Learning on Point Clouds ICCV 2019 Paper

Uploaded by

Ruifang Wang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unsupervised Multi-Task Feature Learning on Point Clouds

Kaveh Hassani Mike Haley

Autodesk AI Lab Autodesk AI Lab
Toronto, Canada San Francisco, USA
[email protected] [email protected]

Abstract mands excessive time and memory, and suffers from infor-
mation loss and quantization artifacts [78].
We introduce an unsupervised multi-task model to jointly Some recent deep models can directly consume point
learn point and shape features on point clouds. We define clouds and learn to perform various tasks such as classifi-
three unsupervised tasks including clustering, reconstruc- cation [78], semantic segmentation [89, 17], part segmen-
tion, and self-supervised classification to train a multi-scale tation [78], image-point cloud translation [19], object de-
graph-based encoder. We evaluate our model on shape clas- tection and region proposal [97], consolidation and surface
sification and segmentation benchmarks. The results sug- reconstruction [92, 45, 47], registration [16, 74, 34], gener-
gest that it outperforms prior state-of-the-art unsupervised ation [68, 67], and up-sampling [93]. These models achieve
models: In the ModelNet40 classification task, it achieves promising results thanks to their feature learning capabili-
an accuracy of 89.1% and in ShapeNet segmentation task, ties. However, to successfully learn such features, they re-
it achieves an mIoU of 68.2 and accuracy of 88.6%. quire large amounts of labeled data.
A few works explore unsupervised feature learning on
point sets using autoencoders [88, 13, 39, 96, 2, 16] and
1. Introduction generative models, e.g., generative adversarial networks
(GAN) [68, 67, 2], variational autoencoders (VAE) [20],
Point clouds are sparse order-invariant sets of interact-
and Gaussian mixture models (GMM) [2]. Despite their
ing points defined in a coordinate space and sampled from
good feature learning capabilities, they suffer from not
surface of objects to capture their spatial-semantic informa-
having access to supervisory signals and targeting a sin-
tion. They are the output of 3D sensors such as LiDAR
gle task. These shortcomings can be addressed by self-
scanners and RGB-D cameras, and are used in applications
supervised learning and multi-task learning, respectively.
such as human-computer interactions [21], self-driving cars
Self-supervised learning defines a pretext task using only
[51], and robotics [60]. Their sparse nature makes them
the information present in the data to provide a surrogate su-
computationally efficient and less sensitive to noise com-
pervisory signal whereas multi-task learning uses the com-
pared to volumetric and multi-view representations.
monalities across tasks by jointly learning them [95].
Classic methods craft salient geometric features on point
We introduce a multi-task model that exploits
clouds to capture their local or global statistical properties.
three regimes of unsupervised learning including self-
Intrinsic features such as wave kernel signature (WKS) [6],
supervision, autoencoding, and clustering as its target tasks
heat kernel signature (HKS) [7], multi-scale Gaussian cur-
to jointly learn point and shape features. Inspired by [9, 22],
vature [66], and global point signature [57]; and extrin-
we show that leveraging joint clustering and self-supervised
sic features such as persistent point feature histograms [59]
classification along with enforcing reconstruction achieves
and fast point feature histograms [58] are examples of such
promising results while avoiding trivial solutions. The key
features. These features cannot address semantic tasks re-
contributions of our work are as follows:
quired by modern applications and hence are replaced by
the unparalleled representation capacity of deep models. • We introduce a multi-scale graph-based encoder for
Feeding point clouds to deep models, however, is not point clouds and train it within an unsupervised multi-
trivial. Standard deep models operate on regular-structured task learning setting.
inputs such as grids (images and volumetric data) and
sequences (speech and text) whereas point clouds are • We exhaustively evaluate our model under various
permutation-invariant and irregular in nature. One can ras- learning settings on ModelNet40 shape classification
terize the point clouds into voxels [81, 43, 53] but it de- and ShapeNetPart segmentation tasks.

8160
• We show that our model achieves state-of-the-art re- 2.2. Self-Supervised Learning
sults w.r.t prior unsupervised models and narrows the
Self-supervised learning defines a proxy task on unla-
gap between unsupervised and supervised models.
beled data and uses the pseudo-labels of that task to provide
the model with supervisory signals. It is used in machine vi-
2. Related Work sion with proxy tasks such as predicting arrow of time [79],
missing pixels [50], position of patches [14], image rota-
2.1. Deep Learning on Point Clouds tions [23], synthetic artifacts [33], image clusters [9], cam-
era transformation in consecutive frames [3], rearranging
PointNet [52] is an MLP that learns point features inde- shuffled patches [48], video colourization [73], and track-
pendently and aggregates them into a shape feature. Point- ing of image patches[77] and has demonstrated promising
Net++ [54] defines multi-scale regions and uses PointNet results in learning and transferring visual features.
to learn their features and then hierarchically aggregates The main challenge in self-supervised learning is to de-
them. Models based on KD-trees [37, 94, 20] spatially par- fine tasks that relate most to the down-stream tasks that use
tition the points using kd-trees and then recursively aggre- the learned features [33]. Unsupervised learning, e.g., den-
gate them. RNNs [31, 89, 17, 41] are applied to point clouds sity estimation and clustering, on the other hand, is not do-
by the assumption that “order matters” [72] and achieve main specific [9]. Deep clustering [4, 44, 86, 28, 83, 22,
promising results on semantic segmentation tasks but the 61, 87, 29] models are recently proposed to learn cluster-
quality of the learned features is not clear. friendly features by jointly optimizing a clustering loss with
CNN models introduce non-Euclidean convolutions to a network-specific loss. A few recent works combine these
operate on point sets. A few models such as RGCNN [70], two approaches and define deep clustering as a surrogate
SyncSpecCNN [91] and Local Spectral GCNN [75] operate task for self-supervised learning. It is shown that alternat-
on spectral domain. These models tend to be computation- ing between clustering the latent representation and predict-
ally expensive. Spatial CNNs learn point features by aggre- ing the cluster assignments achieves state-of-the-art results
gating the contributions of neighbor points. Pointwise con- in visual feature learning[9, 22].
volution [30], Edge convolution [78] , Spider convolution
[84], sparse convolution [65, 25], Monte Carlo convolu- 2.3. Multi-Task Learning
tion [27], parametric continuous convolution [76], feature- Multi-task learning leverages the commonalities across
steered graph convolution [71], point-set convolution [63], relevant tasks to enhance the performance over those tasks
χ-convolution [40], and spherical convolution [38] are ex- [95, 18]. It learns a shared feature with adequate ex-
amples of these models. Spatial models provide strong lo- pressive power to capture the useful information across
calized filters but struggle to learn global structures [70]. the tasks. Multi-task learning has been successfully used
A few works train generative models on point sets. Mul- in machine vision applications such as image classifica-
tiresolution VAE [20] introduces a VAE with multiresolu- tion [42], image segmentation [12], video captioning [49],
tion convolution and deconvolution layers. PointGrow [68] and activity recognition [85]. A few works explore self-
is an auto-regressive model that can generate point clouds supervised multi-task learning to learn high level visual fea-
from scratch or conditioned on given semantic contexts. It tures [15, 55]. Our approach is relevant to these models
is shown that GMMs trained on PointNet features achieve except we use self-supervised tasks in addition to other un-
better performance compared to GANs [2]. supervised tasks such as clustering and autoencoding.
A few recent works explore representation learning us-
ing autoencoders. A simple autoencoder based on Point- 3. Methodology
Net is shown to achieve good results on various tasks [2].
FoldingNet [88] uses an encoder with graph pooling and Assume a training set S = [s1 , s2 , ..., sN ] of N point
MLP layers and introduces a decoder of folding operations sets where a point set si = {pi1 , pi2 , ..., piM } is an order-
that deform a 2D grid onto the underlying object surface. invariant set of M points and each point pij ∈ Rdin . In the
PPF-FoldNet [13] projects the points into point pair feature simplest case pij = (xij , yji , zji ) only contains coordinates,
(PPF) space and then applies a PointNet encoder and a Fold- but can extend to carry other features, e.g., normals. We
ingNet decoder to reconstruct that space. AtlasNet[26] ex- define an encoder Eθ : S 7−→ Z that maps input point
tends the FoldingNet to multiple grid patches whereas SO- sets from RM ×din into the latent space Z ∈ Rdz such that
Net [39] aggregates the point features into SOM node fea- dz ≫ din . For each point pij , the encoder first learns a point
tures to encode the spatial distributions. PointCapsNet [96] (local) feature zji ∈ Rdz and then aggregates them into a
introduces an autoencoder based on dynamic routing to ex- shape (global) feature Z i ∈ Rdz . It basically projects the
tract latent capsules and a few MLPs that generate multiple input points to a feature subspace with higher dimension
point patches from the latent capsules with distinct grids. to encode richer local information than the original space.

8161
Any parametric non-linear function parametrized by θ can where yn = Γc (zn ) and ŷn = fψ (zn ) are the cluster assign-
be used as the encoder. To learn θ in unsupervised multi- ments and the predicted cluster assignments, respectively.
task fashion, we define three parametric functions on the We use Chamfer distance to measure the difference be-
latent variable Z as follows: tween the original point cloud and its reconstruction. Cham-
Clustering function Γc : Z 7−→ Y maps the latent fer distance is differentiable with respect to points and is
variable into K categories Y = [y1 , y2 , ..., yn ] such that computationally efficient. It is computed by finding the
yi ∈ {0, 1}K and ynT 1k = 1. This function encourages nearest neighbor of each point of the original space in the
the encoder to generate features that are clustering-friendly reconstructed space and vice versa, and summing up their
by pushing similar samples in the feature space closer and Euclidean distances. Hence, we optimize the decoding loss
pushing dissimilar ones away. It also provides the model as follows.
with pseudo-labels for self-supervised learning through its
1 XX
N M
hard cluster assignments. 2 2
min min kpnm − p̂k2 + min kp̂nm − pk2
Classifier function fψ : Z 7−→ Ŷ predicts the cluster {θ,φ} 2N M p̂∈ŝn p∈sn
n=1 m=1
assignments of the latent variable such that the predictions (3)
correspond to the hard clusters assignments of Γc . In other where ŝn = gφ (zn ) and, sn and ŝn are the original and
words, fψ maps the latent variable into K predicted cate- reconstructed point sets, respectively. N and M denote the
gories Ŷ = [ŷ1 , ŷ2 , ..., ŷn ] such that ŷi ∈ {0, 1}K . This number of point sets in the train set and the number of points
function uses the pseudo-labels generated by the clustering in each point set, respectively.
function, i.e., cluster assignments, as its proxy train data. Let’s denote the clustering, classification, and decoding
The difference between the cluster assignments and the pre- objectives by LΓ , Lf , and Lg , respectively. we define the
dicted cluster assignments provides the supervisory signals. multi-task objective as a linear combination of these objec-
Decoder function gφ : Z 7−→ Ŝ reconstructs the orig- tives: L = αLΓ + βLf + γLg and train the model based on
inal point set from the latent variable, i.e., maps the latent that. The training process is shown in Algorithm 1.
variable Z ∈ Rdz to a point set Ŝ ∈ RM ×din . Training We first randomly initialize the model parameters and as-
a deep model with a clustering loss collapses the features sume an arbitrary upper bound for the number of clusters.
into a single cluster [9]. Some heuristics such as penalizing We show through experiments that the model converges to a
the minimal number of points per cluster and randomly re- fixed number of clusters by emptying some of the clusters.
assigning empty clusters are introduced to prevent this. We This is especially favorable when the true number of cate-
introduce the decoder function to prevent the model from gories is unknown. We then randomly select K point sets
converging to trivial solutions. from the training data and feed them to the randomly ini-
3.1. Training tialized encoder and set the extracted features as the initial
centroids. Afterwards we optimize the model parameters
The model alternates between clustering the latent vari- w.r.t the multi-task objective using mini-batch stochastic
able Z to generate pseudo-labels Y for self-supervised gradient descent. Updating the centroids with the same fre-
learning, and learning the model parameters by jointly pre- quency as the network parameters can destabilize the train-
dicting the pseudo-labels Ŷ and reconstructing the input ing. Therefore, we aggregate the learned features and the
point set Ŝ. Assuming K-means clustering, the model learns cluster assignments within each epoch and update the cen-
a centroid matrix C ∈ Rdz ×K and cluster assignments yn troids after an epoch is completed.
by optimizing the following objective clustering [9]:
3.2. Architecture
1 X
N
2
min min kzn − Cyn k2 (1) Inspired by Inception [69] and Dynamic Graph CNN
{C,θ} N yn ∈{0,1}K
n=1 (DGCNN) [78] architectures, we introduce a graph-based
where zn = Eθ (sn ) and ynT 1k = 1. The centroid matrix is architecture shown in Figure 1 which consists of an en-
initialized randomly. It is noteworthy that: (i) when assign- coder and three task-specific decoders. The encoder uses
ing cluster labels, the centroid matrix is fixed, and (ii) the a series of graph convolution, convolution, and pooling lay-
centroid matrix is updated epoch-wise and not batch-wise ers in a multi-scale fashion to learn point and shape fea-
to prevent the learning process from diverging. tures from an input point cloud jittered by Gaussian noise.
For the classification function, we minimize the cross- For each point, it extracts three intermediate features by ap-
entropy loss between the cluster assignments and the pre- plying graph convolutions on three neighborhood radii and
dicted cluster assignments as follows. concatenates them with the input point feature and its con-
volved feature. The first three features encode the interac-
−1 X
N
min yn log ŷn (2) tions between each point and its neighbors where as the last
{θ,ψ} N two features encode the information about each point. The
n=1

8162
Algorithm 1: Unsupervised Multi-task training algo- along with the encoder are trained jointly and end-to-end.
rithm. Note that all these tasks are defined on the shape features.
1 θ, φ, ψ ←− Random() Initial parameters Because a shape feature is an aggregation of its correspond-
2 K ←− KU B Upper bound #clusters ing point features, learning a good shape feature pushes the
3 C ←− Eθ (Choice (S, K)) Initial centroids model to learn good point features too.
4 for epoch in epochs do
5 while epoch not completed do 4. Experiments
6 Forward pass 4.1. Implementation Details
7 Sx ←− Sample(S) Mini-batch
8 Zx ←− Eθ (Sx ) Encoding We optimize the network using Adam [36] with an initial
9 Yx ←− Γc (Zx ) Cluster assignment learning rate of 0.003 and batch size of 40. The learning rate
10 Ŷx ←− fψ (Zx ) Cluster prediction is scheduled to decrease by 0.8 every 50 epochs. We apply
batch-normalization [32] and ReLU activation to each layer
11 Ŝx ←− gφ (Zx ) Decoding
and use dropout [64] with p = 0.5. To normalize the task
12 (Z, Y) ←− Aggregate (Zx , Y)
weights to the same scale, we set the weights of clustering
13 Backwards pass
( α), classification (β), and reconstruction(γ) to 0.005, 1.0,
14 ▽θ,φ,ψ (αLΓ (Zx , C; θ)+ Compute gradients
500, respectively. For graph convolutions, we use neighbor-
15 βLf (Yx , Ŷx ; θ, ψ)+ hood radii of 15, 20, and 25 (as suggested in [78]) and for
16 γLg (Sx , Ŝx ; θ, φ)) normal convolutions we use 1×1 kernels. We set the up-
17 U pdate(θ, φ, ψ) Update with gradients per bound number of clusters (KU B ) to 500. We also set
18 end the size of the MLPs in prediction and reconstruction tasks
19 C ←− U pdate(Z, Y) Update centroids to [2048, 1024, 500] and [2048, 1024, 6144], respectively.
20 end Note that the size of the last layers correspond to the upper
bound number of clusters (500) and the reconstruction size
(6144: 2048×3). Following [2] we set the shape and point
concatenation of the intermediate features is then passed feature sizes to 512 and 1024, respectively.
through a few convolution and pooling layers to learn an- For preprocessing and augmentation we follow [52, 78]
other level of intermediate features. These point-wise fea- and uniformly sample 2048 points and normalize them to
tures are then pooled and fed to an MLP to learn the final a unit sphere. We also apply point-wise Gaussian noise of
shape feature. They are also concatenated with the shape N ∼ (0, 0.01) and shape-wise random rotations between
feature to represent the final point features. Similar to [78], [-180, 180] degrees along z-axis and random rotations be-
we define the graph convolution as follows: tween [-20, +20] degrees along x and y axes.
The model is implemented with Tensorflow [1] on a
X Nvidia DGX-1 server with 8 Volta V100 GPUs. We used
zi = hθ ([pi k pk − pi ]) (4)
synchronous parallel training by distributing the training
pk ∈N (pi )
mini-batches over all GPUs and averaging the gradients to
where zi is the learned feature for point pi based on its update the model parameters. With this setting, our model
neighbor contributions, pk ∈ N (pi ) are the k nearest points takes 830s on average to train one epoch on the ShapeNet
to the pi in Euclidean space, hθ is a nonlinear function pa- (i.e, ∼55k samples of size 2048×3). We train the model
rameterized by θ and k is the concatenation operator. We for 500 epochs. At test time, it takes 8ms on an input point
use a shared MLP for hθ . The reason to use both pi and cloud with size 2048×3.
pk − pi is to encode both global information (pi ) and local
4.2. Pre-training for Transfer Learning
interactions (pk − pi ) of each point.
To perform the target tasks, i.e., clustering, classification, Following the experimental protocol introduced in [2],
and autoencoding, we use the following. For clustering, we we pre-train the model across all categories of the ShapeNet
use a standard implementation of K-means to cluster the dataset [10] (i.e., 57,000 models across 55 categories) , and
shape features. For self-supervised classification, we feed then transfer the trained model to two down-stream tasks
the shape features to an MLP to predict the category of the including shape classification and part segmentation. After
shape (i.e., cluster assignment by the clustering module). pre-training the model, we freeze its weights and do not
And for the autoencoding task, we use an MLP to recon- fine-tune it for the down-stream tasks.
struct the original point cloud from the shape feature. This Following [9], we use Normalized Mutual Information
MLP is denoising and reconstructs the original point cloud (NMI) to measure the correlation between cluster assign-
before the addition of the Gaussian noise. All these models ments and the categories without leaking the category in-

8163
Shared Multi-scale Graph-Based Encoder
Gaussian Noise

1x1
Conv Point Feature
64
Concat
1x1 Max 1x1

Concat
KNN Graph
K=15 Conv
Conv Conv Shape Feature
Mean 1x1 1x1 Max

Concat
64 64

Concat
MLP
Conv Conv
(1024, 512)
256 512 Mean
1x1 Max 1x1

Concat
Σ KNN
K=20
Graph
Conv
Conv
64 Mean
Conv
64

1x1 Max 1x1

Concat
KNN Graph
Conv Conv
K=25 Conv
Point Cloud

64 Mean 64

Reconstruction Task Clustering Task Prediction Task

Chamfer K-Means Cross-Entropy
MLP K-Means MLP
Loss Loss Loss

Σ
Multi-Task Loss

Figure 1. Proposed Architecture for unsupervised multi-task feature learning on point clouds. It consists of a multi-scale graph-based
encoder that generates point and shape features for an input point cloud and three task decoders that jointly provide the architecture with a
multi-task loss.

formation to the model. This measure gives insight on the report the classification accuracy on transfer learning from
capability of the model in predicting category level informa- the ShapeNet dataset [10] to the ModelNet40 dataset [82]
tion without observing the ground-truth labels. The model (i.e., 13,834 models across 40 categories divided to 9,843
reaches an NMI of 0.68 and 0.62 on the train and validation and 3,991 train and test samples, respectively). Similar to
sets, respectively which suggests that the learned features [2], we extract the shape features of the ModelNet40 sam-
are progressively encoding category-wise information. ples from the pre-trained model without any fine-tuning,
We also observe that the model converges to 88 clus- train a linear SVM on them, and report the classification
ters (from the initial 500 clusters) which is 33 more clusters accuracy. This approach is a common practice in evaluat-
compared to the number of ShapeNet categories. This is ing unsupervised visual feature learning [9] and provides in-
consistent with the observation that “some amount of over- sight about the effectiveness of the learned features in clas-
segmentation is beneficial” [9]. The model empties more sification tasks.
than 80% of the clusters but does not converge to the trivial Results shown in Table 1 suggest that our model achieves
solution of one cluster. We also trained our model on the state-of-the-art accuracy on the ModelNet40 shape classifi-
10 largest ShapeNet categories to investigate the clustering cation task compared to other unsupervised feature learning
behavior where the model converged to 17 clusters. This models. It is noteworthy that the reported result is without
confirms that model converges to a fixed number of clusters any hyper-parameter tuning. With random hyper-parameter
which is less than the initial upper bound assumption and is search, we observed an 0.4 absolute increase in the accu-
more than the actual number of categories in the data. racy (i.e., 89.5%). The results also suggest that the unsu-
To investigate the dynamics of the learned features, we pervised model is competitive with the supervised models.
selected the 10 largest ShapeNet categories and randomly Error analysis reveals that the misclassifications occur be-
sampled 200 shapes from each category. The evolution of tween geometrically similar shapes. For example, the three
the features of the sampled shapes visualized using t-SNE most frequent misclassifications are between (table, desk),
(Figure 2) suggests that the learned features progressively (nightstand, dresser), and (flowerpot, plant) categories. A
demonstrate clustering-friendly behavior along the training similar observation is reported in [2] and it is suggested that
epochs. stronger supervision signals may be required to learn subtle
details that discriminate these categories.
4.3. Shape Classification
To further investigate the quality of the learned shape
To evaluate the performance of the model on shape fea- features, we evaluated them in a zero-shot setting. For this
ture learning, we follow the experimental protocol in [2] and purpose, we cluster the learned features using agglomera-

8164
epoch 1 epoch 100 epoch 250 epoch 500

Figure 2. Evolution of the learned features along the training epochs (visualized using t-SNE) showing progressive clustering-friendly
behavior.

Unsupervised transfer learning Supervised learning Model 1% of train data 5% of train data
Model Accuracy Model Accuracy Accuracy IoU Accuracy IoU
SPH [35] 68.2 PointNet [52] 89.2 SO-Net[39] 78.0 64.0 84.0 69.0
LFD [11] 75.5 PointNet++ [54] 90.7 PointCapsNet[96] 85.0 67.0 86.0 70.0
T-L Network [24] 74.4 PointCNN [30] 86.1 Ours 88.6 68.2 93.7 77.7
VConv-DAE [62] 75.5 DGCNN [78] 92.2
3D-GAN [80] 83.3 KCNet [63] 91.0 Table 2. Results on semi-supervised ShapeNetPart segmentation
Latent-GAN [2] 85.7 KDNet [37] 91.8 task.
MRTNet-VAE [20] 86.4 MRTNet [20] 91.7
FoldingNet [88] 88.4 SpecGCN [75] 91.5
PointCapsNet [96] 88.9 Following [96], we randomly sample 1% and 5% of the
Ours 89.1 ShapeNetPart train set to evaluate the point features in a
semi-supervised setting. We use the same pre-trained model
Table 1. Left: Accuracy of classification by transfer learning from
the ShapeNet on the ModelNet40 data. Right: Classification ac- to extract the point features of the sampled training data,
curacy of supervised learning on the ModelNet40 data. Our model along with validation and test samples without any fine-
narrows the gap with supervised models. tuning. We then train a 4-layer MLP [2048, 4096, 1024,
50] on the sampled training sets and evaluate it on all test
data. Results shown in Table 2 suggest that our model
tive hierarchical clustering (AHC) [46] and then align the achieves state-of-the-art accuracy and mIoU on ShapeNet-
assigned cluster labels with the ground truth labels (Mod- Part segmentation task compared to other unsupervised fea-
elNet40 categories) based on majority voting within each ture learning models. Also comparisons between our model
cluster. The results suggest that the model achieves 68.88% (trained on 5% of the training data) and the fully supervised
accuracy on the shape classification task with zero supervi- models are shown in Table 3. The results suggest that our
sion. This result is consistent with the observed NMI be- model achieves an mIoU which is only 8% less than the
tween cluster assignments and ground truth labels in the best supervised model and hence narrows the gap with su-
ShapeNet dataset. pervised models.
We also performed intrinsic evaluations to investigate
4.4. Part Segmentation
the consistency of the learned point features within each
Part segmentation is a fine-grained point-wise classifica- category. We sampled a few shapes from each category,
tion task where the goal is to predict the part category label stacked their point features, and reduced the feature dimen-
of each point in a given shape. We evaluate the learned point sion from 1024 to 512 using PCA. We then co-clustered the
features on the ShapeNetPart dataset [90], which contains features using the AHC method. The result of co-clustering
16,881 objects from 16 categories (12149 train, 2874 test, on the airplane category is shown in Figure 3. We observed
and 1858 validation). Each object consists of 2 to 6 parts a similar consistent behavior over all categories. We also
with total of 50 distinct parts among all categories. Fol- used AHC and hierarchical density-based spatial clustering
lowing [52], we use mean Intersection-over-Union (mIoU) (HDBSCAN) [8] methods to cluster the point features of
as the evaluation metric computed by averaging the IoUs each shape. We aligned the assigned cluster labels with the
of different parts occurring in a shape. We also report part ground truth labels based on majority voting within each
classification accuracy. cluster. A few sample shapes along with their ground truth

8165
Model %train Cat. Ins. Aero Bag Cap Car Chair Ear Guitar Knife Lamp Laptop Motor Mug Pistol Rocket Skate Table
data mIoU mIoU phone board
PointNet [52] 80.4 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
PointNet++ [54] 81.9 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
DGCNN [78] 82.3 85.1 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0
KCNet [63] 82.2 84.7 82.8 81.5 86.4 77.6 90.3 76.8 91.0 87.2 84.5 95.5 69.2 94.4 81.6 60.1 75.2 81.3
RSNet [31] 81.4 84.9 82.7 86.4 84.1 78.2 90.4 69.3 91.4 87.0 83.5 95.4 66.0 92.6 81.8 56.1 75.8 82.2
SynSpecCNN [91] 100% 82.0 84.7 81.6 81.7 81.9 75.2 90.2 74.9 93.0 86.1 84.7 95.6 66.7 92.7 81.6 60.6 82.9 82.1
RGCNN [70] 79.5 84.3 80.2 82.8 92.6 75.3 89.2 73.7 91.3 88.4 83.3 96.0 63.9 95.7 60.9 44.6 72.9 80.4
SpiderCNN [84] 82.4 85.3 83.5 81.0 87.2 77.5 90.7 76.8 91.1 87.3 83.3 95.8 70.2 93.5 82.7 59.7 75.8 82.8
SPLATNet [65] 83.7 85.4 83.2 84.3 89.1 80.3 90.7 75.5 92.1 87.1 83.9 96.3 75.6 95.8 83.8 64.0 75.5 81.8
FCPN [56] 84.0 84.0 84.0 82.8 86.4 88.3 83.3 73.6 93.4 87.4 77.4 97.7 81.4 95.8 87.7 68.4 83.6 73.4
Ours 5% 72.1 77.7 78.4 67.7 78.2 66.2 85.5 52.6 87.7 81.6 76.3 93.7 56.1 80.1 70.9 44.7 60.7 73.0

Table 3. Comparison between our semi-supervised model and supervised models on ShapeNetPart segmentation task. Average mIoU over
instances (Ins.) and categories (Cat.) are reported.

Encoder Decoder Accuracy

PointNet Reconstruction 85.7
PointNet Multi-Task 86.2
Ours Reconstruction 86.7
Ours Multi-Task 89.1

Table 4. Effect of encoder and multi-task learning on accuracy on

the ModelNet40.

autoencoder (i.e., single reconstruction decoder) [2] which

achieves classification accuracy of 85.7%. This suggests
that using multi-task learning improves the quality of the
learned features. The summary of the results is shown in
Table 4.
We also investigate the effect of different tasks on the
Figure 3. Co-clustering of the learned point features within the Air- quality of the learned features by masking the task losses
plane category using hierarchical clustering which demonstrates and training and testing the model on each configuration.
the consistency of the learned point features within the category.
The results shown in Table 5 suggest that the reconstruc-
tion task has the highest impact on the performance. This is
because contrary to [9], we are not applying any heuristics
part labels, predicted part labels by the trained MLP, AHC,
to avoid trivial solutions and hence when the reconstruction
and HDBSCAN clustering are illustrated in Figure 4. As
task is masked both clustering and classification tasks tend
shown, HDBSCAN clustering results in a decent segmenta-
to collapse the features to one cluster which results in de-
tion of the learned features in a fully unsupervised setting.
graded feature learning.
4.5. Ablation Study Moreover, the results suggest that masking the cross-
entropy loss degrades the accuracy to 87.6% (absolute de-
We first investigate the effectiveness of the graph-based crease of 1.5%) whereas masking the k-means loss has a
encoder on the shape classification task. In the first experi- less adverse effect (degraded loss of 88.3%, i.e., absolute
ment, we replace the encoder with a PointNet [52] encoder decrease of 0.8%). This implies that the cross-entropy loss
and keep the multi-task decoders. We train and test the (classifier) plays a more important role than the clustering
network with the same transfer learning protocol which re- loss. Furthermore, the results indicate that having both K-
sults in a classification accuracy of 86.2%. Compared to means and cross-entropy losses along with the reconstruc-
the graph-based encoder with accuracy of 89.1%, this sug- tion task yields the best result (i.e., accuracy of 89.1%).
gests that our encoder learns better features and hence con- This may seems counter-intuitive as one may assume that
tributes to the state-of-the-art results that we achieve. To using the clustering pseudo-labels to learn a classification
investigate the effectiveness of the multi-task learning, we function would push the classifier to replicate the K-means
compare our result against the results reported on a PointNet behavior and hence the k-means loss will be redundant.

8166
Ground Truth
MLP trained on
1% of data
Agglomerative
Clustering
Clustering
HDBSCAN

Figure 4. A few sample shapes along with their ground truth part labels, predicted part labels by the trained MLP on 1% of the training
data, and predicted part labels by AHC and HDBSCAN methods.

Classification Reconstruction Clustering Overall lizes the model and hence we isolated them.
Task Task Task Accuracy
√ • Similar to [78], we tried stacking more graph convo-
× × 22.8
√
× × 86.7 lution layers and recomputing the input adjacency to
√
×
√
×
√
6.9 each layer based on the feature space of its predeces-
× 88.3 sor layer. We observed that this has an adverse effect
√ √
× 15.2 on both classification and segmentation tasks.
√ √
× 87.6
√ √ √
89.1 5. Conclusion
Table 5. Effect of tasks on the accuracy of classification on the We proposed an unsupervised multi-task learning ap-
ModelNet40. proach to learn point and shape features on point clouds
which uses three unsupervised tasks including clustering,
However, we think this is not the case because the classi- autoencoding, and self-supervised classification to train a
fier introduces non-linearity to the feature space by learn- multi-scale graph-based encoder. We exhaustively evalu-
ing non-linear boundaries to approximate the predictions of ated our model on point cloud classification and segmenta-
the linear K-means model which in turn affects the clus- tion benchmarks. The results suggest that the learned fea-
tering outcomes in the following epoch. K-means loss on tures outperform prior state-of-the-art models in unsuper-
the other hand, pushes the features in the same cluster to vised representation learning. For example, in ModelNet40
a closer space while pushing the features of other clusters shape classification tasks, our model achieved the state-of-
away. the-art (among unsupervised models) accuracy of 89.1%
Finally, we report some of our failed experiments: which is also competitive with supervised models. In the
ShapeNetPart segmentation task, it achieved mIoU of 77.7
• We tried K-Means++ [5] to warm-start the cluster cen- which is only 8% less than the state-of-the-art supervised
troids. We did not observe any significant improve- model. For future directions, we are planning to: (i) intro-
ment over the randomly selected centroids. duce more powerful decoders to enhance the quality of the
learned features, (ii) investigate the effect of other features
• We tried soft parameter sharing between the decoder such as normals and geodesics, and (iii) adapt the model to
and classifier models. We observed that this destabi- perform semantic segmentation tasks too.

8167
References [15] Carl Doersch and Andrew Zisserman. Multi-task self-
supervised visual learning. In Proceedings of the IEEE Inter-
[1] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, national Conference on Computer Vision, pages 2051–2060,
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- 2017.
mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow:
[16] Gil Elbaz, Tamar Avraham, and Anath Fischer. 3d point
a system for large-scale machine learning.
cloud registration for localization using a deep neural net-
[2] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and work auto-encoder. In The IEEE Conference on Computer
Leonidas Guibas. Learning representations and generative Vision and Pattern Recognition (CVPR), pages 4631–4640,
models for 3d point clouds. 2018. July 2017.
[3] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning [17] Francis Engelmann, Theodora Kontogianni, Alexander Her-
to see by moving. In The IEEE International Conference on mans, and Bastian Leibe. Exploring spatial context for 3d
Computer Vision (ICCV), pages 37–45, December 2015. semantic segmentation of point clouds. In The IEEE Inter-
[4] Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, and Daniel national Conference on Computer Vision (ICCV) Workshops,
Cremers. Clustering with deep learning: Taxonomy and new pages 716–724, Oct 2017.
methods. arXiv preprint arXiv:1801.07648, 2018. [18] Andreas Argyriou Theodoros Evgeniou and Massimiliano
[5] David Arthur and Sergei Vassilvitskii. K-means++: The ad- Pontil. Multi-task feature learning. In Advances in Neu-
vantages of careful seeding. In Proceedings of the Eigh- ral Information Processing Systems 19: Proceedings of the
teenth Annual ACM-SIAM Symposium on Discrete Algo- 2006 Conference, volume 19, page 41. MIT Press, 2007.
rithms, pages 1027–1035, 2007. [19] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point
[6] Mathieu Aubry, Ulrich Schlickewei, and Daniel Cremers. set generation network for 3d object reconstruction from a
The wave kernel signature: A quantum mechanical approach single image. In The IEEE Conference on Computer Vision
to shape analysis. In 2011 IEEE International Conference and Pattern Recognition (CVPR), pages 605–613, July 2017.
on Computer Vision Workshops (ICCV Workshops), pages [20] Matheus Gadelha, Rui Wang, and Subhransu Maji. Multires-
1626–1633, Nov 2011. olution tree networks for 3d point cloud processing. In The
[7] Michael M. Bronstein and Iasonas Kokkinos. Scale-invariant European Conference on Computer Vision (ECCV), pages
heat kernel signatures for non-rigid shape recognition. In 103–118, September 2018.
2010 IEEE Computer Society Conference on Computer Vi- [21] Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan.
sion and Pattern Recognition, pages 1704–1711, June 2010. Hand pointnet: 3d hand pose estimation using point sets.
[8] Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. In The IEEE Conference on Computer Vision and Pattern
Density-based clustering based on hierarchical density esti- Recognition (CVPR), pages 8417–8426, June 2018.
mates. In Pacific-Asia conference on knowledge discovery [22] Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng,
and data mining, pages 160–172. Springer, 2013. Weidong Cai, and Heng Huang. Deep clustering via joint
[9] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and convolutional autoencoder embedding and relative entropy
Matthijs Douze. Deep clustering for unsupervised learning minimization. In The IEEE International Conference on
of visual features. In The European Conference on Computer Computer Vision (ICCV), pages 5736–5745, Oct 2017.
Vision (ECCV), pages 132–149, September 2018. [23] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
[10] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, supervised representation learning by predicting image rota-
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, tions. In International Conference on Learning Representa-
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: tions (ICLR), 2018.
An information-rich 3d model repository. arXiv preprint [24] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab-
arXiv:1512.03012, 2015. hinav Gupta. Learning a predictable and generative vector
[11] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming representation for objects. In European Conference on Com-
Ouhyoung. On visual similarity based 3d model retrieval. In puter Vision, pages 484–499. Springer, 2016.
Computer graphics forum, volume 22, pages 223–232. Wi- [25] Benjamin Graham, Martin Engelcke, and Laurens van der
ley Online Library, 2003. Maaten. 3d semantic segmentation with submanifold sparse
[12] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se- convolutional networks. In The IEEE Conference on Com-
mantic segmentation via multi-task network cascades. In puter Vision and Pattern Recognition (CVPR), pages 9224–
Proceedings of the IEEE Conference on Computer Vision 9232, June 2018.
and Pattern Recognition, pages 3150–3158, 2016. [26] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan
[13] Haowen Deng, Tolga Birdal, and Slobodan Ilic. Ppf-foldnet: Russell, and Mathieu Aubry. Atlasnet: A papier-mâché
Unsupervised learning of rotation invariant 3d local descrip- approach to learning 3d surface generation. In The IEEE
tors. In The European Conference on Computer Vision Conference on Computer Vision and Pattern Recognition
(ECCV), September 2018. (CVPR), June 2018.
[14] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- [27] Pedro Hermosilla, Tobias Ritschel, Pere-Pau Vázquez, Àlvar
pervised visual representation learning by context prediction. Vinacua, and Timo Ropinski. Monte carlo convolution for
In The IEEE International Conference on Computer Vision learning on non-uniformly sampled point clouds. ACM
(ICCV), pages 1422–1430, December 2015. Trans. Graph., 37(6):235:1–235:12, 2018.

8168
[28] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji [41] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias
Watanabe. Deep clustering: Discriminative embeddings Zwicker. Point2sequence: Learning the shape representa-
for segmentation and separation. In 2016 IEEE Interna- tion of 3d point clouds with an attention-based sequence to
tional Conference on Acoustics, Speech and Signal Process- sequence network. arXiv preprint arXiv:1811.02565, 2018.
ing (ICASSP), pages 31–35, March 2016. [42] Yong Luo, Yonggang Wen, Dacheng Tao, Jie Gui, and Chao
[29] C. Hsu and C. Lin. Cnn-based joint clustering and rep- Xu. Large margin multi-modal multi-task feature extraction
resentation learning with feature drift compensation for for image classification. IEEE Transactions on Image Pro-
large-scale image data. IEEE Transactions on Multimedia, cessing, 25(1):414–427, 2015.
20(2):421–429, Feb 2018. [43] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-
[30] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Point- volutional neural network for real-time object recognition.
wise convolutional neural networks. In The IEEE Conference In 2015 IEEE/RSJ International Conference on Intelligent
on Computer Vision and Pattern Recognition (CVPR), pages Robots and Systems (IROS), pages 922–928, Sept 2015.
984–993, June 2018. [44] Erxue Min, Xifeng Guo, Qiang Liu, Gen Zhang, Jianjing
Cui, and Jun Long. A survey of clustering with deep learn-
[31] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Re-
ing: From the perspective of network architecture. IEEE
current slice networks for 3d segmentation of point clouds.
Access, 6:39501–39514, 2018.
In The IEEE Conference on Computer Vision and Pattern
[45] Christian Mostegel, Rudolf Prettenthaler, Friedrich Fraun-
Recognition (CVPR), pages 2626–2635, June 2018.
dorfer, and Horst Bischof. Scalable surface reconstruction
[32] Sergey Ioffe and Christian Szegedy. Batch normalization: from point clouds with extreme scale and density diversity.
Accelerating deep network training by reducing internal co- In The IEEE Conference on Computer Vision and Pattern
variate shift. In Francis Bach and David Blei, editors, Pro- Recognition (CVPR), pages 904–913, July 2017.
ceedings of the 32nd International Conference on Machine [46] Daniel Müllner. Modern hierarchical, agglomerative cluster-
Learning, Proceedings of Machine Learning Research, pages ing algorithms. arXiv preprint arXiv:1109.2378, 2011.
448–456, Jul 2015. [47] Liangliang Nan and Peter Wonka. Polyfit: Polygonal surface
[33] Simon Jenni and Paolo Favaro. Self-supervised feature learn- reconstruction from point clouds. In The IEEE International
ing by learning to spot artifacts. In The IEEE Conference Conference on Computer Vision (ICCV), pages 2353–2361,
on Computer Vision and Pattern Recognition (CVPR), pages Oct 2017.
2733–2742, June 2018. [48] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
[34] Felix Jremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, visual representations by solving jigsaw puzzles. In Euro-
Per-Erik Forssn, and Michael Felsberg. Density adaptive pean Conference on Computer Vision (ECCV), pages 69–84.
point set registration. In The IEEE Conference on Computer Springer, 2016.
Vision and Pattern Recognition (CVPR), pages 3829–3837, [49] Ramakanth Pasunuru and Mohit Bansal. Multi-task video
June 2018. captioning with video and entailment generation. In Pro-
[35] Michael Kazhdan, Thomas Funkhouser, and Szymon ceedings of the 55th Annual Meeting of the Association for
Rusinkiewicz. Rotation invariant spherical harmonic rep- Computational Linguistics (Volume 1: Long Papers), pages
resentation of 3d shape descriptors. In Proceedings of the 1273–1283, 2017.
2003 Eurographics/ACM SIGGRAPH Symposium on Ge- [50] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
ometry Processing, SGP ’03, pages 156–164, Aire-la-Ville, Darrell, and Alexei A. Efros. Context encoders: Feature
Switzerland, Switzerland, 2003. Eurographics Association. learning by inpainting. In The IEEE Conference on Com-
[36] Diederik P Kingma and Jimmy Lei Ba. Adam: Amethod puter Vision and Pattern Recognition (CVPR), pages 2536–
for stochastic optimization. In International Conference on 2544, June 2016.
Learning Representation (ICLR), 2014. [51] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and
Leonidas J. Guibas. Frustum pointnets for 3d object de-
[37] Roman Klokov and Victor Lempitsky. Escape from cells:
tection from rgb-d data. In The IEEE Conference on Com-
Deep kd-networks for the recognition of 3d point cloud mod-
puter Vision and Pattern Recognition (CVPR), pages 918–
els. In The IEEE International Conference on Computer Vi-
927, June 2018.
sion (ICCV), pages 863–872, Oct 2017.
[52] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas.
[38] Haun Lei, Naveed Akhtar, and Ajmal Mian. Spherical con- Pointnet: Deep learning on point sets for 3d classification
volutional neural network for 3d point clouds. arXiv preprint and segmentation. In The IEEE Conference on Computer Vi-
arXiv:1805.07872, 2018. sion and Pattern Recognition (CVPR), pages 652–660, July
[39] Jiaxin Li, Ben M. Chen, and Gim Hee Lee. So-net: Self- 2017.
organizing network for point cloud analysis. In The IEEE [53] Charles R. Qi, Hao Su, Matthias Niessner, Angela Dai,
Conference on Computer Vision and Pattern Recognition Mengyuan Yan, and Leonidas J. Guibas. Volumetric and
(CVPR), pages 19397–9406, June 2018. multi-view cnns for object classification on 3d data. In The
[40] Yangyan Li, Rui Bu, Mingchao Sun, and Baoquan Chen. IEEE Conference on Computer Vision and Pattern Recogni-
Pointcnn: Convolution on x-transformed points. In Advances tion (CVPR), pages 5648–5656, June 2016.
in Neural Information Processing Systems 32, pages 828– [54] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
838. Curran Associates, Inc., 2018. Guibas. Pointnet++: Deep hierarchical feature learning on

8169
point sets in a metric space. In I. Guyon, U. V. Luxburg, [67] Yongbin Sun, Yue Wang, Ziwei Liu, Joshua E. Siegel,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. and Sanjay Sarma. Point cloud gan. arXiv preprint
Garnett, editors, Advances in Neural Information Process- arXiv:1810.05591, 2018.
ing Systems 30, pages 5099–5108. Curran Associates, Inc., [68] Yongbin Sun, Yue Wang, Ziwei Liu, Joshua E. Siegel,
2017. and Sanjay Sarma. Pointgrow: Autoregressively learned
[55] Zhongzheng Ren and Yong Jae Lee. Cross-domain self- point cloud generation with self-attention. arXiv preprint
supervised multi-task feature learning using synthetic im- arXiv:1810.05591, 2018.
agery. In Proceedings of the IEEE Conference on Computer [69] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Vision and Pattern Recognition, pages 762–771, 2018. Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[56] Dario Rethage, Johanna Wald, Jurgen Sturm, Nassir Navab, Vanhoucke, and Andrew Rabinovich. Going deeper with
and Federico Tombari. Fully-convolutional point networks convolutions. In The IEEE Conference on Computer Vision
for large-scale point clouds. In The European Conference on and Pattern Recognition (CVPR), June 2015.
Computer Vision (ECCV), pages 596–611, September 2018. [70] Gusi Te, Wei Hu, Amin Zheng, and Zongming Guo. Rgcnn:
[57] Raif M Rustamov. Laplace-beltrami eigenfunctions for de- Regularized graph cnn for point cloud segmentation. In Pro-
formation invariant shape representation. In Proceedings of ceedings of the 26th ACM International Conference on Mul-
the fifth Eurographics symposium on Geometry processing, timedia, pages 746–754. ACM, 2018.
pages 225–233. Eurographics Association, 2007. [71] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feast-
[58] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast net: Feature-steered graph convolutions for 3d shape analy-
point feature histograms (fpfh) for 3d registration. In 2009 sis. In The IEEE Conference on Computer Vision and Pattern
IEEE International Conference on Robotics and Automation, Recognition (CVPR), pages 2598–2606, June 2018.
pages 3212–3217, May 2009. [72] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Or-
der matters: Sequence to sequence for sets. In International
[59] Radu Bogdan Rusu, Nico Blodow, Zoltan C. Marton, and
Conference on Learning Representation (ICLR), 2015.
Michael Beetz. Aligning point cloud views using persistent
feature histograms. In 2008 IEEE/RSJ International Confer- [73] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio
ence on Intelligent Robots and Systems (IROS), pages 3384– Guadarrama, and Kevin Murphy. Tracking emerges by col-
3391, Sept 2008. orizing videos. In The European Conference on Computer
Vision (ECCV), pages 391–408, September 2018.
[60] Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, Mi-
[74] Jayakorn Vongkulbhisal, Beat Irastorza Ugalde, Fernando
hai Dolha, and Michael Beetz. Towards 3d point cloud based
De la Torre, and Joo P. Costeira. Inverse composition dis-
object maps for household environments. Robotics and Au-
criminative optimization for point cloud registration. In The
tonomous Systems, 56(11):927–941, 2008.
IEEE Conference on Computer Vision and Pattern Recogni-
[61] Uri Shaham, Kelly Stanton, Henry Li, Ronen Basri, Boaz tion (CVPR), pages 2993–3001, June 2018.
Nadler, and Yuval Kluger. Spectralnet: Spectral clustering
[75] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spec-
using deep neural networks. In International Conference on
tral graph convolution for point set feature learning. In The
Learning Representations (ICLR), 2018.
European Conference on Computer Vision (ECCV), pages
[62] Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: 52–66, September 2018.
Deep volumetric shape learning without object labels. In [76] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei
European Conference on Computer Vision, pages 236–250. Pokrovsky, and Raquel Urtasun. Deep parametric continu-
Springer, 2016. ous convolutional neural networks. In The IEEE Conference
[63] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Min- on Computer Vision and Pattern Recognition (CVPR), pages
ing point cloud local structures by kernel correlation and 2589–2597, June 2018.
graph pooling. In The IEEE Conference on Computer Vision [77] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
and Pattern Recognition (CVPR), pages 4548–4557, June of visual representations using videos. In The IEEE Interna-
2018. tional Conference on Computer Vision (ICCV), pages 2794–
[64] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya 2802, December 2015.
Sutskever, and Ruslan Salakhutdinov. Dropout: A simple [78] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
way to prevent neural networks from overfitting. Journal of Michael M Bronstein, and Justin M Solomon. Dynamic
Machine Learning Research, 15:1929–1958, 2014. graph cnn for learning on point clouds. arXiv preprint
[65] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, arXiv:1801.07829, 2018.
Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. [79] Donglai Wei, Joseph J. Lim, Andrew Zisserman, and
Splatnet: Sparse lattice networks for point cloud processing. William T. Freeman. Learning and using the arrow of time.
In The IEEE Conference on Computer Vision and Pattern In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2530–2539, June 2018. Recognition (CVPR), pages 8052–8060, June 2018.
[66] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A con- [80] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
cise and provably informative multi-scale signature based Josh Tenenbaum. Learning a probabilistic latent space of ob-
on heat diffusion. Computer Graphics Forum, 28(5):1383– ject shapes via 3d generative-adversarial modeling. In D. D.
1392, 2009. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,

8170
editors, Advances in Neural Information Processing Systems In The IEEE Conference on Computer Vision and Pattern
29, pages 82–90. Curran Associates, Inc., 2016. Recognition (CVPR), pages 2790–2799, June 2018.
[81] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- [94] Wei Zeng and Theo Gevers. 3dcontextnet: K-d tree guided
guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d hierarchical learning of point clouds using local and global
shapenets: A deep representation for volumetric shapes. contextual cues. In The European Conference on Computer
In The IEEE Conference on Computer Vision and Pattern Vision (ECCV) Workshops, September 2018.
Recognition (CVPR), pages 1912–1920, June 2015. [95] Yu Zhang and Qiang Yang. A survey on multi-task learning.
[82] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- arXiv preprint arXiv:1707.08114, 2017.
guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d [96] Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico
shapenets: A deep representation for volumetric shapes. Tombari. 3d point capsule networks. In The IEEE Confer-
In The IEEE Conference on Computer Vision and Pattern ence on Computer Vision and Pattern Recognition (CVPR),
Recognition (CVPR), pages 1912–1920, June 2015. June 2019.
[83] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised [97] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learn-
deep embedding for clustering analysis. In Proceedings of ing for point cloud based 3d object detection. In The IEEE
The 33rd International Conference on Machine Learning Conference on Computer Vision and Pattern Recognition
(ICML), volume 48, pages 478–487. PMLR, June 2016. (CVPR), pages 4490–4499, June 2018.
[84] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao.
Spidercnn: Deep learning on point sets with parameterized
convolutional filters. In The European Conference on Com-
puter Vision (ECCV), pages 87–102, September 2018.
[85] Yan Yan, Elisa Ricci, Gaowen Liu, and Nicu Sebe. Ego-
centric daily activity recognition via multitask clustering.
IEEE Transactions on Image Processing, 24(10):2984–2995,
2015.
[86] Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi
Hong. Towards k-means-friendly spaces: Simultaneous deep
learning and clustering. In Proceedings of the 34th Interna-
tional Conference on Machine Learning (ICML), volume 70,
pages 3861–3870. PMLR, August 2017.
[87] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper-
vised learning of deep representations and image clusters.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 5147–5156, June 2016.
[88] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Fold-
ingnet: Point cloud auto-encoder via deep grid deformation.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 206–215, June 2018.
[89] Xiaoqing Ye, Jiamao Li, Hexiao Huang, Liang Du, and Xi-
aolin Zhang. 3d recurrent neural networks with context fu-
sion for point cloud semantic segmentation. In The Euro-
pean Conference on Computer Vision (ECCV), pages 403–
417, September 2018.
[90] Li Yi, Vladimir G. Kim, Duygu Ceylan, I-Chao Shen,
Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Shef-
fer, and Leonidas Guibas. A scalable active framework
for region annotation in 3d shape collections. ACM Trans.
Graph., 35(6):210:1–210:12, Nov. 2016.
[91] Li Yi, Hao Su, Xingwen Guo, and Leonidas J. Guibas. Sync-
speccnn: Synchronized spectral cnn for 3d shape segmenta-
tion. In The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 2282–2290, July 2017.
[92] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and
Pheng-Ann Heng. Ec-net: an edge-aware point set consoli-
dation network. In The European Conference on Computer
Vision (ECCV), pages 386–402, September 2018.
[93] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and
Pheng-Ann Heng. Pu-net: Point cloud upsampling network.