0% found this document useful (0 votes)
23 views

SegContrast 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination

This paper proposes a new self-supervised contrastive learning approach called SegContrast to learn representations from 3D point cloud data. SegContrast extracts segments from point clouds and applies contrastive learning to discriminate between similar and dissimilar structures. It is shown to learn representations that improve performance even with limited labeled data and outperform other point cloud contrastive methods.

Uploaded by

mishraaryan846
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

SegContrast 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination

This paper proposes a new self-supervised contrastive learning approach called SegContrast to learn representations from 3D point cloud data. SegContrast extracts segments from point clouds and applies contrastive learning to discriminate between similar and dissimilar structures. It is shown to learn representations that improve performance even with limited labeled data and outperform other point cloud contrastive methods.

Uploaded by

mishraaryan846
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2116 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO.

2, APRIL 2022

SegContrast: 3D Point Cloud Feature Representation


Learning Through Self-Supervised
Segment Discrimination
Lucas Nunes , Graduate Student Member, IEEE, Rodrigo Marcuzzi , Graduate Student Member, IEEE,
Xieyuanli Chen , Graduate Student Member, IEEE, Jens Behley , and Cyrill Stachniss , Member, IEEE

Abstract—Semantic scene interpretation is essential for au-


tonomous systems to operate in complex scenarios. While deep
learning-based methods excel at this task, they rely on vast amounts
of labeled data that is tedious to generate and might not cover all
relevant classes sufficiently. Self-supervised representation learn-
ing has the prospect of reducing the amount of required labeled
data by learning descriptive representations from unlabeled data.
In this letter, we address the problem of representation learning
for 3D point cloud data in the context of autonomous driving.
We propose a new contrastive learning approach that aims at
learning the structural context of the scene. Our approach extracts
class-agnostic segments over the point cloud and applies the con-
trastive loss over these segments to discriminate between similar
and dissimilar structures. We apply our method on data recorded
with a 3D LiDAR. We show that our method achieves competitive
performance and can learn a more descriptive feature represen-
tation than other state-of-the-art self-supervised contrastive point
cloud methods.
Index Terms—Semantic scene understanding, deep learning
methods, representation learning.

I. INTRODUCTION
INE-GRAINED scene interpretation is crucial to under-
F stand the surroundings of a robot or autonomous vehicle.
Given the extensive research effort on convolutional neural net-
works (CNNs) [5], [23], [56], 2D image tasks have been defined
and adapted to this scope, e.g., instance segmentation [33], [53]
or panoptic segmentation [32], [57]. For 3D scene understand-
ing, we can employ semantic segmentation [43], [58] to achieve a
pixel-level classification of an image. Such semantic information
is important for autonomous systems to interpret and safely Fig. 1. Results trained with only 0.1% of the labels, trained from scratch
(middle), i.e., without any pre-training, and pre-training with our approach Seg-
interact with their surroundings. Contrast (bottom). With SegContrast pre-training, the semantic segmentation
can better delineate the structural information in the scene and better classify
the more fine-grained objects, like trees, traffic signs and poles, highlighted by
Manuscript received September 8, 2021; accepted December 31, 2021. Date black arrows.
of publication January 13, 2022; date of current version January 25, 2022. This
letter was recommended for publication by Associate Editor S. Jain and Editor
M. Vincze Lerma upon evaluation of the reviewers’ comments. This work
was supported in part by Deutsche Forschungsgemeinschaft (DFG, German Image data however does not provide straightforward 3D
Research Foundation) under Germany’s Excellence Strategy under Grant EXC- information, which is also essential in this domain. LiDAR
2070–390732324–PhenoRob, and in part by European Union’s Horizon 2020 sensors provide 3D information in the form of spatial positions
Research and Innovation Programme under Grant Agreement No. 101017008
(Harmony). (Corresponding author: Lucas Nunes.)
and intensity values as point-wise information. Such data is often
The authors are with the Institute of Geodesy and Geoinformation, Uni- more challenging to interpret than images due to the lack of
versity of Bonn, 53115 Bonn, Germany (e-mail: [email protected]; color information and the sparser representation of the objects.
[email protected]; [email protected]; jens.behley@ Therefore, many recent studies [10], [19], [34], [38], [40], [44]
igg.uni-bonn.de; [email protected]).
Digital Object Identifier 10.1109/LRA.2022.3142440 focus on convolution operations applied to point clouds to boost
their performance.

2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
NUNES et al.: SEGCONTRAST: 3D POINT CFRL THROUGH SELF-SUPERVISED SEGMENT DISCRIMINATION 2117

Supervised methods need a substantial amount of labeled data Point cloud data has been the source of extensive research
to achieve high performance, which is particularly hard to ac- in the field of autonomous vehicles [21]. Much effort has been
quire for LiDAR data. Data annotation is expensive [3] due to the dedicated to applying convolutional operations over point cloud
density of measurements, which are sensor-specific and also de- data either by projecting the point cloud to images [12], [32],
pend on the sensor mounting. For semantic segmentation, point- [34], [49] or by defining 3D convolution operations [10], [19],
wise labeling is required, making it even harder to annotate. [38], [40], [44]. Between both approaches, 3D convolutions
There are also fewer labeled LiDAR datasets available compared gained recently more visibility on LiDAR data due to their
to image datasets [11], [30]. Given the sensor-specific char- performance on different tasks, e.g., semantic segmentation [43],
acteristics, learned features from one dataset cannot be easily [58] and panoptic segmentation [16], [57].
transferred to a dataset with a different sensor setup. Therefore, The compelling results of contrastive learning methods on 2D
recent work in this domain address label efficiency [27], [47], image data drew attention to the application of such techniques
[51] or transferability across datasets [1], [28], [29], [35], [52]. on 3D data, especially in autonomous driving, for example, for
Self-supervised representation learning has the prospect of semantic segmentation [43], [58] and object detection [9], [39]
using the data in an unsupervised way to learn a robust feature tasks. Xie et al. [50] address the point cloud contrastive learning
presentation that can be transferred to different downstream using a point-wise loss. However, the point-wise contrastive
supervised tasks. One of the self-supervised learning methods loss depends on a involved pre-processing to generate a map of
that received increasing attention recently is contrastive learn- correspondent points between sequential scans. Hou et al. [26]
ing [6]–[8], [20], [22]. These methods take advantage of data add more context to the contrastive pre-training by dividing the
augmentation to generate augmented versions of one anchor scene into different spatial partitions, learning both the feature
sample, then learn a representation to approximate these aug- representation and the spatial partitioning of the points. Zhang et
mented samples with similar features in the feature space and al. [55] propose a more global contrastive loss. For each scan a
put it apart from different samples. Based on the pre-trained pair of augmented views are generated, i.e., the positive pair.
networks using this paradigm, the network is fine-tuned for Then, the other augmented scans are taken as the negative
different downstream tasks. samples, computing the contrastive loss over extracted features
The main contribution of this letter is a contrastive repre- from the scans. However, their method relies on a multi-branch
sentation learning method for 3D LiDAR point clouds that architecture using two backbones, one for points and one for
is able to learn structural information. Our method extracts voxels. Despite promising results, none of the methods focus on
segments from the LiDAR data and uses contrastive learning point clouds generated by an automotive LiDAR sensor.
to discriminate between similar and dissimilar structures. The Our contribution is a contrastive representation learning
learned feature representation is then used as a starting point for method for outdoor LiDAR data used in autonomous driving. We
supervised fine-tuning, reducing the number of labeled training extract class-agnostic segments from the point cloud and propose
data needed. Our results suggest that our method can better learn a contrastive loss to be applied over the extracted segments.
the structural information (see Fig. 1) and a more descriptive Distinct from prior work, our representation learning method
feature representation during the self-supervised pre-training, learns more contextualized information by discriminating the
surpassing previous point cloud-based contrastive methods in segmented structures on the point cloud and it learns a more
different evaluations. We explicitly show that, compared to the robust and descriptive embedding space.
state of the art, our approach (i) is more efficient when using
fewer labels, (ii) can better describe fine-grained structures, and III. OUR APPROACH
(iii) is more transferable between different datasets.
The implementation of our approach is publicly available at Fig. 2 provides an overview of our approach. We rely on a
https://github.com/PRBonn/segcontrast. class-agnostic point cloud segmentation to segment the struc-
tures in the scene. Unlike most prior work, our method uses
a single backbone and does not require a point-wise mapping.
II. RELATED WORK We use a momentum encoder network and a feature bank [22]
during training to increase the number of negatives samples
Contrastive learning methods are receiving increasing atten- and contrast structures segmented over different scans. Then,
tion in the vision community. Such methods improve the per- point-wise features are computed for the whole point cloud and
formance of different image-based classification methods using the mapping of the segment is used to extract the segment-wise
fewer or even no labels [6]–[8], [20], [22], [45], [54]. Contrastive points and features. We apply dropout [42] and global max
learning takes advantage of data augmentation to generate two pooling for each segment and compute a feature vector using an
distinct versions of one anchor sample, creating a positive pair. MLP projection head [6]. After that, we calculate the contrastive
Then, they train the network to learn a feature representation loss over the segments and update the feature bank with these
that maximizes the similarity between this positive pair and segments features. In the next sections, we provide more details
minimizes the similarity with other so-called negative samples. on the individual parts of our method.
Recent contrastive learning works aimed at fine-grained tasks,
such as semantic segmentation [46], or object detection [24].
These methods apply the contrastive loss over image segments A. Unsupervised Segment Extraction
extracted using a class-agnostic segmentation approach. Such Our method relies on segments of different structures in the
works improved the network performance using the contrastive point cloud. In outdoor LiDAR data, the different objects in a
loss for pre-training [24], [46] or as an auxiliary supervised scene are mainly connected by the ground and usually better
loss [37], [48]. separated compared to indoor scenes. Given this characteristic,
Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
2118 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 2, APRIL 2022

Fig. 2. (A) From a point cloud P, we generate the augmented views P q and P k by data augmentation and extract class-agnostic segments S from P. We compute
the point-wise features F q and F k and determine the augmented segments S q and S k with its point-wise features using the point indexes of S extracted from P.
Then, we apply dropout followed by global max pooling over each segment, and we project the segment feature vectors using the projection head to get the final
features sqm and skm from the M segments and compute the contrastive loss. (B) After the pre-training, we fine-tune the pre-trained backbone for the downstream
task, i.e., semantic segmentation.

it is comparably easy to extract segments without labels in


two steps by removing first the ground and then clustering the
remaining points [4], [18], [25].
Given a point cloud P = {p1 , . . . , pN } with |P| = N points
pi ∈ R3 , we can fit the ground plane and partition the point
cloud into ground G and non-ground points P  , such that P =
P  ∪ G and P  ∩ G = ∅, similarly to prior approaches [4], [25].
Then, using a clustering algorithm, we can divide P  into M
 M
segments Sm , such that P  = M m=1 Sm and m=1 Sm = ∅,
i.e., the segments are mutually exclusive. Each segment Sm in
this partition represents a different structure from the original
point cloud.
More specifically, we use RANSAC [15] to fit the ground
plane and define the ground G and non-ground points P  . Given
the fitted plane and a distance threshold α, the inliers (ground)
and outliers (non-ground) points are partitioned. We use DB- Fig. 3. From P, we extract the segments S and generate the augmented views
SCAN [14] to cluster the non-ground points P  and determine P q and P k . With the augmented views and the segments given as point indexes,
the segments Sm . we can extract the set of augmented segments S q and S k . We highlight some of
the structures for a better visualization of the segments (solid colored squares).
A common problem of such class-agnostic segmentation is
the aspect of over- and under-segmentation [4]. To overcome
this and extract representative segments, we define a minimum
B. Segment Augmentation
number of points ε in a cluster to be considered a segment.
Moreover, since we will have a different number of segments With each point assigned to a segment, we apply data aug-
for every point cloud segmentation, we select the δ segments mentations to generate the segments augmented pairs S q and S k .
with the highest number of points to avoid memory overflow We extract random views, P q and P k , by cropping a random
during training. The remaining segments are also added to the cuboid region from the anchor point cloud P [55]. Then, we
set of ground points G. apply random augmentations individually over the point clouds
Despite its simplicity, this segmentation method can divide P q and P k . We use random rotation around the z axis, random
the scene into distinct structures, as illustrated in the Fig. 3. To scale, random flip, random cuboid dropout [55], point jittering,
save computations while training, we cache the segments after and rotation perturbations around x, y and z axes to augment
the first pass over the point clouds. the views. All augmentations are combined and applied once to
each augmented view P q and P k .

Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
NUNES et al.: SEGCONTRAST: 3D POINT CFRL THROUGH SELF-SUPERVISED SEGMENT DISCRIMINATION 2119

By extracting a pair of views and applying those augmen- and data pre-processing. We use MinkUNet [10] as the back-
tations on the point clouds, we indirectly apply augmentations bone, which employs sparse convolutions for 3D processing.
over the extracted segments, see Fig. 3 for an illustration. Since For the class-agnostic segmentation, see Section III-A, we
we maintain the point segment assignments via the point indexes set α = 0.25 cm, ε = 20 and δ = 50, which are the RANSAC
during the augmentation, we can easily extract the augmented distance threshold, the minimum number of points per segment
segments S q and S k from the augmented views and compute and the maximum number of segments extracted per point cloud,
the contrastive loss. respectively.
For Our Pre-Training: we use the stochastic gradient de-
C. Segment Contrastive Loss scent (SGD) optimizer with a momentum of 0.9 and set the
learning rate to 0.12 and the weight decay to 0.0004. We use
The contrastive loss function is designed to discriminate
a cosine annealing learning rate schedule [31] with a minimum
between the positive and negative pairs. We use in this letter the
learning rate of 0.00012, following the pre-training scheme used
InfoNCE loss [45], together with the momentum encoder and
by Zhang et al. [55]. For all the methods, we randomly sample
the feature bank proposed by He et al. [22]. For each augmented
20,000 points from the entire point cloud after data augmentation
sample feature vector q and its positive pair k in the batch, the
to limit the number of points per sample, and we pre-train the
loss is calculated as:
backbone for 200 epochs. Given the size of SemanticKITTI,
exp (q  k/τ ) we use K = 8, 152 for the feature bank of the DepthContrast
Lq = − log  , (1)
exp (q  k/τ ) + K 
i=1 exp (q f i /τ ) method. For our approach, we set it to K = 65, 536, with the
where τ is a normalization temperature parameter and Q = temperature parameter τ = 0.1. It is important to highlight that
{f1 , . . . , fK } is the feature bank with |Q| = K implemented we have a bigger feature bank since we save the M segments
as FIFO queue, which are used as negative samples. features extracted from the point cloud instead of the complete
In our case, we have two augmented views, P q and P k , point cloud. We use the two linear layers proposed by Chen et al.
from the anchor point cloud P. We compute the point-wise [6] for our projection head, which is also used in DepthContrast,
features F q and F k from both augmented views and extract and we set p = 0.4 for the dropout layer. For our method and
the augmented segments S q and S k from them. Then, we DepthContrast, we use batch size 8, and for PointContrast, we
pass the segments through the projection head to compute the use batch size 16 to maintain the batch size 4 per GPU as in the
segment-wise feature vectors sqm and skm . Therefore, we define original letter. Given the PointContrast method characteristics,
our contrastive loss as a segment discrimination: it is known that batch size change may affect the approach
 exp (sq k performance. However, we used the same cluster with four
m sm /τ )
Lq = − log  , (2) NVIDIA GTX1080TI 12 GB GPUs for all evaluated methods
m exp (sq k
m sm /τ ) +
K q
i=1 exp (sm f i /τ ) to make the comparisons as fair as possible.
where, as in (1), τ is a normalization temperature parameter and For Fine-Tuning: we use the same datasets used for pre-
f k are the features from the feature bank as before. Note that training, i.e., SemanticKITTI and SemanticPOSS. For the se-
the number of segments M may vary for different point clouds mantic segmentation experiments, we use SGD with momentum
but is the same for the positive pairs S q and S k . of 0.9, learning rate of 0.24, and weight decay of 0.0004. We
At the end of each iteration, the feature bank is updated with also use a cosine annealing scheduler with a minimum learning
the segment features from the current batch, maintaining only rate equal to 0.00024, following the semantic segmentation setup
the last K segments seen by the network. used by Tang et al. [43]. We set the batch size to 2 and randomly
sample 80,000 points per point cloud during training. For the
D. Pre-Training Pipeline object detection experiment, we use PointRCNN [41] as the
base detector, and the same 5% label set used by Zhang et al.
Given one input point cloud P and its augmented pair P q and
[55] for a fair comparison. For all the experiments, we use one
P k , we use the backbone to compute the point-wise features
NVIDIA RTX2080 Super 8 GB GPU. The experimental results
F q and F k . We use the entire point cloud during the backbone
are collected in the validation sequences from both datasets,
forward pass to learn the relationship between the segments and
i.e., sequence 8 for SemanticKITTI and sequences 4 and 5 for
the scene. Then, we extract the augmented segments S q and S k
SemanticPOSS. In both datasets, the validation sequences were
from the point cloud with its point-wise features. Next, we apply
not used for pre-training or fine-tuning.
dropout and global max-pooling over each segment to compute
a feature vector. This feature vectors are then passed through the
projection head to get sqm and skm and calculate the contrastive
loss. V. EXPERIMENTAL EVALUATION
We present our experiments to show the capabilities of our
IV. IMPLEMENTATION AND EXPERIMENTAL SETUP
method. We compare our method to the state of the art and show
Our experimental setup follows the usual evaluations on con- that our approach (i) is more efficient when using fewer labels,
trastive learning methods. First, we pre-train the backbone using (ii) can better describe fine-grained structures and (iii) is more
our contrastive method. Then, we fine-tune it on different setups transferable between different datasets.
for a more in-depth ablation. We used the SemanticKITTI [2],
[3], [17] and SemanticPOSS [36] datasets for the self-supervised
pre-training, both collected in an outdoor urban environment. A. Label Efficiency
We compare our approach to PointContrast [50] and DepthCon- The first experiment evaluates the robustness of the features
trast [55], using their official implementations for pre-training learned from the different representation learning methods by
Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
2120 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 2, APRIL 2022

Fig. 4. Qualitative results on three different validation scans (rows). We show results of the networks fine-tuned with only 0.1% of the labels that are pre-trained
with different contrastive methods or trained from scratch (without pre-training). We compare the results from PointContrast [50], DepthContrast [55] and our
method, SegContrast. With our pre-training, SegContrast, the fine-tunned network can better distinguish the division between different structures, i.e., sidewalk
and road, as shown by the highlighted areas (solid red circles).

TABLE I TABLE II
NUMBER OF TRAINING EPOCHS USED FOR DIFFERENT LABEL REGIMES FINE-TUNING AT DIFFERENT LABEL REGIMES ON SEMANTICKITTI FOR
ON SEMANTICKITTI SEMANTIC SEGMENTATION (MIOU)

fine-tuning it to the semantic segmentation task using the Se-


manticKITTI dataset. Since it is a larger dataset, we can di-
vide it into different label percentage regimes and increasingly
compare it to support our first claim. We define five different better at lower label regimes. As the amount of labels increases,
label percentage regimes, i.e., using 0.1%, 1%, 10%, 50%, and DepthContrast achieves a comparable performance. At the full
100% of the labeled training data, where we select a fixed subset label training, all the methods converge to a similar result as
of scans from the dataset given the regime percentage. Every training from scratch. This is expected since the data used for
subset is chosen from the entire dataset, such that all the classes pre-training, and fine-tuning are the same. Moreover, Table III
are present. When training with fewer labels, the total number presents the per-class IoU at the lowest label regime. It is possible
of training iterations will decrease accordingly. Therefore, we to see that our method performs better in the per-class IoU, out-
increase the number of epochs as we reduce the number of performing previous approaches in the majority of the classes.
training scans to achieve convergence at every label regime (see This evaluation shows that our approach performs better when
Table I). using fewer labels compared to the other contrastive learning
Fig. 4 gives a qualitative comparison between the different methods, being able to better describe the different structures in
contrastive methods and the network without pre-training when the point cloud.
training with 0.1% of the labels. From the top views, we observe For a more complete evaluation, we also compare the
that the network trained from scratch could not learn much struc- methods fine-tuning to object detection. Table IV presents our
tural information, leading to a noisy division between different results on the object detection task on the KITTI dataset [17]. In
structures. This same noisy division can be seen in the other this experiment, we use the same 5% labels set used by Zhang
contrastive methods. Our approach better learns the structural et al. [55]. With 100% of labels, the comparison between the
information of the point clouds, which leads to better boundaries network without pre-training and the pre-trained models shows
between different structures in the scene. This improvement can no significant difference. Since we use KITTI for pre-training
also be seen in more fine-grained classes, e.g., pole and traffic and fine-tuning, this is expected since no new data is seen
signs, shown in Fig. 1. during pre-training. With 5% of labels, it is possible to show the
Table II shows the results on the different label regimes. All gain of the pre-trained network. Except for the pedestrian class,
methods perform better than training from scratch, i.e., without all the methods outperform the network without pre-training
pre-training, and the gap between the results diminishes as the by a large margin. In the car and cyclist classes, our method
number of scans used for training increases. Our method is surpasses previous methods in almost all the difficulties. These

Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
NUNES et al.: SEGCONTRAST: 3D POINT CFRL THROUGH SELF-SUPERVISED SEGMENT DISCRIMINATION 2121

TABLE III
PER-CLASS IOU FINE-TUNING WITH 0.1% LABELS

TABLE IV
FINE-TUNING WITH 5% AND 100% OF LABELS ON KITTI FOR OBJECT DETECTION (MAP WITH 40 RECALL POSITIONS)

TABLE V
LINEAR EVALUATION MIOU AND PER-CLASS IOU (%) PRE-TRAINED AND EVALUATED ON SEMANTICKITTI

results indicate that our approach can learn a robust feature represented classes, e.g., road, building. In contrast, our method
representation, outperforming previous methods on different seems better suited to represent the fine-grained classes.
tasks when using fewer labels.

C. Feature Representation Transferability


B. Linear Evaluation In this third experiment, we evaluate the transferability of our
A typical experiment used for image-based contrastive meth- learned feature representation. The results support our claim
ods is the so-called linear evaluation. This evaluation freezes the that our method is more transferable between different datasets.
pre-trained backbone and trains only a linear layer on top of it We use SemanticKITTI for unsupervised pre-training of the
to compare how well the feature representation can describe the backbone with different contrastive methods or use the com-
different classes even without fine-tuning the whole network. monly used supervised pre-training. Then, we fine-tune the
Our experiment follows the same setup, the pre-trained back- differently pre-trained networks on the SemanticPOSS dataset
bone weights are frozen, and we train only a linear segmentation and compare their performance. SemanticPOSS is smaller than
head. The result of this experiment supports our claim that our SemanticKITTI, which gives a setup aligned to standard image-
method can learn a feature representation that better describe based setting, where a larger dataset, e.g., ImageNet [13], is used
fine-grained structures. for pre-training and then fine-tuning is performed on smaller
Table V displays the results of the linear evaluation over the datasets.
different self-supervised contrastive methods and over the ran- Fig. 5 displays the network performance in different train-
domly initialized network without pre-training for a lower bound ing epochs during fine-tuning. Here, we see a comparable
on the performance. DepthContrast shows the worst perfor- performance of the unsupervised and supervised pre-training.
mance in this evaluation, which indicates that the method cannot As shown, after only one training epoch, our method shows
learn a feature representation as descriptive as the other methods. a considerably better performance than the other contrastive
PointContrast shows a better performance, since the method methods and achieves the same performance as the supervised
uses a point-wise contrastive loss. Furthermore, when looking pre-training. Even though our approach does not use any labels,
at the underrepresented classes, e.g., parking, trunk, fence, pole our method learns a feature representation comparable with the
or traffic sign, our method outperforms the other methods. Thus, supervised pre-training on SemanticKITTI. This result suggests
PointContrast achieves a higher mIoU by learning the more that our method can learn a more general feature representation
Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
2122 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 2, APRIL 2022

TABLE VIII
FINE-TUNING TO SEMANTICPOSS WITH PRE-TRAINING ON SEMANTICKITTI
AND SEMANTICPOSS USING OUR METHOD (MIOU)

Finally, we evaluate the self-supervised pre-training on Se-


manticKITTI and SemanticPOSS, fine-tuning it on Semantic-
POSS. In Table VIII, we show both the linear evaluation and the
fine-tuning. As we can see, the performance of the pre-training
Fig. 5. Comparison between contrastive pre-trained networks fine-tuned to the on SemanticKITTI is better on both experiments. We can see
SemanticPOSS dataset with the network trained from scratch. At the beginning that we obtain a performance gain using a large dataset for self-
of training, our method shows a comparable performance to supervised pre-
training on SemanticKITTI, evidencing that our learned feature representation
supervised pre-training. This also highlights the generalization
is as robust as the supervised pre-training. of the feature representation achieved by our method. The pre-
training on SemanticKITTI performed better than on Semantic-
TABLE VI
PRE-TRAINING WITH SEMANTICKITTI AND FINE-TUNING ON SEMANTICPOSS
POSS, even though both datasets were collected with a different
WITH DIFFERENT LABEL REGIMES (MIOU) LiDAR sensors and different sensor mounting positions.

VI. CONCLUSION
In this letter, we present a novel representation learning
approach for LiDAR point clouds in outdoor environments.
Our approach exploits the characteristics of outdoor LiDAR data
to extract class-agnostic segments and applies the contrastive
loss over these segments. We evaluate our strategy on different
datasets and provide comparisons with other state-of-the-art
TABLE VII
LINEAR EVALUATION ON SEMANTICPOSS WITH PRE-TRAINING ON feature representation learning approaches. The experiments
SEMANTICKITTI (MIOU) suggest that our approach can learn a more robust feature rep-
resentation than previous works outperforming them on differ-
ent downstream tasks. Furthermore, our self-supervised feature
representation seems to be more transferable when fine-tuning
on a different target dataset outperforming even the supervised
pre-training.

REFERENCES
[1] I. Achituve, H. Maron, and G. Chechik, “Self-supervised learning for
than previous methods and it seems to be more suitable for domain adaptation on point clouds,” in Proc. IEEE Winter Conf. Appl.
fine-tuning on a different dataset. Comput. Vis., 2021, pp. 123–133.
Table VI displays the results of the different contrastive [2] J. Behley et al., “Towards 3D LiDAR-based semantic scene understanding
of 3D point cloud sequences: The semanticKITTI Dataset,” Int. J. Robot.
pre-training methods and the supervised pre-training on Se- Res., vol. 40, no. 8/9, pp. 959–967, 2021.
manticKITTI. The pre-training methods improve the network [3] J. Behley et al., “SemanticKITTI: A dataset for semantic scene under-
performance at the lower label regimes, but the previous con- standing of LiDAR sequences,” in Proc. IEEE/CVF Int. Conf. Comput.
trastive methods cannot surpass the supervised pre-training. Vis., 2019, pp. 9296–9306.
[4] J. Behley, V. Steinhage, and A. Cremers, “Laser-based segment classifica-
However, our method outperforms both self-supervised and tion using a mixture of bag-of-words,” in Proc. IEEE/RSJ Int. Conf. Intell.
supervised pre-training at all different label regimes. This ex- Robots Syst., 2013, pp. 4195–4200.
periment indicates the robustness of the learned feature repre- [5] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only
sentation when fine-tuning to a different LiDAR data, exceeding look one-level feature,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2021, pp. 13034–13043.
even the supervised pre-training. [6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework
In Table VII, we show the linear evaluation over the Se- for contrastive learning of visual representations,” in Proc. Int. Conf. Mach.
manticPOSS with the network pre-trained on SemanticKITTI. Learn., 2020, pp. 1597–1607.
[7] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big
DepthContrast shows the lowest performance in this evaluation, self-supervised models are strong semi-supervised learners,” in Proc. Conf.
showing that the feature representation is less descriptive when Neural Inf. Process. Syst., 2020, pp. 22243–22255.
transferring to a different dataset compared to the other methods. [8] X. Chen and K. He, “Exploring simple siamese representation learning,” in
Our method surpasses the other approaches, outperforming them Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15745–
15753.
by a large margin. These results suggest that our approach can [9] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object
learn a point cloud representation transferable across different detection network for autonomous driving,” in Proc. IEEE Conf. Comput.
datasets and LiDAR sensors. Vis. Pattern Recognit., 2017, pp. 6526–6534.

Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.
NUNES et al.: SEGCONTRAST: 3D POINT CFRL THROUGH SELF-SUPERVISED SEGMENT DISCRIMINATION 2123

[10] C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal ConvNets: [35] A. Nekrasov, J. Schult, O. Litany, B. Leibe, and F. Engelmann, “Mix3D:
Minkowski convolutional neural networks,” in Proc. IEEE/CVF Conf. Out-of-context data augmentation for 3D scenes,” in Proc. Intl. Conf. 3D
Comput. Vis. Pattern Recognit., 2019, pp. 3070–3079. Vis., 2021, pp. 116–125.
[11] M. Cordts et al., “The cityscapes dataset for semantic urban scene un- [36] Y. Pan, B. Gao, J. Mei, S. Geng, C. Li, and H. Zhao, “SemanticPOSS:
derstanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, A point cloud dataset with large quantity of dynamic instances,” 2020,
pp. 3213–3223. arXiv:2002.09147.
[12] T. Cortinhal, G. Tzelepis, and E. Aksoy, “SalsaNext: Fast, uncertainty- [37] T. Park, A. A. Efros, R. Zhang, and J. Y. Zhu, “Contrastive learning for
aware semantic segmentation of LiDAR point clouds,” in Adv. Visual unpaired image-to-image translation,” in Proc. Eur. Conf. Comput. Vis.,
Comput., pp. 207–222, 2020. 2020, pp. 319–345.
[13] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: [38] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. point sets for 3D classification and segmentation,” in Proc. IEEE Conf.
Vis. Pattern Recognit., 2009, pp. 248–255. Comput. Vis. Pattern Recognit., 2017, pp. 77–85.
[14] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for [39] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3D
discovering clusters in large spatial databases with noise,” in Proc. Conf. object detection in point clouds,” in Proc. IEEE/CVF Int. Conf. Comput.
Knowl. Discov. Data Mining, 1996, pp. 226–231. Vis., 2019, pp. 9276–9285.
[15] M. Fischler and R. Bolles, “Random sample consensus: A paradigm for [40] C. Qi, K. Yi, H. Su, and L. J. Guibas, “PointNet : Deep hierarchical feature
model fitting with applications to image analysis and automated cartogra- learning on point sets in a metric space,” in Proc. Conf. Neural Inf. Process.
phy,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. Syst., 2017, pp. 5099–5108.
[16] S. Gasperini, M. N. mahani, A. Marcos-Ramiro, N. Navab, and F. Tombari, [41] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation
“Panoster: End-to-end panoptic segmentation of LiDAR point clouds,” and detection from point cloud,” in Proc. IEEE/CVF Conf. Comput. Vis.
IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 3216–3223, Apr. 2021. Pattern Recognit., 2019, pp. 770–779.
[17] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? [42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis. nov, “Dropout: A simple way to prevent neural networks from overfitting,”
Pattern Recognit., 2012, pp. 3354–3361. J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
[18] Z. Gojcic, O. Litany, A. Wieser, L. J. Guibas, and T. Birdal, “Weakly [43] H. Tang et al., “Searching efficient 3D architectures with sparse
supervised learning of rigid 3D scene flow,” in Proc. IEEE/CVF Conf. point-voxel convolution,” in Proc. Eur. Conf. Comput. Vis., 2020,
Comput. Vis. Pattern Recognit., 2021, pp. 5688–5699. pp. 685–702.
[19] B. Graham and L. van der Maaten, “3D semantic segmentation with [44] H. Thomas, C. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. Guibas,
submanifold sparse convolutional networks,” in Proc. IEEE Conf. Comput. “KPConv: Flexible and deformable convolution for point clouds,” in Proc.
Vision Pattern Recognit. 2018. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 6410–6419.
[20] J. B. Grill et al., “Bootstrap your own latent - a new approach to self- [45] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with
supervised learning,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. contrastive predictive coding,” 2018, arXiv:1807.03748.
21271–21284. [46] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool, “Un-
[21] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep supervised semantic segmentation by contrasting object mask proposals,”
learning for 3D point clouds: A survey,” IEEE Trans. Pattern Anal. Mach. in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 10052–10062.
Intell., vol. 43, no. 12, pp. 4338–4364, Dec. 2021. [47] H. Wang, Y. Cong, O. Litany, Y. Gao, and L. J. Guibas, “3DIoUMatch:
[22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for Leveraging IoU prediction for semi-supervised 3D object detection,” in
unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Proc. the IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021,
Comput. Vis. Pattern Recognit., 2020, pp. 9726–9735. pp. 14610–14619.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [48] W. Wang, T. Zhou, F. Yu, J. Dai, E. Konukoglu, and L. Gool, “Exploring
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, cross-image pixel contrast for semantic segmentation,” in Proc. IEEE/CVF
pp. 770–778. Int. Conf. Comput. Vis., 2021, pp. 7303–7313.
[24] O. J. Hénaff, S. Koppula, J. B. Alayrac, A. van den Oord, O. Vinyals, and J. [49] B. Wu, A. Wan, X. Yue, and K. Keutzer, “SqueezeSeg: Convolutional
Carreira, “Efficient visual pretraining with contrastive detection,” in Proc. neural nets with recurrent CRF for real-time road-object segmentation
IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 10086–10096. from 3D LiDAR point cloud,” in Proc. IEEE Int. Conf. Robot. Automat.,
[25] M. Himmelsbach, F.v. Hundelshausen, and H. J. Wuensche, “Fast segmen- 2018, pp. 1887–1893.
tation of 3D point clouds for ground vehicles,” in Proc. IEEE Veh. Symp., [50] S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany, “PointContrast:
2010, pp. 560–565. Unsupervised pre-training for 3D point cloud understanding,” in Proc. Eur.
[26] J. Hou, B. Graham, M. Niessner, and S. Xie, “Exploring data-efficient 3D Conf. Comput. Vis., 2020, pp. 574–591.
scene understanding with contrastive scene contexts,” in Proc. IEEE/CVF [51] X. Xu and G. H. Lee, “Weakly supervised semantic point cloud segmen-
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15582–15592. tation: Towards 10x fewer labels,” in Proc. IEEE/CVF Conf. Comput. Vis.
[27] Q. Hu et al., “SQN: Weakly-supervised semantic segmentation of large- Pattern Recognit., 2020, pp. 13706–13715.
scale 3D point clouds with 1000x fewer labels,” 2021, arXiv:2104.04891. [52] L. Yi, B. Gong, and T. Funkhouser, “Complete & label: A domain adapta-
[28] P. Jiang and S. Saripalli, “LiDARNet: A boundary-aware domain adap- tion approach to semantic segmentation of LiDAR point clouds,” in Proc.
tation model for point cloud semantic segmentation,” in Proc. IEEE Int. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15358–15368.
Conf. Robot. Automat., 2021, pp. 2457–2464. [53] X. Yuan, A. Kortylewski, Y. Sun, and A. Yuille, “Robust instance seg-
[29] F. Langer, A. Milioto, A. Haag, J. Behley, and C. Stachniss, “Domain trans- mentation through reasoning about multi-object occlusion,” in Proc.
fer for semantic segmentation of LiDAR data using deep neural networks,” IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 11136–11145.
in Proc. IEEE/RSJ Intl. Conf. Intell. Robots Syst., 2020, pp. 8263–8270. [54] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-
[30] T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. supervised learning via redundancy reduction,” in Proc. Int. Conf. Mach.
Eur. Conf. Comput. Vis., 2014, pp. 740–755. Learn., 2021, pp. 12310–12320.
[31] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with [55] Z. Zhang, R. Girdhar, A. Joulin, and I. Misra, “Self-supervised pretraining
restarts,” in Proc. Int. Conf. Learn. Representations„ 2017. of 3D features on any point-cloud,” in Proc. IEEE/CVF Int. Conf. Comput.
[32] A. Milioto, J. Behley, C. McCool, and C. Stachniss, “LiDAR panoptic Vis., 2021, pp. 10252–10263.
segmentation for autonomous driving,” in Proc. IEEE/RSJ Intl. Conf. [56] L. Zheng, M. Tang, Y. Chen, G. Zhu, J. Wang, and H. Lu, “Improving
Intell. Robots Syst., 2020, pp. 8505–8512. multiple object tracking with single object tracking,” in Proc. IEEE/CVF
[33] A. Milioto, L. Mandtler, and C. Stachniss, “Fast instance and semantic Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2453–2462.
segmentation exploiting local connectivity, metric learning, and one-shot [57] Z. Zhou, Y. Zhang, and H. Foroosh, “Panoptic-PolarNet: Proposal-free
detection for robotics,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, LiDAR point cloud panoptic segmentation,” in Proc. IEEE/CVF Conf.
pp. 5481–5487. Comput. Vis. Pattern Recognit., 2021, pp. 13189–13198.
[34] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “RangeNet : Fast and [58] X. Zhu et al., “Cylindrical and asymmetrical 3D convolution networks
accurate LiDAR semantic segmentation,” in Proc. IEEE/RSJ Int. Conf. for LiDAR segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Intell. Robots Syst., 2019, pp. 4213–4220. Recognit., 2021, pp. 9939–9948.

Authorized licensed use limited to: Indian Ins of Science Edu & Research. Downloaded on February 19,2024 at 12:49:29 UTC from IEEE Xplore. Restrictions apply.

You might also like