Semantickitti: A Dataset For Semantic Scene Understanding of Lidar Sequences
Semantickitti: A Dataset For Semantic Scene Understanding of Lidar Sequences
of LiDAR Sequences
Figure 1: Our dataset provides dense annotations for each scan of all sequences from the KITTI Odometry Benchmark [19].
Here, we show multiple scans aggregated using pose information estimated by a SLAM approach.
Abstract 1. Introduction
Semantic scene understanding is important for various Semantic scene understanding is essential for many ap-
applications. In particular, self-driving cars need a fine- plications and an integral part of self-driving cars. Par-
grained understanding of the surfaces and objects in their ticularly, fine-grained understanding provided by seman-
vicinity. Light detection and ranging (LiDAR) provides pre- tic segmentation is necessary to distinguish drivable and
cise geometric information about the environment and is non-drivable surfaces and to reason about functional prop-
thus a part of the sensor suites of almost all self-driving erties, like parking areas and sidewalks. Currently, such un-
cars. Despite the relevance of semantic scene understand- derstanding, represented in so-called high definition maps,
ing for this application, there is a lack of a large dataset for is mainly generated in advance using surveying vehicles.
this task which is based on an automotive LiDAR. However, self-driving cars should also be able to drive
in unmapped areas and adapt their behavior if there are
In this paper, we introduce a large dataset to propel re-
changes in the environment.
search on laser-based semantic segmentation. We anno-
tated all sequences of the KITTI Vision Odometry Bench- Most self-driving cars currently use multiple different
mark and provide dense point-wise annotations for the com- sensors to perceive the environment. Complementary sen-
plete 360o field-of-view of the employed automotive LiDAR. sor modalities enable to cope with deficits or failures of par-
We propose three benchmark tasks based on this dataset: ticular sensors. Besides cameras, light detection and rang-
(i) semantic segmentation of point clouds using a single ing (LiDAR) sensors are often used as they provide precise
scan, (ii) semantic segmentation using multiple past scans, distance measurements that are not affected by lighting.
and (iii) semantic scene completion, which requires to an- Publicly available datasets and benchmarks are crucial
ticipate the semantic scene in the future. We provide base- for empirical evaluation of research. They mainly ful-
line experiments and show that there is a need for more fill three purposes: (i) they provide a basis to measure
sophisticated models to efficiently tackle these tasks. Our progress, since they allow to provide results that are re-
dataset opens the door for the development of more ad- producible and comparable, (ii) they uncover shortcomings
vanced methods, but also provides plentiful data to inves- of the current state of the art and therefore pave the way
tigate new research directions. for novel approaches and research directions, and (iii) they
make it possible to develop approaches without the need to
first painstakingly collect and label data. While multiple
∗ indicates equal contribution
1
#scans1 #points2 #classes3 sensor annotation sequential
SemanticKITTI (Ours) 23201/20351 4549 25 (28) Velodyne HDL-64E point-wise 3
Oakland3d [36] 17 1.6 5 (44) SICK LMS point-wise 7
Freiburg [50, 6] 77 1.1 4 (11) SICK LMS point-wise 7
Wachtberg [6] 5 0.4 5 (5) Velodyne HDL-64E point-wise 7
Semantic3d [23] 15/15 4009 8 (8) Terrestrial Laser Scanner point-wise 7
Paris-Lille-3D [47] 3 143 9 (50) Velodyne HDL-32E point-wise 7
Zhang et al. [65] 140/112 32 10 (10) Velodyne HDL-64E point-wise 7
KITTI [19] 7481/7518 1799 3 Velodyne HDL-64E bounding box 7
Table 1: Overview of other point cloud datasets with semantic annotations. Ours is by far the largest dataset with sequential
information. 1 Number of scans for train and test set, 2 Number of points is given in millions, 3 Number of classes used for
evaluation and number of classes annotated in brackets.
1
108
other-structure
1
other-ground
other-vehicle
motorcyclist
other-object
motorcycle
vegetation
107
traffic sign
sidewalk
bicyclist
building
1
parking
bicycle
person
terrain
outlier
fence
trunk
truck
106
road
pole
car
105
ground structure vehicle nature human object
Figure 3: Label distribution. The number of labeled points per class and the root categories for the classes are shown. For
movable classes, we also show the number of points on non-moving (solid bars) and moving objects (hatched bars).
ent height. For three sequences, we had to manually add motorcyclist only occurs rarely, but still more than 100 000
loop closure constraints to get correctly loop closed trajec- points are annotated.
tories, since this is essential to get consistent point clouds The unbalanced count of classes is common for datasets
for annotation. The loop closed poses allow us to load all captured in natural environments and some classes will be
overlapping point clouds for specific locations and visualize always under-represented, since they do not occur that of-
them together, as depicted in Figure 2. ten. Thus, an unbalanced class distribution is part of the
We subdivide the sequence of point clouds into tiles of problem that an approach has to master. Overall, the distri-
100 m by 100 m. For each tile, we only load scans overlap- bution and relative differences between the classes is quite
ping with the tile. This enables us to label all scans con- similar in other datasets, e.g. Cityscapes [10].
sistently even when we encounter temporally distant loop
closures. To ensure consistency for scans overlapping with 4. Evaluation of Semantic Segmentation
more than one tile, we show all points inside each tile and a
In this section, we provide the evaluation of several state-
small boundary overlapping with neighboring tiles. Thus, it
of-the-art methods for semantic segmentation of a single
is possible to continue labels from a neighboring tile.
scan. We also provide experiments exploiting information
Following best practices, we compiled a labeling instruc-
provided by sequences of multiple scans.
tion and provided instructional videos on how to label cer-
tain objects, such as cars and bicycles standing near a wall. 4.1. Single Scan Experiments
Compared to image-based annotation, the annotation pro-
Task and Metrics. In semantic segmentation of point
cess with point clouds is more complex, since the annotator
clouds, we want to infer the label of each three-dimensional
often needs to change the viewpoint. An annotator needs on
point. Therefore, the input to all evaluated methods is a list
average 4.5 hours per tile, when labeling residential areas
of coordinates of the three-dimensional points along with
corresponding to the most complex encountered scenery,
their remission, i.e., the strength of the reflected laser beam
and needs on average 1.5 hours for labeling a highway tile.
which depends on the properties of the surface that was hit.
We explicitly did not use bounding boxes or other avail-
Each method should then output a label for each point of a
able annotations for the KITTI dataset, since we want to en-
scan, i.e., one full turn of the rotating LiDAR sensor.
sure that the labeling is consistent and the point-wise labels
To assess the labeling performance, we rely on the com-
should only contain the object itself.
monly applied mean Jaccard Index or mean intersection-
We provided regular feedback to the annotators to im-
over-union (mIoU) metric [15] over all classes, given by
prove the quality and accuracy of labels. Nevertheless, a
single annotator also verified the labels in a second pass, 1 X
C
TPc
i.e., corrected inconsistencies and added missing labels. In , (1)
C c=1 TPc + FPc + FNc
summary, the whole dataset comprises 518 tiles and over
1 400 hours of labeling effort have been invested with addi- where TPc , FPc , and FNc correspond to the number of true
tional 10 − 60 minutes verification and correction per tile, positive, false positive, and false negative predictions for
resulting in a total of over 1 700 hours. class c, and C is the number of classes.
As the classes other-structure and other-object have ei-
3.2. Dataset Statistics
ther only a few points and are otherwise too diverse with a
Figure 3 shows the distribution of the different classes, high intra-class variation, we decided to not include these
where we also included the root categories as labels on the classes in the evaluation. Thus, we use 25 instead of 28
x-axis. The ground classes, road, sidewalk, building, vege- classes, ignoring outlier, other-structure, and other-object
tation, and terrain are the most frequent classes. The class during training and inference.
Furthermore, we cannot expect to distinguish moving data is sampled from smooth surfaces and defining a tan-
from non-moving objects with a single scan, since this Velo- gent convolution as a convolution applied to the projection
dyne LiDAR cannot measure velocities like radars exploit- of the local surface at each point into the tangent plane.
ing the Doppler effect. We therefore combine the moving SPLATNet [51] takes an approach that is similar to the
classes with the corresponding non-moving class resulting aforementioned voxelization methods and represents the
in a total number of 19 classes for training and evaluation. point clouds in a high-dimensional sparse lattice. As with
voxel-based methods, this scales poorly both in compu-
State of the Art. Semantic segmentation or point-wise tation and in memory cost and therefore they exploit the
classification of point clouds is a long-standing topic [2], sparsity of this representation by using bilateral convolu-
which was traditionally solved using a feature extractor, tions [27], which only operates on occupied lattice parts.
such as Spin Images [29], in combination with a traditional Similarly to PointNet, Superpoint Graph [31], captures
classifier, like support vector machines [1] or even semantic the local relationships by summarizing geometrically ho-
hashing [4]. Many approaches used Conditional Random mogeneous groups of points into superpoints, which are
Fields (CRF) to enforce label consistency of neighboring later embedded by local PointNets. The result is a super-
points [56, 37, 36, 38, 62]. point graph representation that is more compact and rich
With the advent of deep learning approaches in image- than the original point cloud exploiting contextual relation-
based classification, the whole pipeline of feature extrac- ships between the superpoints.
tion and classification has been replaced by end-to-end deep SqueezeSeg [60, 61] also discretizes the point cloud in
neural networks. Voxel-based methods transforming the a way that makes it possible to apply 2D convolutions to
point cloud into a voxel-grid and then applying convolu- the point cloud data exploiting the sensor geometry of a ro-
tional neural networks (CNN) with 3D convolutions for ob- tating LiDAR. In the case of a rotating LiDAR, all points
ject classification [34] and semantic segmentation [26] were of a single turn can be projected to an image by using a
among the first investigated models, since they allowed to spherical projection. A fully convolutional neural network
exploit architectures and insights known for images. is applied and then finally filtered with a CRF to smooth
To overcome the limitations of the voxel-based represen- the results. Due to the promising results of SqueezeSeg and
tation, such as the exploding memory consumption when the fast training, we investigated how the labeling perfor-
the resolution of the voxel grid increases, more recent ap- mance is affected by the number of model parameters. To
proaches either upsample voxel-predictions [53] using a this end, we used a different backbone based on the Dark-
CRF or use different representations, like more efficient net architecture [42] with 21 and 53 layers, and 25 and 50
spatial subdivisions [30, 44, 63, 59, 21], rendered 2D im- million parameters respectively. We furthermore eliminated
age views [7], graphs [31, 54], splats [51], or even directly the vertical downsampling used in the architecture.
the points [41, 40, 25, 22, 43, 28, 14]. We modified the available implementations such that the
Baseline approaches. We provide the results of six state- methods could be trained and evaluated on our large-scale
of-the-art architectures for the semantic segmentation of dataset. Note that most of these approaches have so far only
point clouds in our dataset: PointNet [40], PointNet++ [41], been evaluated on shape [8] or RGB-D indoor datasets [48].
Tangent Convolutions [52], SPLATNet [51], Superpoint However, some of the approaches [40, 41] were only possi-
Graph [31], and SqueezeSeg (V1 and V2) [60, 61]. Further- ble to run with considerable downsampling to 50 000 points
more, we investigate two extensions of SqueezeSeg: Dark- due to memory limitations.
Net21Seg and DarkNet53Seg. Results and Discussion. Table 2 shows the results of our
PointNet [40] and PointNet++ [41] use the raw un- baseline experiments for various approaches using either di-
ordered point cloud data as input. Core of these approaches rectly the point cloud information [40, 41, 51, 52, 31] or a
is max pooling to get an order-invariant operator that works projection of the point cloud [60]. The results show that the
surprisingly well for semantic segmentation of shapes and current state of the art for point cloud semantic segmenta-
several other benchmarks. Due to this nature, however, tion falls short for the size and complexity of our dataset.
PointNet fails to capture the spatial relationships between We believe that this is mainly caused by the limited ca-
the features. To alleviate this, PointNet++ [41] applies indi- pacity of the used architectures (see Table 3), because the
vidual PointNets to local neighborhoods and uses a hierar- number of parameters of these approaches is much lower
chical approach to combine their outputs. This enables it to than the number of parameters used in leading image-based
build complex hierarchical features that capture both local semantic segmentation networks. As mentioned above, we
fine-grained and global contextual information. added DarkNet21Seg and DarkNet53Seg to test this hy-
Tangent Convolutions [52] also handles unstructured pothesis and the results show that this simple modifica-
point clouds by applying convolutional neural networks di- tion improves the accuracy from 29.5 % for SqueezeSeg to
rectly on surfaces. This is achieved by assuming that the 47.4 % for DarkNet21Seg and to 49.9 % for DarkNet53Seg.
other-vehicle
other-ground
motorcyclist
motorcycle
traffic sign
vegetation
sidewalk
bicyclist
building
parking
bicycle
person
terrain
mIoU
fence
trunk
truck
road
pole
car
Approach
PointNet [40] 14.6 61.6 35.7 15.8 1.4 41.4 46.3 0.1 1.3 0.3 0.8 31.0 4.6 17.6 0.2 0.2 0.0 12.9 2.4 3.7
SPGraph [31] 17.4 45.0 28.5 0.6 0.6 64.3 49.3 0.1 0.2 0.2 0.8 48.9 27.2 24.6 0.3 2.7 0.1 20.8 15.9 0.8
SPLATNet [51] 18.4 64.6 39.1 0.4 0.0 58.3 58.2 0.0 0.0 0.0 0.0 71.1 9.9 19.3 0.0 0.0 0.0 23.1 5.6 0.0
PointNet++ [41] 20.1 72.0 41.8 18.7 5.6 62.3 53.7 0.9 1.9 0.2 0.2 46.5 13.8 30.0 0.9 1.0 0.0 16.9 6.0 8.9
SqueezeSeg [60] 29.5 85.4 54.3 26.9 4.5 57.4 68.8 3.3 16.0 4.1 3.6 60.0 24.3 53.7 12.9 13.1 0.9 29.0 17.5 24.5
SqueezeSegV2 [61] 39.7 88.6 67.6 45.8 17.7 73.7 81.8 13.4 18.5 17.9 14.0 71.8 35.8 60.2 20.1 25.1 3.9 41.1 20.2 36.3
TangentConv [52] 40.9 83.9 63.9 33.4 15.4 83.4 90.8 15.2 2.7 16.5 12.1 79.5 49.3 58.1 23.0 28.4 8.1 49.0 35.8 28.5
DarkNet21Seg 47.4 91.4 74.0 57.0 26.4 81.9 85.4 18.6 26.2 26.5 15.6 77.6 48.4 63.6 31.8 33.6 4.0 52.3 36.0 50.0
DarkNet53Seg 49.9 91.8 74.6 64.8 27.9 84.1 86.4 25.5 24.5 32.7 22.6 78.3 50.1 64.0 36.2 33.6 4.7 55.0 38.9 52.2
Table 2: Single scan results (19 classes) for all baselines on sequences 11 to 21 (test set). All methods were trained on
sequences 00 to 10, except for sequence 08 which is used as validation set.
PointNet 3 4 0.5
40 PointNet++ 6 16 5.9
SPGraph 0.25 6 5.2
Mean IoU [%]
30
TangentConv 0.4 6 3.0
SPLATNet 0.8 8 1.0
SqueezeSeg 1 0.5 0.015
20
SqueezeSegV2 1 0.6 0.02
DarkNet21Seg 25 2 0.055
10
DarkNet53Seg 50 3 0.1
10 15 20 25 30 35 40 45 50
Distance to sensor [m] Table 3: Approach statistics.
Figure 4: IoU vs. distance to the sensor.
4.2. Multiple Scan Experiments
Another reason is that the point clouds generated by Li- Task and Metrics. In this task, we allow methods to ex-
DAR are relatively sparse, especially as the distance to the ploit information from a sequence of multiple past scans
sensor increases. This is partially solved in SqueezeSeg, to improve the segmentation of the current scan. We fur-
which exploits the way the rotating scanner captures the thermore want the methods to distinguish moving and non-
data to generate a dense range image, where each pixel cor- moving classes, i.e., all 25 classes must be predicted, since
responds roughly to a point in the scan. this information should be visible in the temporal informa-
These effects are further analyzed in Figure 4, where the tion of multiple past scans. The evaluation metric for this
mIoU is plotted w.r.t. the distance to the sensor. It shows task is still the same as in the single scan case, i.e., we eval-
that results of all approaches get worse with increasing dis- uate the mean IoU of the current scan no matter how many
tance. This further confirms our hypothesis that the spar- past scans were used to compute the results.
sity is the main reason for worse results at large distances.
However, the results also show that some methods, like SP- Baselines. We exploit the sequential information by com-
Graph, are less affected by the distance-dependent sparsity bining 5 scans into a single, large point cloud, i.e., the cur-
and this might be a promising direction for future research rent scan at timestamp t and the 4 scans before at times-
to combine the strength of both paradigms. tamps t − 1, . . . , t − 4. We evaluate DarkNet53Seg and
Especially classes with few examples, like motorcyclists TangentConv, since these approaches can deal with a larger
and trucks, seem to be more difficult for all approaches. But number of points without downsampling of the point clouds
also classes with only a small number of points in a single and could still be trained in a reasonable amount of time.
point cloud, like bicycles and poles, are hard classes. Results and Discussion. Table 4 shows the per-class re-
Finally, the best performing approach (DarkNet53Seg) sults for the movable classes and the mean IoU (mIoU) over
with 49.9% mIoU is still far from achieving results that all classes. For each method, we show in the upper part of
are on par with image-based approaches, e.g., 80% on the the row the IoU for non-moving (unshaded) and in the lower
Cityscapes benchmark [10]. part of the row the IoU for moving objects (shaded). The
records their backsides, which are hidden in the initial
other-vehicle
motorcyclist
scan due to self-occlusion. This is exactly the information
bicyclist
person
mIoU
needed for semantic scene completion as it contains the full
truck
car
Approach 3D geometry of all objects while their semantics are pro-
84.9 21.1 18.5 1.6 0.0 0.0 vided by our dense annotations.
TangentConv [52] 34.1
40.3 42.2 30.1 6.4 1.1 1.9
Dataset Generation. By superimposing an exhaustive
DarkNet53Seg
84.1 20.0 20.7 7.5 0.0 0.0
41.6 number of future laser scans in a predefined region in front
61.5 37.8 28.9 15.2 14.1 0.2 of the car, we can generate pairs of inputs and targets that
correspond to the task of semantic scene completion. As
Table 4: IoU results using a sequence of multiple past scans proposed by Song et al. [49], our dataset for the scene com-
(in %). Shaded cells correspond to the IoU of the moving pletion task is a voxelized representation of the 3D scene.
classes, while unshaded entries are the non-moving classes. We select a volume of 51.2 m ahead of the car, 25.6 m
to every side and 6.4 m in height with a voxel resolution of
performance of the remaining static classes is similar to the 0.2 m, which results in a volume of 256×256×32 voxels to
single scan results and we refer to the supplement for a table predict. We assign a single label to every voxel based on the
containing all classes. majority vote over all labeled points inside a voxel. Voxels
The general trend that the projective methods perform that do not contain any points are labeled as empty.
better than the point-based methods is still apparent, which To compute which voxels belong to the occluded space,
can be also attributed to the larger amount of parameters we check for every pose of the car which voxels are visi-
as in the single scan case. Both approaches show difficul- ble to the sensor by tracing a ray. Some of the voxels, e.g.
ties in separating moving and non-moving objects, which those inside objects or behind walls are never visible, so we
might be caused by our design decision to aggregate multi- ignore them during training and evaluation.
ple scans into a single large point cloud. The results show Overall, we extracted 19 130 pairs of input and target
that especially bicyclist and motorcyclist never get correctly voxel grids for training, 815 for validation and 3 992 for
assigned the non-moving class, which is most likely a con- testing. For the test set, we only provide the unlabeled in-
sequence from the generally sparser object point clouds. put voxel grid and withhold the target voxel grids. Figure 5
We expect that new approaches could explicitly exploit shows an example of an input and target pair.
the sequential information by using multiple input streams Task and Metrics. In semantic scene completion, we are
to the architecture or even recurrent neural networks to ac- interested in predicting the complete scene inside a certain
count for the temporal information, which again might open volume from a single initial scan. More specifically, we use
a new line of research. as input a voxel grid, where each voxel is marked as empty
or occupied, depending on whether or not it contains a laser
5. Evaluation of Semantic Scene Completion measurement. For semantic scene completion, one needs to
predict whether a voxel is occupied and its semantic label
After leveraging a sequence of past scans for seman-
in the completed scene.
tic point cloud segmentation, we now show a scenario that
makes use of future scans. Due to its sequential nature, our For evaluation, we follow the evaluation protocol of
dataset provides the unique opportunity to be extended for Song et al. [49] and compute the IoU for the task of scene
the task of 3D semantic scene completion. Note that this is completion, which only classifies a voxel as being occu-
the first real world outdoor benchmark for this task. Exist- pied or empty, i.e., ignoring the semantic label, as well as
ing point cloud datasets cannot be used to address this task, mIoU (1) for the task of semantic scene completion over the
as they do not allow for aggregating labeled point clouds same 19 classes that were used for the single scan semantic
that are sufficiently dense in both space and time. segmentation task (see Section 4).
In semantic scene completion, one fundamental prob- State of the Art. Early approaches addressed the task of
lem is to obtain ground truth labels for real world datasets. scene completion either without predicting semantics [16],
In case of NYUv2 [48], CAD models were fit into the thereby not providing a holistic understanding of the scene,
scene [45] using an RGB-D image captured by a Kinect or by trying to fit a fixed number of mesh models to the
sensor. New approaches often resort to prove their effective- scene geometry [20], which limits the expressiveness of the
ness on the larger, but synthetic SUNCG dataset [49]. How- approach.
ever, a dataset combining the scale of a synthetic dataset and Song et al. [49] were the first to address the task of se-
usage of real-world data is still missing. mantic scene completion in an end-to-end fashion. Their
In the case of our proposed dataset, the car carrying the work spawned a lot of interest in the field yielding mod-
LiDAR moves past 3D objects in the scene and thereby els that combine the usage of color and depth informa-
Figure 5: Left: Visualization of the incomplete input for the semantic scene completion benchmark. Note that we show the
labels only for better visualization, but the real input is a single raw voxel grid without any labels. Right: Corresponding
target output representing the completed and fully labeled 3D scene.
tion [33, 18] or address the problem of sparse 3D fea- Completion Semantic Scene
ture maps by introducing submanifold convolutions [64] or (IoU) Completion (mIoU)
increase the output resolution by deploying a multi-stage SSCNet [49] 29.83 9.53
TS3D [18] 29.81 9.54
coarse to fine training scheme [12]. Other works exper- TS3D [18] + DarkNet53Seg 24.99 10.19
imented with new encoder-decoder CNN architectures as TS3D [18] + DarkNet53Seg + SATNet 50.60 17.70
well as improving the loss term by adding adversarial loss
components [58]. Table 5: Semantic scene completion baselines.
Baseline Approaches. We report the results of four se-
memory limitations, we use random cropping during train-
mantic scene completion approaches. In the first approach,
ing. During inference, we divide each volume into six equal
we apply SSCNet [49] without the flipped TSDF as input
parts, perform scene completion on them individually and
feature. This has minimal impact on the performance, but
subsequently fuse them. This approach performs much bet-
significantly speeds up the training time due to faster pre-
ter than the SSCNet based approaches.
processing [18]. Then we use the Two Stream (TS3D) ap-
Apart from dealing with the target resolution, a challenge
proach [18], which makes use of the additional information
for current models is the sparsity of the laser input signal in
from the RGB image corresponding to the input laser scan.
the far field as can be seen from Figure 5. To obtain a higher
Therefore the RGB image is first processed by a 2D seman-
resolution input signal in the far field, approaches would
tic segmentation network, using the approach DeepLab v2
have to exploit more efficiently information from high res-
(ResNet-101) [9] trained on Cityscapes to generate a se-
olution RGB images provided along with each laser scan.
mantic segmentation. The depth information from the sin-
gle laser scan and the labels inferred from the RGB image
6. Conclusion and Outlook
are combined in an early fusion. Furthermore, we modify
the TS3D approach in two steps: First, by directly using In this work, we have presented a large-scale dataset
labels from the best LiDAR-based semantic segmentation showing unprecedented scale in point-wise annotation of
approach (DarkNet53Seg) and secondly, by exchanging the point cloud sequences. We provide a range of different
3D-CNN backbone by SATNet [33]. baseline experiments for three tasks: (i) semantic segmen-
Results and Discussion. Table 5 shows the results of each tation using a single scan, (ii) semantic segmentation using
of the baselines, whereas results for individual classes are multiple scans, and (iii) semantic scene completion.
reported in the supplement. The TS3D network, incorpo- In future work, we plan to provide also instance-level
rating 2D semantic segmentation of the RGB image, per- annotation over the whole sequence, i.e., we want to distin-
forms similar to SSCNet which only uses depth informa- guish different objects in a scan, but also identify the same
tion. However, the usage of the best semantic segmen- object over time. This will enable to investigate temporal
tation directly working on the point cloud slightly out- instance segmentation over sequences. However, we also
performs SSCNet on semantic scene completion (TS3D + see potential for other new tasks based on our labeling ef-
DarkNet53Seg). Note that the first three approaches are fort, such as the evaluation of semantic SLAM.
based on SSCNet’s 3D-CNN architecture, which performs Acknowledgments We thank all students that helped with
a 4 fold downsampling in a forward pass and thus renders annotating the data. The work has been funded by the
them incapable of dealing with details of the scene. In Deutsche Forschungsgemeinschaft (DFG, German Research
our final approach, we exchange the SSCNet-backbone of Foundation) under FOR 1505 Mapping on Demand, BE 5996/1-1,
TS3D + DarkNet53Seg with SATNet [33], which is capa- GA 1927/2-2, and under Germanys Excellence Strategy, EXC-
ble of dealing with the desired output resolution. Due to 2070 – 390732324 (PhenoRob).
References Scans. In Proc. of the IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), 2018. 2, 8
[1] Anuraag Agrawal, Atsushi Nakazawa, and Haruo Takemura. [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and
MMM-classification of 3D Range Data. In Proc. of the IEEE Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image
Intl. Conf. on Robotics & Automation (ICRA), 2009. 5 Database. In Proc. of the IEEE Conf. on Computer Vision
[2] Dragomir Anguelov, Ben Taskar, Vassil Chatalbashev, and Pattern Recognition (CVPR), 2009. 2
Daphne Koller, Dinkar Gupta, Geremy Heitz, and Andrew [14] Francis Engelmann, Theodora Kontogianni, Jonas Schult,
Ng. Discriminative Learning of Markov Random Fields and Bastian Leibe. Know What Your Neighbors Do: 3D Se-
for Segmentation of 3D Scan Data. In Proc. of the IEEE mantic Segmentation of Point Clouds. arXiv preprint, 2018.
Conf. on Computer Vision and Pattern Recognition (CVPR), 5
pages 169–176, 2005. 5 [15] Mark Everingham, S.M. Ali Eslami, Luc van Gool, Christo-
[3] Iro Armeni, Alexander Sax, Amir R. Zamir, and Silvio pher K.I. Williams, John Winn, and Andrew Zisserman. The
Savarese. Joint 2D-3D-Semantic Data for Indoor Scene Un- Pascal Visual Object Classes Challenge a Retrospective. In-
derstanding. arXiv preprint, 2017. 2 ternational Journal on Computer Vision (IJCV), 111(1):98–
[4] Jens Behley, Kristian Kersting, Dirk Schulz, Volker Stein- 136, 2015. 4
hage, and Armin B. Cremers. Learning to Hash Logistic [16] Michael Firman, Oisin Mac Aodha, Simon Julier, and
Regression for Fast 3D Scan Point Classification. In Proc. of Gabriel J. Brostow. Structured Prediction of Unobserved
the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems Voxels From a Single Depth Image. In Proc. of the IEEE
(IROS), pages 5960–5965, 2010. 5 Conf. on Computer Vision and Pattern Recognition (CVPR),
[5] Jens Behley and Cyrill Stachniss. Efficient Surfel-Based pages 5431–5440, 2016. 7
SLAM using 3D Laser Range Data in Urban Environments. [17] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora
In Proc. of Robotics: Science and Systems (RSS), 2018. 3 Vig. Virtual Worlds as Proxy for Multi-Object Tracking
[6] Jens Behley, Volker Steinhage, and Armin B. Cremers. Per- Analysis. In Proc. of the IEEE Conf. on Computer Vision
formance of Histogram Descriptors for the Classification of and Pattern Recognition (CVPR), 2016. 3
3D Laser Range Data in Urban Environments. In Proc. of the [18] Martin Garbade, Yueh-Tung Chen, J. Sawatzky, and Juer-
IEEE Intl. Conf. on Robotics & Automation (ICRA), 2012. 2, gen Gall. Two Stream 3D Semantic Scene Completion. In
3 Proc. of the IEEE/CVF Conf. on Computer Vision and Pat-
[7] Alexandre Boulch, Joris Guerry, Bertrand Le Saux, and tern Recognition (CVPR) Workshops, 2019. 8
Nicolas Audebert. SnapNet: 3D point cloud semantic la- [19] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
beling with 2D deep segmentation networks. Computers & ready for Autonomous Driving? The KITTI Vision Bench-
Graphics, 2017. 5 mark Suite. In Proc. of the IEEE Conf. on Computer Vision
[8] Angel X. Chang, Thomas Funkhouser, Leonidas J. Guibas, and Pattern Recognition (CVPR), pages 3354–3361, 2012.
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, 1, 2, 3
Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, [20] Andres Geiger and Chaohui Wang. Joint 3d Object and Lay-
and Fisher Yu. ShapeNet: An Information-Rich 3D Model out Inference from a single RGB-D Image. In Proc. of the
Repository. Technical Report arXiv:1512.03012 [cs.GR], German Conf. on Pattern Recognition (GCPR), pages 183–
Stanford University and Princeton University and Toyota 195, 2015. 7
Technological Institute at Chicago, 2015. 2, 5 [21] Benjamin Graham, Martin Engelcke, and Laurens van der
[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Maaten. 3D Semantic Segmentation with Submanifold
Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Sparse Convolutional Networks. In Proc. of the IEEE
Image Segmentation withDeep Convolutional Nets, Atrous Conf. on Computer Vision and Pattern Recognition (CVPR),
Convolution,and Fully Connected CRFs. IEEE Transac- 2018. 5
tions on Pattern Analysis and Machine Intelligence (PAMI), [22] Fabian Groh, Patrick Wieschollek, and Hendrik Lensch.
40(4):834–848, 2018. 8 Flex-Convolution (Million-Scale Pointcloud Learning Be-
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo yond Grid-Worlds). In Proc. of the Asian Conf. on Computer
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Vision (ACCV), Dezember 2018. 5
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes [23] Timo Hackel, Nikolay Savinov, Lubor Ladicky, Jan D.
Dataset for Semantic Urban Scene Understanding. In Wegner, Konrad Schindler, and Marc Pollefeys. SEMAN-
Proc. of the IEEE Conf. on Computer Vision and Pattern TIC3D.NET: A new large-scale point cloud classification
Recognition (CVPR), 2016. 2, 3, 4, 6 benchmark. In ISPRS Annals of the Photogrammetry, Re-
[11] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- mote Sensing and Spatial Information Sciences, volume IV-
ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: 1-W1, pages 91–98, 2017. 2
Richly-annotated 3D Reconstructions of Indoor Scenes. In [24] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen,
Proc. of the IEEE Conf. on Computer Vision and Pattern Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN:
Recognition (CVPR), 2009. 2 A Scene Meshes Dataset with aNNotations. In Proc. of the
[12] Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Intl. Conf. on 3D Vision (3DV), 2016. 2
Jürgen Sturm, and Matthias Nießner. ScanComplete: Large- [25] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Point-
Scale Scene Completion and Semantic Segmentation for 3D wise Convolutional Neural Networks. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), [39] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and
2018. 5 Peter Kontschieder. The Mapillary Vistas Dataset for Se-
[26] Jing Huang and Suya You. Point Cloud Labeling using 3D mantic Understanding of Street Scenes. In Proc. of the IEEE
Convolutional Neural Network. In Proc. of the Intl. Conf. on Intl. Conf. on Computer Vision (ICCV), 2017. 2, 3
Pattern Recognition (ICPR), 2016. 5 [40] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas.
[27] Varun Jampani, Martin Kiefel, and Peter V. Gehler. Learn- PointNet: Deep Learning on Point Sets for 3D Classification
ing Sparse High Dimensional Filters: Image Filtering, Dense and Segmentation. In Proc. of the IEEE Conf. on Computer
CRFs and Bilateral Neural Networks. In Proc. of the IEEE Vision and Pattern Recognition (CVPR), 2017. 5, 6
Conf. on Computer Vision and Pattern Recognition (CVPR), [41] Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point-
2016. 5 Net++: Deep Hierarchical Feature Learning on Point Sets in
[28] Mingyang Jiang, Yiran Wu, and Cewu Lu. PointSIFT: A a Metric Space. In Proc. of the Conf. on Neural Information
SIFT-like Network Module for 3D Point Cloud Semantic Processing Systems (NeurIPS), 2017. 5, 6
Segmentation. arXiv preprint, 2018. 5 [42] Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental
[29] Andrew E. Johnson and Martial Hebert. Using spin Improvement. arXiv preprint, 2018. 5
images for effcient object recognition in cluttered 3D [43] Dario Rethage, Johanna Wald, Jürgen Sturm, Nassir Navab,
scenes. Trans. on Pattern Analysis and Machine Intelligence and Frederico Tombari. Fully-Convolutional Point Net-
(TPAMI), 21(5):433–449, 1999. 5 works for Large-Scale Point Clouds. Proc. of the European
[30] Roman Klukov and Victor Lempitsky. Escape from Cells: Conf. on Computer Vision (ECCV), 2018. 5
Deep Kd-Networks for the Recognition of 3D Point Cloud [44] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.
Models. In Proc. of the IEEE Intl. Conf. on Computer Vision OctNet: Learning Deep 3D Representations at High Reso-
(ICCV), 2017. 5 lutions. In Proc. of the IEEE Conf. on Computer Vision and
[31] Loic Landrieu and Martin Simonovsky. Large-scale Point Pattern Recognition (CVPR), 2017. 5
Cloud Semantic Segmentation with Superpoint Graphs. In
[45] Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoung
Proc. of the IEEE Conf. on Computer Vision and Pattern
Gwak, Daeyun Shin, and Derek Hoiem. Completing 3D Ob-
Recognition (CVPR), 2018. 5, 6
ject Shape from One Depth Image. In Proc. of the IEEE
[32] Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark,
Conf. on Computer Vision and Pattern Recognition (CVPR),
Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang,
2015. 7
and Stefan Leutenegger. InteriorNet: Mega-scale Multi-
[46] German Ros, Laura Sellart, Joanna Materzynska, David
sensor Photo-realistic Indoor Scenes Dataset. In Proc. of the
Vazquez, and Antonio Lopez. The SYNTHIA Dataset: A
British Machine Vision Conference (BMVC), 2018. 2
Large Collection of Synthetic Images for Semantic Segmen-
[33] Shice Liu, Yu Hu, Yiming Zeng, Qiankun Tang, Beibei Jin,
tation of Urban Scenes. In Proc. of the IEEE Conf. on Com-
Yainhe Han, and Xiaowei Li. See and Think: Disentangling
puter Vision and Pattern Recognition (CVPR), June 2016. 2
Semantic Scene Completion. In Proc. of the Conf. on Neural
Information Processing Systems (NeurIPS), pages 261–272, [47] Xavier Roynard, Jean-Emmanuel Deschaud, and Francois
2018. 8 Goulette. Paris-Lille-3D: A large and high-quality ground-
truth urban point cloud dataset for automatic segmentation
[34] Daniel Maturana and Sebastian Scherer. VoxNet: A 3D Con-
and classification. Intl. Journal of Robotics Research (IJRR),
volutional Neural Network for Real-Time Object Recogni-
37(6):545–557, 2018. 2, 3
tion. In Proc. of the IEEE/RSJ Intl. Conf. on Intelligent
Robots and Systems (IROS), 2015. 5 [48] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
[35] John McCormac, Ankur Handa, Stefan Leutenegger, and Fergus. Indoor Segmentation and Support Inference from
Andrew J. Davison. SceneNet RGB-D: Can 5M Synthetic RGBD Images. In Proc. of the European Conf. on Computer
Images Beat Generic ImageNet Pre-training on Indoor Seg- Vision (ECCV), 2012. 2, 5, 7
mentation? In Proc. of the IEEE Intl. Conf. on Computer [49] Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang,
Vision (ICCV), 2017. 2 Manolis Savva, and Thomas Funkhouser. Semantic Scene
[36] Daniel Munoz, J. Andrew Bagnell, Nicolas Vandapel, and Completion from a Single Depth Image. In Proc. of the IEEE
Martial Hebert. Contextual Classification with Functional Conf. on Computer Vision and Pattern Recognition (CVPR),
Max-Margin Markov Networks. In Proc. of the IEEE 2017. 7, 8
Conf. on Computer Vision and Pattern Recognition (CVPR), [50] Bastian Steder, Giorgio Grisetti, and Wolfram Burgard. Ro-
2009. 2, 5 bust Place Recognition for 3D Range Data based on Point
[37] Daniel Munoz, Nicholas Vandapel, and Marial Hebert. Di- Features. In Proc. of the IEEE Intl. Conf. on Robotics &
rectional Associative Markov Network for 3-D Point Cloud Automation (ICRA), 2010. 2
Classification. In Proc. of the International Symposium [51] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
on 3D Data Processing, Visualization and Transmission Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
(3DPVT), pages 63–70, 2008. 5 SPLATNet: Sparse Lattice Networks for Point Cloud Pro-
[38] Daniel Munoz, Nicholas Vandapel, and Martial Hebert. On- cessing. In Proc. of the IEEE Conf. on Computer Vision and
board Contextual Classification of 3-D Point Clouds with Pattern Recognition (CVPR), 2018. 5, 6
Learned High-order Markov Random Fields. In Proc. of the [52] Maxim Tatarchenko, Jaesik Park, Vladen Koltun, and Qian-
IEEE Intl. Conf. on Robotics & Automation (ICRA), 2009. 5 Yi Zhou. Tangent Convolutions for Dense Prediction in 3D.
In Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), 2018. 5, 6, 7
[53] Lyne P. Tchapmi, Christopher B. Choy, Iro Armeni,
Jun Young Gwak, and Silvio Savarese. SEGCloud: Se-
mantic Segmentation of 3D Point Clouds. In Proc. of the
Intl. Conf. on 3D Vision (3DV), 2017. 5
[54] Gusi Te, Wei Hu, Zongming Guo, and Amin Zheng.
RGCNN: Regularized Graph CNN for Point Cloud Segmen-
tation. arXiv preprint, 2018. 5
[55] A. Torralba and A. Efros. Unbiased Look at Dataset Bias.
In Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), 2011. 2, 3
[56] Rudolph Triebel, Krisitian Kersting, and Wolfram Bur-
gard. Robust 3D Scan Point Classification using Associa-
tive Markov Networks. In Proc. of the IEEE Intl. Conf. on
Robotics & Automation (ICRA), pages 2603–2608, 2006. 5
[57] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei
Pokrovsky, and Raquel Urtasun. Deep Parametric Contin-
uous Convolutional Neural Networks. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR),
2018. 3
[58] Yida Wang, Davod Tan Joseph, Nassir Navab, and Frederico
Tombari. Adversarial Semantic Scene Completion from a
Single Depth Image. In Proc. of the Intl. Conf. on 3D Vision
(3DV), pages 426–434, 2018. 8
[59] Zongji Wang and Feng Lu. VoxSegNet: Volumetric CNNs
for Semantic Part Segmentation of 3D Shapes. arXiv
preprint, 2018. 5
[60] Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer.
SqueezeSeg: Convolutional Neural Nets with Recurrent
CRF for Real-Time Road-Object Segmentation from 3D Li-
DAR Point Cloud. In Proc. of the IEEE Intl. Conf. on
Robotics & Automation (ICRA), 2018. 5, 6
[61] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and
Kurt Keutzer. SqueezeSegV2: Improved Model Structure
and Unsupervised Domain Adaptation for Road-Object Seg-
mentation from a LiDAR Point Cloud. Proc. of the IEEE
Intl. Conf. on Robotics & Automation (ICRA), 2019. 5, 6
[62] Xuehan Xiong, Daniel Munoz, J. Andrew Bagnell, and Mar-
tial Hebert. 3-D Scene Analysis via Sequenced Predictions
over Points and Regions. In Proc. of the IEEE Intl. Conf. on
Robotics & Automation (ICRA), pages 2609–2616, 2011. 5
[63] Wei Zeng and Theo Gevers. 3DContextNet: K-d Tree
Guided Hierarchical Learning of Point Clouds Using Local
and Global Contextual Cues. arXiv preprint, 2017. 5
[64] Jiahui Zhang, Hao Zhao, Anbang Yao, Yurong Chen, Li
Zhang, and Hongen Liao. Efficient Semantic Scene Com-
pletion Network with Spatial Group Convolution. In Proc. of
the European Conf. on Computer Vision (ECCV), pages 733–
749, 2018. 8
[65] Richard Zhang, Stefan A. Candra, Kai Vetter, and Avideh
Zakhor. Sensor Fusion for Semantic Segmentation of Ur-
ban Scenes. In Proc. of the IEEE Intl. Conf. on Robotics &
Automation (ICRA), 2015. 2