0% found this document useful (0 votes)
9 views

HDMapNet

Uploaded by

lileonxiang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

HDMapNet

Uploaded by

lileonxiang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

HDMapNet: An Online HD Map Construction and Evaluation

Framework
Qi Li1,∗ , Yue Wang2,∗ , Yilun Wang3 and Hang Zhao1

Abstract— Constructing HD semantic maps is a central com- projection. Moreover, we investigate whether point clouds
ponent of autonomous driving. However, traditional pipelines and camera images complement each other in this task. We
require a vast amount of human efforts and resources in find different map elements are not equally recognizable in
annotating and maintaining the semantics in the map, which
limits its scalability. In this paper, we introduce the problem a single modality. To take the best from both worlds, our
arXiv:2107.06307v4 [cs.CV] 18 Mar 2022

of HD semantic map learning, which dynamically constructs best model combines point cloud representations with image
the local semantics based on onboard sensor observations. representations. This model outperforms its single-modal
Meanwhile, we introduce a semantic map learning method, counterparts by a significant margin in all categories. To
dubbed HDMapNet. HDMapNet encodes image features from demonstrate the practical value of our method, we generate
surrounding cameras and/or point clouds from LiDAR, and
predicts vectorized map elements in the bird’s-eye view. We a locally-consistent map using our model in Figure 6; the
benchmark HDMapNet on nuScenes dataset and show that in map is immediately applicable to real-time motion planning.
all settings, it performs better than baseline methods. Of note, Finally, we propose comprehensive ways to evaluate the
our camera-LiDAR fusion-based HDMapNet outperforms ex- performance of map learning. These metrics include both
isting methods by more than 50% in all metrics. In addition, we semantic level and instance level evaluations as map elements
develop semantic-level and instance-level metrics to evaluate the
map learning performance. Finally, we showcase our method is are typically represented as object instances in HD maps.
capable of predicting a locally consistent map. By introducing On the public NuScenes dataset, HDMapNet improves over
the method and metrics, we invite the community to study this existing methods by 12.1 IoU on semantic segmentation and
novel map learning problem. 13.1 mAP on instance detection.
I. I NTRODUCTION To summarize, our contributions include the following:
• We propose a novel online framework to construct HD
High-definition (HD) semantic maps are an essential mod-
ule for autonomous driving. Traditional pipelines to construct semantic maps from the sensory observations, and together
such HD semantic maps involve capturing point clouds with a method named HDMapNet.
• We come up with a novel feature projection module from
beforehand, building globally-consistent maps using SLAM,
and annotating semantics in the maps. This paradigm, though perspective view to bird’s-eye view. This module models
producing accurate HD maps and adopted by many au- 3D environments implicitly and considers the camera
tonomous driving companies, requires a vast amount of extrinsic explicitly.
• We develop comprehensive evaluation protocols and met-
human efforts.
As an alternative, we investigate scalable and affordable rics to facilitate future research.
autonomous driving solutions, e.g. minimizing human efforts II. R ELATED W ORK
in annotating and maintaining HD maps. To that end, we
Semantic map construction. Most existing HD semantic
introduce a novel semantic map learning framework that
maps are annotated either manually or semi-automatically
makes use of on-board sensors and computation to estimate
on LiDAR point clouds of the environment, merged from
vectorized local semantic maps. Of note, our framework does
LiDAR scans collected from survey vehicles with high-end
not aim to replace global HD map reconstruction, instead to
GPS and IMU. SLAM algorithms are the most commonly
provide a simple way to predict local semantic maps for
used algorithms to fuse LiDAR scans into a highly accu-
real-time motion prediction and planning.
rate and consistent point cloud. First, pairwise alignment
We propose a semantic map learning method named
algorithm like ICP [1], NDT [2] and their variants [3] are
HDMapNet, which produces vectorized map elements from
employed to match LiDAR data at two nearby timestamps
images of the surrounding cameras and/or from point clouds
using semantic [4] or geometry information [5]. Second,
like LiDARs. We study how to effectively transform perspec-
estimating accurate poses of ego vehicle is formulated as
tive image features to bird’s-eye view features when depth
a non-linear least-square problem [6] or a factor graph [7]
is missing. We put forward a novel view transformer that
which is critical to build a globally consistent map. Yang et
consists of both neural feature transformation and geometric
al. [8] presented a method for reconstructing maps at
1 Qi Li and Hang Zhao are with Tsinghua University, city scale based on the pose graph optimization under the
Beijing, China. Hang Zhao is the corresponding author: constraint of pairwise alignment factor. To reduce the cost
[email protected]
2 Yue Wang is with Massachusetts Institute of Technology, MA, USA.
of manual annotation of semantic maps, [9], [10] proposed
3 Yilun Wang is with Li Auto, Beijing, China. several machine learning techniques to extract static elements
* Equal contribution. from fused LiDAR point clouds and cameras. However, it
Global map construction

LiDAR Pairwise
scans alignment Centimeter-level
Globally consistent point cloud Global HD semantic map localization
IMU

Manual
GPS Annotation

Wheel
odometry

Local semantic map learning (Ours)


Surrounding cameras LiDAR HDMapNet

and/or

Local HD map

Fig. 1: In contrast to pre-annotating global semantic maps, we introduce a novel local map learning framework that makes use of on-board
sensors to estimate local semantic maps.

is still laborious and costly to maintain an HD semantic and outputs vectorized map elements, such as lane dividers,
map since it requires high precision and timely update. lane boundaries and pedestrian crossings. We use I and P to
In this paper, we argue that our proposed local semantic denote the images and point clouds, respectively. Optionally,
map learning task is a potentially more scalable solution for the framework can be extended to include other sensor
autonomous driving. signals like radars. We define M as the map elements to
Perspective view lane detection. The traditional predict.
perspective-view-based lane detection pipeline involves local
image feature extraction (e.g. color, directional filters [11], A. HDMapNet
[12], [13]) , line fitting (e.g. Hough transform [14]), Our semantic map learning model, named HDMapNet,
image-to-world projection, etc. With the advances of predicts map elements M from single frame I and P with
deep learning based image segmentation and detection neural networks directly. An overview is shown in Figure 2,
techniques [15], [16], [17], [18], researchers have explored four neural networks parameterize our model: a perspective
more data-driven approaches. Deep models were developed view image encoder φI and a neural view transformer φV
for road segmentation [19], [20], lane detection [21], [22], in the image branch, a pillar-based point cloud encoder φP ,
drivable area analysis [23], etc. More recently, models were and a map element decoder φM . We denote our HDMapNet
built to give 3D outputs rather than 2D. Bai et al. [24] family as HDMapNet(Surr), HDMapNet(LiDAR), HDMap-
incorporated LiDAR signals so that image pixels can be Net(Fusion) if the model takes only surrounding images, only
projected onto the ground. Garnett et al. [25] and Guo et LiDAR, or both of them as input.
al. [26] used synthetic lane datasets to perform supervised 1) Image encoder: Our image encoder has two compo-
training on the prediction of camera height and pitch, so that nents, namely perspective view image encoder and neural
the output lanes sit in a 3D ground plane. Beyond detecting view transformer.
lanes, our work outputs a consistent local semantic map Perspective view image encoder. Our image branch takes
around the vehicle from surround cameras or LiDARs. perspective view inputs from Nm surrounding cameras, cov-
Cross-view learning. Recently, some efforts have been made ering the panorama of the scene. Each image Ii is embedded
to study cross-view learning to facilitate robots’ surrounding by a shared neural network φI to get perspective view feature
sensing capability. Pan [27] used MLPs to learn the relation- map FIpvi ⊆ RHpv ×Wpv ×K where Hpv , Wpv , and K are the
ship between perspective-view feature maps and bird’s-eye height, width, and feature dimension respectively.
view feature maps. Roddick and Cipolla [28] applied 1D Neural view transformer. As shown in Figure 3, we first
convolution on the image fetures along the horizontal axis transform image features from perspective view to camera
to predict bird’s-eye view. Philion and Fidler [29] predicted coordinate system and then to bird’s-eye view. The relation
the depth of monocular cameras and project image features of any two pixels between perspective view and camera
into bird’s-eye view using soft attention. Our work focuses coodinate system is modeled by a multi-layer perceptron φVi :
on the crucial task of local semantic map construction that we
pv pv
use cross-view sensing methods to generate map elements in FIci [h][w] = φhw
Vi (FIi [1][1], . . . , FIi [Hpv ][Wpv ]) (1)
a vectorized form. Moreover, our model can be easily fused
with LiDAR input to further improve its accuracy. where φhwVi models the relation between feature vector at
position (h, w) in the camera coodinate system and every
III. S EMANTIC M AP L EARNING pixel on the perspective view feature map. We denote Hc
We propose semantic map learning, a novel framework and Wc as the top-down spatial dimensions of FIc . The
that produces local high-definition semantic maps. It takes bird’s-eye view (ego coordinate system) features FIbevi

sensor inputs like camera images and LiDAR point clouds, RHbev ×Wbev ×K is obtained by transforming the features FIci
Fig. 2: Model overview. HDMapNet works with either or both of images and point clouds, outputs semantic segmentation, instance
embedding and directions, and finally produces a vectorized local semantic map. Top left: image branch. Bottom left: point cloud branch.
Right: HD semantic map. .

using geometric projection with camera extrinsics, where


Hbev and Wbev are the height and width in the bird’s-eye
view. The final image feature FIbev is an average of Nm
camera features.
2) Point cloud encoder: Our point cloud encoder φP is
a variant of PointPillar [30] with dynamic voxelization [31],
which divide the 3d space into multiple pillars and learn
feature maps from pillar-wise features of pillar-wise point
clouds. The input is N lidar points in the point cloud. For
each point p, it has three-dimensional coordinates and addi- Fig. 3: Feature transformation. Left: the 6 input images in the
perspective view. Middle: 6 feature maps in camera coordinate
tional K-dimensional features represented as fp ⊆ RK+3 . system, which are obtained by extracting features using image
When projecting features from points to bird’s-eye view, encoder and transforming these features with MLP; each feature
multiple points can potentially fall into the same pillar. We map (in different colors) covers a certain area. Right: the feature
define Pj as the set of points corresponding to pillar j. To map (in orange) in the ego vehicle coordinate system; this is fused
aggregate features from points in a pillar, a PointNet [32] from 6 feature maps and transformed to the ego vehicle coordinate
system with camera extrinsics.
(denoted as PN) is warranted, where
fjpillar = PN({fp |∀p ∈ Pj }). (2) notation, we follow the exact definition in [34]: C is the
number of clusters in the ground truth, Nc is the number of
Then, pillar-wise features are further encoded through a
elements in cluster c, µc is the mean embedding of cluster
convolutional neural network φpillar . We denote the feature
c, k · k is the L2 norm, and [x]+ = max(0, x) denotes the
map in the bird’s-eye view as FPbev .
element maximum. δv and δd are respectively the margins
3) Bird’s-eye view decoder: The map is a complex graph
for the variance and distance loss. The clustering loss L is
network that includes instance-level and directional informa-
computed by:
tion of lane dividers and lane boundaries. Instead of pixel-
C Nc
level representation, lane lines need to be vectorized so that 1 X 1 X
they can be followed by self-driving vehicles. Therefore, our Lvar = [kµc − fjinstance k − δv ]2+ (3)
C c=1 Nc j=1
BEV decoder φM not only outputs semantic segmentation
but also predicts instance embedding and lane direction. A 1 X
Ldist = [2δd − kµcA − µcB k]2+ , (4)
post-processing process is applied to cluster instances from C(C − 1)
cA 6=cB ∈C
embeddings and vectorize them.
L = αLvar + βLdist . (5)
Overall architecture. The BEV decoder is a fully con-
volutional network (FCN) [33] with 3 branches, namely Direction prediction. Our direction module aims to predict
semantic segmentation branch, instance embedding branch, directions of lanes from each pixel C. The directions are
and direction prediction branch. The input of BEV decoder discretized into Nd classes uniformed distributed on a unit
is image feature map FIbev and/or point cloud feature map circle. By classifying direction D of current pixel Cnow , the
FPbev , and we concatenate them if both exist. next pixel of lane Cnext can be obtained as Cnext = Cnow +
Semantic prediction. The semantic prediction module is a ∆step · D, where ∆step is a predefined step size. Since we
fully convolutional network (FCN) [33] We use cross-entropy don’t know the direction of the lane, we cannot identify the
loss for the semantic prediction. forward and backward direction of each node. Instead, we
Instance embedding. Our instance embedding module seeks treat both of them as positive labels. Concretely, the direction
to cluster each bird’s-eye view embedding. For ease of label of each lane node is a Nd vector with 2 indices labeled
as 1 and others labeled as 0. Note that most of the pixels on IV. E XPERIMENTS
the topdown map don’t lie on the lanes, which means they A. Implementation details
don’t have directions. The direction vector of those pixels is a
Tasks & Metrics. We evaluate our approach on the
zero vector and we never do backpropagation for those pixels
NuScenes dataset [36]. We focus on two sub-tasks: semantic
during training. We use softmax as the activation function for
map segmentation and instance detection. Due to the limited
classification.
types of map elements in the nuScenes dataset, we consider
Vectorization. During inference, we first cluster instance
three static map elements: lane boundary, lane divider, and
embeddings using the Density-Based Spatial Clustering of
pedestrian crossing.
Applications with Noise (DBSCAN). Then non-maximum
Architecture. For the perspective view image encoder, we
suppression (NMS) is used to reduce redundancy. Finally, the
adopt EfficientNet-B0 [37] pre-trained on ImageNet [38],
vector representations are obtained by greedily connecting
as in [29]. Then, we use a multi-layer perceptron (MLP)
the pixels with the help of the predicted direction.
to convert the perspective view features to bird’s-eye view
B. Evaluation features in the camera coordinate system. The MLP is shared
In this section, we propose evaluation protocols for seman- channel-wisely and does not change the feature dimension.
tic map learning, including semantic metrics and instance For point clouds, we use a variant of PointPillars [39] with
metrics. dynamic voxelization [31]. We use a PointNet [32] with a 64-
1) Semantic metrics: The semantics of model predictions dimensional layer to aggregate points in a pillar. ResNet [40]
can be evaluated in the Eulerian fashion and the Lagrangian with three blocks is used as the BEV decoder.
fashion. Eulerian metrics are computed on a dense grid and Training details. We use the cross-entropy loss for the se-
measure the pixel value differences. In contrast, Lagrangian mantic segmentation, and use the discriminative loss (Equa-
metrics move with the shape and measure the spatial dis- tion 5) for the instance embedding where we set α = β = 1,
tances of shapes. δv = 0.5, and δd = 3.0. We use Adam [41] for model
Eulerian metrics. We use intersection-over-union (IoU) as training, witfh a learning rate of 1e − 3.
Eulerian metrics, which is given by, B. Baseline methods
|D1 ∩ D2 | For all baseline methods, we use the same image encoder
IoU(D1 , D2 ) = , (6)
|D1 ∪ D2 | and decoder as HDMapNet and only change the view trans-
where D1 , D2 ⊆ RH×W ×D are dense representations of formation module.
shapes (curves rasterized on a grid); H and W are the height Inverse Perspective Mapping (IPM). The most straight-
and width of the grid, D is number of categories; |·| denotes forward baseline is to map segmentation predictions to the
the size of the set. bird’s-eye view via IPM [42], [43].
Lagrangian metrics. We are interested in structured outputs, IPM with bird’s-eye view decoder (IPM(B)). Our second
namely curves consists of connected points. To evaluate the baseline is an extension of IPM. Rather than making predic-
spatial distances between the predicted curves and ground- tions in perspective view, we perform semantic segmentation
truth curves, we use Chamfer distance (CD) of between point directly in bird-eye view.
sets sampled on the curves: IPM with perspective view feature encoder and bird’s-eye
view decoder (IPM(CB)). The next extension is to perform
1 X
CDDir (S1 , S2 ) = min kx − yk2 (7) feature learning in the perspective view while making pre-
S1 y∈S2
dictions in the bird’s-eye view.
x∈S1
CD(S1 , S2 ) = CDDir (S1 , S2 ) + CDDir (S2 , S1 ) (8) Lift-Splat-Shoot. Lift-Splat-Shoot [29] estimates a distribu-
tion over depth in the perspective view images. Then, it
where CDdir is the directional Chamfer distance and CD is converts 2D images into 3D point clouds with features and
the bi-directional Chamfer distance; S1 and S2 are the two projects them into the ego vehicle frame.
sets of points on the curves. View Parsing Network (VPN). VPN [27] proposes a simple
2) Instance metrics: We further evaluate the instance view transformation module: a view relation module to
detection capability of our models. We use average precision model the relations between any two pixels and a view fusion
(AP) similar to the one in object detection [35], given by module to fuses the features of pixels.
1 X
AP = APr , (9) C. Results
10
r∈{0.1,0.2,...,1.0}
We compare our HDMapNet against baselines in Sec-
where APr is the precision at recall=r. We collect all tion IV-B. Table I shows the comparisons. First, Our
predictions and rank them in descending order according to HDMapNet(Surr), which is the surrounding camera-only
the semantic confidences. Then, we classify each prediction method, outperforms all baselines. This suggests that our
based on the CD threshold. For example, if the CD is lower novel learning-based view transformation is indeed effec-
than a predefined threshold, it is considered true positive, tive, without making impractical assumptions about a com-
otherwise false positive. Finally, we obtain all precision- plex ground plane (IPM) or estimating the depth (Lift-
recall pairs and compute APs accordingly. Splat-Shoot). Second, our HDMapNet(LiDAR) is better than
Ground Truth IPM IPM(B)

IPM(CB) Lift-Splat-Shoot VPN

HDMapNet(Surr) HDMapNet(LiDAR) HDMapNet(Fusion)


Fig. 4: Qualitative results on the validation set. Top left: the surrounding images and the ground-truth local semantic map annotations.
IPM: the lane segmentation result in the perspective view and the bird’s-eye view. Others: the semantic segmentation results and the
vectorized instance detection results.
TABLE I: IoU scores (%) and CD (m) of semantic map segmentation. CDP denotes the CD from label to prediction (equivalent to
precision) while CDL denotes the CD from Prediction to Label (equivalent to recall); CD is the average of them. IoU: higher is better.
CD: lower is better. *: the perspective view labels projected from 2D HD Maps.
Divider Ped Crossing Boundary All Classes
Method IoU CDP CDL CD IoU CDP CDL CD IoU CDP CDL CD IoU CDP CDL CD
IPM∗ 14.4 1.149 2.232 2.193 9.5 1.232 3.432 2.482 18.4 1.502 2.569 1.849 14.1 1.294 2.744 2.175
IPM(B) 25.5 1.091 1.730 1.226 12.1 0.918 2.947 1.628 27.1 0.710 1.670 0.918 21.6 0.906 2.116 1.257
IPM(CB) 38.6 0.743 1.106 0.802 19.3 0.741 2.154 1.081 39.3 0.563 1.000 0.633 32.4 0.682 1.42 0.839
Lift-Splat-Shoot [29] 38.3 0.872 1.144 0.916 14.9 0.680 2.691 1.313 39.3 0.580 1.137 0.676 30.8 0.711 1.657 0.968
VPN [27] 36.5 0.534 1.197 0.919 15.8 0.491 2.824 2.245 35.6 0.283 1.234 0.848 29.3 0.436 1.752 1.337
HDMapNet(Surr) 40.6 0.761 0.979 0.779 18.7 0.855 1.997 1.101 39.5 0.608 0.825 0.624 32.9 0.741 1.267 0.834
HDMapNet(LiDAR) 26.7 1.134 1.508 1.219 17.3 1.038 2.573 1.524 44.6 0.501 0.843 0.561 29.5 0.891 1.641 1.101
HDMapNet(Fusion) 46.1 0.625 0.893 0.667 31.4 0.535 1.715 0.790 56.0 0.461 0.443 0.459 44.5 0.540 1.017 0.639
TABLE II: Instance detection results. {0.2, 0.5, 1.} are the predefined thresholds of Chamfer distance (e.g. a prediction is considered a
true positive if the Chamfer distance between label and prediction is lower than that threshold). The mAP is the average of three APs.
AP & mAP: higher is better.
Divider Ped Crossing Boundary All Classes
Method [email protected] [email protected] AP@1. mAP [email protected] [email protected] AP@1. mAP [email protected] [email protected] AP@1. mAP [email protected] [email protected] AP@1. mAP
IPM(B) 2.6 9.8 19.6 10.7 1.6 4.8 7.8 4.7 2.2 9.2 23.7 11.7 2.1 7.9 17.0 9.0
IPM(CB) 10.2 25.0 36.8 24.0 2.0 7.8 12.2 7.3 10.1 27.9 45.5 27.8 7.4 20.2 31.5 19.7
Lift-Splat-Shoot [29] 9.1 23.8 35.9 22.9 0.9 5.4 8.9 5.1 8.5 22.9 41.2 24.2 6.2 17.4 28.7 17.4
VPN [27] 8.8 22.7 34.9 22.1 1.2 5.3 9.0 5.2 9.2 24.1 42.7 25.3 6.4 17.4 28.9 17.5
HDMapNet(Surr) 13.7 30.7 40.6 28.3 1.9 7.4 12.1 7.1 13.7 33.9 50.1 32.6 9.8 24.0 34.3 22.7
HDMapNet(LiDAR) 1.0 5.7 15.1 7.3 1.9 5.8 9.0 5.6 5.8 20.2 39.6 21.9 2.9 10.6 21.2 11.6
HDMapNet(Fusion) 15.0 32.6 46.0 31.2 6.4 13.6 17.8 12.6 26.7 51.5 65.6 47.9 16.0 32.6 43.1 30.6
Fig. 5: Qualitative results under bad weather conditions. The left two images show predictions at night. The right twos show predictions
at rainy days.
Weather Night Rainy Normal
IoU 39.3 38.7 44.9

TABLE III: The semantic segmentation performance of HDMap-


Net(Fusion) under different weather.

# of frames 1 2 4
IoU 32.9 35.8 36.4

TABLE IV: Temporal fusion on HDMapNet(Surr). 1 means tem-


poral fusion is not used.

Fig. 6: Long-term temporal accumulation by fusing occupancy the lane boundary, making it easy to detect in LiDAR point
probabilities over multiple frames. clouds. On the other hand, the color contrast of road divider
and pedestrian crossing is helpful information, making two
HDMapNet(Surr) in boundary but worse in divider and
categories more recognizable in images; visualizations also
pedestrian crossing. This indicates different categories are
confirm this in Figure 4. The strongest performance is
not equally recognizable in one modality. Third, our fusion
achieved when combining LiDAR and cameras; the com-
model with both camera images and LiDAR point clouds
bined model outperforms both models with a single sensor
achieves the best performance. It improves over baselines
by a large margin. This suggests these two sensors include
and our camera-only method by 50% relatively.
complementary information for each other.
Another interesting phenomenon is that various models
Bad weather conditions. Here we assess the robustness of
behave differently in terms of the CD. For example, VPN has
our model under extreme weather conditions. As shown in
the lowest CDP in all categories, while it underperforms its
Figure 5, our model can generate complete lanes even when
counterparts on CDL and has the worst overall CD. Instead,
lighting condition is bad, or when the rain obscures sight.
our HDMapNet(Surr) balances both CDP and CDL , achieving
We speculate that the model can predict the shape of the
the best CD among all camera-only-based methods. This
lane based on partial observations when the roads are not
finding indicates that CD is complementary to IoU, which
completely visible. Although there are performance drop in
shows the precision and recall aspects of models. This helps
extreme weather condition, the overall performance is still
us understand the behaviors of different models from another
reasonable. (Table III)
perspective.
Temporal Fusion. Here we experiment on temporal fusion
Instance map detection. In Figure 2 (Instance detection strategies. We first conduct short-term temporal fusion by
branch), we show the visualization of embeddings using pasting feature maps of previous frames into current’s ac-
principal component analysis (PCA). Different lanes are cording to ego poses. The feature maps are fused by max
assigned different colors even when they are close to each pooling and then fed into decoder. As shown in Table IV,
other or have intersections. This confirms our model learns fusing multiple frames can improve the IoU of the semantics.
instance-level information and can predict instance labels We further experiment on long-term temporal accumulation
accurately. In Figure 2 (Direction classification branch), we by fusing segmentation probabilities. As shown in Figure 6,
show the direction mask predicted by our direction branch. our method produces consistent semantic maps with larger
The direction is consistent and smooth. We show the vec- field of view while fusing multiple frames.
torized curve produced after post processing in Figure 4. In
Table II, we present the quantitative results of instance map V. C ONCLUSION
detection. HDMapNet(Surr) already outperforms baselines HDMapNet predicts HD semantic maps directly from
while HDMapNet(Fusion) is significantly better than all camera images and/or LiDAR point clouds. The local se-
counterparts, e.g., it improves over IPM by 55.4%. mantic map learning framework could be a more scalable
Sensor fusion. In this section, we further analyze the ef- approach than the global map construction and annotation
fect of sensor fusion for constructing HD semantic maps. pipeline that requires a significant amount of human efforts.
As shown in Table I, for divider and pedestrian crossing, Even though our baseline method of semantic map learning
HDMapNet(Surr) outperforms HDMapNet(LiDAR), while does not produce map elements as accurate, it gives system
for lane boundary, HDMapNet(LiDAR) works better. We developers another possible choice of the trade-off between
hypothesize this is because there are elevation changes near scalability and accuracy.
R EFERENCES [24] M. Bai, G. Mattyus, N. Homayounfar, S. Wang, K. Lakshmikanth,
Shrinidhi, and R. Urtasun, “Deep multi-sensor lane detection,” in
[1] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” IROS, 2018.
in Sensor fusion IV: control paradigms and data structures, vol. 1611. [25] N. Garnett, R. Cohen, T. Pe’er, R. Lahav, and D. Levi, “3d-
International Society for Optics and Photonics, 1992, pp. 586–606. lanenet: end-to-end 3d multiple lane detection,” in Proceedings of the
[2] P. Biber and W. Straßer, “The normal distributions transform: A new IEEE/CVF International Conference on Computer Vision, 2019, pp.
approach to laser scan matching,” in Proceedings 2003 IEEE/RSJ 2921–2930.
International Conference on Intelligent Robots and Systems (IROS [26] Y. Guo, G. Chen, P. Zhao, W. Zhang, J. Miao, J. Wang, and T. E.
2003)(Cat. No. 03CH37453), vol. 3. IEEE, 2003, pp. 2743–2748. Choe, “Genlanenet: A generalized and scalable approach for 3d lane
[3] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: detection,” 2020.
science and systems, vol. 2, no. 4. Seattle, WA, 2009, p. 435. [27] B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou, “Cross-
view semantic segmentation for sensing surroundings,” IEEE Robotics
[4] F. Yu, J. Xiao, and T. Funkhouser, “Semantic alignment of lidar data
and Automation Letters, vol. 5, no. 3, p. 4867–4873, Jul 2020.
at city scale,” in Proceedings of the IEEE Conference on Computer
[Online]. Available: http://dx.doi.org/10.1109/LRA.2020.3004325
Vision and Pattern Recognition, 2015, pp. 1722–1731.
[28] T. Roddick and R. Cipolla, “Predicting semantic map representations
[5] F. Pomerleau, F. Colas, and R. Siegwart, “A review of point cloud from images using pyramid occupancy networks,” in Proceedings of
registration algorithms for mobile robotics,” Foundations and Trends the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
in Robotics, vol. 4, no. 1, pp. 1–104, 2015. tion, 2020, pp. 11 138–11 147.
[6] C. L. Lawson and R. J. Hanson, Solving least squares problems. [29] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from
SIAM, 1995. arbitrary camera rigs by implicitly unprojecting to 3d,” 2020.
[7] F. Dellaert, “Factor graphs and gtsam: A hands-on introduction,” [30] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
Georgia Institute of Technology, Tech. Rep., 2012. “Pointpillars: Fast encoders for object detection from point clouds,”
[8] S. Yang, X. Zhu, X. Nian, L. Feng, X. Qu, and T. Ma, “A robust in The IEEE Conference on Computer Vision and Pattern Recognition
pose graph approach for city scale lidar mapping,” in 2018 IEEE/RSJ (CVPR), June 2019.
International Conference on Intelligent Robots and Systems (IROS). [31] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo,
IEEE, 2018, pp. 1175–1182. J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d
[9] J. Jiao, “Machine learning assisted high-definition map creation,” object detection in LiDAR point clouds,” in The Conference on Robot
in 2018 IEEE 42nd Annual Computer Software and Applications Learning (CoRL), 2019.
Conference (COMPSAC), vol. 1. IEEE, 2018, pp. 367–373. [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
[10] L. Mi, H. Zhao, C. Nash, X. Jin, J. Gao, C. Sun, C. Schmid, on point sets for 3d classification and segmentation,” in The IEEE
N. Shavit, Y. Chai, and D. Anguelov, “Hdmapgen: A hierarchical Conference on Computer Vision and Pattern Recognition (CVPR), July
graph generative model of high definition maps,” in Proceedings of the 2017.
IEEE/CVF Conference on Computer Vision and Pattern Recognition [33] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
(CVPR), June 2021, pp. 4227–4236. for semantic segmentation,” IEEE Transactions on Pattern Analysis
[11] K.-Y. Chiu and S.-F. Lin, “Lane detection using color-based segmen- and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
tation,” in IEEE Proceedings. Intelligent Vehicles Symposium, 2005. [34] B. De Brabandere, D. Neven, and L. Van Gool, “Semantic instance
IEEE, 2005, pp. 706–711. segmentation for autonomous driving,” in Proceedings of the IEEE
[12] H. Loose, U. Franke, and C. Stiller, “Kalman particle filter for Conference on Computer Vision and Pattern Recognition (CVPR)
lane recognition on rural roads,” in 2009 IEEE Intelligent Vehicles Workshops, July 2017.
Symposium. IEEE, 2009, pp. 60–65. [35] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
[13] S. Zhou, Y. Jiang, J. Xi, J. Gong, G. Xiong, and H. Chen, “A novel P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco:
lane detection based on geometrical model and gabor filter,” in 2010 Common objects in context,” 2015.
IEEE Intelligent Vehicles Symposium. IEEE, 2010, pp. 59–64. [36] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
[14] J. Illingworth and J. Kittler, “A survey of the hough transform,” A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
Computer vision, graphics, and image processing, vol. 44, no. 1, pp. multimodal dataset for autonomous driving,” 2020.
87–116, 1988. [37] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
[15] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene convolutional neural networks,” 2020.
segmentation from a single image,” in ECCV, 2012. [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[16] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
“Scene parsing through ade20k dataset,” in Proceedings of the IEEE L. Fei-Fei, “Imagenet large scale visual recognition challenge,” 2015.
conference on computer vision and pattern recognition, 2017, pp. 633– [39] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
641. “Pointpillars: Fast encoders for object detection from point clouds,”
[17] A. Ess, T. Mueller, H. Grabner, and L. Van Gool, “Segmentation-based 2019.
urban traffic scene understanding.” in BMVC, vol. 1. Citeseer, 2009, [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
p. 2. image recognition,” in CVPR, 2016.
[18] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset tion,” 2017.
for semantic urban scene understanding,” in CVPR, 2016. [42] L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, “Restricted
[19] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, deformable convolution-based road scene semantic segmentation
“Bdd100k: A diverse driving video database with scalable annotation using surround view cameras,” IEEE Transactions on Intelligent
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018. Transportation Systems, vol. 21, no. 10, p. 4350–4362, Oct 2020.
[Online]. Available: http://dx.doi.org/10.1109/TITS.2019.2939832
[20] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The
[43] T. Sämann, K. Amende, S. Milz, C. Witt, M. Simon, and J. Petzold,
mapillary vistas dataset for semantic understanding of street scenes,”
“Efficient semantic segmentation for visual bird’s-eye view interpre-
in Proceedings of the IEEE International Conference on Computer
tation,” 2018.
Vision, 2017, pp. 4990–4999.
[21] Z. Wang, W. Ren, and Q. Qiu, “Lanenet: Real-time lane detection
networks for autonomous driving,” arXiv preprint arXiv:1807.01726,
2018.
[22] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
tation approach,” in 2018 IEEE intelligent vehicles symposium (IV).
IEEE, 2018, pp. 286–291.
[23] X. Liu and Z. Deng, “Segmentation of drivable road using deep
fully convolutional residual network with pyramid pooling,” Cognitive
Computation, vol. 10, no. 2, pp. 272–281, 2018.

You might also like