HDMapNet

Uploaded by

lileonxiang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

HDMapNet

Uploaded by

lileonxiang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

HDMapNet: An Online HD Map Construction and Evaluation

Framework
Qi Li1,∗ , Yue Wang2,∗ , Yilun Wang3 and Hang Zhao1

Abstract— Constructing HD semantic maps is a central com- projection. Moreover, we investigate whether point clouds
ponent of autonomous driving. However, traditional pipelines and camera images complement each other in this task. We
require a vast amount of human efforts and resources in find different map elements are not equally recognizable in
annotating and maintaining the semantics in the map, which
limits its scalability. In this paper, we introduce the problem a single modality. To take the best from both worlds, our
arXiv:2107.06307v4 [cs.CV] 18 Mar 2022

of HD semantic map learning, which dynamically constructs best model combines point cloud representations with image
the local semantics based on onboard sensor observations. representations. This model outperforms its single-modal
Meanwhile, we introduce a semantic map learning method, counterparts by a significant margin in all categories. To
dubbed HDMapNet. HDMapNet encodes image features from demonstrate the practical value of our method, we generate
surrounding cameras and/or point clouds from LiDAR, and
predicts vectorized map elements in the bird’s-eye view. We a locally-consistent map using our model in Figure 6; the
benchmark HDMapNet on nuScenes dataset and show that in map is immediately applicable to real-time motion planning.
all settings, it performs better than baseline methods. Of note, Finally, we propose comprehensive ways to evaluate the
our camera-LiDAR fusion-based HDMapNet outperforms ex- performance of map learning. These metrics include both
isting methods by more than 50% in all metrics. In addition, we semantic level and instance level evaluations as map elements
develop semantic-level and instance-level metrics to evaluate the
map learning performance. Finally, we showcase our method is are typically represented as object instances in HD maps.
capable of predicting a locally consistent map. By introducing On the public NuScenes dataset, HDMapNet improves over
the method and metrics, we invite the community to study this existing methods by 12.1 IoU on semantic segmentation and
novel map learning problem. 13.1 mAP on instance detection.
I. I NTRODUCTION To summarize, our contributions include the following:
• We propose a novel online framework to construct HD
High-definition (HD) semantic maps are an essential mod-
ule for autonomous driving. Traditional pipelines to construct semantic maps from the sensory observations, and together
such HD semantic maps involve capturing point clouds with a method named HDMapNet.
• We come up with a novel feature projection module from
beforehand, building globally-consistent maps using SLAM,
and annotating semantics in the maps. This paradigm, though perspective view to bird’s-eye view. This module models
producing accurate HD maps and adopted by many au- 3D environments implicitly and considers the camera
tonomous driving companies, requires a vast amount of extrinsic explicitly.
• We develop comprehensive evaluation protocols and met-
human efforts.
As an alternative, we investigate scalable and affordable rics to facilitate future research.
autonomous driving solutions, e.g. minimizing human efforts II. R ELATED W ORK
in annotating and maintaining HD maps. To that end, we
Semantic map construction. Most existing HD semantic
introduce a novel semantic map learning framework that
maps are annotated either manually or semi-automatically
makes use of on-board sensors and computation to estimate
on LiDAR point clouds of the environment, merged from
vectorized local semantic maps. Of note, our framework does
LiDAR scans collected from survey vehicles with high-end
not aim to replace global HD map reconstruction, instead to
GPS and IMU. SLAM algorithms are the most commonly
provide a simple way to predict local semantic maps for
used algorithms to fuse LiDAR scans into a highly accu-
real-time motion prediction and planning.
rate and consistent point cloud. First, pairwise alignment
We propose a semantic map learning method named
algorithm like ICP [1], NDT [2] and their variants [3] are
HDMapNet, which produces vectorized map elements from
employed to match LiDAR data at two nearby timestamps
images of the surrounding cameras and/or from point clouds
using semantic [4] or geometry information [5]. Second,
like LiDARs. We study how to effectively transform perspec-
estimating accurate poses of ego vehicle is formulated as
tive image features to bird’s-eye view features when depth
a non-linear least-square problem [6] or a factor graph [7]
is missing. We put forward a novel view transformer that
which is critical to build a globally consistent map. Yang et
consists of both neural feature transformation and geometric
al. [8] presented a method for reconstructing maps at
1 Qi Li and Hang Zhao are with Tsinghua University, city scale based on the pose graph optimization under the
Beijing, China. Hang Zhao is the corresponding author: constraint of pairwise alignment factor. To reduce the cost
[email protected]
2 Yue Wang is with Massachusetts Institute of Technology, MA, USA.
of manual annotation of semantic maps, [9], [10] proposed
3 Yilun Wang is with Li Auto, Beijing, China. several machine learning techniques to extract static elements
* Equal contribution. from fused LiDAR point clouds and cameras. However, it
Global map construction

LiDAR Pairwise
scans alignment Centimeter-level
Globally consistent point cloud Global HD semantic map localization
IMU

Manual
GPS Annotation

Wheel
odometry

Local semantic map learning (Ours)

Surrounding cameras LiDAR HDMapNet

and/or

Local HD map

Fig. 1: In contrast to pre-annotating global semantic maps, we introduce a novel local map learning framework that makes use of on-board
sensors to estimate local semantic maps.

is still laborious and costly to maintain an HD semantic and outputs vectorized map elements, such as lane dividers,
map since it requires high precision and timely update. lane boundaries and pedestrian crossings. We use I and P to
In this paper, we argue that our proposed local semantic denote the images and point clouds, respectively. Optionally,
map learning task is a potentially more scalable solution for the framework can be extended to include other sensor
autonomous driving. signals like radars. We define M as the map elements to
Perspective view lane detection. The traditional predict.
perspective-view-based lane detection pipeline involves local
image feature extraction (e.g. color, directional filters [11], A. HDMapNet
[12], [13]) , line fitting (e.g. Hough transform [14]), Our semantic map learning model, named HDMapNet,
image-to-world projection, etc. With the advances of predicts map elements M from single frame I and P with
deep learning based image segmentation and detection neural networks directly. An overview is shown in Figure 2,
techniques [15], [16], [17], [18], researchers have explored four neural networks parameterize our model: a perspective
more data-driven approaches. Deep models were developed view image encoder φI and a neural view transformer φV
for road segmentation [19], [20], lane detection [21], [22], in the image branch, a pillar-based point cloud encoder φP ,
drivable area analysis [23], etc. More recently, models were and a map element decoder φM . We denote our HDMapNet
built to give 3D outputs rather than 2D. Bai et al. [24] family as HDMapNet(Surr), HDMapNet(LiDAR), HDMap-
incorporated LiDAR signals so that image pixels can be Net(Fusion) if the model takes only surrounding images, only
projected onto the ground. Garnett et al. [25] and Guo et LiDAR, or both of them as input.
al. [26] used synthetic lane datasets to perform supervised 1) Image encoder: Our image encoder has two compo-
training on the prediction of camera height and pitch, so that nents, namely perspective view image encoder and neural
the output lanes sit in a 3D ground plane. Beyond detecting view transformer.
lanes, our work outputs a consistent local semantic map Perspective view image encoder. Our image branch takes
around the vehicle from surround cameras or LiDARs. perspective view inputs from Nm surrounding cameras, cov-
Cross-view learning. Recently, some efforts have been made ering the panorama of the scene. Each image Ii is embedded
to study cross-view learning to facilitate robots’ surrounding by a shared neural network φI to get perspective view feature
sensing capability. Pan [27] used MLPs to learn the relation- map FIpvi ⊆ RHpv ×Wpv ×K where Hpv , Wpv , and K are the
ship between perspective-view feature maps and bird’s-eye height, width, and feature dimension respectively.
view feature maps. Roddick and Cipolla [28] applied 1D Neural view transformer. As shown in Figure 3, we first
convolution on the image fetures along the horizontal axis transform image features from perspective view to camera
to predict bird’s-eye view. Philion and Fidler [29] predicted coordinate system and then to bird’s-eye view. The relation
the depth of monocular cameras and project image features of any two pixels between perspective view and camera
into bird’s-eye view using soft attention. Our work focuses coodinate system is modeled by a multi-layer perceptron φVi :
on the crucial task of local semantic map construction that we
pv pv
use cross-view sensing methods to generate map elements in FIci [h][w] = φhw
Vi (FIi [1][1], . . . , FIi [Hpv ][Wpv ]) (1)
a vectorized form. Moreover, our model can be easily fused
with LiDAR input to further improve its accuracy. where φhwVi models the relation between feature vector at
position (h, w) in the camera coodinate system and every
III. S EMANTIC M AP L EARNING pixel on the perspective view feature map. We denote Hc
We propose semantic map learning, a novel framework and Wc as the top-down spatial dimensions of FIc . The
that produces local high-definition semantic maps. It takes bird’s-eye view (ego coordinate system) features FIbevi
⊆
sensor inputs like camera images and LiDAR point clouds, RHbev ×Wbev ×K is obtained by transforming the features FIci
Fig. 2: Model overview. HDMapNet works with either or both of images and point clouds, outputs semantic segmentation, instance
embedding and directions, and finally produces a vectorized local semantic map. Top left: image branch. Bottom left: point cloud branch.
Right: HD semantic map. .

using geometric projection with camera extrinsics, where

Hbev and Wbev are the height and width in the bird’s-eye
view. The final image feature FIbev is an average of Nm
camera features.
2) Point cloud encoder: Our point cloud encoder φP is
a variant of PointPillar [30] with dynamic voxelization [31],
which divide the 3d space into multiple pillars and learn
feature maps from pillar-wise features of pillar-wise point
clouds. The input is N lidar points in the point cloud. For
each point p, it has three-dimensional coordinates and addi- Fig. 3: Feature transformation. Left: the 6 input images in the
perspective view. Middle: 6 feature maps in camera coordinate
tional K-dimensional features represented as fp ⊆ RK+3 . system, which are obtained by extracting features using image
When projecting features from points to bird’s-eye view, encoder and transforming these features with MLP; each feature
multiple points can potentially fall into the same pillar. We map (in different colors) covers a certain area. Right: the feature
define Pj as the set of points corresponding to pillar j. To map (in orange) in the ego vehicle coordinate system; this is fused
aggregate features from points in a pillar, a PointNet [32] from 6 feature maps and transformed to the ego vehicle coordinate
system with camera extrinsics.
(denoted as PN) is warranted, where
fjpillar = PN({fp |∀p ∈ Pj }). (2) notation, we follow the exact definition in [34]: C is the
number of clusters in the ground truth, Nc is the number of
Then, pillar-wise features are further encoded through a
elements in cluster c, µc is the mean embedding of cluster
convolutional neural network φpillar . We denote the feature
c, k · k is the L2 norm, and [x]+ = max(0, x) denotes the
map in the bird’s-eye view as FPbev .
element maximum. δv and δd are respectively the margins
3) Bird’s-eye view decoder: The map is a complex graph
for the variance and distance loss. The clustering loss L is
network that includes instance-level and directional informa-
computed by:
tion of lane dividers and lane boundaries. Instead of pixel-
C Nc
level representation, lane lines need to be vectorized so that 1 X 1 X
they can be followed by self-driving vehicles. Therefore, our Lvar = [kµc − fjinstance k − δv ]2+ (3)
C c=1 Nc j=1
BEV decoder φM not only outputs semantic segmentation
but also predicts instance embedding and lane direction. A 1 X
Ldist = [2δd − kµcA − µcB k]2+ , (4)
post-processing process is applied to cluster instances from C(C − 1)
cA 6=cB ∈C
embeddings and vectorize them.
L = αLvar + βLdist . (5)
Overall architecture. The BEV decoder is a fully con-
volutional network (FCN) [33] with 3 branches, namely Direction prediction. Our direction module aims to predict
semantic segmentation branch, instance embedding branch, directions of lanes from each pixel C. The directions are
and direction prediction branch. The input of BEV decoder discretized into Nd classes uniformed distributed on a unit
is image feature map FIbev and/or point cloud feature map circle. By classifying direction D of current pixel Cnow , the
FPbev , and we concatenate them if both exist. next pixel of lane Cnext can be obtained as Cnext = Cnow +
Semantic prediction. The semantic prediction module is a ∆step · D, where ∆step is a predefined step size. Since we
fully convolutional network (FCN) [33] We use cross-entropy don’t know the direction of the lane, we cannot identify the
loss for the semantic prediction. forward and backward direction of each node. Instead, we
Instance embedding. Our instance embedding module seeks treat both of them as positive labels. Concretely, the direction
to cluster each bird’s-eye view embedding. For ease of label of each lane node is a Nd vector with 2 indices labeled
as 1 and others labeled as 0. Note that most of the pixels on IV. E XPERIMENTS
the topdown map don’t lie on the lanes, which means they A. Implementation details
don’t have directions. The direction vector of those pixels is a
Tasks & Metrics. We evaluate our approach on the
zero vector and we never do backpropagation for those pixels
NuScenes dataset [36]. We focus on two sub-tasks: semantic
during training. We use softmax as the activation function for
map segmentation and instance detection. Due to the limited
classification.
types of map elements in the nuScenes dataset, we consider
Vectorization. During inference, we first cluster instance
three static map elements: lane boundary, lane divider, and
embeddings using the Density-Based Spatial Clustering of
pedestrian crossing.
Applications with Noise (DBSCAN). Then non-maximum
Architecture. For the perspective view image encoder, we
suppression (NMS) is used to reduce redundancy. Finally, the
adopt EfficientNet-B0 [37] pre-trained on ImageNet [38],
vector representations are obtained by greedily connecting
as in [29]. Then, we use a multi-layer perceptron (MLP)
the pixels with the help of the predicted direction.
to convert the perspective view features to bird’s-eye view
B. Evaluation features in the camera coordinate system. The MLP is shared
In this section, we propose evaluation protocols for seman- channel-wisely and does not change the feature dimension.
tic map learning, including semantic metrics and instance For point clouds, we use a variant of PointPillars [39] with
metrics. dynamic voxelization [31]. We use a PointNet [32] with a 64-
1) Semantic metrics: The semantics of model predictions dimensional layer to aggregate points in a pillar. ResNet [40]
can be evaluated in the Eulerian fashion and the Lagrangian with three blocks is used as the BEV decoder.
fashion. Eulerian metrics are computed on a dense grid and Training details. We use the cross-entropy loss for the se-
measure the pixel value differences. In contrast, Lagrangian mantic segmentation, and use the discriminative loss (Equa-
metrics move with the shape and measure the spatial dis- tion 5) for the instance embedding where we set α = β = 1,
tances of shapes. δv = 0.5, and δd = 3.0. We use Adam [41] for model
Eulerian metrics. We use intersection-over-union (IoU) as training, witfh a learning rate of 1e − 3.
Eulerian metrics, which is given by, B. Baseline methods
|D1 ∩ D2 | For all baseline methods, we use the same image encoder
IoU(D1 , D2 ) = , (6)
|D1 ∪ D2 | and decoder as HDMapNet and only change the view trans-
where D1 , D2 ⊆ RH×W ×D are dense representations of formation module.
shapes (curves rasterized on a grid); H and W are the height Inverse Perspective Mapping (IPM). The most straight-
and width of the grid, D is number of categories; |·| denotes forward baseline is to map segmentation predictions to the
the size of the set. bird’s-eye view via IPM [42], [43].
Lagrangian metrics. We are interested in structured outputs, IPM with bird’s-eye view decoder (IPM(B)). Our second
namely curves consists of connected points. To evaluate the baseline is an extension of IPM. Rather than making predic-
spatial distances between the predicted curves and ground- tions in perspective view, we perform semantic segmentation
truth curves, we use Chamfer distance (CD) of between point directly in bird-eye view.
sets sampled on the curves: IPM with perspective view feature encoder and bird’s-eye
view decoder (IPM(CB)). The next extension is to perform
1 X
CDDir (S1 , S2 ) = min kx − yk2 (7) feature learning in the perspective view while making pre-
S1 y∈S2
dictions in the bird’s-eye view.
x∈S1
CD(S1 , S2 ) = CDDir (S1 , S2 ) + CDDir (S2 , S1 ) (8) Lift-Splat-Shoot. Lift-Splat-Shoot [29] estimates a distribu-
tion over depth in the perspective view images. Then, it
where CDdir is the directional Chamfer distance and CD is converts 2D images into 3D point clouds with features and
the bi-directional Chamfer distance; S1 and S2 are the two projects them into the ego vehicle frame.
sets of points on the curves. View Parsing Network (VPN). VPN [27] proposes a simple
2) Instance metrics: We further evaluate the instance view transformation module: a view relation module to
detection capability of our models. We use average precision model the relations between any two pixels and a view fusion
(AP) similar to the one in object detection [35], given by module to fuses the features of pixels.
1 X
AP = APr , (9) C. Results
10
r∈{0.1,0.2,...,1.0}
We compare our HDMapNet against baselines in Sec-
where APr is the precision at recall=r. We collect all tion IV-B. Table I shows the comparisons. First, Our
predictions and rank them in descending order according to HDMapNet(Surr), which is the surrounding camera-only
the semantic confidences. Then, we classify each prediction method, outperforms all baselines. This suggests that our
based on the CD threshold. For example, if the CD is lower novel learning-based view transformation is indeed effec-
than a predefined threshold, it is considered true positive, tive, without making impractical assumptions about a com-
otherwise false positive. Finally, we obtain all precision- plex ground plane (IPM) or estimating the depth (Lift-
recall pairs and compute APs accordingly. Splat-Shoot). Second, our HDMapNet(LiDAR) is better than
Ground Truth IPM IPM(B)

IPM(CB) Lift-Splat-Shoot VPN

HDMapNet(Surr) HDMapNet(LiDAR) HDMapNet(Fusion)

Fig. 4: Qualitative results on the validation set. Top left: the surrounding images and the ground-truth local semantic map annotations.
IPM: the lane segmentation result in the perspective view and the bird’s-eye view. Others: the semantic segmentation results and the
vectorized instance detection results.
TABLE I: IoU scores (%) and CD (m) of semantic map segmentation. CDP denotes the CD from label to prediction (equivalent to
precision) while CDL denotes the CD from Prediction to Label (equivalent to recall); CD is the average of them. IoU: higher is better.
CD: lower is better. *: the perspective view labels projected from 2D HD Maps.
Divider Ped Crossing Boundary All Classes
Method IoU CDP CDL CD IoU CDP CDL CD IoU CDP CDL CD IoU CDP CDL CD
IPM∗ 14.4 1.149 2.232 2.193 9.5 1.232 3.432 2.482 18.4 1.502 2.569 1.849 14.1 1.294 2.744 2.175
IPM(B) 25.5 1.091 1.730 1.226 12.1 0.918 2.947 1.628 27.1 0.710 1.670 0.918 21.6 0.906 2.116 1.257
IPM(CB) 38.6 0.743 1.106 0.802 19.3 0.741 2.154 1.081 39.3 0.563 1.000 0.633 32.4 0.682 1.42 0.839
Lift-Splat-Shoot [29] 38.3 0.872 1.144 0.916 14.9 0.680 2.691 1.313 39.3 0.580 1.137 0.676 30.8 0.711 1.657 0.968
VPN [27] 36.5 0.534 1.197 0.919 15.8 0.491 2.824 2.245 35.6 0.283 1.234 0.848 29.3 0.436 1.752 1.337
HDMapNet(Surr) 40.6 0.761 0.979 0.779 18.7 0.855 1.997 1.101 39.5 0.608 0.825 0.624 32.9 0.741 1.267 0.834
HDMapNet(LiDAR) 26.7 1.134 1.508 1.219 17.3 1.038 2.573 1.524 44.6 0.501 0.843 0.561 29.5 0.891 1.641 1.101
HDMapNet(Fusion) 46.1 0.625 0.893 0.667 31.4 0.535 1.715 0.790 56.0 0.461 0.443 0.459 44.5 0.540 1.017 0.639
TABLE II: Instance detection results. {0.2, 0.5, 1.} are the predefined thresholds of Chamfer distance (e.g. a prediction is considered a
true positive if the Chamfer distance between label and prediction is lower than that threshold). The mAP is the average of three APs.
AP & mAP: higher is better.
Divider Ped Crossing Boundary All Classes
Method [email protected] [email protected] AP@1. mAP [email protected] [email protected] AP@1. mAP [email protected] [email protected] AP@1. mAP [email protected] [email protected] AP@1. mAP
IPM(B) 2.6 9.8 19.6 10.7 1.6 4.8 7.8 4.7 2.2 9.2 23.7 11.7 2.1 7.9 17.0 9.0
IPM(CB) 10.2 25.0 36.8 24.0 2.0 7.8 12.2 7.3 10.1 27.9 45.5 27.8 7.4 20.2 31.5 19.7
Lift-Splat-Shoot [29] 9.1 23.8 35.9 22.9 0.9 5.4 8.9 5.1 8.5 22.9 41.2 24.2 6.2 17.4 28.7 17.4
VPN [27] 8.8 22.7 34.9 22.1 1.2 5.3 9.0 5.2 9.2 24.1 42.7 25.3 6.4 17.4 28.9 17.5
HDMapNet(Surr) 13.7 30.7 40.6 28.3 1.9 7.4 12.1 7.1 13.7 33.9 50.1 32.6 9.8 24.0 34.3 22.7
HDMapNet(LiDAR) 1.0 5.7 15.1 7.3 1.9 5.8 9.0 5.6 5.8 20.2 39.6 21.9 2.9 10.6 21.2 11.6
HDMapNet(Fusion) 15.0 32.6 46.0 31.2 6.4 13.6 17.8 12.6 26.7 51.5 65.6 47.9 16.0 32.6 43.1 30.6
Fig. 5: Qualitative results under bad weather conditions. The left two images show predictions at night. The right twos show predictions
at rainy days.
Weather Night Rainy Normal
IoU 39.3 38.7 44.9

TABLE III: The semantic segmentation performance of HDMap-

Net(Fusion) under different weather.

# of frames 1 2 4
IoU 32.9 35.8 36.4

TABLE IV: Temporal fusion on HDMapNet(Surr). 1 means tem-

poral fusion is not used.

Fig. 6: Long-term temporal accumulation by fusing occupancy the lane boundary, making it easy to detect in LiDAR point
probabilities over multiple frames. clouds. On the other hand, the color contrast of road divider
and pedestrian crossing is helpful information, making two
HDMapNet(Surr) in boundary but worse in divider and
categories more recognizable in images; visualizations also
pedestrian crossing. This indicates different categories are
confirm this in Figure 4. The strongest performance is
not equally recognizable in one modality. Third, our fusion
achieved when combining LiDAR and cameras; the com-
model with both camera images and LiDAR point clouds
bined model outperforms both models with a single sensor
achieves the best performance. It improves over baselines
by a large margin. This suggests these two sensors include
and our camera-only method by 50% relatively.
complementary information for each other.
Another interesting phenomenon is that various models
Bad weather conditions. Here we assess the robustness of
behave differently in terms of the CD. For example, VPN has
our model under extreme weather conditions. As shown in
the lowest CDP in all categories, while it underperforms its
Figure 5, our model can generate complete lanes even when
counterparts on CDL and has the worst overall CD. Instead,
lighting condition is bad, or when the rain obscures sight.
our HDMapNet(Surr) balances both CDP and CDL , achieving
We speculate that the model can predict the shape of the
the best CD among all camera-only-based methods. This
lane based on partial observations when the roads are not
finding indicates that CD is complementary to IoU, which
completely visible. Although there are performance drop in
shows the precision and recall aspects of models. This helps
extreme weather condition, the overall performance is still
us understand the behaviors of different models from another
reasonable. (Table III)
perspective.
Temporal Fusion. Here we experiment on temporal fusion
Instance map detection. In Figure 2 (Instance detection strategies. We first conduct short-term temporal fusion by
branch), we show the visualization of embeddings using pasting feature maps of previous frames into current’s ac-
principal component analysis (PCA). Different lanes are cording to ego poses. The feature maps are fused by max
assigned different colors even when they are close to each pooling and then fed into decoder. As shown in Table IV,
other or have intersections. This confirms our model learns fusing multiple frames can improve the IoU of the semantics.
instance-level information and can predict instance labels We further experiment on long-term temporal accumulation
accurately. In Figure 2 (Direction classification branch), we by fusing segmentation probabilities. As shown in Figure 6,
show the direction mask predicted by our direction branch. our method produces consistent semantic maps with larger
The direction is consistent and smooth. We show the vec- field of view while fusing multiple frames.
torized curve produced after post processing in Figure 4. In
Table II, we present the quantitative results of instance map V. C ONCLUSION
detection. HDMapNet(Surr) already outperforms baselines HDMapNet predicts HD semantic maps directly from
while HDMapNet(Fusion) is significantly better than all camera images and/or LiDAR point clouds. The local se-
counterparts, e.g., it improves over IPM by 55.4%. mantic map learning framework could be a more scalable
Sensor fusion. In this section, we further analyze the ef- approach than the global map construction and annotation
fect of sensor fusion for constructing HD semantic maps. pipeline that requires a significant amount of human efforts.
As shown in Table I, for divider and pedestrian crossing, Even though our baseline method of semantic map learning
HDMapNet(Surr) outperforms HDMapNet(LiDAR), while does not produce map elements as accurate, it gives system
for lane boundary, HDMapNet(LiDAR) works better. We developers another possible choice of the trade-off between
hypothesize this is because there are elevation changes near scalability and accuracy.
R EFERENCES [24] M. Bai, G. Mattyus, N. Homayounfar, S. Wang, K. Lakshmikanth,
Shrinidhi, and R. Urtasun, “Deep multi-sensor lane detection,” in
[1] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” IROS, 2018.
in Sensor fusion IV: control paradigms and data structures, vol. 1611. [25] N. Garnett, R. Cohen, T. Pe’er, R. Lahav, and D. Levi, “3d-
International Society for Optics and Photonics, 1992, pp. 586–606. lanenet: end-to-end 3d multiple lane detection,” in Proceedings of the
[2] P. Biber and W. Straßer, “The normal distributions transform: A new IEEE/CVF International Conference on Computer Vision, 2019, pp.
approach to laser scan matching,” in Proceedings 2003 IEEE/RSJ 2921–2930.
International Conference on Intelligent Robots and Systems (IROS [26] Y. Guo, G. Chen, P. Zhao, W. Zhang, J. Miao, J. Wang, and T. E.
2003)(Cat. No. 03CH37453), vol. 3. IEEE, 2003, pp. 2743–2748. Choe, “Genlanenet: A generalized and scalable approach for 3d lane
[3] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: detection,” 2020.
science and systems, vol. 2, no. 4. Seattle, WA, 2009, p. 435. [27] B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou, “Cross-
view semantic segmentation for sensing surroundings,” IEEE Robotics
[4] F. Yu, J. Xiao, and T. Funkhouser, “Semantic alignment of lidar data
and Automation Letters, vol. 5, no. 3, p. 4867–4873, Jul 2020.
at city scale,” in Proceedings of the IEEE Conference on Computer
[Online]. Available: http://dx.doi.org/10.1109/LRA.2020.3004325
Vision and Pattern Recognition, 2015, pp. 1722–1731.
[28] T. Roddick and R. Cipolla, “Predicting semantic map representations
[5] F. Pomerleau, F. Colas, and R. Siegwart, “A review of point cloud from images using pyramid occupancy networks,” in Proceedings of
registration algorithms for mobile robotics,” Foundations and Trends the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
in Robotics, vol. 4, no. 1, pp. 1–104, 2015. tion, 2020, pp. 11 138–11 147.
[6] C. L. Lawson and R. J. Hanson, Solving least squares problems. [29] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from
SIAM, 1995. arbitrary camera rigs by implicitly unprojecting to 3d,” 2020.
[7] F. Dellaert, “Factor graphs and gtsam: A hands-on introduction,” [30] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
Georgia Institute of Technology, Tech. Rep., 2012. “Pointpillars: Fast encoders for object detection from point clouds,”
[8] S. Yang, X. Zhu, X. Nian, L. Feng, X. Qu, and T. Ma, “A robust in The IEEE Conference on Computer Vision and Pattern Recognition
pose graph approach for city scale lidar mapping,” in 2018 IEEE/RSJ (CVPR), June 2019.
International Conference on Intelligent Robots and Systems (IROS). [31] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo,
IEEE, 2018, pp. 1175–1182. J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d
[9] J. Jiao, “Machine learning assisted high-definition map creation,” object detection in LiDAR point clouds,” in The Conference on Robot
in 2018 IEEE 42nd Annual Computer Software and Applications Learning (CoRL), 2019.
Conference (COMPSAC), vol. 1. IEEE, 2018, pp. 367–373. [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
[10] L. Mi, H. Zhao, C. Nash, X. Jin, J. Gao, C. Sun, C. Schmid, on point sets for 3d classification and segmentation,” in The IEEE
N. Shavit, Y. Chai, and D. Anguelov, “Hdmapgen: A hierarchical Conference on Computer Vision and Pattern Recognition (CVPR), July
graph generative model of high definition maps,” in Proceedings of the 2017.
IEEE/CVF Conference on Computer Vision and Pattern Recognition [33] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
(CVPR), June 2021, pp. 4227–4236. for semantic segmentation,” IEEE Transactions on Pattern Analysis
[11] K.-Y. Chiu and S.-F. Lin, “Lane detection using color-based segmen- and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
tation,” in IEEE Proceedings. Intelligent Vehicles Symposium, 2005. [34] B. De Brabandere, D. Neven, and L. Van Gool, “Semantic instance
IEEE, 2005, pp. 706–711. segmentation for autonomous driving,” in Proceedings of the IEEE
[12] H. Loose, U. Franke, and C. Stiller, “Kalman particle filter for Conference on Computer Vision and Pattern Recognition (CVPR)
lane recognition on rural roads,” in 2009 IEEE Intelligent Vehicles Workshops, July 2017.
Symposium. IEEE, 2009, pp. 60–65. [35] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
[13] S. Zhou, Y. Jiang, J. Xi, J. Gong, G. Xiong, and H. Chen, “A novel P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco:
lane detection based on geometrical model and gabor filter,” in 2010 Common objects in context,” 2015.
IEEE Intelligent Vehicles Symposium. IEEE, 2010, pp. 59–64. [36] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
[14] J. Illingworth and J. Kittler, “A survey of the hough transform,” A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
Computer vision, graphics, and image processing, vol. 44, no. 1, pp. multimodal dataset for autonomous driving,” 2020.
87–116, 1988. [37] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
[15] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene convolutional neural networks,” 2020.
segmentation from a single image,” in ECCV, 2012. [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[16] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
“Scene parsing through ade20k dataset,” in Proceedings of the IEEE L. Fei-Fei, “Imagenet large scale visual recognition challenge,” 2015.
conference on computer vision and pattern recognition, 2017, pp. 633– [39] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
641. “Pointpillars: Fast encoders for object detection from point clouds,”
[17] A. Ess, T. Mueller, H. Grabner, and L. Van Gool, “Segmentation-based 2019.
urban traffic scene understanding.” in BMVC, vol. 1. Citeseer, 2009, [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
p. 2. image recognition,” in CVPR, 2016.
[18] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset tion,” 2017.
for semantic urban scene understanding,” in CVPR, 2016. [42] L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, “Restricted
[19] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, deformable convolution-based road scene semantic segmentation
“Bdd100k: A diverse driving video database with scalable annotation using surround view cameras,” IEEE Transactions on Intelligent
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018. Transportation Systems, vol. 21, no. 10, p. 4350–4362, Oct 2020.
[Online]. Available: http://dx.doi.org/10.1109/TITS.2019.2939832
[20] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The
[43] T. Sämann, K. Amende, S. Milz, C. Witt, M. Simon, and J. Petzold,
mapillary vistas dataset for semantic understanding of street scenes,”
“Efficient semantic segmentation for visual bird’s-eye view interpre-
in Proceedings of the IEEE International Conference on Computer
tation,” 2018.
Vision, 2017, pp. 4990–4999.
[21] Z. Wang, W. Ren, and Q. Qiu, “Lanenet: Real-time lane detection
networks for autonomous driving,” arXiv preprint arXiv:1807.01726,
2018.
[22] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
tation approach,” in 2018 IEEE intelligent vehicles symposium (IV).
IEEE, 2018, pp. 286–291.
[23] X. Liu and Z. Deng, “Segmentation of drivable road using deep
fully convolutional residual network with pyramid pooling,” Cognitive
Computation, vol. 10, no. 2, pp. 272–281, 2018.

Cba106 Group 1 Allday Marts Inc
No ratings yet
Cba106 Group 1 Allday Marts Inc
193 pages
Ac1000 Discussion Script Final Draft
No ratings yet
Ac1000 Discussion Script Final Draft
4 pages
Risk Assesment Model Based On ISO 22301-2012 PDF
100% (1)
Risk Assesment Model Based On ISO 22301-2012 PDF
10 pages
Neural Map Prior for Autonomous Driving
No ratings yet
Neural Map Prior for Autonomous Driving
10 pages
Stream Map Net
No ratings yet
Stream Map Net
15 pages
MapTRv2 - An End-to-End Framework For Online Vectorized HD Map Construction
No ratings yet
MapTRv2 - An End-to-End Framework For Online Vectorized HD Map Construction
17 pages
Lidar2Map: in Defense of Lidar-Based Semantic Map Construction Using Online Camera Distillation
No ratings yet
Lidar2Map: in Defense of Lidar-Based Semantic Map Construction Using Online Camera Distillation
15 pages
HD Map
No ratings yet
HD Map
10 pages
FusionLane：使用深度神经网络进行车道标记语义分割的多传感器融合
No ratings yet
FusionLane：使用深度神经网络进行车道标记语义分割的多传感器融合
10 pages
pdf paraphrase
No ratings yet
pdf paraphrase
16 pages
LiDar Re
No ratings yet
LiDar Re
13 pages
YOCR
No ratings yet
YOCR
14 pages
BEVFormer: Learning Bird's-Eye-View Representation From Multi-Camera Images Via Spatiotemporal Transformers
No ratings yet
BEVFormer: Learning Bird's-Eye-View Representation From Multi-Camera Images Via Spatiotemporal Transformers
20 pages
Superquadric Object Representation For Optimization-Based Semantic SLAM
No ratings yet
Superquadric Object Representation For Optimization-Based Semantic SLAM
8 pages
3 D Point Cloud Classification
No ratings yet
3 D Point Cloud Classification
7 pages
SegContrast 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination
No ratings yet
SegContrast 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination
8 pages
2409.00620v1
No ratings yet
2409.00620v1
26 pages
2406.12214v3
No ratings yet
2406.12214v3
40 pages
Lenet
No ratings yet
Lenet
7 pages
Sensors 23 04364
No ratings yet
Sensors 23 04364
19 pages
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
No ratings yet
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
6 pages
TMP 1500
No ratings yet
TMP 1500
10 pages
sensors-24-07702
No ratings yet
sensors-24-07702
18 pages
Deep Learning For LiDAR Point Clouds in Autonomous Driving A Review
No ratings yet
Deep Learning For LiDAR Point Clouds in Autonomous Driving A Review
21 pages
Domain Transfer For Semantic Segmentation of Lidar Data Using Deep Neural Networks
No ratings yet
Domain Transfer For Semantic Segmentation of Lidar Data Using Deep Neural Networks
8 pages
Chapter 01 - 3D Perception Vision
No ratings yet
Chapter 01 - 3D Perception Vision
8 pages
2502.10127v1
No ratings yet
2502.10127v1
4 pages
RangeNet ++
No ratings yet
RangeNet ++
8 pages
Squeeze Seg
No ratings yet
Squeeze Seg
7 pages
Road Environment Semantic Segmentation with Deep
No ratings yet
Road Environment Semantic Segmentation with Deep
13 pages
2207.12691v1
No ratings yet
2207.12691v1
6 pages
SwinURNet_Hybrid_Transformer_CNN_Architecture_for_Real_Time_Unstructured
No ratings yet
SwinURNet_Hybrid_Transformer_CNN_Architecture_for_Real_Time_Unstructured
16 pages
Dark Green Light Green White Corporate Geometric Company Internal Deck Business Presentation
No ratings yet
Dark Green Light Green White Corporate Geometric Company Internal Deck Business Presentation
17 pages
Semantickitti: A Dataset For Semantic Scene Understanding of Lidar Sequences
No ratings yet
Semantickitti: A Dataset For Semantic Scene Understanding of Lidar Sequences
11 pages
csBoundary
No ratings yet
csBoundary
8 pages
Remotesensing 13 02864 v2
No ratings yet
Remotesensing 13 02864 v2
21 pages
Deep Learning For Lidar Point Clouds in Autonomous Driving: A Review
No ratings yet
Deep Learning For Lidar Point Clouds in Autonomous Driving: A Review
21 pages
MHD202-Semantic Segmentation and Mapping of Traffic Scene-Presentation Slides
No ratings yet
MHD202-Semantic Segmentation and Mapping of Traffic Scene-Presentation Slides
32 pages
Khanda Photogrammetry paraphrase
No ratings yet
Khanda Photogrammetry paraphrase
14 pages
7. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
No ratings yet
7. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
12 pages
Lu-Net: An Efficient Network For 3D Lidar Point Cloud Semantic Segmentation Based On End-To-End-Learned 3D Features and U-Net
No ratings yet
Lu-Net: An Efficient Network For 3D Lidar Point Cloud Semantic Segmentation Based On End-To-End-Learned 3D Features and U-Net
9 pages
FuseSeg Semantic Segmentation of Urban Scenes Based On RGB and Thermal Data Fusion
No ratings yet
FuseSeg Semantic Segmentation of Urban Scenes Based On RGB and Thermal Data Fusion
12 pages
GND Net
No ratings yet
GND Net
7 pages
Autonomous Design Report ASURT FS AI 2021 v4
No ratings yet
Autonomous Design Report ASURT FS AI 2021 v4
6 pages
2005 04078
No ratings yet
2005 04078
7 pages
A Terrain Data Collection Sensor Box Towards A Better Analysis of Terrains Conditions
No ratings yet
A Terrain Data Collection Sensor Box Towards A Better Analysis of Terrains Conditions
15 pages
Deep Learning For Image and Point Cloud Fusion in Autonomous Driving: A Review
No ratings yet
Deep Learning For Image and Point Cloud Fusion in Autonomous Driving: A Review
19 pages
High-Definition Map Generation Technologies For Autonomous Driving
No ratings yet
High-Definition Map Generation Technologies For Autonomous Driving
21 pages
HUGS
No ratings yet
HUGS
10 pages
Science Research Paper 2021 FINAL
No ratings yet
Science Research Paper 2021 FINAL
24 pages
Deep Learning For Image and Point Cloud Fusion in Autonomous Driving A Review
No ratings yet
Deep Learning For Image and Point Cloud Fusion in Autonomous Driving A Review
18 pages
Pointseg: Real-Time Semantic Segmentation Based On 3D Lidar Point Cloud
No ratings yet
Pointseg: Real-Time Semantic Segmentation Based On 3D Lidar Point Cloud
7 pages
Heritage 05 00208 v2
No ratings yet
Heritage 05 00208 v2
24 pages
Neurocomputing: Fei Yan, Jiawei Wang, Guojian He, Huan Chang, Yan Zhuang
No ratings yet
Neurocomputing: Fei Yan, Jiawei Wang, Guojian He, Huan Chang, Yan Zhuang
10 pages
Blitznet: A Real-Time Deep Network For Scene Understanding
No ratings yet
Blitznet: A Real-Time Deep Network For Scene Understanding
11 pages
pereira2019
No ratings yet
pereira2019
4 pages
Yang, Liu Et Al 2022 - Graph R-CNN - Towards Accurate 3D Object Detection With Semantic-Decorated Local Graph
No ratings yet
Yang, Liu Et Al 2022 - Graph R-CNN - Towards Accurate 3D Object Detection With Semantic-Decorated Local Graph
18 pages
Regional Feature Fusion For On-Road Detection of Objects Using Camera and 3D-Lidar in High-Speed Autonomous Vehicles
No ratings yet
Regional Feature Fusion For On-Road Detection of Objects Using Camera and 3D-Lidar in High-Speed Autonomous Vehicles
20 pages
Semantic Mapping Using Object-Class Segmentation of RGB-D Images
No ratings yet
Semantic Mapping Using Object-Class Segmentation of RGB-D Images
6 pages
[BEVfusion] NeurIPS-2022-bevfusion-a-simple-and-robust-lidar-camera-fusion-framework-Paper-Conference
No ratings yet
[BEVfusion] NeurIPS-2022-bevfusion-a-simple-and-robust-lidar-camera-fusion-framework-Paper-Conference
14 pages
Global Illumination: Advancing Vision: Insights into Global Illumination
From Everand
Global Illumination: Advancing Vision: Insights into Global Illumination
Fouad Sabry
No ratings yet
Rendering Computer Graphics: Exploring Visual Realism: Insights into Computer Graphics
From Everand
Rendering Computer Graphics: Exploring Visual Realism: Insights into Computer Graphics
Fouad Sabry
No ratings yet
Distance Fog: Exploring the Visual Frontier: Insights into Computer Vision's Distance Fog
From Everand
Distance Fog: Exploring the Visual Frontier: Insights into Computer Vision's Distance Fog
Fouad Sabry
No ratings yet
Opa1632 Used in AMB Laboratories Schematics
No ratings yet
Opa1632 Used in AMB Laboratories Schematics
35 pages
A Pun Walks Into A Bar
100% (1)
A Pun Walks Into A Bar
3 pages
TRS Q &A MEMU
No ratings yet
TRS Q &A MEMU
6 pages
Curses Programming With Python: Table Des Matières
No ratings yet
Curses Programming With Python: Table Des Matières
8 pages
Aliens ColdWar Plug in
No ratings yet
Aliens ColdWar Plug in
38 pages
Maruf Notess
No ratings yet
Maruf Notess
6 pages
Hybrid Power Systems Thesis
100% (2)
Hybrid Power Systems Thesis
8 pages
FR D720 2.2K Mitsubishi PDF
No ratings yet
FR D720 2.2K Mitsubishi PDF
32 pages
GED Educator - Handbook
No ratings yet
GED Educator - Handbook
17 pages
Lesson Plan - Practicing With Devised Theatre - Dreamgirls
No ratings yet
Lesson Plan - Practicing With Devised Theatre - Dreamgirls
9 pages
Cape Town
No ratings yet
Cape Town
2 pages
Job Logic Diagram by Avdija Hamzić
No ratings yet
Job Logic Diagram by Avdija Hamzić
1 page
UGC NET EXAM VENUE 29th JUNE 2014 BANGALORE UNIVERSITY CENTRE CODE 06
No ratings yet
UGC NET EXAM VENUE 29th JUNE 2014 BANGALORE UNIVERSITY CENTRE CODE 06
9 pages
Development of High Speed Paint Mixer
100% (2)
Development of High Speed Paint Mixer
13 pages
The Day The Bulldozers Came
100% (2)
The Day The Bulldozers Came
3 pages
Axpert TradePlus ERP
No ratings yet
Axpert TradePlus ERP
27 pages
ĐỀ
No ratings yet
ĐỀ
10 pages
2003 Level 7 Questions: Newham Bulk LEA
No ratings yet
2003 Level 7 Questions: Newham Bulk LEA
17 pages
Coures Outline RC II
No ratings yet
Coures Outline RC II
1 page
Chapter 5 Quiz
No ratings yet
Chapter 5 Quiz
8 pages
Westward Basic
100% (4)
Westward Basic
357 pages
Basf Masterflow 928 Tds
No ratings yet
Basf Masterflow 928 Tds
4 pages
CS S1 Composite Registration Form
No ratings yet
CS S1 Composite Registration Form
9 pages
Light Anti-Armour Weapons
100% (2)
Light Anti-Armour Weapons
5 pages
Queen - Another One Bites The Dust Lyrics
No ratings yet
Queen - Another One Bites The Dust Lyrics
2 pages
Production of Ecstasy in The Netherlands
No ratings yet
Production of Ecstasy in The Netherlands
26 pages
dataset_apollo-io-scraper_2024-12-11_02-23-08-207
No ratings yet
dataset_apollo-io-scraper_2024-12-11_02-23-08-207
142 pages

HDMapNet

Uploaded by

HDMapNet

Uploaded by

HDMapNet: An Online HD Map Construction and Evaluation

Local semantic map learning (Ours)

using geometric projection with camera extrinsics, where

IPM(CB) Lift-Splat-Shoot VPN

HDMapNet(Surr) HDMapNet(LiDAR) HDMapNet(Fusion)

TABLE III: The semantic segmentation performance of HDMap-

TABLE IV: Temporal fusion on HDMapNet(Surr). 1 means tem-

You might also like