HDMapNet
HDMapNet
Framework
Qi Li1,∗ , Yue Wang2,∗ , Yilun Wang3 and Hang Zhao1
Abstract— Constructing HD semantic maps is a central com- projection. Moreover, we investigate whether point clouds
ponent of autonomous driving. However, traditional pipelines and camera images complement each other in this task. We
require a vast amount of human efforts and resources in find different map elements are not equally recognizable in
annotating and maintaining the semantics in the map, which
limits its scalability. In this paper, we introduce the problem a single modality. To take the best from both worlds, our
arXiv:2107.06307v4 [cs.CV] 18 Mar 2022
of HD semantic map learning, which dynamically constructs best model combines point cloud representations with image
the local semantics based on onboard sensor observations. representations. This model outperforms its single-modal
Meanwhile, we introduce a semantic map learning method, counterparts by a significant margin in all categories. To
dubbed HDMapNet. HDMapNet encodes image features from demonstrate the practical value of our method, we generate
surrounding cameras and/or point clouds from LiDAR, and
predicts vectorized map elements in the bird’s-eye view. We a locally-consistent map using our model in Figure 6; the
benchmark HDMapNet on nuScenes dataset and show that in map is immediately applicable to real-time motion planning.
all settings, it performs better than baseline methods. Of note, Finally, we propose comprehensive ways to evaluate the
our camera-LiDAR fusion-based HDMapNet outperforms ex- performance of map learning. These metrics include both
isting methods by more than 50% in all metrics. In addition, we semantic level and instance level evaluations as map elements
develop semantic-level and instance-level metrics to evaluate the
map learning performance. Finally, we showcase our method is are typically represented as object instances in HD maps.
capable of predicting a locally consistent map. By introducing On the public NuScenes dataset, HDMapNet improves over
the method and metrics, we invite the community to study this existing methods by 12.1 IoU on semantic segmentation and
novel map learning problem. 13.1 mAP on instance detection.
I. I NTRODUCTION To summarize, our contributions include the following:
• We propose a novel online framework to construct HD
High-definition (HD) semantic maps are an essential mod-
ule for autonomous driving. Traditional pipelines to construct semantic maps from the sensory observations, and together
such HD semantic maps involve capturing point clouds with a method named HDMapNet.
• We come up with a novel feature projection module from
beforehand, building globally-consistent maps using SLAM,
and annotating semantics in the maps. This paradigm, though perspective view to bird’s-eye view. This module models
producing accurate HD maps and adopted by many au- 3D environments implicitly and considers the camera
tonomous driving companies, requires a vast amount of extrinsic explicitly.
• We develop comprehensive evaluation protocols and met-
human efforts.
As an alternative, we investigate scalable and affordable rics to facilitate future research.
autonomous driving solutions, e.g. minimizing human efforts II. R ELATED W ORK
in annotating and maintaining HD maps. To that end, we
Semantic map construction. Most existing HD semantic
introduce a novel semantic map learning framework that
maps are annotated either manually or semi-automatically
makes use of on-board sensors and computation to estimate
on LiDAR point clouds of the environment, merged from
vectorized local semantic maps. Of note, our framework does
LiDAR scans collected from survey vehicles with high-end
not aim to replace global HD map reconstruction, instead to
GPS and IMU. SLAM algorithms are the most commonly
provide a simple way to predict local semantic maps for
used algorithms to fuse LiDAR scans into a highly accu-
real-time motion prediction and planning.
rate and consistent point cloud. First, pairwise alignment
We propose a semantic map learning method named
algorithm like ICP [1], NDT [2] and their variants [3] are
HDMapNet, which produces vectorized map elements from
employed to match LiDAR data at two nearby timestamps
images of the surrounding cameras and/or from point clouds
using semantic [4] or geometry information [5]. Second,
like LiDARs. We study how to effectively transform perspec-
estimating accurate poses of ego vehicle is formulated as
tive image features to bird’s-eye view features when depth
a non-linear least-square problem [6] or a factor graph [7]
is missing. We put forward a novel view transformer that
which is critical to build a globally consistent map. Yang et
consists of both neural feature transformation and geometric
al. [8] presented a method for reconstructing maps at
1 Qi Li and Hang Zhao are with Tsinghua University, city scale based on the pose graph optimization under the
Beijing, China. Hang Zhao is the corresponding author: constraint of pairwise alignment factor. To reduce the cost
[email protected]
2 Yue Wang is with Massachusetts Institute of Technology, MA, USA.
of manual annotation of semantic maps, [9], [10] proposed
3 Yilun Wang is with Li Auto, Beijing, China. several machine learning techniques to extract static elements
* Equal contribution. from fused LiDAR point clouds and cameras. However, it
Global map construction
LiDAR Pairwise
scans alignment Centimeter-level
Globally consistent point cloud Global HD semantic map localization
IMU
Manual
GPS Annotation
Wheel
odometry
and/or
Local HD map
Fig. 1: In contrast to pre-annotating global semantic maps, we introduce a novel local map learning framework that makes use of on-board
sensors to estimate local semantic maps.
is still laborious and costly to maintain an HD semantic and outputs vectorized map elements, such as lane dividers,
map since it requires high precision and timely update. lane boundaries and pedestrian crossings. We use I and P to
In this paper, we argue that our proposed local semantic denote the images and point clouds, respectively. Optionally,
map learning task is a potentially more scalable solution for the framework can be extended to include other sensor
autonomous driving. signals like radars. We define M as the map elements to
Perspective view lane detection. The traditional predict.
perspective-view-based lane detection pipeline involves local
image feature extraction (e.g. color, directional filters [11], A. HDMapNet
[12], [13]) , line fitting (e.g. Hough transform [14]), Our semantic map learning model, named HDMapNet,
image-to-world projection, etc. With the advances of predicts map elements M from single frame I and P with
deep learning based image segmentation and detection neural networks directly. An overview is shown in Figure 2,
techniques [15], [16], [17], [18], researchers have explored four neural networks parameterize our model: a perspective
more data-driven approaches. Deep models were developed view image encoder φI and a neural view transformer φV
for road segmentation [19], [20], lane detection [21], [22], in the image branch, a pillar-based point cloud encoder φP ,
drivable area analysis [23], etc. More recently, models were and a map element decoder φM . We denote our HDMapNet
built to give 3D outputs rather than 2D. Bai et al. [24] family as HDMapNet(Surr), HDMapNet(LiDAR), HDMap-
incorporated LiDAR signals so that image pixels can be Net(Fusion) if the model takes only surrounding images, only
projected onto the ground. Garnett et al. [25] and Guo et LiDAR, or both of them as input.
al. [26] used synthetic lane datasets to perform supervised 1) Image encoder: Our image encoder has two compo-
training on the prediction of camera height and pitch, so that nents, namely perspective view image encoder and neural
the output lanes sit in a 3D ground plane. Beyond detecting view transformer.
lanes, our work outputs a consistent local semantic map Perspective view image encoder. Our image branch takes
around the vehicle from surround cameras or LiDARs. perspective view inputs from Nm surrounding cameras, cov-
Cross-view learning. Recently, some efforts have been made ering the panorama of the scene. Each image Ii is embedded
to study cross-view learning to facilitate robots’ surrounding by a shared neural network φI to get perspective view feature
sensing capability. Pan [27] used MLPs to learn the relation- map FIpvi ⊆ RHpv ×Wpv ×K where Hpv , Wpv , and K are the
ship between perspective-view feature maps and bird’s-eye height, width, and feature dimension respectively.
view feature maps. Roddick and Cipolla [28] applied 1D Neural view transformer. As shown in Figure 3, we first
convolution on the image fetures along the horizontal axis transform image features from perspective view to camera
to predict bird’s-eye view. Philion and Fidler [29] predicted coordinate system and then to bird’s-eye view. The relation
the depth of monocular cameras and project image features of any two pixels between perspective view and camera
into bird’s-eye view using soft attention. Our work focuses coodinate system is modeled by a multi-layer perceptron φVi :
on the crucial task of local semantic map construction that we
pv pv
use cross-view sensing methods to generate map elements in FIci [h][w] = φhw
Vi (FIi [1][1], . . . , FIi [Hpv ][Wpv ]) (1)
a vectorized form. Moreover, our model can be easily fused
with LiDAR input to further improve its accuracy. where φhwVi models the relation between feature vector at
position (h, w) in the camera coodinate system and every
III. S EMANTIC M AP L EARNING pixel on the perspective view feature map. We denote Hc
We propose semantic map learning, a novel framework and Wc as the top-down spatial dimensions of FIc . The
that produces local high-definition semantic maps. It takes bird’s-eye view (ego coordinate system) features FIbevi
⊆
sensor inputs like camera images and LiDAR point clouds, RHbev ×Wbev ×K is obtained by transforming the features FIci
Fig. 2: Model overview. HDMapNet works with either or both of images and point clouds, outputs semantic segmentation, instance
embedding and directions, and finally produces a vectorized local semantic map. Top left: image branch. Bottom left: point cloud branch.
Right: HD semantic map. .
# of frames 1 2 4
IoU 32.9 35.8 36.4
Fig. 6: Long-term temporal accumulation by fusing occupancy the lane boundary, making it easy to detect in LiDAR point
probabilities over multiple frames. clouds. On the other hand, the color contrast of road divider
and pedestrian crossing is helpful information, making two
HDMapNet(Surr) in boundary but worse in divider and
categories more recognizable in images; visualizations also
pedestrian crossing. This indicates different categories are
confirm this in Figure 4. The strongest performance is
not equally recognizable in one modality. Third, our fusion
achieved when combining LiDAR and cameras; the com-
model with both camera images and LiDAR point clouds
bined model outperforms both models with a single sensor
achieves the best performance. It improves over baselines
by a large margin. This suggests these two sensors include
and our camera-only method by 50% relatively.
complementary information for each other.
Another interesting phenomenon is that various models
Bad weather conditions. Here we assess the robustness of
behave differently in terms of the CD. For example, VPN has
our model under extreme weather conditions. As shown in
the lowest CDP in all categories, while it underperforms its
Figure 5, our model can generate complete lanes even when
counterparts on CDL and has the worst overall CD. Instead,
lighting condition is bad, or when the rain obscures sight.
our HDMapNet(Surr) balances both CDP and CDL , achieving
We speculate that the model can predict the shape of the
the best CD among all camera-only-based methods. This
lane based on partial observations when the roads are not
finding indicates that CD is complementary to IoU, which
completely visible. Although there are performance drop in
shows the precision and recall aspects of models. This helps
extreme weather condition, the overall performance is still
us understand the behaviors of different models from another
reasonable. (Table III)
perspective.
Temporal Fusion. Here we experiment on temporal fusion
Instance map detection. In Figure 2 (Instance detection strategies. We first conduct short-term temporal fusion by
branch), we show the visualization of embeddings using pasting feature maps of previous frames into current’s ac-
principal component analysis (PCA). Different lanes are cording to ego poses. The feature maps are fused by max
assigned different colors even when they are close to each pooling and then fed into decoder. As shown in Table IV,
other or have intersections. This confirms our model learns fusing multiple frames can improve the IoU of the semantics.
instance-level information and can predict instance labels We further experiment on long-term temporal accumulation
accurately. In Figure 2 (Direction classification branch), we by fusing segmentation probabilities. As shown in Figure 6,
show the direction mask predicted by our direction branch. our method produces consistent semantic maps with larger
The direction is consistent and smooth. We show the vec- field of view while fusing multiple frames.
torized curve produced after post processing in Figure 4. In
Table II, we present the quantitative results of instance map V. C ONCLUSION
detection. HDMapNet(Surr) already outperforms baselines HDMapNet predicts HD semantic maps directly from
while HDMapNet(Fusion) is significantly better than all camera images and/or LiDAR point clouds. The local se-
counterparts, e.g., it improves over IPM by 55.4%. mantic map learning framework could be a more scalable
Sensor fusion. In this section, we further analyze the ef- approach than the global map construction and annotation
fect of sensor fusion for constructing HD semantic maps. pipeline that requires a significant amount of human efforts.
As shown in Table I, for divider and pedestrian crossing, Even though our baseline method of semantic map learning
HDMapNet(Surr) outperforms HDMapNet(LiDAR), while does not produce map elements as accurate, it gives system
for lane boundary, HDMapNet(LiDAR) works better. We developers another possible choice of the trade-off between
hypothesize this is because there are elevation changes near scalability and accuracy.
R EFERENCES [24] M. Bai, G. Mattyus, N. Homayounfar, S. Wang, K. Lakshmikanth,
Shrinidhi, and R. Urtasun, “Deep multi-sensor lane detection,” in
[1] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” IROS, 2018.
in Sensor fusion IV: control paradigms and data structures, vol. 1611. [25] N. Garnett, R. Cohen, T. Pe’er, R. Lahav, and D. Levi, “3d-
International Society for Optics and Photonics, 1992, pp. 586–606. lanenet: end-to-end 3d multiple lane detection,” in Proceedings of the
[2] P. Biber and W. Straßer, “The normal distributions transform: A new IEEE/CVF International Conference on Computer Vision, 2019, pp.
approach to laser scan matching,” in Proceedings 2003 IEEE/RSJ 2921–2930.
International Conference on Intelligent Robots and Systems (IROS [26] Y. Guo, G. Chen, P. Zhao, W. Zhang, J. Miao, J. Wang, and T. E.
2003)(Cat. No. 03CH37453), vol. 3. IEEE, 2003, pp. 2743–2748. Choe, “Genlanenet: A generalized and scalable approach for 3d lane
[3] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: detection,” 2020.
science and systems, vol. 2, no. 4. Seattle, WA, 2009, p. 435. [27] B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou, “Cross-
view semantic segmentation for sensing surroundings,” IEEE Robotics
[4] F. Yu, J. Xiao, and T. Funkhouser, “Semantic alignment of lidar data
and Automation Letters, vol. 5, no. 3, p. 4867–4873, Jul 2020.
at city scale,” in Proceedings of the IEEE Conference on Computer
[Online]. Available: http://dx.doi.org/10.1109/LRA.2020.3004325
Vision and Pattern Recognition, 2015, pp. 1722–1731.
[28] T. Roddick and R. Cipolla, “Predicting semantic map representations
[5] F. Pomerleau, F. Colas, and R. Siegwart, “A review of point cloud from images using pyramid occupancy networks,” in Proceedings of
registration algorithms for mobile robotics,” Foundations and Trends the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
in Robotics, vol. 4, no. 1, pp. 1–104, 2015. tion, 2020, pp. 11 138–11 147.
[6] C. L. Lawson and R. J. Hanson, Solving least squares problems. [29] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from
SIAM, 1995. arbitrary camera rigs by implicitly unprojecting to 3d,” 2020.
[7] F. Dellaert, “Factor graphs and gtsam: A hands-on introduction,” [30] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
Georgia Institute of Technology, Tech. Rep., 2012. “Pointpillars: Fast encoders for object detection from point clouds,”
[8] S. Yang, X. Zhu, X. Nian, L. Feng, X. Qu, and T. Ma, “A robust in The IEEE Conference on Computer Vision and Pattern Recognition
pose graph approach for city scale lidar mapping,” in 2018 IEEE/RSJ (CVPR), June 2019.
International Conference on Intelligent Robots and Systems (IROS). [31] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo,
IEEE, 2018, pp. 1175–1182. J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d
[9] J. Jiao, “Machine learning assisted high-definition map creation,” object detection in LiDAR point clouds,” in The Conference on Robot
in 2018 IEEE 42nd Annual Computer Software and Applications Learning (CoRL), 2019.
Conference (COMPSAC), vol. 1. IEEE, 2018, pp. 367–373. [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
[10] L. Mi, H. Zhao, C. Nash, X. Jin, J. Gao, C. Sun, C. Schmid, on point sets for 3d classification and segmentation,” in The IEEE
N. Shavit, Y. Chai, and D. Anguelov, “Hdmapgen: A hierarchical Conference on Computer Vision and Pattern Recognition (CVPR), July
graph generative model of high definition maps,” in Proceedings of the 2017.
IEEE/CVF Conference on Computer Vision and Pattern Recognition [33] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
(CVPR), June 2021, pp. 4227–4236. for semantic segmentation,” IEEE Transactions on Pattern Analysis
[11] K.-Y. Chiu and S.-F. Lin, “Lane detection using color-based segmen- and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.
tation,” in IEEE Proceedings. Intelligent Vehicles Symposium, 2005. [34] B. De Brabandere, D. Neven, and L. Van Gool, “Semantic instance
IEEE, 2005, pp. 706–711. segmentation for autonomous driving,” in Proceedings of the IEEE
[12] H. Loose, U. Franke, and C. Stiller, “Kalman particle filter for Conference on Computer Vision and Pattern Recognition (CVPR)
lane recognition on rural roads,” in 2009 IEEE Intelligent Vehicles Workshops, July 2017.
Symposium. IEEE, 2009, pp. 60–65. [35] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
[13] S. Zhou, Y. Jiang, J. Xi, J. Gong, G. Xiong, and H. Chen, “A novel P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco:
lane detection based on geometrical model and gabor filter,” in 2010 Common objects in context,” 2015.
IEEE Intelligent Vehicles Symposium. IEEE, 2010, pp. 59–64. [36] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
[14] J. Illingworth and J. Kittler, “A survey of the hough transform,” A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
Computer vision, graphics, and image processing, vol. 44, no. 1, pp. multimodal dataset for autonomous driving,” 2020.
87–116, 1988. [37] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
[15] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene convolutional neural networks,” 2020.
segmentation from a single image,” in ECCV, 2012. [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[16] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
“Scene parsing through ade20k dataset,” in Proceedings of the IEEE L. Fei-Fei, “Imagenet large scale visual recognition challenge,” 2015.
conference on computer vision and pattern recognition, 2017, pp. 633– [39] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
641. “Pointpillars: Fast encoders for object detection from point clouds,”
[17] A. Ess, T. Mueller, H. Grabner, and L. Van Gool, “Segmentation-based 2019.
urban traffic scene understanding.” in BMVC, vol. 1. Citeseer, 2009, [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
p. 2. image recognition,” in CVPR, 2016.
[18] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset tion,” 2017.
for semantic urban scene understanding,” in CVPR, 2016. [42] L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, “Restricted
[19] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, deformable convolution-based road scene semantic segmentation
“Bdd100k: A diverse driving video database with scalable annotation using surround view cameras,” IEEE Transactions on Intelligent
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018. Transportation Systems, vol. 21, no. 10, p. 4350–4362, Oct 2020.
[Online]. Available: http://dx.doi.org/10.1109/TITS.2019.2939832
[20] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The
[43] T. Sämann, K. Amende, S. Milz, C. Witt, M. Simon, and J. Petzold,
mapillary vistas dataset for semantic understanding of street scenes,”
“Efficient semantic segmentation for visual bird’s-eye view interpre-
in Proceedings of the IEEE International Conference on Computer
tation,” 2018.
Vision, 2017, pp. 4990–4999.
[21] Z. Wang, W. Ren, and Q. Qiu, “Lanenet: Real-time lane detection
networks for autonomous driving,” arXiv preprint arXiv:1807.01726,
2018.
[22] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
tation approach,” in 2018 IEEE intelligent vehicles symposium (IV).
IEEE, 2018, pp. 286–291.
[23] X. Liu and Z. Deng, “Segmentation of drivable road using deep
fully convolutional residual network with pyramid pooling,” Cognitive
Computation, vol. 10, no. 2, pp. 272–281, 2018.