Neural Map Prior for Autonomous Driving
Neural Map Prior for Autonomous Driving
Xuan Xiong1 Yicheng Liu1 Tianyuan Yuan2 Yue Wang3 Yilun Wang2 Hang Zhao2,1 *
1 2 3
Shanghai Qi Zhi Institute IIIS, Tsinghua University MIT
LiDAR Manual
Scans Alignment Annotation
High-definition (HD) semantic maps are crucial in en- IMU
abling autonomous vehicles to navigate urban environ- GPS
Data Collection Globally Consistent Point Cloud Global Semantic Map
ments. The traditional method of creating offline HD
maps involves labor-intensive manual annotation pro- Online Local Map Prediction Local Semantic Map
cesses, which are not only costly but also insufficient for
Encoder Decoder
timely updates. Recent studies have proposed an alterna-
BEV features
tive approach that generates local maps using online sen- Surrounding Camera Images
sor observations. However, this approach is limited by the
sensor’s perception range and its susceptibility to occlu- Neural Map Prior (Ours)
Local Semantic Map
sions. In this study, we propose Neural Map Prior (NMP), refined BEV features
a neural representation of global maps. This representation Encoder Fusion Decoder
prior into local map inference, we apply cross-attention, a Ego Location Online Local Map Inference
mechanism that dynamically identifies correlations between Offline Global Map Prior Update
current and prior features. Secondly, to update the global Update Prior features
Vehicle B
Extract
w/o prior
Detect Download
Extract
Current 2018-09-18 Rainy with prior Map Tile
Map Tiles Storage
(a) The map prior update from sunny days help with rainy days. (b) Global Neural Map Prior
Figure 2. Demonstration of NMP for autonomous driving in adverse weather conditions. Ground reflections during rainy days make
online HD map predictions harder, posing safety issues for an autonomous driving system. NMP helps to make better predictions, as it
incorporates prior information from other vehicles that have passed through the same area on sunny days.
onboard camera images [17, 36] and videos [4]. Rely- Problem Setup. Our model operates on typical au-
ing solely on onboard sensors for model input poses a chal- tonomous driving systems equipped with an array of
lenge, as the inputs and target map belong to different co- onboard sensors, such as surround-view cameras and
ordinate systems. Cross-view learning methodologies, such GPS/IMU, for precise localization. We assume a single-
as those found in [5, 11, 21, 23, 25, 27, 33, 40], exploit scene frame setting, similar to [11], which adopts a BEV encoder-
geometric structures to bridge the gap between sensor in- decoder model for inferring local semantic maps. The BEV
puts and BEV representations. Our proposed method capi- encoder is denoted as Fenc , and the decoder is denoted as
talizes on the inherent spatial properties of BEV features as Fdec . Additionally, we create and maintain a global neural
a neural map prior, making it compatible with a majority of map prior pg ∈ RHG ×WG ×C , where HG and WG represent
BEV semantic map learning techniques. Consequently, this the height and width of the city, respectively. Each obser-
approach holds the potential to enhance online map predic- vation consists of input from the surrounding cameras I and
tion capabilities. the ego vehicle’s position in the global coordinate system
Neural Representations. Recently, advances have been Gego ∈ R4×4 . We can transform the local coordinate of
made in neural representations [8, 14, 20, 22, 28, 37]. Neu- each pixel of the BEV, denoted as lego ∈ RH×W ×2 (where
ralRecon [30] presents an approach for implicit neural 3D H and W denote the size of the BEV features), to a fixed
reconstruction that integrates reconstruction and fusion pro- global coordinate system using Gego . This transformation
cesses. Unlike traditional methods that first estimate depths results in pego ∈ RH×W ×2 . Initially, we acquire the online
and subsequently perform fusion offline. Similarly, our BEV features o = Fenc (I) ∈ RH×W ×C , where C repre-
work learns neural representation by employing the en- sents the network’s hidden embedding size. We then query
coded image features to predict the map prior through a the global prior pg using the ego position pego to obtain the
neural network. local prior BEV features plt−1 ∈ RH×W ×C . A fusion func-
tion is subsequently applied to the online BEV features and
the local prior BEV features to yield refined BEV features:
3. Neural Map Prior
The aim of this work is to improve local map estimation f_{\text {refine}} = F_{\text {fuse}} (o, p^l_{t-1}), f_{\text {refine}}\in \mathbb {R}^{H \times W \times C}. (1)
performance by leveraging a global neural map prior. To
achieve this, we propose a pipeline, depicted in Figure 3, Finally, the refined BEV features are decoded into the fi-
which is specifically designed to concurrently train both nal map outputs by the decoder Fdec . Simultaneously, the
the global map prior update and local map learning with global map prior pg is updated using frefine . The global neu-
integrating a fusion component. Moreover, we address ral network prior acts as an external memory, capable of
the memory-intensive challenge associated with storing incrementally integrating new information and simultane-
features of urban streets by introducing a sparse tile format ously offering knowledge output. This dual functionality
for the global neural map prior, as detailed in Section 4.8. ultimately leads to improved local map estimation perfor-
mance.
Local Inference
Fusion 𝐹fuse
Encoder Decoder
𝐹enc BEV features 𝑜 as query 𝐹dec
𝑃𝐸𝑐 𝑝𝑡𝑙
Positional
Embeddings C2P Attention GRU 𝑓refine
𝑃𝐸𝑝
as key and value
𝑝𝑡𝑙
𝑙
prior features 𝑝𝑡−1
Map Tile
Fusion Module
𝑙 Local Inference in
𝑝𝑡−1 Sampled from 𝑝 & by 𝑝ego Updated with 𝑝𝑡𝑙 Sec.3.1
Global Update in
Sec.3.2
Global update
Selected Map Tile from 𝑝 ! Shared Paths
Figure 3. The model architecture of NMP. The top yellow box illustrates the online HD map learning process, which takes images as
input and processes them through a BEV encoder and decoder to generate map segmentation results. Within the green box, customized
fusion modules—comprising C2P attention and GRU—are designed to effectively integrate prior map features between the encoder and
decoder, subsequently decoded to produce the final map predictions. In the bottom blue box, the model queries map tiles that overlap with
the current BEV feature from storage. After the update, the neural map is returned to the previously extracted map tiles.
3.1. Local Map Learning quality compared to both prior and current features.
In order to accommodate the dynamic nature of road
Positional Embedding. It has been observed that the ac-
networks in the real world, advanced online map learning
curacy of predicted maps declines as the distance from the
algorithms have recently been developed. These methods
ego vehicle increases. To address this issue, we propose
generate semantic map predictions based solely on data
the integration of position embeddings, a set of grid-shaped
collected by onboard sensors. In contrast to earlier ap-
learnable parameters, into the fusion module. The aim is
proaches, our proposed method incorporates neural priors
to augment the spatial awareness of the fusion module re-
to bolster accuracy. As road structures on maps are subject
garding the feature positions, empowering it to learn to trust
to change, it is imperative that recent observations take
the current features closer to the ego vehicle and rely more
precedence over older ones. To emphasize the importance
on the prior features for distant locations. Specifically, two
of current features, we introduce an asymmetric fusion
position embeddings are introduced: P Ep ∈ RH×W ×C for
strategy that combines current-to-prior attention and gated
the prior features and P Ec ∈ RH×W ×C for the current fea-
recurrent units.
tures, respectively, before the fusion module Ffuse . Here,
H and W represent the height and width of the BEV fea-
Current-to-Prior (C2P) Cross Attention. We introduce
tures. These embeddings provide spatial awareness to the
the current-to-prior cross-attention mechanism, which
fusion module, effectively allowing it to assimilate infor-
employs a standard cross-attention approach [16] to operate
mation from varying feature distances and locations.
between current and prior BEV features. Concretely, We
divide each BEV feature into patches and add them with
3.2. Global Map Prior Update
a set of learnable positional embeddings, which will be
described subsequently. Current features produce queries, To update the global map prior with the refined features
while prior features produce keys and values. A standard generated by the C2P attention module, an auxiliary mod-
cross-attention is then applied, succeeded by a fully con- ule is introduced, devised to attain a balanced ratio between
nected layer. Ultimately, we assemble the output queries to the current and prior features. This process is illustrated
derive the refined BEV features, which maintain the same in Figure 3. Intuitively, the module regulates the updat-
dimensions as the input current features. The resulting ing rate of the global map prior. A high update rate may
refined BEV features are expected to exhibit superior lead to corruption of the global map prior due to suboptimal
local observations, while a low update rate may result in Table 1. Quantitative analysis of map segmentation. The per-
the global map prior’s inability to promptly capture changes formance of online map segmentation methods and their NMP
in road conditions. Therefore, we introduce a 2D convolu- versions on the nuScenes validation set. By adding prior knowl-
edge, NMP consistently improves these methods. (* HDMapNet
tional variant of the Gated Recurrent Unit [6] module into
remains the same as in the original work, while LSS uses the same
NMP, serving to balance the updating and forgetting ratio. backbone as BEVFormer.)
Local map prior features plt−1 , updated at t−1, are extracted
from the global neural map prior pgt−1 . The refined features mIoU
′ Model
generated by the C2P attention module are denoted as o . Divider Crossing Boundary All
′
l
Integrating o with the local prior features pt−1 , the GRU HDMapNet 41.04 16.23 40.93 32.73
yields the new prior features plt at time t. Subsequently, HDMapNet + NMP 44.15 20.95 46.07 37.05
these features are passed through the decoder to predict the △ mIoU +3.11 +4.72 +5.14 +4.32
local semantic map and the global neural map prior pgt is LSS∗ 45.19 26.90 47.27 39.78
updated at the corresponding location by directly replacing LSS∗ + NMP 50.20 30.66 53.56 44.80
them with plt . Let zt denote the update gate, rt the reset △ mIoU +5.01 +3.76 +6.29 +5.02
gate, σ the sigmoid function, w∗ the weight for 2D con- BEVFormer∗ 49.51 28.85 50.67 43.01
volution, and ⊙ the Hadamard product. Via the following BEVFormer∗ + NMP 55.01 34.09 56.52 48.54
′ △ mIoU +5.50 +5.24 +5.95 +5.53
operations, the GRU fuses o with the prior feature plt−1 :
Component mIoU
Name MA GRU Local PE CA Divider Crossing Boundary All
A 49.51 28.85 50.67 43.01
B ✓ 52.19(+2.68) 33.70(+4.85) 55.34(+4.67) 47.07(+4.06)
C ✓ 53.22(+3.71) 31.46(+2.61) 55.93(+5.26) 46.87(+3.86)
D ✓ 53.25(+3.74) 33.13(+4.28) 55.15(+4.48) 47.17(+4.16)
E ✓ ✓ 52.96(+3.45) 34.13(+5.28) 56.14(+5.47) 47.74(+4.73)
F ✓ ✓ 55.05(+3.74) 31.37(+2.52) 56.19(+5.52) 47.53(+4.52)
G ✓ ✓ ✓ 55.01(+5.50) 34.09(+5.24) 56.52(+5.85) 48.54(+5.53)
Table 7. Ablation on the global map resolution. 0.3m × 0.3m is where α denotes a manually searched ratio, and other nota-
a good design choice that balances storage size and accuracy. tions are defined in Section 3.2. Although both GRU and
MA showcase comparable performance enhancements as
mIoU
NMP Grid Resolution
Divider Crossing Boundary All updating modules, GRU is preferred due to the elimination
of manual parameter searches required in MA. Both GRU
Baseline 49.51 28.85 50.67 43.01
0.3m × 0.3m 53.22 31.46 55.93 46.87(+3.86) and CA act as effective feature fusion modules, resulting in
0.6m × 0.6m 52.42 31.63 54.74 46.26(+3.25) substantial performance enhancements. The slight edge of
1.2m × 1.2m 51.36 30.24 52.78 44.79(+1.78) C2P attention over GRU indicates that the transformer ar-
chitecture holds a minor advantage in fusing prior feature
Table 8. Performance on Boston split. The original split con- contexts. Comparing C to E and F to G in Table 6, we ob-
tains unbalanced historical trips for the training and validation serve that local PE increases the IoU of the crossing by 2.67
sets; Boston split is more balanced. and 2.72, respectively. This suggests that local PE has im-
proved feature fusion, particularly in the challenging cate-
mIoU
Data Split + NMP gory of pedestrian crossings. Local PE enables the model to
Divider Crossing Boundary All
extract additional information from the map prior, thereby
X 26.35 15.32 25.06 22.24
Boston Split complementing current observations. In comparisons of C
✓ 33.04 21.72 32.63 29.13
△ mIoU +6.69 +6.40 +7.57 +6.89 to F and E to G in Table 6, C2P Attention increases the
X 49.51 28.85 50.67 43.01 IoU of the lane divider by 1.83 and 2.05, respectively, high-
Original Split
✓ 55.01 34.09 56.52 48.54 lighting its effectiveness in handling lane structures. The
△ mIoU +5.50 +5.24 +5.95 +5.53 attention mechanism extracts relevant features based on the
spatial context, leading to a more accurate understanding
of dividers and boundary structures. Overall, the ablation
tion of NMP in challenging conditions, including rain and study confirms the effectiveness of all three proposed com-
night-time driving, leads to more substantial improvements ponents for feature fusion and updating.
compared to normal weather scenarios. This indicates that
our perception model effectively leverages the necessary in-
formation from the NMP to handle bad weather situations. Map Resolution. We investigate the impact of different
However, given the smaller sample size and the limited prior resolutions of global neural priors on the effectiveness of
trip data available for this sample, the improvements were online map learning in Table 7. High resolutions are pre-
less significant under night-rain conditions. ferred to preserve details on the map. However, there is a
trade-off between storage and resolution. Our experiments
4.6. Ablation Studies on Fusion Components achieved good performance with an appropriate resolution
of 0.3m.
GRU, C2P Attention and Local Position Embedding.
In this section, we evaluate the effectiveness of the com- 4.7. Dataset Re-split
ponents proposed in Section 3. For the sake of comparison,
we introduce a simple fusion baseline, termed Moving Av- In the original split of the nuScenes dataset, some sam-
erage (MA). In this context, the C2P Attention and GRU ples lack historical traversals. We adopt an approach sim-
are replaced with a moving average fusion function. The ilar to the one presented in Hindsight [38], to re-split the
corresponding update rule can be represented as follows: trips in Boston, and name it as Boston split. The Boston
split ensures that each sample includes at least one histori-
p_t^l = \alpha o + (1-\alpha )o_{t-1}^l, (3) cal trip, while the training and test samples are geographi-
Ground Truth
HDMapNet
BEVFormer
Ours
GRU Weights
Figure 4. Qualitative results. From the first to the fifth row: Ground truth, HDMapNet, BEVFormer, BEVFormer with Neural Map Prior
and GRU weights. We also visualize zt , the attention map of the last step of the GRU fusion process. The model learns to selectively
combine current and prior map features: specifically, when the prediction quality of the current frame is good, the network tends to learn a
larger zt , assigning more weight to the current feature; when the prediction quality of the current frame is poor, usually at intersections or
locations farther away from the ego-vehicle, the network tends to learn a smaller zt for the prior feature.
cally disjoint. To estimate the proximity of two samples, we these map tiles are updated, integrated, and uploaded to the
calculate the areal overlap, specifically IoU in the bird’s- cloud asynchronously. As more trip data is collected over
eye view, between the field of view of the two traversals. time, the map prior becomes broader and of better quality.
This approach results in 7354 training samples and 6504
test samples. The comparison of model performance on the 5. Conclusion
original split and Boston split is shown in Table 8. The im-
provement of NMP observed on the Boston split is greater In this paper, we introduce a novel system, Neural Map
compared to the original split. Prior, which is designed to enhance online learning of HD
semantic maps. NMP involves the joint execution of global
4.8. Map Tiles map prior updates and local map inference for each frame
We use map tile as the storage format for our global in an incremental manner. A comprehensive analysis on
neural map prior. In urban environments, buildings gener- the nuScenes dataset demonstrates that NMP improves on-
ally occupy a substantial portion of the area, whereas road- line map inference performance, especially in challenging
related regions account for a smaller part. To prevent the weather conditions and extended prediction horizons. Fu-
map’s storage size from expanding excessively in propor- ture work includes learning more semantic map elements
tion to the physical scale of the city, we design a storage and 3D maps.
structure that divides the city into sparse pieces indexed by
their physical coordinates. It consumes 70% less memory Acknowledgments
space than dense tiles. Furthermore, each vehicle does not
need to store the entire city map; instead, it can download This work is in part supported by Li Auto. We thank
map tiles on demand. The trained model remains fixed, but Binghong Yu for proofreading the final manuscript.
References [15] Yicheng Liu, Yue Wang, Yilun Wang, and Hang Zhao. Vec-
tormapnet: End-to-end vectorized hd map learning. arXiv
[1] Paul J Besl and Neil D McKay. Method for registration of preprint arXiv:2206.08920, 2022. 1, 5
3-d shapes. In Sensor fusion IV: control paradigms and data [16] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
structures, volume 1611, pages 586–606. Spie, 1992. 2 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[2] Peter Biber and Wolfgang Straßer. The normal distribu- Hierarchical vision transformer using shifted windows. In
tions transform: A new approach to laser scan matching. Proceedings of the IEEE/CVF International Conference on
In Proceedings 2003 IEEE/RSJ International Conference Computer Vision, pages 10012–10022, 2021. 4
on Intelligent Robots and Systems (IROS 2003)(Cat. No. [17] Chenyang Lu, Marinus Jacobus Gerardus van de Molen-
03CH37453), volume 3, pages 2743–2748. IEEE, 2003. 2 graft, and Gijs Dubbelman. Monocular semantic occu-
[3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, pancy grid mapping with convolutional variational encoder–
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- decoder networks. IEEE Robotics and Automation Letters,
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- 4(2):445–452, 2019. 3
modal dataset for autonomous driving. In Proceedings of [18] Gellert Mattyus, Shenlong Wang, Sanja Fidler, and Raquel
the IEEE/CVF conference on computer vision and pattern Urtasun. Enhancing road maps by parsing aerial images
recognition, pages 11621–11631, 2020. 5 around the world. In Proceedings of the IEEE international
[4] Yigit Baran Can, Alexander Liniger, Ozan Unal, Danda conference on computer vision, pages 1689–1697, 2015. 2
Paudel, and Luc Van Gool. Understanding bird’s-eye view [19] Gellért Máttyus, Shenlong Wang, Sanja Fidler, and Raquel
semantic hd-maps using an onboard monocular camera. Urtasun. Hd maps: Fine-grained road segmentation by pars-
arXiv preprint arXiv:2012.03040, 2020. 3 ing ground and aerial images. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
[5] Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang,
pages 3611–3619, 2016. 2
and Hang Zhao. Futr3d: A unified sensor fusion framework
[20] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
for 3d detection. arXiv preprint arXiv:2203.10642, 2022. 3
bastian Nowozin, and Andreas Geiger. Occupancy networks:
[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Learning 3d reconstruction in function space. In CVPR,
Yoshua Bengio. Empirical evaluation of gated recurrent 2019. 3
neural networks on sequence modeling. arXiv preprint
[21] Bowen Pan, Jiankai Sun, Ho Yin Tiga Leung, Alex Ando-
arXiv:1412.3555, 2014. 5
nian, and Bolei Zhou. Cross-view semantic segmentation
[7] Frank Dellaert. Factor graphs and gtsam: A hands-on intro- for sensing surroundings. IEEE Robotics and Automation
duction. Technical report, Georgia Institute of Technology, Letters, 5(3):4867–4873, 2020. 3
2012. 2 [22] Jeong Joon Park, Peter Florence, Julian Straub, Richard
[8] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Newcombe, and Steven Lovegrove. DeepSDF: Learning
Huang, Matthias Nießner, and Thomas Funkhouser. Local Continuous Signed Distance Functions for Shape Represen-
implicit grid representations for 3d scenes. In CVPR, 2020. tation. In CVPR, 2019. 3
3 [23] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding
[9] Jialin Jiao. Machine learning assisted high-definition map images from arbitrary camera rigs by implicitly unprojecting
creation. In 2018 IEEE 42nd Annual Computer Software to 3d. In European Conference on Computer Vision, pages
and Applications Conference (COMPSAC), volume 1, pages 194–210. Springer, 2020. 3, 5
367–373. IEEE, 2018. 2 [24] François Pomerleau, Francis Colas, Roland Siegwart, et al.
[10] Charles L Lawson and Richard J Hanson. Solving least A review of point cloud registration algorithms for mobile
squares problems. SIAM, 1995. 2 robotics. Foundations and Trends® in Robotics, 4(1):1–104,
2015. 2
[11] Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet:
[25] Thomas Roddick and Roberto Cipolla. Predicting seman-
A local semantic map learning and evaluation framework.
tic map representations from images using pyramid occu-
arXiv preprint arXiv:2107.06307, 2021. 1, 3, 5
pancy networks. In Proceedings of the IEEE/CVF Confer-
[12] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- ence on Computer Vision and Pattern Recognition, pages
hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: 11138–11147, 2020. 3
Learning bird’s-eye-view representation from multi-camera [26] Guodong Rong, Byung Hyun Shin, Hadi Tabatabaee, Qiang
images via spatiotemporal transformers. arXiv preprint Lu, Steve Lemke, Mārtiņš Možeiko, Eric Boise, Geehoon
arXiv:2203.17270, 2022. 5 Uhm, Mark Gerow, Shalin Mehta, et al. Lgsvl simulator:
[13] Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng A high fidelity simulator for autonomous driving. arXiv
Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: preprint arXiv:2005.03778, 2020. 2
Structured modeling and learning for online vectorized hd [27] Avishkar Saha, Oscar Mendez, Chris Russell, and Richard
map construction. arXiv preprint arXiv:2208.14437, 2022. Bowden. Translating images into maps. In 2022 Inter-
1 national Conference on Robotics and Automation (ICRA),
[14] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and pages 9200–9206. IEEE, 2022. 3
Christian Theobalt. Neural sparse voxel fields. In NeurIPS, [28] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-
2020. 3 ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-Aligned
Implicit Function for High-Resolution Clothed Human Dig-
itization. In ICCV, 2019. 3
[29] Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun.
Generalized-icp. In Robotics: science and systems, vol-
ume 2, page 435. Seattle, WA, 2009. 2
[30] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and
Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc-
tion from monocular video. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 15598–15607, 2021. 3
[31] Shenlong Wang, Min Bai, Gellert Mattyus, Hang Chu, Wen-
jie Luo, Bin Yang, Justin Liang, Joel Cheverie, Sanja Fidler,
and Raquel Urtasun. Torontocity: Seeing the world with a
million eyes. arXiv preprint arXiv:1612.00423, 2016. 2
[32] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Holis-
tic 3d scene understanding from a single geo-tagged image.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3964–3972, 2015. 2
[33] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang,
Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d:
3d object detection from multi-view images via 3d-to-2d
queries. In Conference on Robot Learning, pages 180–191.
PMLR, 2022. 3
[34] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploit-
ing hd maps for 3d object detection. In Conference on Robot
Learning, pages 146–155. PMLR, 2018. 2
[35] Sheng Yang, Xiaoling Zhu, Xing Nian, Lu Feng, Xiaozhi
Qu, and Teng Ma. A robust pose graph approach for city
scale lidar mapping. In 2018 IEEE/RSJ International Con-
ference on Intelligent Robots and Systems (IROS), pages
1175–1182. IEEE, 2018. 2
[36] Weixiang Yang, Qi Li, Wenxi Liu, Yuanlong Yu, Yuexin
Ma, Shengfeng He, and Jia Pan. Projecting your view at-
tentively: Monocular road scene layout estimation via cross-
view transformation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
15536–15545, 2021. 3
[37] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
ral surface reconstruction by disentangling geometry and ap-
pearance. In NeurIPS, 2020. 3
[38] Yurong You, Katie Z Luo, Xiangyu Chen, Junan Chen, Wei-
Lun Chao, Wen Sun, Bharath Hariharan, Mark Campbell,
and Kilian Q Weinberger. Hindsight is 20/20: Leverag-
ing past traversals to aid 3d perception. arXiv preprint
arXiv:2203.11405, 2022. 7
[39] Fisher Yu, Jianxiong Xiao, and Thomas Funkhouser. Seman-
tic alignment of lidar data at city scale. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1722–1731, 2015. 2
[40] Brady Zhou and Philipp Krähenbühl. Cross-view transform-
ers for real-time map-view semantic segmentation. arXiv
preprint arXiv:2205.02833, 2022. 3