0% found this document useful (0 votes)
3 views

Neural Map Prior for Autonomous Driving

The document presents the Neural Map Prior (NMP), a novel approach for creating and updating high-definition semantic maps for autonomous driving, addressing the limitations of traditional manual mapping methods. NMP utilizes a neural representation that integrates data from multiple vehicles to enhance local map inference and supports timely updates, significantly improving map prediction performance even in challenging conditions. Experimental results demonstrate its compatibility with various mapping architectures and its effectiveness in enhancing the accuracy of local semantic maps.

Uploaded by

yuhuanzjj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Neural Map Prior for Autonomous Driving

The document presents the Neural Map Prior (NMP), a novel approach for creating and updating high-definition semantic maps for autonomous driving, addressing the limitations of traditional manual mapping methods. NMP utilizes a neural representation that integrates data from multiple vehicles to enhance local map inference and supports timely updates, significantly improving map prediction performance even in challenging conditions. Experimental results demonstrate its compatibility with various mapping architectures and its effectiveness in enhancing the accuracy of local semantic maps.

Uploaded by

yuhuanzjj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Neural Map Prior for Autonomous Driving

Xuan Xiong1 Yicheng Liu1 Tianyuan Yuan2 Yue Wang3 Yilun Wang2 Hang Zhao2,1 *
1 2 3
Shanghai Qi Zhi Institute IIIS, Tsinghua University MIT

Abstract Offline Global Map Construction


arXiv:2304.08481v2 [cs.CV] 14 Jun 2023

LiDAR Manual
Scans Alignment Annotation
High-definition (HD) semantic maps are crucial in en- IMU
abling autonomous vehicles to navigate urban environ- GPS
Data Collection Globally Consistent Point Cloud Global Semantic Map
ments. The traditional method of creating offline HD
maps involves labor-intensive manual annotation pro- Online Local Map Prediction Local Semantic Map

cesses, which are not only costly but also insufficient for
Encoder Decoder
timely updates. Recent studies have proposed an alterna-
BEV features
tive approach that generates local maps using online sen- Surrounding Camera Images
sor observations. However, this approach is limited by the
sensor’s perception range and its susceptibility to occlu- Neural Map Prior (Ours)
Local Semantic Map
sions. In this study, we propose Neural Map Prior (NMP), refined BEV features
a neural representation of global maps. This representation Encoder Fusion Decoder

automatically updates itself and improves the performance


Surrounding Camera Images
of local map inference. Specifically, we utilize two ap-
proaches to achieve this. Firstly, to integrate a strong map Query
Global Neural Map Prior

prior into local map inference, we apply cross-attention, a Ego Location Online Local Map Inference

mechanism that dynamically identifies correlations between Offline Global Map Prior Update

current and prior features. Secondly, to update the global Update Prior features

neural map prior, we utilize a learning-based fusion mod- Map Tile

ule that guides the network in fusing features from previous


traversals. Our experimental results, based on the nuScenes Figure 1. Comparison of semantic map construction methods.
dataset, demonstrate that our framework is highly compat- Traditional offline semantic mapping pipelines (the first row) in-
ible with various map segmentation and detection architec- volve a complex manual annotation pipeline and do not support
tures. It significantly improves map prediction performance, timely map updates. Online HD semantic map learning methods
even in challenging weather conditions and situations with (the second row) rely entirely on onboard sensor observations and
a longer perception range. To the best of our knowledge, are susceptible to occlusions. We propose the Neural Map Prior
this is the first learning-based system for creating a global (NMP, the third row), an innovative neural representation of global
map prior. maps designed to aid onboard map prediction. NMP is incremen-
tally updated as it continuously integrates new observations from
a fleet of autonomous vehicles.
1. Introduction
Despite the high precision of these offline mapping solu-
Autonomous vehicles require high-definition (HD) se- tions, their scalability is constrained, and they do not sup-
mantic maps to accurately predict the future trajectories port timely updates in response to changing road conditions.
of other agents and to navigate urban environments safely. As a result, autonomous vehicles may operate based on out-
However, the majority of these vehicles rely on costly and dated maps, which could compromise driving safety.
labor-intensive pre-annotated offline HD maps. These maps
Recent research has explored alternative methods for
are constructed through a complex pipeline involving multi-
constructing HD semantic maps using onboard sensor
ple LiDAR scanning trips with survey vehicles, global point
observations, such as camera images and LiDAR point
cloud alignment, and manual annotation of map elements.
clouds [11,13,15]. These methods typically use deep learn-
* Corresponding at: [email protected] ing techniques to infer map elements in real-time, thus ad-
dressing the issue of map updates associated with offline tively enhance their map prediction performance.
maps. Nevertheless, the quality of these inferred maps is
generally inferior when compared to pre-constructed global 3. We conduct a comprehensive evaluation of our method
maps. And this quality can degrade further under unfavor- on the nuScenes dataset, considering different map el-
able weather conditions or in occluded scenarios. The com- ements and four map segmentation/detection architec-
parison of different semantic map construction methods is tures. The results demonstrate consistent and signifi-
provided in Figure 1. cant improvements. Moreover, our approach demon-
strates substantial progress in challenging scenarios,
In this study, we present Neural Map Prior (NMP), a
such as bad weather conditions and longer perception
novel hybrid mapping solution that combines the best of
ranges.
both worlds. NMP leverages neural representations to build
and update a global map prior, thereby enhancing local map
inference performance. The NMP methodology consists of 2. Related Works
two important stages: global map prior update and lo-
cal map inference. The global map prior is automatically LiDAR SLAM-Based Mapping. Autonomous driving sys-
developed by aggregating sensor data from a fleet of self- tem requires an understanding of road map elements, in-
driving cars. Onboard sensor data and the global map prior cluding lanes, pedestrian crossing, and traffic signs, to nav-
are then integrated into the local map inference process, igate the world. Such map elements are typically provided
which subsequently refines the map prior. These procedures by pre-annotated High-Definition (HD) semantic maps in
are interconnected in a feedback loop that grows stronger as existing pipelines [26]. Most current HD semantic maps are
more data are collected from vehicles traversing the roads manually or semi-automatically annotated on LiDAR point
daily. One example is shown in Figure 2. clouds of the environment, merged from LiDAR scans col-
lected from survey vehicles equipped with high-end GPS
Technically, the global NMP is defined as sparse map
and IMU. SLAM algorithms are the most commonly used
tiles, where each tile corresponds to a specific real-world lo-
algorithms to fuse LiDAR scans into a highly accurate and
cation and starts in an empty state. For each online observa-
consistent point cloud. First, to match LiDAR data at two
tion from an autonomous vehicle, a neural network encoder
nearby timestamps, pairwise alignment algorithms such as
first extracts local bird’s-eye view (BEV) features. These
ICP [1], NDT [2], and their variants [29] are employed, us-
features are then refined using the corresponding NMP prior
ing either semantic [39] or geometry information [24]. Sec-
features, derived from the global NMP’s map tile. The re-
ond, accurately estimating the poses of the ego vehicles is
fined BEV feature enables us to better infer the local se-
critical for building a globally consistent map and is formu-
mantic map and update the global NMP. As the autonomous
lated as either a non-linear least-square problem [10] or a
vehicles traverse through various scenes, the local map in-
factor graph [7]. Yang et al. [35] presented a method for re-
ference phase and the global map prior update step mutu-
constructing city-scale maps based on pose graph optimiza-
ally reinforce each other. This iterative process results in
tion under the constraint of pairwise alignment factors. To
improved quality of the predicted local semantic map and
reduce the cost of manual annotation of semantic maps, Jian
maintains a more complete and up-to-date global NMP.
et al. [9] proposed several machine-learning techniques for
We demonstrate that our NMP can be easily applied to
extracting static elements from fused LiDAR point clouds
various state-of-the-art HD semantic map learning methods,
and cameras. However, maintaining an HD semantic map
effectively enhancing their accuracy. Through experiments
remains a laborious and costly process due to the require-
conducted on the nuScenes dataset, our pipeline showcases
ment for high precision and timely updates. In this paper,
remarkable performance improvements, including a +4.32
we propose using neural map priors as a novel mapping
mIoU for HDMapNet, +5.02 mIoU for LSS, +5.50 mIoU
paradigm to replace human-curated HD maps, supporting
for BEVFormer, and +3.90 mAP for VectorMapNet.
timely updates to the global map prior and enhancing local
To summarize, our contributions are as follows:
map learning, potentially making it a more scalable solution
1. We propose a novel mapping paradigm, Neural Map for autonomous driving.
Prior, which integrates the maintenance of offline Semantic Map Learning. Semantic map learning consti-
global maps and the inference of online local maps. tutes a fundamental challenge in real-world map construc-
Notably, the computational and memory resources re- tion and has been formulated as a semantic segmentation
quired by our approach’s local map inference are com- problem in [18]. Various approaches have been employed
parable to previous methods. to address this issue, including aerial images in [19], Li-
DAR point clouds in [34], and HD panoramas in [31].
2. We propose current-to-prior attention and Gated Re- To enhance fine-grained segmentation performance, crowd-
current Unit modules. These are adaptable to main- sourcing tags have been proposed in [32]. Recent stud-
stream HD semantic map learning methods and effec- ies have concentrated on deciphering BEV semantics from
Vehicle A

Detect Update Upload


as prior
History 2018-08-30 Sunny Map Tile

Vehicle B

Extract
w/o prior
Detect Download
Extract
Current 2018-09-18 Rainy with prior Map Tile
Map Tiles Storage
(a) The map prior update from sunny days help with rainy days. (b) Global Neural Map Prior

Figure 2. Demonstration of NMP for autonomous driving in adverse weather conditions. Ground reflections during rainy days make
online HD map predictions harder, posing safety issues for an autonomous driving system. NMP helps to make better predictions, as it
incorporates prior information from other vehicles that have passed through the same area on sunny days.

onboard camera images [17, 36] and videos [4]. Rely- Problem Setup. Our model operates on typical au-
ing solely on onboard sensors for model input poses a chal- tonomous driving systems equipped with an array of
lenge, as the inputs and target map belong to different co- onboard sensors, such as surround-view cameras and
ordinate systems. Cross-view learning methodologies, such GPS/IMU, for precise localization. We assume a single-
as those found in [5, 11, 21, 23, 25, 27, 33, 40], exploit scene frame setting, similar to [11], which adopts a BEV encoder-
geometric structures to bridge the gap between sensor in- decoder model for inferring local semantic maps. The BEV
puts and BEV representations. Our proposed method capi- encoder is denoted as Fenc , and the decoder is denoted as
talizes on the inherent spatial properties of BEV features as Fdec . Additionally, we create and maintain a global neural
a neural map prior, making it compatible with a majority of map prior pg ∈ RHG ×WG ×C , where HG and WG represent
BEV semantic map learning techniques. Consequently, this the height and width of the city, respectively. Each obser-
approach holds the potential to enhance online map predic- vation consists of input from the surrounding cameras I and
tion capabilities. the ego vehicle’s position in the global coordinate system
Neural Representations. Recently, advances have been Gego ∈ R4×4 . We can transform the local coordinate of
made in neural representations [8, 14, 20, 22, 28, 37]. Neu- each pixel of the BEV, denoted as lego ∈ RH×W ×2 (where
ralRecon [30] presents an approach for implicit neural 3D H and W denote the size of the BEV features), to a fixed
reconstruction that integrates reconstruction and fusion pro- global coordinate system using Gego . This transformation
cesses. Unlike traditional methods that first estimate depths results in pego ∈ RH×W ×2 . Initially, we acquire the online
and subsequently perform fusion offline. Similarly, our BEV features o = Fenc (I) ∈ RH×W ×C , where C repre-
work learns neural representation by employing the en- sents the network’s hidden embedding size. We then query
coded image features to predict the map prior through a the global prior pg using the ego position pego to obtain the
neural network. local prior BEV features plt−1 ∈ RH×W ×C . A fusion func-
tion is subsequently applied to the online BEV features and
the local prior BEV features to yield refined BEV features:
3. Neural Map Prior
The aim of this work is to improve local map estimation f_{\text {refine}} = F_{\text {fuse}} (o, p^l_{t-1}), f_{\text {refine}}\in \mathbb {R}^{H \times W \times C}. (1)
performance by leveraging a global neural map prior. To
achieve this, we propose a pipeline, depicted in Figure 3, Finally, the refined BEV features are decoded into the fi-
which is specifically designed to concurrently train both nal map outputs by the decoder Fdec . Simultaneously, the
the global map prior update and local map learning with global map prior pg is updated using frefine . The global neu-
integrating a fusion component. Moreover, we address ral network prior acts as an external memory, capable of
the memory-intensive challenge associated with storing incrementally integrating new information and simultane-
features of urban streets by introducing a sparse tile format ously offering knowledge output. This dual functionality
for the global neural map prior, as detailed in Section 4.8. ultimately leads to improved local map estimation perfor-
mance.
Local Inference

Fusion 𝐹fuse
Encoder Decoder
𝐹enc BEV features 𝑜 as query 𝐹dec
𝑃𝐸𝑐 𝑝𝑡𝑙
Positional
Embeddings C2P Attention GRU 𝑓refine
𝑃𝐸𝑝
as key and value
𝑝𝑡𝑙
𝑙
prior features 𝑝𝑡−1

Map Tile

Fusion Module
𝑙 Local Inference in
𝑝𝑡−1 Sampled from 𝑝 & by 𝑝ego Updated with 𝑝𝑡𝑙 Sec.3.1
Global Update in
Sec.3.2
Global update
Selected Map Tile from 𝑝 ! Shared Paths

Figure 3. The model architecture of NMP. The top yellow box illustrates the online HD map learning process, which takes images as
input and processes them through a BEV encoder and decoder to generate map segmentation results. Within the green box, customized
fusion modules—comprising C2P attention and GRU—are designed to effectively integrate prior map features between the encoder and
decoder, subsequently decoded to produce the final map predictions. In the bottom blue box, the model queries map tiles that overlap with
the current BEV feature from storage. After the update, the neural map is returned to the previously extracted map tiles.

3.1. Local Map Learning quality compared to both prior and current features.
In order to accommodate the dynamic nature of road
Positional Embedding. It has been observed that the ac-
networks in the real world, advanced online map learning
curacy of predicted maps declines as the distance from the
algorithms have recently been developed. These methods
ego vehicle increases. To address this issue, we propose
generate semantic map predictions based solely on data
the integration of position embeddings, a set of grid-shaped
collected by onboard sensors. In contrast to earlier ap-
learnable parameters, into the fusion module. The aim is
proaches, our proposed method incorporates neural priors
to augment the spatial awareness of the fusion module re-
to bolster accuracy. As road structures on maps are subject
garding the feature positions, empowering it to learn to trust
to change, it is imperative that recent observations take
the current features closer to the ego vehicle and rely more
precedence over older ones. To emphasize the importance
on the prior features for distant locations. Specifically, two
of current features, we introduce an asymmetric fusion
position embeddings are introduced: P Ep ∈ RH×W ×C for
strategy that combines current-to-prior attention and gated
the prior features and P Ec ∈ RH×W ×C for the current fea-
recurrent units.
tures, respectively, before the fusion module Ffuse . Here,
H and W represent the height and width of the BEV fea-
Current-to-Prior (C2P) Cross Attention. We introduce
tures. These embeddings provide spatial awareness to the
the current-to-prior cross-attention mechanism, which
fusion module, effectively allowing it to assimilate infor-
employs a standard cross-attention approach [16] to operate
mation from varying feature distances and locations.
between current and prior BEV features. Concretely, We
divide each BEV feature into patches and add them with
3.2. Global Map Prior Update
a set of learnable positional embeddings, which will be
described subsequently. Current features produce queries, To update the global map prior with the refined features
while prior features produce keys and values. A standard generated by the C2P attention module, an auxiliary mod-
cross-attention is then applied, succeeded by a fully con- ule is introduced, devised to attain a balanced ratio between
nected layer. Ultimately, we assemble the output queries to the current and prior features. This process is illustrated
derive the refined BEV features, which maintain the same in Figure 3. Intuitively, the module regulates the updat-
dimensions as the input current features. The resulting ing rate of the global map prior. A high update rate may
refined BEV features are expected to exhibit superior lead to corruption of the global map prior due to suboptimal
local observations, while a low update rate may result in Table 1. Quantitative analysis of map segmentation. The per-
the global map prior’s inability to promptly capture changes formance of online map segmentation methods and their NMP
in road conditions. Therefore, we introduce a 2D convolu- versions on the nuScenes validation set. By adding prior knowl-
edge, NMP consistently improves these methods. (* HDMapNet
tional variant of the Gated Recurrent Unit [6] module into
remains the same as in the original work, while LSS uses the same
NMP, serving to balance the updating and forgetting ratio. backbone as BEVFormer.)
Local map prior features plt−1 , updated at t−1, are extracted
from the global neural map prior pgt−1 . The refined features mIoU
′ Model
generated by the C2P attention module are denoted as o . Divider Crossing Boundary All

l
Integrating o with the local prior features pt−1 , the GRU HDMapNet 41.04 16.23 40.93 32.73
yields the new prior features plt at time t. Subsequently, HDMapNet + NMP 44.15 20.95 46.07 37.05
these features are passed through the decoder to predict the △ mIoU +3.11 +4.72 +5.14 +4.32
local semantic map and the global neural map prior pgt is LSS∗ 45.19 26.90 47.27 39.78
updated at the corresponding location by directly replacing LSS∗ + NMP 50.20 30.66 53.56 44.80
them with plt . Let zt denote the update gate, rt the reset △ mIoU +5.01 +3.76 +6.29 +5.02
gate, σ the sigmoid function, w∗ the weight for 2D con- BEVFormer∗ 49.51 28.85 50.67 43.01
volution, and ⊙ the Hadamard product. Via the following BEVFormer∗ + NMP 55.01 34.09 56.52 48.54
′ △ mIoU +5.50 +5.24 +5.95 +5.53
operations, the GRU fuses o with the prior feature plt−1 :

ments:road boundary, lane divider, and pedestrian crossing


\begin {split} &z_t = \sigma (\text {Conv2D}([p^l_{t-1}, o^{'}], w_z)) \\ &r_t = \sigma (\text {Conv2D}([p^l_{t-1}, o^{'}], w_r)) \\ &\tilde {p^l_{t}} = \tanh (\text {Conv2D}([r_t \odot p^l_{t-1}, o^{'}], w_h)) \\ & p^l_{t} = (1 - z_t) \odot p^l_{t-1} + z_t \odot \tilde {p^l_{t}} \end {split}
on the nuScenes dataset.
(2)
4.1. Implementation Details
Base models. We primarily perform our experiments us-
ing the BEVFormer model [12] (the version excluding the
Within the GRU, the update gate zt and reset gate rt are temporal aspect), selected for its strength in BEV feature
instrumental in determining the fusion of information from extraction abilities and its exceptional performance in map
the previous traversal (i.e., prior feature plt−1 ) with the cur- semantic segmentation. To validate the broad applicability

rent BEV feature o . Furthermore, they govern the incorpo- of our methods, as shown in Table 1 and 2, we incorpo-

ration of information from the current BEV feature o into rate our NMP paradigm into four recently proposed camera-
the global map prior feature plt . GRU enables the model to based map semantic segmentation and detection methods,
better adapt to various road conditions and mapping scenar- which serve as our baseline models: HDMapNet [11], LSS
ios more effectively. [23], BEVFormer [12], and VectorMapNet [15]. Each of
these methods implements distinct 2D–3D feature-lifting
4. Experiments strategies: MLP-based unprojection is adopted by HDMap-
Net, depth-based unprojection by LSS, geometry-aware
Datasets. We validate our NMP on the nuScenes transformer-like models by BEVFormer, and homography-
dataset [3], a large-scale autonomous driving benchmark based unprojection by VectorMapNet. For the comparisons
that includes multiple traversals with precise localization presented in Table 4 and Table 7, we only use the GRU
and annotated HD map semantic labels. The NuScenes fusion module.
dataset contains 700 scenes in the train, 150 in the val, and
150 in the test. Data were collected using a 32-beam Li-
DAR operating at 20 Hz and six cameras offering a 360- C2P Attention. For all linear layers within the current-to-
degree field of view at 12 Hz. Annotations for keyframes prior attention module, we set the dimension of the features
are provided at 2 Hz. Each scene has a duration of 20 sec- to 256. For patching, we use a patch size of 10 × 10, corre-
onds, resulting in 28,130 and 6,019 frames for the training sponding to a 3m × 3m area in BEV. This setting preserves
and validation sets, respectively. local spatial information while conserving parameters.

Metric. We assess the quality of HD semantic learn-


ing using two metrics: Mean Intersection over Union Global Map Resolution. We use a default map resolution
(mIoU) and Mean Average Precision (mAP), as presented of 0.3m for the rasterized neural map priors for all experi-
in HDMapNet [11]. In accordance with the methodology ments and conduct an ablation study on the resolution in
detailed in HDMapNet, we evaluate three static map ele- Table 7.
Table 2. Quantitative analysis of vectorized map detection. By tioned four base models: HDMapNet, LSS, BEVFormer,
adding prior knowledge, the NMP boosts the performance of Vec- and VectorMapNet. We use the same hyperparameter set-
torMapNet. tings as in their original designs. During training, we freeze
Average Precision
all the modules before the BEV features and only train the
Model C2P attention module, the GRU, the local PE, and the de-
APDivider APCrossing APBoundary mAP
VectorMapNet 47.3 36.1 39.3 40.9 coder. For testing, all samples are arranged chronologically.
VectorMapNet + NMP 49.6 42.9 41.9 44.8 As evidenced in Table 1 and Table 2, NMP consistently im-
△ AP +2.3 +6.8 +2.6 +3.9
proves map segmentation and detection performance com-
Table 3. Comparison of model performance at different BEV pared to the baseline models. Qualitative results are shown
ranges. As the perception range increases, the online method per- in Figure 4. These findings suggest that NMP is a generic
formance declines.; NMP significantly improves the results. approach that can potentially be applied to other mapping
frameworks.
mIoU
BEV Range + NMP
Divider Crossing Boundary All 4.3. Neural Map Prior Helps to See Further
X 49.51 28.85 50.67 43.01
60m × 30m
✓ 55.01 34.09 56.52 48.54
Conventional maps used in autonomous driving systems
△ mIoU +5.50 +5.24 +5.95 +5.53 provide crucial information about roads extending beyond
X 43.41 29.07 56.57 43.01 the line of sight, aiding in navigation, planning, and in-
100m × 100m
✓ 49.51 32.67 59.94 47.37 formed decision-making. However, the recent adoption of
△ mIoU +6.10 +3.60 +3.60 +4.36 onboard cameras for online map prediction as an alterna-
160m × 100m
X 41.21 26.42 51.74 39.79 tive approach has introduced a limitation in the prediction
✓ 46.85 29.25 57.22 44.44 range. This limitation arises due to the low resolution of
△ mIoU +5.64 +2.83 +5.48 +4.65
distant areas in the captured images. To overcome this lim-
itation, our proposed NMP enables an extended reach for
Table 4. Comparison of intra-trip fusion and inter-trip fusion. online map prediction. Specifically, the NMP leverages
prior history information generated by other trips, encap-
mIoU
Intra or Inter Trips
Divider Crossing Boundary All
sulating rich contextual details about the scenes and signifi-
cantly augmenting the capabilities of online map prediction.
Baseline 49.51 28.85 50.67 43.01
This enhancement is demonstrated in Table 3, which con-
Intra-trip fusion 51.87 30.34 53.74 45.31(+2.30)
Inter-trip fusion 53.41 31.92 55.15 46.82(+3.81) sistently shows improved segmentation results compared to
the baseline methods across various BEV ranges, including
60m × 30m, 100m × 100m, and 160m × 100m.
Table 5. Performance in adverse weather conditions.
4.4. Inter-trip Fusion is better than Intra-trip Fu-
mIoU
Weather + NMP sion
Divider Crossing Boundary All
X 50.25 26.90 44.54 40.56 In Table 4, we show the effectiveness of intra-trip fu-
Rain sion versus inter-trip fusion for map construction. Intra-trip
✓ 54.64 30.62 54.19 46.48
△ mIoU +4.39 +3.72 +9.65 +5.92 refers to the scenario where the fusion is limited to a single
traversal, while the inter-trip fusion model uses map priors
X 51.02 21.17 48.99 40.39
Night generated from multiple traversals at the same location. The
✓ 54.66 33.78 55.92 48.12
△ mIoU +3.64 +12.61 +6.93 +7.73 findings indicate that the integration of multi-traversal prior
information is more helpful for accurate map construction,
X 55.76 00.00 47.60 34.45 highlighting the significance of using multiple traversals.
NightRain
✓ 61.22 00.00 50.84 37.35
△ mIoU +5.46 +00.00 +3.24 +2.90 4.5. Neural Map Prior is more helpful under Ad-
X 49.27 29.49 52.11 43.62 verse Weather Conditions
Normal
✓ 53.46 35.27 57.75 48.82
△ mIoU +4.19 +5.78 +5.64 +5.20
Autonomous vehicles face challenges when driving in
bad weather conditions or low light conditions, such as rain
or night driving, which may impede accurate road informa-
tion identification. However, our method, NMP, captures
4.2. Neural Map Prior Helps Online Map Inference
and retains the road’s appearance under optimal weather
In this section, we show that the effectiveness of NMP is and lighting conditions, thereby equipping the vehicle with
agnostic to various model architectures and evaluation met- enhanced and reliable information for precise road percep-
rics. To illustrate this, we integrate NMP into the aforemen- tion during current trips. As shown in Table 5, the applica-
Table 6. Ablation on the fusion components. MA stands for Moving Average. Local PE stands for the positional embedding proposed in
§ 3.1. CA stands for the C2P Attention proposed in § 3.1 and GRU stands for gated recurrent units proposed in § 3.2.

Component mIoU
Name MA GRU Local PE CA Divider Crossing Boundary All
A 49.51 28.85 50.67 43.01
B ✓ 52.19(+2.68) 33.70(+4.85) 55.34(+4.67) 47.07(+4.06)
C ✓ 53.22(+3.71) 31.46(+2.61) 55.93(+5.26) 46.87(+3.86)
D ✓ 53.25(+3.74) 33.13(+4.28) 55.15(+4.48) 47.17(+4.16)
E ✓ ✓ 52.96(+3.45) 34.13(+5.28) 56.14(+5.47) 47.74(+4.73)
F ✓ ✓ 55.05(+3.74) 31.37(+2.52) 56.19(+5.52) 47.53(+4.52)
G ✓ ✓ ✓ 55.01(+5.50) 34.09(+5.24) 56.52(+5.85) 48.54(+5.53)

Table 7. Ablation on the global map resolution. 0.3m × 0.3m is where α denotes a manually searched ratio, and other nota-
a good design choice that balances storage size and accuracy. tions are defined in Section 3.2. Although both GRU and
MA showcase comparable performance enhancements as
mIoU
NMP Grid Resolution
Divider Crossing Boundary All updating modules, GRU is preferred due to the elimination
of manual parameter searches required in MA. Both GRU
Baseline 49.51 28.85 50.67 43.01
0.3m × 0.3m 53.22 31.46 55.93 46.87(+3.86) and CA act as effective feature fusion modules, resulting in
0.6m × 0.6m 52.42 31.63 54.74 46.26(+3.25) substantial performance enhancements. The slight edge of
1.2m × 1.2m 51.36 30.24 52.78 44.79(+1.78) C2P attention over GRU indicates that the transformer ar-
chitecture holds a minor advantage in fusing prior feature
Table 8. Performance on Boston split. The original split con- contexts. Comparing C to E and F to G in Table 6, we ob-
tains unbalanced historical trips for the training and validation serve that local PE increases the IoU of the crossing by 2.67
sets; Boston split is more balanced. and 2.72, respectively. This suggests that local PE has im-
proved feature fusion, particularly in the challenging cate-
mIoU
Data Split + NMP gory of pedestrian crossings. Local PE enables the model to
Divider Crossing Boundary All
extract additional information from the map prior, thereby
X 26.35 15.32 25.06 22.24
Boston Split complementing current observations. In comparisons of C
✓ 33.04 21.72 32.63 29.13
△ mIoU +6.69 +6.40 +7.57 +6.89 to F and E to G in Table 6, C2P Attention increases the
X 49.51 28.85 50.67 43.01 IoU of the lane divider by 1.83 and 2.05, respectively, high-
Original Split
✓ 55.01 34.09 56.52 48.54 lighting its effectiveness in handling lane structures. The
△ mIoU +5.50 +5.24 +5.95 +5.53 attention mechanism extracts relevant features based on the
spatial context, leading to a more accurate understanding
of dividers and boundary structures. Overall, the ablation
tion of NMP in challenging conditions, including rain and study confirms the effectiveness of all three proposed com-
night-time driving, leads to more substantial improvements ponents for feature fusion and updating.
compared to normal weather scenarios. This indicates that
our perception model effectively leverages the necessary in-
formation from the NMP to handle bad weather situations. Map Resolution. We investigate the impact of different
However, given the smaller sample size and the limited prior resolutions of global neural priors on the effectiveness of
trip data available for this sample, the improvements were online map learning in Table 7. High resolutions are pre-
less significant under night-rain conditions. ferred to preserve details on the map. However, there is a
trade-off between storage and resolution. Our experiments
4.6. Ablation Studies on Fusion Components achieved good performance with an appropriate resolution
of 0.3m.
GRU, C2P Attention and Local Position Embedding.
In this section, we evaluate the effectiveness of the com- 4.7. Dataset Re-split
ponents proposed in Section 3. For the sake of comparison,
we introduce a simple fusion baseline, termed Moving Av- In the original split of the nuScenes dataset, some sam-
erage (MA). In this context, the C2P Attention and GRU ples lack historical traversals. We adopt an approach sim-
are replaced with a moving average fusion function. The ilar to the one presented in Hindsight [38], to re-split the
corresponding update rule can be represented as follows: trips in Boston, and name it as Boston split. The Boston
split ensures that each sample includes at least one histori-
p_t^l = \alpha o + (1-\alpha )o_{t-1}^l, (3) cal trip, while the training and test samples are geographi-
Ground Truth

HDMapNet

BEVFormer

Ours

GRU Weights

Figure 4. Qualitative results. From the first to the fifth row: Ground truth, HDMapNet, BEVFormer, BEVFormer with Neural Map Prior
and GRU weights. We also visualize zt , the attention map of the last step of the GRU fusion process. The model learns to selectively
combine current and prior map features: specifically, when the prediction quality of the current frame is good, the network tends to learn a
larger zt , assigning more weight to the current feature; when the prediction quality of the current frame is poor, usually at intersections or
locations farther away from the ego-vehicle, the network tends to learn a smaller zt for the prior feature.

cally disjoint. To estimate the proximity of two samples, we these map tiles are updated, integrated, and uploaded to the
calculate the areal overlap, specifically IoU in the bird’s- cloud asynchronously. As more trip data is collected over
eye view, between the field of view of the two traversals. time, the map prior becomes broader and of better quality.
This approach results in 7354 training samples and 6504
test samples. The comparison of model performance on the 5. Conclusion
original split and Boston split is shown in Table 8. The im-
provement of NMP observed on the Boston split is greater In this paper, we introduce a novel system, Neural Map
compared to the original split. Prior, which is designed to enhance online learning of HD
semantic maps. NMP involves the joint execution of global
4.8. Map Tiles map prior updates and local map inference for each frame
We use map tile as the storage format for our global in an incremental manner. A comprehensive analysis on
neural map prior. In urban environments, buildings gener- the nuScenes dataset demonstrates that NMP improves on-
ally occupy a substantial portion of the area, whereas road- line map inference performance, especially in challenging
related regions account for a smaller part. To prevent the weather conditions and extended prediction horizons. Fu-
map’s storage size from expanding excessively in propor- ture work includes learning more semantic map elements
tion to the physical scale of the city, we design a storage and 3D maps.
structure that divides the city into sparse pieces indexed by
their physical coordinates. It consumes 70% less memory Acknowledgments
space than dense tiles. Furthermore, each vehicle does not
need to store the entire city map; instead, it can download This work is in part supported by Li Auto. We thank
map tiles on demand. The trained model remains fixed, but Binghong Yu for proofreading the final manuscript.
References [15] Yicheng Liu, Yue Wang, Yilun Wang, and Hang Zhao. Vec-
tormapnet: End-to-end vectorized hd map learning. arXiv
[1] Paul J Besl and Neil D McKay. Method for registration of preprint arXiv:2206.08920, 2022. 1, 5
3-d shapes. In Sensor fusion IV: control paradigms and data [16] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
structures, volume 1611, pages 586–606. Spie, 1992. 2 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[2] Peter Biber and Wolfgang Straßer. The normal distribu- Hierarchical vision transformer using shifted windows. In
tions transform: A new approach to laser scan matching. Proceedings of the IEEE/CVF International Conference on
In Proceedings 2003 IEEE/RSJ International Conference Computer Vision, pages 10012–10022, 2021. 4
on Intelligent Robots and Systems (IROS 2003)(Cat. No. [17] Chenyang Lu, Marinus Jacobus Gerardus van de Molen-
03CH37453), volume 3, pages 2743–2748. IEEE, 2003. 2 graft, and Gijs Dubbelman. Monocular semantic occu-
[3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, pancy grid mapping with convolutional variational encoder–
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- decoder networks. IEEE Robotics and Automation Letters,
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- 4(2):445–452, 2019. 3
modal dataset for autonomous driving. In Proceedings of [18] Gellert Mattyus, Shenlong Wang, Sanja Fidler, and Raquel
the IEEE/CVF conference on computer vision and pattern Urtasun. Enhancing road maps by parsing aerial images
recognition, pages 11621–11631, 2020. 5 around the world. In Proceedings of the IEEE international
[4] Yigit Baran Can, Alexander Liniger, Ozan Unal, Danda conference on computer vision, pages 1689–1697, 2015. 2
Paudel, and Luc Van Gool. Understanding bird’s-eye view [19] Gellért Máttyus, Shenlong Wang, Sanja Fidler, and Raquel
semantic hd-maps using an onboard monocular camera. Urtasun. Hd maps: Fine-grained road segmentation by pars-
arXiv preprint arXiv:2012.03040, 2020. 3 ing ground and aerial images. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
[5] Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang,
pages 3611–3619, 2016. 2
and Hang Zhao. Futr3d: A unified sensor fusion framework
[20] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
for 3d detection. arXiv preprint arXiv:2203.10642, 2022. 3
bastian Nowozin, and Andreas Geiger. Occupancy networks:
[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Learning 3d reconstruction in function space. In CVPR,
Yoshua Bengio. Empirical evaluation of gated recurrent 2019. 3
neural networks on sequence modeling. arXiv preprint
[21] Bowen Pan, Jiankai Sun, Ho Yin Tiga Leung, Alex Ando-
arXiv:1412.3555, 2014. 5
nian, and Bolei Zhou. Cross-view semantic segmentation
[7] Frank Dellaert. Factor graphs and gtsam: A hands-on intro- for sensing surroundings. IEEE Robotics and Automation
duction. Technical report, Georgia Institute of Technology, Letters, 5(3):4867–4873, 2020. 3
2012. 2 [22] Jeong Joon Park, Peter Florence, Julian Straub, Richard
[8] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Newcombe, and Steven Lovegrove. DeepSDF: Learning
Huang, Matthias Nießner, and Thomas Funkhouser. Local Continuous Signed Distance Functions for Shape Represen-
implicit grid representations for 3d scenes. In CVPR, 2020. tation. In CVPR, 2019. 3
3 [23] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding
[9] Jialin Jiao. Machine learning assisted high-definition map images from arbitrary camera rigs by implicitly unprojecting
creation. In 2018 IEEE 42nd Annual Computer Software to 3d. In European Conference on Computer Vision, pages
and Applications Conference (COMPSAC), volume 1, pages 194–210. Springer, 2020. 3, 5
367–373. IEEE, 2018. 2 [24] François Pomerleau, Francis Colas, Roland Siegwart, et al.
[10] Charles L Lawson and Richard J Hanson. Solving least A review of point cloud registration algorithms for mobile
squares problems. SIAM, 1995. 2 robotics. Foundations and Trends® in Robotics, 4(1):1–104,
2015. 2
[11] Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet:
[25] Thomas Roddick and Roberto Cipolla. Predicting seman-
A local semantic map learning and evaluation framework.
tic map representations from images using pyramid occu-
arXiv preprint arXiv:2107.06307, 2021. 1, 3, 5
pancy networks. In Proceedings of the IEEE/CVF Confer-
[12] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- ence on Computer Vision and Pattern Recognition, pages
hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: 11138–11147, 2020. 3
Learning bird’s-eye-view representation from multi-camera [26] Guodong Rong, Byung Hyun Shin, Hadi Tabatabaee, Qiang
images via spatiotemporal transformers. arXiv preprint Lu, Steve Lemke, Mārtiņš Možeiko, Eric Boise, Geehoon
arXiv:2203.17270, 2022. 5 Uhm, Mark Gerow, Shalin Mehta, et al. Lgsvl simulator:
[13] Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng A high fidelity simulator for autonomous driving. arXiv
Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: preprint arXiv:2005.03778, 2020. 2
Structured modeling and learning for online vectorized hd [27] Avishkar Saha, Oscar Mendez, Chris Russell, and Richard
map construction. arXiv preprint arXiv:2208.14437, 2022. Bowden. Translating images into maps. In 2022 Inter-
1 national Conference on Robotics and Automation (ICRA),
[14] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and pages 9200–9206. IEEE, 2022. 3
Christian Theobalt. Neural sparse voxel fields. In NeurIPS, [28] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-
2020. 3 ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-Aligned
Implicit Function for High-Resolution Clothed Human Dig-
itization. In ICCV, 2019. 3
[29] Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun.
Generalized-icp. In Robotics: science and systems, vol-
ume 2, page 435. Seattle, WA, 2009. 2
[30] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and
Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc-
tion from monocular video. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 15598–15607, 2021. 3
[31] Shenlong Wang, Min Bai, Gellert Mattyus, Hang Chu, Wen-
jie Luo, Bin Yang, Justin Liang, Joel Cheverie, Sanja Fidler,
and Raquel Urtasun. Torontocity: Seeing the world with a
million eyes. arXiv preprint arXiv:1612.00423, 2016. 2
[32] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Holis-
tic 3d scene understanding from a single geo-tagged image.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3964–3972, 2015. 2
[33] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang,
Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d:
3d object detection from multi-view images via 3d-to-2d
queries. In Conference on Robot Learning, pages 180–191.
PMLR, 2022. 3
[34] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploit-
ing hd maps for 3d object detection. In Conference on Robot
Learning, pages 146–155. PMLR, 2018. 2
[35] Sheng Yang, Xiaoling Zhu, Xing Nian, Lu Feng, Xiaozhi
Qu, and Teng Ma. A robust pose graph approach for city
scale lidar mapping. In 2018 IEEE/RSJ International Con-
ference on Intelligent Robots and Systems (IROS), pages
1175–1182. IEEE, 2018. 2
[36] Weixiang Yang, Qi Li, Wenxi Liu, Yuanlong Yu, Yuexin
Ma, Shengfeng He, and Jia Pan. Projecting your view at-
tentively: Monocular road scene layout estimation via cross-
view transformation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
15536–15545, 2021. 3
[37] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
ral surface reconstruction by disentangling geometry and ap-
pearance. In NeurIPS, 2020. 3
[38] Yurong You, Katie Z Luo, Xiangyu Chen, Junan Chen, Wei-
Lun Chao, Wen Sun, Bharath Hariharan, Mark Campbell,
and Kilian Q Weinberger. Hindsight is 20/20: Leverag-
ing past traversals to aid 3d perception. arXiv preprint
arXiv:2203.11405, 2022. 7
[39] Fisher Yu, Jianxiong Xiao, and Thomas Funkhouser. Seman-
tic alignment of lidar data at city scale. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1722–1731, 2015. 2
[40] Brady Zhou and Philipp Krähenbühl. Cross-view transform-
ers for real-time map-view semantic segmentation. arXiv
preprint arXiv:2205.02833, 2022. 3

You might also like