0% found this document useful (0 votes)
2 views7 pages

2402.14308v1

Ground-Fusion is a low-cost SLAM system for ground vehicles that integrates RGB-D images, inertial measurements, wheel odometer, and GNSS signals for robust localization and mapping. It features an adaptive initialization strategy and effective sensor anomaly detection to maintain accuracy in diverse environments. Experimental results demonstrate its superiority over existing low-cost SLAM systems, particularly in corner cases, and the authors have made the code and datasets publicly available.

Uploaded by

zwy2869651050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views7 pages

2402.14308v1

Ground-Fusion is a low-cost SLAM system for ground vehicles that integrates RGB-D images, inertial measurements, wheel odometer, and GNSS signals for robust localization and mapping. It features an adaptive initialization strategy and effective sensor anomaly detection to maintain accuracy in diverse environments. Experimental results demonstrate its superiority over existing low-cost SLAM systems, particularly in corner cases, and the authors have made the code and datasets publicly available.

Uploaded by

zwy2869651050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Ground-Fusion: A Low-cost Ground SLAM System Robust

to Corner Cases
Jie Yin, Ang Li, Wei Xi, Wenxian Yu, and Danping Zou*

Adaptive Initialization Visual Feature Verification Mapping


Abstract— We introduce Ground-Fusion, a low-cost sensor Depth Wheel- Flow Pose
fusion simultaneous localization and mapping (SLAM) system Motion Visual Verification aided MCC Back
+
Check Initialization
for ground vehicles. Our system features efficient initializa- Wheel Anomaly Handling RGB
tion, effective sensor anomaly detection and handling, real- Dynamic Stationary Anomaly Angular +
time dense color mapping, and robust localization in diverse Initialization Initialization Filtration Fitting Depth
arXiv:2402.14308v1 [cs.RO] 22 Feb 2024

environments. We tightly integrate RGB-D images, inertial mea- GNSS Processing


Yaw Dense Color
surements, wheel odometer and GNSS signals within a factor Anchor Set
Calibration
Invalid Satellite Tightly-
Mapping
Removal coupled
graph to achieve accurate and reliable localization both indoors
and outdoors. To ensure successful initialization, we propose
an efficient strategy that comprises three different methods:
Fig. 1. The system adopts an adaptive initialization strategy based on the
stationary, visual, and dynamic, tailored to handle diverse
robot’s motion state. Potential sensor faults will be detected and handled
cases. Furthermore, we develop mechanisms to detect sensor accordingly. Real-time dense color mapping is supported to facilitate navi-
anomalies and degradation, handling them adeptly to maintain gation tasks.
system accuracy. Our experimental results on both public
and self-collected datasets demonstrate that Ground-Fusion
outperforms existing low-cost SLAM systems in corner cases.
We release the code and datasets at https://github.com/SJTU- system initialization and corner case addressing. To ensure
ViSYS/Ground-Fusion. a successful initialization, we propose an effective strategy
that includes three different methods: stationary, visual, and
I. INTRODUCTION dynamic, which are designed to handle various situations.
Ground robots find extensive applications in logistics, In addition, we discuss the possible sensor faults that might
catering, and industrial production. In many scenarios, it happen in corner cases [11] and handle them accordingly.
is critical to reliably navigate the ground robots within We have conducted extensive experiments to evaluate our
both indoor [1] and outdoor environments [2]. Simultaneous method. The results show the robustness of our method in
Localization and Mapping (SLAM) technology plays a key different scenarios. We highlight the main contributions of
role in robot navigation. While LiDAR-based SLAM systems this work as follows:
excel in many scenarios, their high costs are not suitable • We implement a low-cost SLAM system that tightly
for low-cost applications where SLAM systems using af- couples GNSS-RGBD-IMU-Wheel sensors, which can
fordable sensors such as Visual-Inertial Odometry (VIO) are work reliably both indoors and outdoors by fully ex-
preferred. However, VIO may exhibit reduced accuracy in ploiting each sensor to enable robust initialization in
specific motion modes that introduce unobserved degrees of diverse cases.
freedom (DoFs), as discussed in [3]. To tackle this challenge, • We propose effective strategies to detect and handle
VINS-RGBD [4] integrated RGBD images with inertial data sensor faults that may arise in sensor fusion systems,
to avoid scale issues and achieve better pose estimation. Sim- including visual failures, wheel anomalies, and GNSS
ilarly, VINS-on-Wheels [5] introduced a wheel odometer- degradation, thereby greatly enhancing robustness.
assisted VIO system to maintain a consistent metric scale. • We present a SLAM dataset serving as a new benchmark
[6] [7] further loosely integrates GNSS signals into a visual- for challenging corner cases.
inertial-wheel-odometry. Additionally, [8] tightly combined
GNSS raw measurements with the VIO system for drift- II. R ELATED WORK
free global state estimation. [9] harnessed a sky-pointing A. Initialization of a multi-sensor SLAM system
camera for NLOS mitigation, thereby enhancing localization
in urban canyons. Multi-sensor SLAM systems, particularly those that are
Nevertheless, our previous investigations [10] have re- tightly-coupled, are heavily reliant on high-quality initializa-
vealed that the robustness of existing SLAM systems in tion due to its profound impact on system robustness and
challenging scenarios needs to be further improved. To ad- accuracy. VINS-Mono [12] employs a hybrid initialization
dress this concern, this paper focuses on two aspects: robust strategy that performs vision-only Structure from Motion
(SfM) and a subsequent visual-IMU alignment. Then the
All authors are with Shanghai Jiao Tong University. ∗ Corresponding scale and gravity direction will be further optimized. This
Author: Danping Zou ([email protected]). method requires sufficient IMU excitation and visual paral-
This work was supported by National Key R&D Program
(2022YFB3903802), NSFC(62073214), and Midea Group’s 3D Vision lax for successful initialization. However, the movement of
Project. a ground robot can easily experience unobservable DoFs,
such as uniform motion, which makes the VIO systems related issues in challenging scenarios. GVINS [8] remains
unobservable at scale. To mitigate this issue, [4] utilizes robust in less-than-four-satellite case due to its tighly-coupled
depth information as the source of the scale information for framework. In GNSS-denied environments, GVINS degen-
initialization. Similarly, [13] uses wheel odometer measure- erates into a convectional VIO. Building upon these in-
ments to refine the gravity vector, utilizing visual SfM for sights, our system firstly filters out unreliable satellites with
scale estimation. Moreover, [14] proposes a Zero Velocity threshold-based methods. Furthermore, our system monitors
Update (ZUPT) aided initialization method with stationary low-speed states and ensures that GNSS-related factors are
and dynamic phase to achieve better robustness. However, not optimized in these scenarios.
the initialization of these systems is still not robust enough IMU saturation: He et al. [21] address IMU saturation
to severe sensor faults. Building upon these insights, we issues for drones with highly aggressive motions, while
introduce an adaptive initialization strategy encompassing this situation is rarely encountered by safety-critical ground
three distinct approaches tailored to different scenarios in robots. In our study, we consider short-term IMU observa-
this work. tions within a sliding window to be reliable and free from
faults.
B. Corner cases
Visual challenge: We classify vision failures into three C. Ground robot datasets
categories. The first type is insufficient-features problem, A high-quality dataset can stimulate progress in SLAM.
typically caused by lack of textures or inadequate light. While there have been a few datasets captured by ground
This can lead to significant drift in visual systems [12], robots [22] [23] [24], many of them are outdated and not
[15]. The second is characterized by no valid feature points challenging enough to current advanced SLAM systems. Our
due to significant occlusion or aggressive motion. This can previous work [10] presents a challenging dataset M2DGR
1 , which has been met with great interest from SLAM re-
result in unsuccessful system initialization or tracking failure
as demonstrated in [11]. The third type refers to dynamic searchers. Therefore, we further extend the M2DGR dataset
environments with numerous moving objects. The moving by adding extra low-cost sensors, including a wheel encoder,
points on these objects can greatly degrade localization a 2D LiDAR and a depth camera. This not only allows
accuracy. Currently, there exist some effective semantic- us to test our proposed system, but also launches a novel
based methods [16] [17] that well tackle this challenge. In benchmark for future research on multi-sensor SLAM.
this study, we focus on cost-efficient geometric approaches
III. M ETHODOLOGY
that do not utilize GPUs. Specifically, we reject dynamic
features through feature filtration and depth validation. Our Ground-Fusion system tightly integrates RGB images,
Wheel odometer challenge: There are two types of depth information, inertial information, and wheel odometer
wheel odometer challenges: inaccurate angular velocity and measurements within the optimization framework. All the
wheel anomaly. To tackle the former issue, [18] integrated sensory measurements are maintained in a sliding window
measurements from wheel encoders and gyroscopes to for- for real-time performance. The system consists of adaptive
mulate a relative motion constraint. For the latter challenge, initialization, a multi-sensor state estimator with corner case
they detect the wheel anomaly with visual information: If addressing, and a dense mapping module. Before introducing
more than half of the previous features are considered as each module, we clarify the notations and frames used in this
outliers during visual-wheel optimization, wheel slippage is paper: (·)w denotes the world frame. (·)b is the body (IMU)
detected. In response, they reset the current frame’s initial frame, and (·)c is the camera frame. qwb and pwb represent the
state and eliminate current wheel measurements to mitigate rotation and translation from the body frame to the world
the anomaly. However, a significant decrease in feature points frame, respectively. b j and c j are the body frame and the
could also arise from environmental changes, rather than camera frame when capturing the jth image.
just wheel-related factors. Therefore, this strategy is likely to A. Adaptive initialization
mistakenly eliminate reliable wheel odometer observations.
Our initialization module consists of three alternative
Considering this limitation, [19] introduced three techniques
strategies for local odometry initialization: dynamic method,
for actively detecting abnormal chassis movements with
visual method and stationary one. The system determines
the help of motion constrains and IMU measurements. We
whether there is sufficient motion excitation based on the
integrate insights from these efforts to enhance our system’s
GLRT (Generalized Likelihood Ratio Test) [25] method,
capacity to address wheel odometer challenges.
which is formulated as:
GNSS degradation: Typically, there are three kinds of !
2
GNSS challenges: low-speed movements, less than four 1 1 āt 1
G= ∑ ãt − g + 2 ∥ω̃t ∥2 (1)
satellites, and no GNSS signals. Early work employed Re- m t∈ψ σa2 ∥āt ∥ σω
ceiver Autonomous Integrity Monitoring (RAIM) scheme where {ãt , ω̃ t } are the raw measurements of the IMU, m
[20] to assess integrity performance levels of GNSS systems, is IMU measurement number within the sliding window, ψ
detecting and mitigating errors within the GNSS receiver denotes the window range, and ā is the average acceleration.
based on residuals. As a GNSS-Visual-IMU tightly-coupled
system, GVINS [8] proposes strategies to address GNSS- 1 https://github.com/SJTU-ViSYS/M2DGR
The GLRT value roughly divides the motion state into the following least-square function:
following three categories, where the threshold values for β 2
and γ are determined by experimental approach: minδ bw ∑k∈B qcb0k+1 −1 ⊗ qcb0k ⊗ γ bbkk+1

1
 (7)
 γbbk+1
k
≈ γ̂bbk+1
k
⊗ 1 γ
 Stationary (G < β ) 2 Jbw δ bw
State ≈ Slow Motion (β ≤ G ≤ γ) (2)
 Aggressive Motion (G > γ) where B represents all frame indexes in the window Since the
scale is known, later parameters initialization only contain
Next, we’ll further validate the motion state and describe the
velocities and gravity vector.
corresponding initialization method for each kind of motion. h i
Stationary: If the G is below the value β , we introduce X I = vbb00 , vbb11 , · · · , vbbnn , gC0 (8)
wheel and visual observations to further ensure whether the
system is static. We utilize wheel median integration method Considering two consecutive IMU frames bk and bk+1 , we
to predict the pose: have following equations:
   
Roj+1 = Roj Exp ω oj+1 ∆t 1
α bbkk+1 = Rbc0k pCbk+1
0
− pcb0k − Rcb0k vbbkk ∆t + g c0 ∆t 2
(3) 2
pwj+1 = pwj+1 + v wj+1 ∆t   (9)
bk bk c0 bk+1 c0 bk c0
βbk+1 = Rc0 Rbk+1 vbk+1 − Rbk vbk + g ∆t
where ∆t is the time difference between odometer
 frames
o j and o j+1 , and ω oj+1 = 12 ω oj + ω̃ oj+1 and v wj+1 = Combining equation (6) and equation (9), we can solve
1

w w
 the initial values of X I . Finally, the gravity vector obtained
2 v j + ṽ j+1 . Assuming there are n odometer frames be- from the previous linear initialization step is further refined.
tween consecutive images ck and ck+1 , the wheel preintegra- Aggressive motion: In highly aggressive motion, the
tion pose’s norm between them can be expressed as: visual features may be unstable due to motion blur or few
2
W = pwj+n − pwj (4) overlap, causing the visual SfM-based initialization to be
unreliable. By contrast, the wheel odometer measurement
Additionaly, we can extract feature points from the latest makes the velocity and scale observable, the pose can be
frame match them with images in the sliding window. The calculated by wheel integration. Consequently, we opt not
average visual parallax can then be formulated as: to employ visual SfM for pose estimation as described
in Equation (6), but instead, we employ a wheel-aided
!
1 j j 2
V= ∑ ∑ pi − pm−1 (5) initialization method. The camera pose in equation (9) could
m i∈[0,m−1] j∈[0,r−1]
be replaced by wheel odometer pose to solve the gyroscope
where r is the matched feature point number between jth bias. To establish a consistent reference frame, we define the
image and the latest image. world frame using the first wheel frame, aligning its z-axis
In the initialization phase, if at least two of the stationary with that of the wheel frame. In comparison to the approach
criteria {G < β , W < η, V < θ } are met (all threshold values used in [13], which exclusively employs the wheel odometer
determined by experimental approach), we consider the vehi- for scale refinement, our method eliminates the redundant
cle is static. Otherwise, we treat the vehicle as in motion and SfM component, fully harnessing the wheel odometer for a
use the methods in the next paragraph for initialization. In more efficient initialization process. It’s worth noting that
the confirmed stationary case, we establish the first camera while this initialization method does not rely on visual
frame as the local world frame and align the z-axis with information, once successfully initialized, the visual factor
the gravity direction. Subsequently, all other poses within can still be integrated into the tightly coupled optimization
the sliding window are aligned with the first pose, while the process when the system identifies effective feature points.
velocity is set to zero. The system state p, v, q will be set to a After successful local initialization by any of above three
constant block during optimization. The stationary detection methods, we perform a three-step global initialization, which
and ZUPT is not only applicable to the initialization phase, are adapted from [8].
but also used throughout the optimization process. B. Multi-sensor state estimator with corner case handling
Slow motion: In the slow motion case, the camera pose
pwc , qcwk between the two frames could be computed by We formulate the state estimation as a maximizing a
solving a PnP (Perspective-n-Point) problem. Since the RGB- posteriori (MAP) problem. We follow the factor graph
D camera can directly measure the depth information, the framework of [5] which maintains a sliding window, and
IMU pose can be calculated without scale parameter by: further extend to a GNSS-RGBD-IMU-Wheel fusion system.
 −1 The calculation of residuals and Jacobi can refer to the
qw w bk
bk = qck ⊗ qck previous literature [8] [4] [5]. Next, we mainly introduce
(6)
how our system processes sensor measurements to become
pw w w bk
bk = pck − Rbk pck
more robust to corner cases.
where the extrinsic pbc , qcb is provided offline.

Wheel anomalies: The wheel odometer measurement
Combining above states with the IMU pre-integration can be formulated as {ṽto = vto + ntv , ω̃ to = ω to + ntω }. Here
term γ, we can calibrate the gyroscope bias by minimizing {ṽt , ω̃ t } are the raw measurements of the wheel odometer,
T o  T
vo = vox voy 0 , ω = 0 0 ωzo

denote the ve- camera, the pixel is temporarily left empty. Subsequently,
locity and angular velocity in the wheel odometer frame we employ the triangulation method on the RGB image to
The error of wheel odometer mainly comes from inaccu- calculate the depth of the feature points, thus filling in all the
rate angular velocity estimation and sudden chassis anomaly, pixel depths. Additionally, for those feature points where the
such as wheel slippage and collision. Since the IMU’s disparity between the depth measured by the depth camera
angular velocity measurement is more reliable and has a and the depth computed through triangulation is below a
higher frame rate than the wheel odometer, we replace the predefined threshold, we record their indexes and fix their
original wheel odometer measurement with the IMU angular depth values to a constant value during the optimization
velocity using a linear fitting method: phase.
GNSS anomalies: Three GNSS challenge scenarios in-
ω bn − ω bm clude too-few-satellites, no-satellite-signal, and low-speed
ω oz = Rob (ω bm + (t − tm ) − bω ) (10)
tn − tm z movement. In the first two cases, [8] has proven that with the
where tm and tn are two closest IMU timestamps to current help of the tightly-coupled GNSS-Visual-Inertial framework,
wheel measurement. limited reliable satellites can still be effective in improving
To detect wheel anomalies, we compare the pre-integration the global state estimation, and the GVIO system will
of both the IMU and the wheel odometer measurements degrade into a VIO system when no GNSS signals observed.
between the current frame and the second latest frame. If In this work, our system firstly filters out unreliable satellites
the difference of their resulting poses’ norm surpasses the with excessive pseudo-range and Doppler uncertainty, those
threshold ε = 0.015, we see it as a wheel abnormality. In with an insufficient number of tracking times, and those at
this case, we refrain from incorporating the current wheel a low elevation angle. In low-speed scenarios with a GNSS
odometer observations into subsequent optimization process. receiver velocity below the threshold vths = 0.3m/s (the noise
Vision anomalies: Our system employs the KLT sparse level of the Doppler shift), we do not involve GNSS factors
optical flow algorithm [26] for tracking feature points as in the optimization to prevent GNSS noise from corrupting
adapted from [12]. Three visual challenges include no-valid- the state estimation.
feature problem during initialization, insufficient-feature is-
IV. E XPERIMENTS
sue during the localization and dynamic environments The
first one has been solved in Sec III (a), while the second A. Benchmark tests
one can be mitigated by tightly-coupled wheel odometer and Localization performance: Openloris-Scene [24] is a
IMU data. To address dynamic objects, We further introduce SLAM dataset collected by a ground robot with an RGBD
two strategies: feature filtration and depth validation. camera, an IMU and a wheel odometer. Ground-Fusion is
For feature filtration, we firstly adopt the flow back method tested against cutting-edge SLAM systems on three scenarios
by reversing the order of the two frames for optical flow of Openloris-Scene [24], namely Office (7 seqs), home (5
backtracking. Only the feature points that are successfully seqs) and corridor (3 seqs). Table I shows that Ground-Fusion
tracked during both iterations and exhibit an adjacent dis- performs well on these scenarios.
tance below a specified threshold are retained for further Initialization performance: We conduct initialization
process. Moreover, we introduce a wheel-assisted moving tests on Ground-Challenge dataset [11] with complex se-
consistency check (MCC) approach. Our system utilizes the quences in corner cases2 . To evaluate the efficiency of system
wheel-preintegration pose and previous optimized poses. For initialization, we measure the time required for each system
a feature observed for the first time in the i-th image and to complete the initialization process, which is defined as
subsequently observed in other m images within the sliding the difference between the timestamp of the first observation
windows, we define the average reprojection residual rk of received by the system and that of the first output pose. In
the feature observations as: terms of the quality of initialization, we evaluate the Absolute
1 
b cj
 Trajectory Error (ATE) RMSE [22] for each system, focusing
rk = ∑ ucki − π Tcb Tbwi Tw
b j Tc Pk (11)
m on the initial 10 seconds of each sequence. We select several
j̸=i
challenging sequences from Ground-Challenge [11] for the
here ucki represents
c
the observation of the k-th feature in the initialization test. These sequences include O f f ice3, which
i-th frame, and Pk j is the 3D location of the k-th feature in features changing light conditions; Darkroom2 recorded in a
the j-th frame. The function π denotes the camera projection dark room; and Wall2 captured in front of a textureless wall.
model. When the value of rk exceeds a preset threshold, Additionally, Motionblur3 exhibits severe camera motion
the k-th feature is considered dynamic and will be removed blur during aggressive movement, while Occlusion4 involves
from further process. This approach offers the advantage of severe camera occlusion. And Static1 was recorded in a
preemptively eliminating potential error-tracking features in stationary state with wheel suspension.
current image frame before the optimization phase. We evaluate our method against baseline methods on
For the depth validation, we initiate by associating the these challenging sequences. Results in Table II demonstrate
depth measurements acquired from the depth camera with that our method excels in both quality and efficiency of
each pixel representing a feature point. In cases where the
depth measurement surpasses the effective range of the depth 2 https://github.com/sjtuyinjie/Ground-Challenge
40 12 1.6 3 150
VINS-Mono ESKF VINS-Mono ESKF
ESKF VINS-RGBD Ours VINS-RGBD Ours
10 1.4 120
30 VINS-RGBD VIW-Fusion Feature num 2.5 VIW-Fusion Feature num
VIW-Fusion
8 1.2 100
20 ESKF Ground-Fusion

Relative Pose Error

Relative Pose Error


VINS-Mono GT 2 100
6 1

Feature Num

Feature Num
VINS-RGBD 80
10
VIW-Fusion
y/m

y/m
4 0.8 1.5
Ground-Fusion
0 60
GT
2 0.6
1 50
-10 40
0 0.4

0.5
-20 20
-2 0.2

-30 -4 0 0 0
-20 -10 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70
x/m x/m Time (s) Time (s)

(a). Loop2 (b). Corridor1 (a). Office3 (b). Wall2


30 120 100 200

Fig. 2. Estimated and ground-truth (GT) trajectories on part of sample VINS-Mono


VINS-RGBD
ESKF
Ours 90
VINS-Mono
ESKF
Ours-MCC
Yolo 180
Ours-Yolo Classic MCC
sequences are visualized on the x-y plane. 25 VIW-Fusion Feature num 100
80
Ours-Baseline Ours
160

70 140

Relative Pose Error

Relative Pose Error


20 80

Feature Num

Feature Num
60 120

15 60 50 100

initialization. Specifically, in Sequence O f f ice3, our system 10 40


40 80

30 60

adaptively chooses visual initialization method, requiring 5 20


20 40

10 20

the least amount of time. In visual challenging scenarios 0


0 5 10 15 20 25 30 35 40
0 0
0 10 20 30 40 50 60 70 80 90
0
100
Time (s) Time (s)

with limited textures (Darkroom2 and Wall2) and visual


(c). Motionblur3 (d). Hall1
failure(Motionblur3 and Occlusion4), our system adopts the
Fig. 3. The relative pose errors (m) of each method and the number of
dynamic initialization method and initialize successfully with effective feature points over time on some visual challenging sequences are
high efficiency, while some baseline systems fail due to plotted.
inadequate feature points for visual SfM. In stationary case
Static1, our system initializes successfully with the stationary learning-based vo). Additionally, we implement an ESKF-
method, while baseline methods fail due to insufficient based IMU-wheel fusion odometry as a baseline without
motion excitation. visual input. The results are shown in Table III, and trajec-
tories for some sample sequences are visualized in Figure 2.
TABLE I
Overall, Ground-Fusion achieves the best localization results
AVERAGE ATE RMSE( M ) OF SLAM SYSTEMS ON CHALLENGING
across all tested sequences.
SEQUENCES OF O PENLORIS -S CENE [24] DATASET
To illustrate the visual challenges, we plot the relative pose
ORBSLAM3 [15] DSO [27] VINS-Mono [12] InfiniT [28] Elastic-F [29] Ours error (RPE) of each method and the number of valid visual
office 0.52 0.75 0.19 0.16 0.13 0.18
home 1.11 0.67 0.54 0.58 1.45 0.46 feature points over time in Figure 3. The results show that
corridor 2.98 10.66 3.41 3.43 17.76 0.71
insufficient feature points greatly degrade the performance of
TABLE II the visual front-end. For instance, at 25 seconds in Sequence
I NITIALIZATION T IME C OST ( S ) AND ATE RMSE( M ) OF SLAM O f f ice3, the feature point number suddenly drops to zero
SYSTEMS ON SAMPLE SEQUENCES due to loss, causing a notable increase in RPE for VINS-
Sequence VINS-Mono [12] VINS-RGBD [4] VIW-Fusion Ours
Mono [12] and VINS-RGBD [4]. Similarly, VINS-Mono
Office3 2.09 / 0.02 2.10 / 0.02 2.42 / 0.02 1.37 / 0.02 struggles to estimate the depth by triangulation during pure
Darkroom2 1.30 / 2.06 2.52 / 0.04 1.61 / 0.04 1.36 / 0.03
Wall2 2.50 / 0.08 1.27 / 0.07 2.00 / 0.05 1.37 / 0.03 rotation (Rotation3), resulting in significant drift. In such
Motionblur3 fail fail 1.30 / 0.11 1.30 / 0.09
Occlusion4 fail fail fail 5.52 / 0.14
scenarios, our system still performs well due to the tightly-
Static1 fail fail fail 1.42 / 0.00 coupled wheel odometer. In Occlusion4 with no valid feature
points observed, most systems including VIW-Fusion, fail
TABLE III initialization. By contrast, our system initializes using a
ATE RMSE( M ) OF SLAM SYSTEMS ON SAMPLE SEQUENCES wheel-aided dynamic approach and outperforms the wheel-
Sequence VINS-Mono [12] VINS-RGBD [4] VIW-Fusion TartanVO [30] ESKF Ours
IMU fusion ESKF baseline in localization accuracy.
Office3 0.34 0.31 0.18 1.52 0.15 0.14
Darkroom2 86.06 0.82 0.53 1.18 0.38 0.22
For the dynamic environment (Hall1), VINS-Mono [12]
Wall2
Hall1
1.21
7.06
1.00
94.27
0.15
0.85
2.76
3.08
0.20
1.29
0.12
0.36
suffers from significant drift. We conducted a comparison
Rotation3
Motionblur3
29.12
9.37
0.19
32.31
0.18
0.78
0.13
1.61
0.14
0.44
0.08
0.26
between the method using a YOLO [31] module to remove
Occlusion4
Roughroad3

0.17

25.52

0.14
2.35
0.41
0.16
0.17
0.15
0.11
features within the bounding box of pre-defined moving
Slope1 9.41 2.84 0.65 3.13 3.13 0.64 categories (e.g. people) and a baseline method with a classic
Loop2 6.09 3.44 9.23 13.18 6.31 2.28
Corridor1 4.48 0.85 1.12 2.05 1.55 0.74 MCC using the optimized pose. In Figure 3 (d), while
Static1 — — — 0.01 2.87 0.01
YOLO significantly reduces the number of feature points,
Visual challenges: We further introduce two new se-
but some of them are not actually on dynamic objects.
quences for evaluation: Sequence Hall1 was recorded in a
Consequently, the RPE does not show a significant decrease
highly dynamic hall with a lot of people moving around;
when compared to the baseline. When incorporating wheel-
Sequence Rotation3 involves the robot rotating without much
aided MCC method, the system effectively eliminates gen-
translation, which can influence the triangulation process
uine dynamic points, leading to a significant improvement
in the visual front-end. We evaluated cutting-edge SLAM
in positioning accuracy. To further validate the efficacy of
systems along with our method on the aforementioned
wheel-aided MCC, we conducted ablation tests on all seven
sequences. The evaluated algorithms include VINS-Mono
visual challenge sequences, showing an average decrease of
[12], VINS-RGBD [4], VIW-Fusion 3 and TartanVO [30](a
0.07m in the ATE RMSE compared to the baseline method.
3 https://github.com/TouchDeeper/VIW-Fusion Wheel odometer challenges: In sequences Roughroad3
30 0.04

25
Odom
IMU-Odom 0.035
0 B. Outdoor Tests
Baseline -0.2
Ours 0.03 Odom

Relative Pose Error 20 Anomaly Value


-0.4 Imu-odom
Baseline We further evaluate our method in large-scale outdoor

Anomaly Value
thre 0.025
-0.6
Ours

environments as follows. we built a ground robot for data

y/m
15 0.02 -0.8 GT

-1
0.015
10

0.01
-1.2

-1.4
collection, with all the sensors well-synchronized and cali-
brated. We recorded some sequences in various scenarios4
5
0.005
-1.6

0 0 -1.8
0 10 20 30 40 50 -1 -0.5 0 0.5 1 1.5
Time (s) x/m
and choose three most challenging sequences in this paper:
(a) (b) In sequence Lowspeed, the ground vehicle moved at a low
speed and made several stops; Sequence Tree was under
Fig. 4. (a) Wheel anomaly analysis and (b) Trajectory of different methods
in the Anomaly sequence. dense tree cover, causing occlusion of the GNSS satellites;
In sequence Switch, the vehicle transitioned from outdoors
1.4

1.2
Odom
GVINS
VIWG
14
to indoors, and then returned outdoors again.
1
Ours
Satellite num
12
GNSS challenge: We evaluate our method under GNSS
Relative Pose Error(m)

10
Satellite Num

0.8

8
challenges against baseline methods, with their localization
0.6

0.4
6
results shown in Table V. Overall, Ground-Fusion outper-
4

0.2
2
forms baseline methods in all these cases. In Lowspeed,
0
0 50 100 150 200
Time (s)
250 300 350
when the robot is stationary, GVINS fails to localize due
(a) (b) to Doppler noise, and the VINS-GW 5 also drifts, while
Ground-Fusion works robustly by removing unreliable GNSS
Fig. 5. (a) ATE RMSE(m) of SLAM systems on wheel anomaly sequences measurements. As shown in Figure 5(a), in Switch, VINS-
(b) The solid lines denote the value of each method, and dashed lines denote
their corresponding thresholds. Grey shading denotes areas where at least GW experiences severe drift due to the loosely-coupled
two stationary conditions are satisfied. GNSS signals that deteriorate significantly as the robot
approaches indoor areas, while both GVINS and our method
remain unaffected due to their tightly-coupled integration.
and Slope1, the robot moves on a rough road and a steep In Sequence Tree, GVINS falters due to unstable visual
slope respectively; In Loop2, the robot traverses a hall with features, while our method remains robust due to tightly-
a carpeted aisle loop, where wheels slip considerably; Se- coupled wheel and depth measurement.
quence Corridor1 is a zigzag route in a long corridor. Table TABLE V
III shows that our method achieves the best performance ATE RMSE( M ) OF SLAM SYSTEMS ON SAMPLE SEQUENCES
in all these sequences. We further conducted ablation tests
to verify the effect of IMU angular velocity as a substitute Sequence Lowspeed Switch Tree
Raw Odom 8.88 4.95 2.88
for wheel angular velocity. We selected two sequences with SPP 2.54 6.60 3.03
sharp turns, including corridor1 and loop2. The results GVINS [8] fail 1.40 fail
in Table IV indicate that the IMU-odometer measurements VINS-RGBD [4] 4.72 1.70 2.27
VINS-GW 20.68 33.61 3.26
contribute to a better localization accuracy. Ours 0.63 1.32 0.55

TABLE IV
ATE RMSE( M ) OF SLAM SYSTEMS ON SELECTED SEQUENCES Zero velocity update (ZUPT): We visualize the three
stationary detection methods with the GT distance on the
Seq. Odom IMU-Odom Baseline Ours Lowspeed sequence in Figure 5(b). The figure shows that
Corridor1 2.17 1.03 0.89 0.66
Loop2 17.60 6.53 4.66 1.53 a single sensor might misclassify the motion state. For
Anomaly 0.78 0.77 0.62 0.07
Static 3.54 3.51 2.35 0.01
instance, the wheel method fails to detect the stationary
state at approximately 110 seconds. By contrast, our scheme
Moreover, we test our method on sequences with wheel combines three sensors, presenting reliable detection of the
anomaly. In the Anomaly sequence, the robot body moves as stationary state. Quantitatively, the ATE RMSE in Lowspeed
the carpet beneath it is pulled away, while the robot wheels decreased by 0.05m after ZUPT.
do not move. Conversely, in the Static sequence the robot
is suspended so the robot body will not move even when V. C ONCLUSION
wheels are moving. The results of different methods on the This paper presents a tightly-coupled RGBD-Wheel-IMU-
two sequences are shown in Table IV, where ”baseline” GNSS SLAM system to achieve reliable localization for
denotes Ground-Fusion without wheel anomaly detection, ground vehicles. Our system features robust initialization
while ”ours” denotes our full method. As depicted in Figure through three strategies. We have also devised effective
4 (a), a wheel anomaly is evident between 20s and 40s. Our anomaly detection and handling methods to address corner
full method adeptly eliminates erroneous wheel odometer cases, with experimental results demonstrating the superior-
readings here. Figure 4 (b) shows that only our full method ity of our system.
matches well with the ground-truth trajectory. Similarly, our
4 https://github.com/SJTU-ViSYS/M2DGR-plus
method effectively detects the wheel anomaly in the Static
5 https://github.com/Wallong/VINS-GPS-Wheel
sequence.
R EFERENCES [22] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
“A benchmark for the evaluation of rgb-d slam systems,” in 2012
[1] J. Yin, C. Liang, X. Li, Q. Xu, H. Wang, T. Fan, Z. Wu, and Z. Zhang, IEEE/RSJ international conference on intelligent robots and systems
“Design, sensing and control of service robotic system for intelligent (IROS). IEEE, 2012, pp. 573–580.
navigation and operation in internet data centers,” in 2023 IEEE [23] K. Y. Leung, Y. Halpern, T. D. Barfoot, and H. H. Liu, “The
19th International Conference on Automation Science and Engineering utias multi-robot cooperative localization and mapping dataset,” The
(CASE). IEEE, 2023, pp. 1–8. International Journal of Robotics Research, vol. 30, no. 8, pp. 969–
[2] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, 974, 2011.
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous [24] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song,
localization and mapping: Toward the robust-perception age,” IEEE F. Qiao, L. Song, et al., “Are we ready for service robots? the
Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016. openloris-scene datasets for lifelong slam,” in 2020 IEEE international
[3] A. Martinelli, “Closed-form solution of visual-inertial structure from conference on robotics and automation (ICRA). IEEE, 2020, pp.
motion,” International Journal of Computer Vision, vol. 106, pp. 138– 3139–3145.
152, 2013. [25] I. Skog, P. Handel, J.-O. Nilsson, and J. Rantakokko, “Zero-velocity
[4] Z. Shan, R. Li, and S. Schwertfeger, “Rgbd-inertial trajectory estima- detection—an algorithm evaluation,” IEEE transactions on biomedical
tion and mapping for ground robots,” Sensors, vol. 19, no. 10, p. 2251, engineering, vol. 57, no. 11, pp. 2657–2666, 2010.
2019. [26] B. D. Lucas, T. Kanade, et al., An iterative image registration
[5] K. J. Wu, C. X. Guo, G. Georgiou, and S. I. Roumeliotis, “Vins technique with an application to stereo vision. Vancouver, 1981,
on wheels,” in 2017 IEEE International Conference on Robotics and vol. 81.
Automation (ICRA). IEEE, 2017, pp. 5155–5162. [27] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE
[6] T. Hua, L. Pei, T. Li, J. Yin, G. Liu, and W. Yu, “M2c-gvio: motion transactions on pattern analysis and machine intelligence, vol. 40,
manifold constraint aided gnss-visual-inertial odometry for ground no. 3, pp. 611–625, 2017.
vehicles,” Satellite Navigation, vol. 4, no. 1, pp. 1–15, 2023. [28] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and D. Murray,
[7] J. Yin, H. Jiang, J. Wang, D. Yan, and H. Yin, “A robust and efficient “Very high frame rate volumetric integration of depth images on
ekf-based gnss-visual-inertial odometry,” in 2023 IEEE International mobile devices,” IEEE transactions on visualization and computer
Conference on Robotics and Biomimetics (ROBIO). IEEE, 2023, pp. graphics, vol. 21, no. 11, pp. 1241–1250, 2015.
1–5. [29] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davi-
[8] S. Cao, X. Lu, and S. Shen, “Gvins: Tightly coupled gnss–visual– son, “Elasticfusion: Dense slam without a pose graph.” Robotics:
inertial fusion for smooth and consistent state estimation,” IEEE Science and Systems, 2015.
Transactions on Robotics, 2022. [30] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning-
[9] J. Yin, T. Li, H. Yin, W. Yu, and D. Zou, “Sky-gvins: a sky- based vo,” in Conference on Robot Learning. PMLR, 2021, pp.
segmentation aided gnss-visual-inertial system for robust navigation 1761–1772.
in urban canyons,” Geo-spatial Information Science, vol. 0, no. 0, pp. [31] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
1–11, 2023. arXiv preprint arXiv:1804.02767, 2018.
[10] J. Yin, A. Li, T. Li, W. Yu, and D. Zou, “M2dgr: A multi-sensor and
multi-scenario slam dataset for ground robots,” IEEE Robotics and
Automation Letters, 2022.
[11] J. Yin, H. Yin, C. Liang, H. Jiang, and Z. Zhang, “Ground-challenge: A
multi-sensor slam dataset focusing on corner cases for ground robots,”
in 2023 IEEE International Conference on Robotics and Biomimetics
(ROBIO). IEEE, 2023, pp. 1–5.
[12] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
ular visual-inertial state estimator,” IEEE Transactions on Robotics,
vol. 34, no. 4, pp. 1004–1020, 2018.
[13] J. Liu, W. Gao, and Z. Hu, “Visual-inertial odometry tightly coupled
with wheel encoder adopting robust initialization and online extrinsic
calibration,” in 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE, 2019, pp. 5391–5397.
[14] L. Gui, C. Zeng, S. Dauchert, J. Luo, and X. Wang, “A zupt aided
initialization procedure for tightly-coupled lidar inertial odometry
based slam system,” Journal of Intelligent & Robotic Systems, vol.
108, no. 3, p. 40, 2023.
[15] C. Campos, R. Elvira, J. J. G. Rodrı́guez, J. M. Montiel, and
J. D. Tardós, “Orb-slam3: An accurate open-source library for visual,
visual–inertial, and multimap slam,” IEEE Transactions on Robotics,
2021.
[16] J. Liu, X. Li, Y. Liu, and H. Chen, “Rgb-d inertial odometry for a
resource-restricted robot in dynamic environments,” IEEE Robotics
and Automation Letters, vol. 7, no. 4, pp. 9573–9580, 2022.
[17] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
slam: A semantic visual slam towards dynamic environments,” in 2018
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2018, pp. 1168–1174.
[18] M. Quan, S. Piao, M. Tan, and S.-S. Huang, “Tightly-coupled monocu-
lar visual-odometric slam using wheels and a mems gyroscope,” IEEE
Access, vol. 7, pp. 97 374–97 389, 2019.
[19] G. Peng, Z. Lu, S. Chen, D. He, and L. Xinde, “Pose estimation based
on wheel speed anomaly detection in monocular visual-inertial slam,”
IEEE Sensors Journal, vol. 21, no. 10, pp. 11 692–11 703, 2020.
[20] S. Hewitson and J. Wang, “Gnss receiver autonomous integrity moni-
toring (raim) performance analysis,” Gps Solutions, vol. 10, pp. 155–
170, 2006.
[21] D. He, W. Xu, N. Chen, F. Kong, C. Yuan, and F. Zhang, “Point-lio:
Robust high-bandwidth light detection and ranging inertial odometry,”
Advanced Intelligent Systems, p. 2200459, 2023.

You might also like