Improving Repeatability of Experiments by Automatic Evaluation of SLAM Algorithms
Improving Repeatability of Experiments by Automatic Evaluation of SLAM Algorithms
Abstract— The development of good experimental methodolo- particular, we focus on the domain of SLAM (Simultaneous
gies for robotics takes often inspiration from general principles Localization And Mapping) [13] and we first show that the
of experimental practice. Repeatability prescribes that experi- performance of SLAM algorithms presents some variability
ments should involve several trials in order to guarantee that
results are not achieved by chance, but are systematic, and when the algorithms are applied to data collected with
statistically significant trends can be identified. In this paper, different runs in the same environment. This aspect is often
we propose an approach to improve the repeatability of exper- disregarded and SLAM algorithms are usually evaluated on
iments performed in robotics. In particular, we focus on the data acquired with single runs in different environments. We
domain of SLAM (Simultaneous Localization And Mapping) then introduce a system that exploits simulations to generate
and we introduce a system that exploits simulations to generate
a large number of test data on which SLAM algorithms are a large number of test data on which SLAM algorithms are
automatically evaluated in order to obtain consistent results, automatically evaluated in order to obtain consistent results,
according to the principle of repeatability. according to the principle of repeatability. Our system is
finally validated showing that a SLAM algorithm applied to
I. INTRODUCTION
the test data we generate shows a performance very similar
Development of good experimental methodologies for to that obtained when the algorithm is applied to data coming
robotics is a topic that has attracted increasing interest [1]. from real robots.
The discussion has evolved from early methodological pro- This paper is organized as follows. The next section
posals [2], [3] to a tangible impact on publications, with motivates the contribution we provide. Section III illustrates
special issues [4] and special kinds of articles (reproducible our proposed method and the system we developed, which is
articles or R-articles) [5]. Several practical solutions have experimentally validated in Section IV. Section V concludes
been advanced to support good experimental methodologies, the paper.
ranging from the use of datasets [6], [7], to the development
of platforms for benchmarking [8], [9], and to the definition II. MOTIVATION
of robotic competitions [10], [11]. In this section, we motivate the approach we present in
Among the several aspects that are involved in good this paper to improve the repeatability of experiments in
experimental methodologies, the principles of reproducibility robotics. We focus on a specific, yet significative and widely
and repeatability are central. They refer to two similar but not studied, domain, that of SLAM. Broadly speaking, in a
fully overlapped characteristics of experimental practice [12]. SLAM problem, a robot should localize itself within a map
Reproducibility is the possibility to verify, in an independent of the environment that it is building at the same time, on the
way, the results of an experiment. This means that exper- basis of data coming from its sensors, typically laser range
imenters, other than those claiming for the validity of the scanners and encoders.
results, should be able to achieve the same results when We consider a well-known SLAM algorithm based on
starting from the same initial conditions, using the same particle filters, called GMapping [14]. In brief, it maintains
type of instruments and parameters, and adopting the same a predefined number of hypotheses (particles) about the map
experimental techniques. Repeatability, instead, refers to the of the environment and the pose of the robot, which are
fact that a single result is not sufficient to ensure the success continuously updated according to the information provided
of an experiment. A successful experiment must involve a by new observations (laser range scans and odometry read-
number of trials, possibly performed at different times and ings). The selection of the particles that should be maintained
in different places, in order to guarantee that results have or eliminated at each update step is based on a maximum
not been achieved by chance, but are systematic, and that likelihood probabilistic approach, so that particles that are
statistically significant trends can be identified. less likely to represent the current knowledge of the robot
In this paper we propose an approach that enhances (including the observations) tend to be replaced.
the repeatability of experiments performed in robotics. In We also consider a commonly used metric to evaluate
1 F. Amigoni and V. Castelli are with the Artificial Intelligence the performance of SLAM algorithms [15], that provides a
and Robotics Laboratory, Politecnico di Milano, Piazza Leonardo da measure of the translational and rotational components of
Vinci 32, 20133 Milano, Italy [email protected], the localization error, which is calculated by comparing the
[email protected] trajectory of the robot as reconstructed by a SLAM algorithm
2 M. Luperto is with the Applied Intelligent Systems Laboratory, Univer-
sità degli Studi di Milano, Via Festa del Perdono 7, 20122 Milano, Italy and the ground truth trajectory. The details of the metric are
[email protected] explained in Section III-B.
7238
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on June 19,2023 at 08:58:51 UTC from IEEE Xplore. Restrictions apply.
buildings of the MIT university campus, from [18]. Their size closures is an interesting future direction of work.)
ranges from 1000 m2 to 30 000 m2 . Finally, we complete E At the end of each run RE we have thus the set D(RE )
with our dataset of 64 floor plans [19], 26 offices and 38 of data (laser range scans and odometry readings) collected
schools, whose size ranges from 100 m2 to 10 000 m2 . along the path followed to cover environment E. These data
For each environment E 2 E we collect, using simula- are fed to GMapping that, at the end of the exploration,
tions, a set of test data DE to be fed to GMapping. Note that, produces the grid map M (RE ) and the estimated poses of
in addition to reducing the costs of data collection, simula- the robot x1:TRE from time step 1 to TRE (the time step
tions also easily provide the ground truth of the trajectories at which the exploration run RE ended). The process is
followed by the robot that, as discussed in Section III-B, is automatically iterated until a number of runs |RE | (where
required by the metric we employ. RE is the set of runs performed in E) are performed for each
Simulations are performed in Stage, using the ROS GMap- environment E, as we discuss in Section III-C. Eventually,
ping2 and Navigation packages3 . Mapping is performed for each environment E 2 E, we have the set DE =
using 40 particles and processing a new scan whenever {D(RE ) for all RE 2 RE } of test data and the correspond-
the robot travels 1 m, rotates 0.25 rad, or 5 s have passed ing results produced by GMapping, namely the set of grid
since the last update of the map. We employ a virtual robot maps {M (RE ) for all RE 2 RE } and the set of estimated
equipped with a two-dimensional laser range scanner with poses {x1:TRE for all RE 2 RE }. (We note that the test
a field of view of 270 , an angular resolution of 0.5 , and data DE could be used to evaluate other SLAM algorithms
a range of 30 m. In our simulations, we assume that the without the need to re-run simulations.) We point out that the
virtual robot has a translational odometry error of up to test data DE are relative to the particular configuration of the
0.01 m/m and a rotational odometry error of up to 2 /rad, virtual robot (and of its sensors) that we have considered. For
which provide a reasonable approximation of the odometry example, changing the field of view or the range of the laser
accuracy of real wheeled robots. The actual amount of error range scanner leads to generate a set of data that could be
is randomly chosen by the simulator (at the start of each run) different. However, generating a new set of test data DE for
with uniform probability in the range [ 0.005, +0.005]m/m an environment E is relatively cheap (for example, in a large
and [ 1, +1] /rad, respectively. Although Stage, as any environment, like that of Fig. 6b, an exploration run requires,
simulator, does not fully capture all the aspects of the real on average, 43.8 minutes). Similarly, the data {M (RE )} and
world, its use allows us to generate data easily. Moreover, as {x1:TRE }, representing the results of GMapping, depend on
shown in Section IV, these data are quite similar (in order the configuration of the algorithm (and, of course, on test
to evaluate GMapping performance) to those obtained with data DE ).
real robots.
Given an environment E and a starting pose for the robot B. Automated Metric
(close to the center of the environment, the same for all runs), The research community has developed several metrics to
a run RE explores E using the frontier-based exploration assess the performance of SLAM algorithms. Some of them
approach of [20] according to which the robot moves to the involve a comparison with the ground truth [16], [21], [22],
closest frontier, where a frontier is a region on the boundary while some others are based on evaluating the usefulness of
between known and unknown space, collecting laser range the results (e.g., maps or localization) produced by SLAM
scans and odometry readings at each time step (every 100 ms algorithms [7], [23]. In our work, we employ a variant of the
in our case). These (timestamped) data are both fed to the localization error metric proposed by [15], which measures
ROS GMapping node and stored in a ROS bag file4 . A the performance of SLAM algorithms according to their
run RE ends when two consecutive snapshots of the grid ability to accurately estimate the actual trajectory followed
map produced by GMapping (taken every 120 s for small by the robot. The idea is to measure the deformation energy
environments and every 600 s for large environments) are that is required to superimpose the estimated trajectory onto
similar enough, according to the mean square error metric the ground truth trajectory: the smaller the energy, the higher
that evaluates the difference between images. Empirically, the accuracy of the reconstruction. We choose this metric
this automated criterion for termination amounts to fully because it does not rely on any particular map representation
map the environment E in almost all runs. Note that the format nor on any specific sensor.
exploration process finds frontiers on the map of the envi- In [15], the metric is defined as follows.
ronment incrementally built by GMapping. As a consequence
of this, and of the localization errors simulated in Stage, Definition III.1. Let x1:T be the poses of the robot estimated
exploration runs follow different paths in the environment. by a SLAM algorithm from time step 1 to T during an
These paths include several loop closures, although our exploration run of environment E; in our case, xt 2 SE(2),
online exploration approach does not actively seek to find with SE(2) being the special Euclidean group of order 2.
them. (The use of offline approaches that optimize loop Let x⇤1:T be the associated ground truth poses of the robot
during mapping.
2 http://wiki.ros.org/gmapping Let i,j = xj xi be the relative transformation that moves
3 http://wiki.ros.org/navigation the pose xi onto xj and let i,j⇤
= x⇤j x⇤i .
4 http://wiki.ros.org/Bags Finally, let be a set of N pairs of relative transformations
7239
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on June 19,2023 at 08:58:51 UTC from IEEE Xplore. Restrictions apply.
over all the exploration, = h i,j , i,j⇤
i . performed on E.
The localization error performance metric is defined as: The sample mean and sample standard deviation of the
1 X translational localization error in E are:
"( ) = ( i,j ⇤ 2
i,j ) =
P
N i,j "t ( R E )
R 2R
1 X "t (E) = E E
= [trans( i,j ⇤ 2 ⇤ 2 |RE |
i,j ) + rot( i,j i,j ) ] =
N i,j
v P
= "t ( ) + "r ( ), u
u [" ( RE ) "t (E)]2
t RE 2RE t
where the sums are over the elements of , is the inverse of s("t (E)) = .
the standard motion composition operator, and trans(·) and |RE |
rot(·) are used to separate the translational and rotational The sample mean and sample standard deviation of the
components of the error. rotational localization error in E are defined as:
P
In order to apply this metric, we need to address two "r ( R E )
R 2R
issues. First, the metric as defined above is intrinsically "r (E) = E E
devoted to evaluate a single run RE in the environment E. |RE |
Second, the metric requires to define the set of pairs of
v P
relative transformations. u
[" ( "r (E)]2
u RE )
Addressing the first issue amounts to moving from the t RE 2RE r
evaluation of the performance measured on a single run s("r (E)) = .
|RE |
in an environment E (namely, on test data D(RE )) to the
We now turn to the second issue discussed above, namely
evaluation of performance measured on all the runs in E
the determination of the set = h i,j , i,j ⇤
i of pairs of rela-
(namely, on test data DE ). In principle, one would like
tive transformations. In each pair, the relative transformation
to evaluate the expected localization error of a generic
i,j between two poses as estimated by the SLAM algorithm
exploration run in an environment E.
is associated to the relative transformation i,j⇤
between the
Definition III.2. Let p E be the probability of observing the ground truth poses. In [15], human expertise is exploited to
set E set of relative transformations during an exploration determine the N pairs of relative transformations in . After
run in an environment E. a run, a human operator analyzes the pairs of laser range
The mean translational localization error in E, E["t (E)], is scans acquired by the robot to determine which ones refer to
the expected value of the translational component of the the same part of the environment and manually aligns them.
localization error over all the possible exploration runs on (This is done in order to cope with the difficulty of collecting
E: ground truth trajectories in real world scenarios.) The amount
X of displacement required for the alignment is stored as the
E["t (E)] = "t ( E ) ⇤ p E .
ground truth of the relative transformation i,j ⇤
between the
E
poses xi and xj from which the laser range scans have been
The standard deviation of the translational localization error acquired. The human operator can match laser range scans at
in E, ["t (E)], is: semantically relevant places (e.g., loop closures), providing
p
["t (E)] = E["t (E)2 ] E["t (E)]2 . ground truth for global consistency. Clearly, this method does
not efficiently scale as the numbers of laser range scans, runs,
Similarly, the mean rotational localization error in E, and environments increase.
E["r (E)], is the expected value of the rotational component Since we are using simulations, we can assume to have
of the localization error over all the possible exploration the ground truth trajectories followed by the robot. Hence,
runs on E: we propose a new way to determine RE of Definition III.3
X
E["r (E)] = "r ( E ) ⇤ p E . that is independent of human intervention. Although, in
E
principle, RE could contain all the possible pairs of relative
transformations (i.e., for all the i and j in 1 : TRE ), this
The standard deviation of the rotational localization error of
solution is unpractical, because the size of RE would be
environment E, ["r (E)], is:
p quadratic in the number of poses on the robot’s trajectory.
["r (E)] = E["r (E)2 ] E["r (E)]2 . We propose to build RE by randomly sampling a set of
relative transformations, whose size trade-offs between sam-
We approximate the above quantities with their sampled
pling quality and computational complexity. The procedure
versions, since the weak law of large numbers guarantees
is based on the central limit theorem to approximate the
their convergence to the theoretical definitions as the number
sampling distribution with a normal distribution [25]. The
of exploration runs |RE | in an environment E increases [24].
quality of the sampling is relative to the accuracy of the
Definition III.3. Let E be a set of environments, E 2 E one estimation of the localization error. More precisely, we set the
of such environments, and RE the set of exploration runs confidence level and the margin of error of the estimation and
7240
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on June 19,2023 at 08:58:51 UTC from IEEE Xplore. Restrictions apply.
# of occurrences
we determine the number of relative transformations sampled 40 40
for estimating the localization error as:
20 20
z↵/2 ⇤ s2
N= , (1)
d2 0 0
0.28 0.3 0.32 0.22 0.24 0.26
where s is the usual unbiased estimator of the population
2
"t () [m]
variance, d is the margin of error, ↵ is the complement
of the desired confidence level and z↵/2 is its associated Fig. 3: Distribution of the translational localization error "t ()
z-score. To validate our approach, we empirically verify in two environments.
the distribution normality assumption on a representative
set of environments. Fig. 3 shows the sample distribution
of the translational localization error "t () in two of these other. However, Chebyshev’s weak law of large numbers
environments. The distributions are obtained by repeatedly guarantees the convergence of the sample mean to the true
extracting 200 different samples of relative transformations mean under the assumption that the covariances tend to be
imposing a 99 % confidence level and a margin of error of zero on average [24]. Then, we assume the distribution of
± 0.02 m. It is evident that the shape of the distributions the sample mean to be approximately normal and we exploit
is approximately normal. The above process is sound if we the same formulation of Equation (1) to obtain |RE |.
assume the relative transformations to be independent and Given E, the estimation of the sample size |RE | is
identically distributed random variables. In principle, this performed as follows. The process starts with an initial
may not be the case for all pairs of relative transformations, estimate of the variance of the localization error, obtained
for example, relative transformations that involve pairs of from a small sample of 10 runs. We use this value to compute
poses that are close to each other are similar and not an initial estimate of the number of required runs. We then
independent. However, the number of possible relative trans- perform that number of runs and compute a new estimate
formations is so large that, given any two random relative of the variance and its associated sample size, iteratively
transformations, the likelihood that they are dependent can repeating the process until the newly estimated sample size is
be assumed negligible for all practical purposes. not larger than the number of already performed runs. In our
To show that sampling relative transformations leads to case, we end up with different values of |RE | for different
a metric that actually captures the quality of SLAM results, environments E, with an average of |RE | = 36 (and a total
Fig. 4 shows a good and a bad map of the same environment, of about 3, 600 simulated exploration runs in Stage).
with the bad map being visibly broken with a room that is Note that the sample size Nt required for an accurate
significantly misaligned. This visual difference is correctly estimate of the translational localization error may differ
reflected by the metric, as the translational and rotational from the sample size Nr required to accurately estimate the
localization errors of the good map are of 0.54 m and rotational localization error; in this case, we consider the
0.02 rad, respectively, while those of the bad map are 2.42 m maximum of Nt and Nr .
and 0.29 rad, respectively.
In summary, given data relative to all runs RE performed IV. EXPERIMENTAL VALIDATION
in the environment E, we calculate mean and standard In this section, we show the effectiveness of the proposed
deviation of "t (E) and "r (E), namely of the two components approach and we validate it.
of the localization error, according to Definition III.3. Fig. 5 shows the translational localization errors of GMap-
According to what discussed at the end of Section III-A, ping in all the runs performed in the 100 environments of
the values of the mean and standard deviation of "t () and E (the rotational localization errors are similar). The vari-
"r () depend on the virtual robot configuration. For example, ability of the performance in any given environment and the
for the environment of Fig. 4, the value of "t () is 0.68 m presence of some outliers are evident, reinforcing the need
if the range of the laser range scanner is 30 m and 0.91 m
if the range is 15 m. (The intuitive explanation is that, with
a reduced range, the robot travels a longer distance and the
error increases.)
7241
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on June 19,2023 at 08:58:51 UTC from IEEE Xplore. Restrictions apply.
2.00 TABLE II: Translational ([m]) and rotational ([rad]) compo-
nents of the localization error for the dataset of [26].
7242
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on June 19,2023 at 08:58:51 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Translational and rotational components of the [4] F. Bonsignorio and A. del Pobil, “Toward replicable and measurable
localization error for the AIRLab experiment. robotics research [from the guest editors],” IEEE RAM, vol. 22, no. 3,
pp. 32–35, 2015.
translational error [m] rotational error [rad]
[5] F. Bonsignorio, “A new kind of article for reproducible research in
"t () s("t ()) "r () s("r ())
Robocom 0.086 0.026 0.066 0.010 intelligent robotics [from the field],” IEEE RAM, vol. 24, no. 3, pp.
simulation 0.101 0.019 0.022 0.004 178–182, 2017.
[6] A. Howard and N. Roy, “The robotics data set repository (radish),”
2003. [Online]. Available: http://radish.sourceforge.net/
[7] G. Fontana, M. Matteucci, and D. Sorrenti, “Rawseeds: Building
a benchmarking toolkit for autonomous robotics,” in Methods and
Experimental Techniques in Computer Engineering, F. Amigoni and
V. Schiaffonati, Eds. Springer, 2014, pp. 55–68.
[8] J. Weisz, Y. Huang, F. Lier, S. Sethumadhavan, and P. Allen,
“RoboBench: Towards sustainable robotics system benchmarking,” in
Proc. ICRA, 2016, pp. 3383–3389.
[9] D. Pickem, P. Glotfelter, L. Wang, M. Mote, A. Ames, E. Feron, and
M. Egerstedt, “The Robotarium: A remotely accessible swarm robotics
research testbed,” in Proc. ICRA, 2017, pp. 1699–1706.
[10] F. Amigoni, E. Bastianelli, J. Berghofer, A. Bonarini, G. Fontana,
Fig. 7: Robocom (left). The map built by Robocom in the N. Hochgeschwender, L. Iocchi, G. Kraetzschmar, P. Lima, M. Mat-
AIRLab (center). The map build in a simulation run (right). teucci, P. Miraldo, D. Nardi, and V. Schiaffonati, “Competitions for
benchmarking,” IEEE RAM, vol. 22, no. 3, pp. 53–61, 2015.
[11] L. Iocchi, D. Holz, J. Ruiz-del-Solar, K. Sugiura, and T. van der Zant,
“RoboCup@Home: Analysis and results of evolving competitions for
with data collected with our simulator is comparable to that domestic and service robots,” Artif Intell, vol. 229, pp. 258–281, 2015.
obtained with data collected with real robots. This outcome [12] I. Hacking, Representing and Intervening. Cambridge University
Press, 1983.
suggests the validity of our simulation-based approach to [13] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. The MIT
automatically evaluate SLAM algorithms. Press, 2005.
[14] G. Grisetti, C. Stachniss, and W. Burgard, “Improved techniques for
V. CONCLUSIONS grid mapping with Rao-Blackwellized particle filters,” IEEE T Robot,
vol. 23, no. 1, pp. 34–46, 2007.
In this paper we have presented an approach to address the [15] R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti,
C. Stachniss, and A. Kleiner, “On measuring the accuracy of SLAM
limited repeatability of experiments performed to evaluate algorithms,” Auton Robot, vol. 27, no. 4, pp. 387–407, 2009.
SLAM algorithms. The proposed system exploits simulations [16] J. Santos, D. Portugal, and R. Rocha, “An evaluation of 2D SLAM
to generate a large amount of test data with relatively small techniques available in Robot Operating System,” in Proc. SSRR, 2013,
pp. 1–6.
effort and automates the evaluation of SLAM algorithms. The [17] R. Bormann, F. Jordan, W. Li, J. Hampp, and M. Hägele, “Room
validation has shown that GMapping performs similarly on segmentation: Survey, implementation, and analysis,” in Proc. ICRA,
test data collected by real robots and on test data generated 2016, pp. 1019–1026.
[18] E. Whiting, J. Battat, and S. Teller, “Generating a topological model of
with our approach. Note that the availability of several test multi-building environments from floorplans,” in Proc. CAADFutures,
data, relative to different runs in the same environment and 2007, pp. 115–128.
to different environments, promotes also the reproducibility [19] M. Luperto, A. Quattrini Li, and F. Amigoni, “A system for building
semantic maps of indoor environments exploiting the concept of
of experimental results. building typology,” in Proc. RoboCup, 2013, pp. 504–515.
While we have considered a specific algorithm (GMap- [20] B. Yamauchi, “A frontier-based approach for autonomous exploration,”
ping) and a specific simulator (Stage), most modules of our in Proc. CIRA, 1997, pp. 146–151.
[21] B. Balaguer, S. Carpin, and S. Balakirsky, “Towards quantitative
system could be generalized with small adjustments to other comparisons of robot algorithms: Experiences with SLAM in simu-
SLAM algorithms and other simulators, and, in principle, lation and real world systems,” in IROS Workshop on Performance
to other domains. Preliminary results obtained with Karto Evaluation and Benchmarking for Intelligent Robots and Systems,
2007.
SLAM5 seem to confirm the findings of this paper. A draw- [22] S. Schwertfeger and A. Birk, “Map evaluation using matched topology
back of the proposed approach, as it is currently structured, graphs,” Auton Robot, vol. 40, no. 5, pp. 761–787, 2016.
is that the test data it generates depend on the configuration [23] T. Colleens, J. Colleens, and D. Ryan, “Occupancy grid mapping: An
empirical evaluation,” in Proc. MED, 2007, pp. 1–6.
of the virtual robot and of its sensors. Making the approach [24] S. Karlin and H. Taylor, A First Course in Stochastic Processes.
more platform-independent is one of the challenges for future Academic Press, 1975.
work. [25] P. Billingsley, Probability and Measure. Wiley, 1995.
[26] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A
benchmark for the evaluation of RGB-D SLAM systems,” in Proc.
R EFERENCES IROS, 2012, pp. 573–580.
[1] F. Amigoni and V. Schiaffonati, “Models and experiments in robotics,”
in Springer Handbook of Model-Based Science, L. Magnani and
T. Bertolotti, Eds. Springer, 2017, pp. 799–815.
[2] F. Bonsignorio, J. Hallam, and A. del Pobil, “Gem
guidelines,” http://www.heronrobots.com/EuronGEMSig/downloads/
GemSigGuidelinesBeta.pdf, last visited July 2018.
[3] F. Amigoni, M. Reggiani, and V. Schiaffonati, “An insightful compar-
ison between experiments in mobile robotics and in science,” Auton
Robot, vol. 27, no. 4, pp. 313–325, 2009.
5 http://wiki.ros.org/slam_karto
7243
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on June 19,2023 at 08:58:51 UTC from IEEE Xplore. Restrictions apply.