Francis Colas2 Daniel Cremers1 Roland Siegwart2 Wolfram Burgard3
Abstract— We provide a large dataset containing RGB-D
image sequences and the ground-truth camera trajectories with the goal to establish a benchmark for the evaluation of visual SLAM systems. Our dataset contains the color and depth images of a Microsoft Kinect sensor and the ground- truth trajectory of camera poses. The data was recorded at full frame rate (30 Hz) and sensor resolution (640x480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz). Further, we provide the accelerometer data from (a) Typical office scene (b) Motion capture system the Kinect. Finally, we propose an evaluation criterion for measuring the quality of the estimated camera trajectory of visual SLAM systems.
I. I NTRODUCTION
Simultaneous localization and mapping (SLAM) has a
long history in robotics and computer-vision research [11], (c) Microsoft Kinect sensor (d) Checkerboard with reflective [6], [1], [15], [7], [4]. Different sensor modalities have been with reflective markers markers used for calibration explored in the past, including 2D laser scanners [12], [3], Fig. 1: The office environment and the experimental setup 3D scanners [14], [16], monocular cameras [13], [7], [9], in which the RGB-D dataset with ground truth camera poses [19], [20] and stereo systems [8]. Recently, low-cost RGB- was recorded. D sensors became available, of which the most prominent one is the Microsoft Kinect. Such sensors provide both color images and dense depth maps at video frame rates. Henry et II. E XPERIMENTAL S ETUP AND DATA ACQUISITION al. [5] were the first to use the Kinect sensor in a 3D SLAM system. Others have followed [2], and we expect to see more We acquired a large set of data recordings containing approaches using RGB-D data for visual SLAM in the near both the RGB-D data from the Kinect and the ground truth future. estimates from the mocap system. We moved the Kinect Various datasets and benchmarks have been proposed for along different trajectories in typical office environments laser- and camera-based SLAM, such as the Freiburg, Intel (see Fig. 1a). The recordings differ in their translational and Newcollege datasets [18], [17]. However until now, no and angular velocities (fast/slow movements) and the size suitable dataset or benchmark existed that can be used to of the environment (one desk, several desks, whole room). evaluate, measure, and compare the performance of RGB- We also acquired data for three specific trajectories for D SLAM systems. As we consider objective evaluation debugging purposes, i.e., we moved the Kinect (more or less) methods to be highly important for measuring progress in the individually along the x/y/z-axes and rotated it individually field (and demonstrating this in a verifiable way), we decided around the x/y/z-axes. to provide such a dataset. To the best of our knowledge, this We captured both the color and depth images from an is the first RGB-D dataset for visual SLAM benchmarking. off-the-shelf Microsoft Kinect sensor using PrimeSense’s OpenNI-driver. All data was logged at full resolution 1 Jürgen Sturm and Daniel Cremers are with the Computer Vision and (640×480) and full frame rate (30 Hz) of the sensor on a Pattern Recognition Group, Computer Science Department, Technical Uni- Linux laptop running Ubuntu 10.10 and ROS Diamondback. versity of Munich, Germany. {sturmju,cremers}@in.tum.de Further, we recorded IMU data from the accelerometer in 2 S. Magnenat, F. Pomerlau, F. Colas and R. Seigwart are the Kinect at 500 Hz and also read out the internal sensor with the Autonomous Systems Lab, ETH Zurich, Switzerland. {stephane.magnenat,francis.colas}@mavt.ethz.ch parameters from the Kinect factory calibration. and [email protected] Further, we obtained the camera trajectory by using an 3 Nikolas Engelhard and Wolfram Burgard are with external motion capturing system from MotionAnalysis at the Autonomous Intelligent Systems Lab, Computer Science Department, University of Freiburg, Germany. 100 Hz (see Fig. 1b). We attached reflective targets to the {engelhar,burgard}@informatik.uni-freiburg.de Kinect (see Fig. 1c) and used a modified checkerboard for calibration (Fig. 1d) to obtain the transformation between the and evaluations. In this way, we hope to detect (and resolve) optical frame of the Kinect sensor and the coordinate system potential problems present in our current dataset, such as of the motion capture system. Finally, we also video-taped calibration and synchronization issues between the Kinect all recordings with an external video camera to capture the and our mocap system as well as the effects of motion blur camera motion and the environment from a different view and the rolling shutter of the Kinect. Furthermore, we want point. to investigate ways to measure the performance of a SLAM The original data has been recorded as a ROS bag file. system not only in terms of the accuracy of the estimated In total, we collected 50 GB of Kinect data, divided into camera trajectory, but also in terms of the quality of the separate nine sequences. The dataset is available online under resulting map of the environment. the Creative Commons Attribution license at R EFERENCES https://cvpr.in.tum.de/research/ [1] F. Dellaert. Square root SAM. In Proc. of Robotics: Science and datasets/rgbd-dataset Systems (RSS), Cambridge, MA, USA, 2005. [2] N. Engelhard, F. Endres, J. Hess, J. Sturm, and W. Burgard. Real- The website contains—next to additional information about time 3D visual SLAM with a hand-held RGB-D camera. In Proc. of the data formats—videos for simple visual inspection of the the RGB-D Workshop on 3D Perception in Robotics at the European Robotics Forum, Vasteras, Sweden, 2011. dataset. [3] G. Grisetti, C. Stachniss, and W Burgard. Improved techniques for grid mapping with rao-blackwellized particle filters. IEEE Transactions on III. E VALUATION Robotics (T-RO), 23:34–46, 2007. For evaluating visual SLAM algorithms on our dataset, [4] G. Grisetti, C. Stachniss, and W. Burgard. Non-linear constraint network optimization for efficient map learning. IEEE Transactions we propose a metric similar to the one introduced by [10]. on Intelligent Transportation systems, 10(3):428–439, 2009. The general idea is to compute the relative error between [5] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D mapping: the true and estimated motion w.r.t. the optical frame of the Using depth cameras for dense 3D modeling of indoor environments. In Proc. of the Intl. Symp. on Experimental Robotics (ISER), Delhi, RGB camera. As we have ground-truth pose information for India, 2010. all time indices, we propose to compute the error as the sum [6] H. Jin, P. Favaro, and S. Soatto. Real-time 3-D motion and structure of distances between the relative pose at time i and time of point features: Front-end system for vision-based control and interaction. In Proc. of the IEEE Conf. on Computer Vision and Pattern i + ∆, i.e., Recognition (CVPR), 2000. n X [7] G. Klein and D. Murray. Parallel tracking and mapping for small AR 2 workspaces. In Proc. of the IEEE and ACM International Symposium error = [(x̂i+∆ x̂i ) (xi+∆ xi )] (1) on Mixed and Augmented Reality (ISMAR), Nara, Japan, 2007. i=1 [8] K. Konolige, M. Agrawal, R.C. Bolles, C. Cowan, M. Fischler, and where i = 1, . . . , n are the time indices where ground B.P. Gerkey. Outdoor mapping and navigation using stereo vision. In Intl. Symp. on Experimental Robotics (ISER), 2007. truth information is available, ∆ is a free parameter that [9] K. Konolige and J. Bowman. Towards lifelong visual maps. In Proc. of corresponds to the time scale, xi is the ground truth pose the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), at time index i, x̂i the estimated pose at time index i, pages 1156–1163, 2009. [10] R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti, stands for the inverse motion composition operator. If the es- C. Stachniss, and A. Kleiner. On measuring the accuracy of SLAM timated trajectory has missing values, i.e., there are timesteps algorithms. Autonomous Robots, 27:387–407, 2009. ij1 , . . . , ijm for which no pose x̂i could be estimated, the [11] F. Lu and E. Milios. Globally consistent range scan alignment for environment mapping. Autonomous Robots, 4(4):333–349, 1997. ratio of missing poses m/n should be stated as well. [12] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: All data necessary to evaluate our measure are present A factored solution to the simultaneous localization and mapping in the dataset. We plan to release a Python script that problem. In Proceedings of the AAAI National Conference on Artificial Intelligence, Edmonton, Canada, 2002. AAAI. computes these measures automatically given the estimated [13] D. Nistér. Preemptive ransac for live structure and motion estimation. trajectory and the respective dataset. To prevent that (future) Machine Vision and Applications, 16:321–329, 2005. approaches are over-fitted on the dataset, we recorded all [14] A. Nüchter, K. Lingemann, J. Hertzberg, and H. Surmann. 6D SLAM – 3D mapping outdoor environments: Research articles. J. Field Robot., scenes twice, and held back the ground-truth trajectory in 24:699–722, August 2007. these secondary recordings. With this, we plan to provide a [15] E. Olson, J. Leonard, and S. Teller. Fast iterative optimization of pose comparative offline evaluation benchmark for visual SLAM graphs with poor initial estimates. In Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA), 2006. systems. [16] B. Pitzer, S. Kammel, C. DuHadway, and J. Becker. Automatic recon- struction of textured 3D models. In Proc. of the IEEE Intl. Conf. on IV. C ONCLUSIONS Robotics and Automation (ICRA), 2010. [17] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman. The new In this paper, we have presented a novel RGB-D dataset college vision and laser data set. Intl. Journal of Robotics Research for benchmarking visual SLAM algorithms. The dataset con- (IJRR), 28(5):595–599, 2009. tains color images, depth maps, and associated ground-truth [18] C. Stachniss, P. Beeson, D. Hähnel, M. Bosse, J. Leonard, B. Steder, R. Kümmerle, C. Dornhege, M. Ruhnke, G. Grisetti, and A. Kleiner. camera pose information. Further, we proposed an evaluation Laser-based slam datasets and benchmarks at http://openslam.org. metric that can be used to assess the performance of a visual [19] H. Strasdat, J. M. M. Montiel, and A. Davison. Scale drift-aware SLAM system. We thus propose a benchmark that allows large scale monocular slam. In Proc. of Robotics: Science and Systems (RSS), Zaragoza, Spain, 2010. researchers to objectively evaluate visual SLAM systems. [20] J. Stühmer, S. Gumhold, and D. Cremers. Real-time dense geometry Our next step is to evaluate our own system [2] on this dataset from a handheld camera. In Proc. of the DAGM Symposium on Pattern in order to provide a baseline for future implementations Recognition (DAGM), Darmstadt, Germany, 2010.
Computer Vision Fundamental Matrix: Please, suggest a subtitle for a book with title 'Computer Vision Fundamental Matrix' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.