0% found this document useful (0 votes)
32 views8 pages

Kinect 10

gfhgfhh

Uploaded by

Motín Ar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

Kinect 10

gfhgfhh

Uploaded by

Motín Ar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2011 Canadian Conference on Computer and Robot Vision

Visual Odometry Using 3-Dimensional Video Input


Mark Fiala Alex Ufkes
Computing Science Computing Science
Ryerson University Ryerson University
Toronto, Canada Toronto, Canada
[email protected] [email protected]
February 5, 2011

Abstract Ego-motion estimation using computer vision has


been used in monocular applications ([3],[19]) where
Using Visual Odometry a robot can track its tra- scale is unimportant. Typically, robot navigation
jectory using video input. This allows more accu- requires scale information of its motion requiring
rate ego-motion estimation when compared to classi- 3D knowledge of tracked features [15]. The ap-
cal odometry which relies on measurement of wheel proach was popularized in the Mars rovers [6] allow-
motion. The Microsoft Kinect sensor provides 3D ing longer automatic movement commands despite
imagery, similar to a LASER or LIDAR scanner, deep martian sand. Tracking features requires stereo
which can be used for visual odometry with a single vision, features followed to calculate robot motion
sensor. This differs from usual implementations that would be those whose 3D depth is known, i.e. those
require stereo vision with two or more standard im- seen in both images. A monocular system cannot be
age sensors. The system has advantages over a laser used unless there is a priori knowledge of identifiable
scanner in that it provides a video image as well as features.
depth information such that matching using feature
Visual odometry systems usually rely on stereo
detectors such as SIFT or SURF is possible. Visual
and therefore have two stages of matching; matching
odometry is performed by matching 3D points that
features between different cameras at the same in-
have 2D descriptors. This paper presents and imple-
stant in time, and matching these features to a later
ments a visual odometry system for a mobile robot
frame. Greater system robustness could be achieved
that utilizes feature detection and tracking combined
if the first matching stage was not necessary.
with a low cost 3D video sensor, Microsoft’s Kinect.
In this paper we propose a visual odometry sys-
Keywords: visual odometry, pose tracking, com- tem that can estimate the 3D pose of a mobile plat-
puter vision, sensor fusion. form using monocular video data and associated 3D
depth data, as provided by Microsoft’s Kinect sensor
1 Introduction [1]. Stereo matching is thus avoided, and matching is
only performed between images from different times.
Mobile robots benefit from real time measurement of
We utilize standard feature detectors, SIFT [13] in
their motion (odometry) to allow them to follow as-
our case, but match between 3D points to calculate
signed trajectories. Typically this is accomplished
pose change directly.
with wheel encoders which measure how far each
wheel moves. Wheel slippage, loose terrain, dust and This work differs from visual SLAM (Simulta-
grit floor and wheels all cause error in ego-motion es- neous Localization and Mapping) demos created
timation when only measuring wheel motion. This shortly after the release of the Kinect sensor, these
is especially true after rotation. Even in indoor sit- demos use existing visual monocular SLAM and just
uations with clean floors and wheels the error in po- create a world model using the 3D depth of points
sition accumulates rapidly, especially more in out- after the camera positions are already determined,
door robotics in sand, mud, etc. Visual Odome- i.e. the 3D information from the Kinect is just used
try uses computer vision observing the environment after motion is calculated. Our system uses the 3D
from cameras mounted on the robot to measure mo- information in the actual relative pose estimation be-
tion more accurately. tween frames.

978-0-7695-4362-8/11 $26.00 © 2011 IEEE 86


DOI 10.1109/CRV.2011.19
Outliers and false matches are a problem and can
lead to pose errors, which are problematic if over-
all path is important since subsequent estimates will
contain this error. [11] calculate rotation first using
features at infinite distance, RANSAC [8] is applied
using hypotheses generated from two point matches.
A single point match is the sufficient for translation
estimation (RANSAC is used again using hypotheses
from one match at a time). Olsen et al [18] use an
orientation sensor to reduce error growth, since ori-
entation error is more damaging to overall position
error. Matthies et al [15] use 3D Gaussian distribu-
tions to model triangulation error.
Figure 1: Front view of a forward-facing Kinect sen-
Modern feature detectors (arrived since the year
sor mounted on a Pioneer-2DX mobile robot plat-
2000) such as SIFT [13] and SURF [4] can be used
form.
instead of corner detector and patch matching. Se et
al [20]’s paper, mentioned above, uses SIFT features.
1.1 Visual Odometry Furgale et al [9] use SURF, notably they created a
real time GPU based SURF implementation.
The relative pose between two robot positions can Approaches utilizing omni-directional camera
be determined from the change in 3D position of a configurations can be seen in [12], [5] and [7]. These
set of points. Image features typically used for visual approaches rely on images taken by a single cam-
odometry have been corner detectors, such as the Shi era at different oriantations to create a panoramic or
and Tomasi or Harris corners. These approximate spherical image of an environment from which fea-
’matchability’ of image points by looking for points of tures can be detected and matched and/or tracked.
maximal change in a ’cornerness’ measure composed In general, omnidirectional or wide angle imagery
a matrix of sums of spatial derivatives. Nister et al can produce more accurate orientation estimates.
[17] use such a corner detector to find a list of features Our paper does not attempt to recognize previ-
in each image and variance normalized correlation ously visited locations, this would be used for local-
to match image patches around these features. They ization in previously visited areas. This technique is
use the 5-point algorithm and pre-emptive RANSAC used in SLAM systems to increase accuracy for sit-
to estimate relative poses between monocular frames, uations where a robot revisits areas multiple times,
and the 3-point algorithm to compute camera pose as in the work of Zhu et al [21].
in both monocular and stereo setups. Kaess et al Being able to recognize a location in a map is sim-
[11] uses sub-pixel calculations of corner locations to ilar to an image search problem and different from
improve 3D position whereas Cheng et al’s paper et just visual odometry, it would require a large online
al [6] focuses on efficient algorithms for corner detec- database of features to be searched to keep check-
tion. Howard et al [10] uses Harris or FAST corners ing for return to familiar locations. Furgale et al [9]
and matches patches with SAD (Sum of Absolute use stored features so a robot’s path may repeat, but
Differences) patch matching. break a robot’s map into sections so that matching
Visual odometry can use the image information with a database is only done on a smaller local set.
only, or can work combining wheel motion. The for- A concept in their work is that of not attempting to
mer is useful also for visual slam problems such as recognize previous locations or create a metric map,
hand-held cameras, flying robots or UAV’s, as well but of focusing on the practical task of being able
as just mobile robots. Stereo methods relying on tra- to make a robot repeat its path or return from an
ditional wheel odometry to seed the visual estimate excursion.
can be found in [16], [11] and [20]. These methods
generally involve predicting motion using non-visual
odometry for a rough motion estimate, and then re- 2 The Kinect Sensor
fining that estimate using visual data. Barfoot et
al [20] use least squares fitting of image features to Our system uses a single sensor, the Kinect camera
refine information from non-visual sensors. [14] uti- made for the Microsoft Xbox gaming console. It al-
lize inertial sensors and a Kalman filter along with lows the capture of true 3D feature points which are
vision. used to calculate robot motion.

87
Fig. 3 shows the imagery captured by by our sys-
tem from the Kinect. To map a point between the
visible and infrared images a mapping is necessary
to account for the different focal lengths and image
centers between the two sensors. This calibration
was necessary for each different Kinect sensor. The
pinhole model can be represented in the camera ma-
Figure 2: Front view of Microsoft’s Kinect sensor, trices Kvis and Kir) for the visible and IR depth
highlighting the RGB camera, the infrared detector, cameras respectively. The focal lengths were fea-
and the IR laser. sured to be approximately fxvis = fyvis = 520 and
fxir = fyxir = 475 between three Kinect sensors we
tested. However, it was found necessary to calibrate
The Kinect sensor contains two imagers: a color at least the ratios between the focal lengths and im-
visible light camera and a infrared depth sensor. age centers, as well as approximating the stereo dis-
Each has a horizontal field of view (FoV) of 57 de- parity for a given depth, to provide a good mapping
grees, a vertical FoV of 43 degrees, and returns depth between objects in both images. A perfect pinhole
and visible data at a resolution of 640x480 at a rate model was used in our implementation, neglecting
of 30Hz. The infrared camera is located 2.5 cm away radial and thin prism distortion.
from the visible light camera, and so for accurate cal- Under the following assumptions the mapping
ibration the stereo disparity between these cameras can be separated into the two separate simple lin-
should be used ear expressions in Eqn. 1; 1) the cameras have no
Depth information is obtained by a structured in- rotation relative to each other, 2) there is no skew
frared laser that emits a pattern of dots and measures factor (element k12 in either matrix), and 3) stereo
the parallax shift of the dots for each pixel in the in- disparity effects are neglected. The four parameters
frared detector. The depth data is stored in blocks of (A, B, C, D) in Eqn. 1 were calculated with a least
11 bits per pixel, for up to 2048 levels of depth sen- squares fit of a number of points where the same
sitivity. The Kinect’s workings are detailed in [1]. world point was identified in both images. Regard-
The Kinect is accessed directly via the USB in- ing the stereo disparity between the two cameras,
terface, without the need for the actual Xbox game two image sensors are b = 2.5cm apart and the fo-
console that it was intended to be connected to. We cal length of the IR depth camera is approximately
used the interface from the OpenKinect project [2] fxir = fyxir = 475 therefore at a distance of z = 3m
which provides libaries and code to access the two there is a disparity of d = fxir zb = 475 0.025 3 = 4
image streams (visible and IR depth) on the Kinect’s pixels. If we assume all objects are at this distance
USB interface. and the point matches between visible and IR images
during calibration are for objects at this distance,
then this disparity will be contained in (A, B, C, D).
2.1 Calibration and Point mapping For objects that deviate from this distance, such
The following describes and justifies the simple lin- as an object at z = 1.5m the disparity will be
ear calibration performed to map points between the d = 475 0.025
1.5 = 7.9 pixels, or an object further at
visible and IR depth images, and describes how a 3D z = 5m the disparity will be d = 475 0.025
5 = 2.4 pix-
point is calculated from a 2D image point. els. This results in a horizontal error of 7.9 − 4 = 3.9
and 2.4 − 4 = −1.6 pixels respectively. This error is
acceptable since the IR depth image does not have
sharp edges due to its recognition of a projected IR
dot pattern. SIFT or SURF features on objects that
are at least 5 pixels away from a depth discontinuity
for mostly fronto-parallel object planes will not pro-
duce an error in distance. SIFT or SURF features
that are on the borders will likley just be filtered out
in the RANSAC fitting stage. Furthermore, features
that are on an edge discontinuity could be false fea-
Figure 3: Examples of images captured by the Kinect tures anyways formed not by object features but by
sensor. Left: visible image (turned to greyscale). the foreground, background and will (should) not be
Right: Infra-red depth image. matched anyways.

88
uir = A · uvis + B, vir = C · vvis + D (1) (Xc , Yc , Zc ) using the depth information and the in-
frared camera calibration information as described
Fig. 4 shows the parameters we obtained for three in the previous section.
Kinect sensors. The first, K0 is the one used in the
experiments. The variety of the parameters demon- A set of 3D matches is provided for each succes-
strate the necessity of performing this calibration for sive frame f ramei and f ramei+1 . A rotation ma-
each Kinect sensor. trix R and translation vector T is calculated (the
translation is in real world units accurate to scale).
These are calculated using up to N (200 in our im-
plementation) RANSAC random trials where 4 3D
matches are used to calculate a hypothesis Rh and
Th . The full set of 3D matches between f ramei and
f ramei+1 is tested with Rh , Th to see if the points
Figure 4: Table of calibration parameters used in our align. A threshold of 1.5cm was used in three di-
first order approximation for mapping points between mensional distance to classify a 3D point match as
the visible and IR depth images. an inlier.
The trial with the largest number of inliers is cho-
The depth of a point in the IR depth image can sen and a refined estimate of the pose is calculated.
be mapped to a depth using the unit vector direction The refined rotation Rref is obtained from a com-
given by the inverse of the Kir) matrix in Eqn. 3 mon practice of using the SVD decomposition of 3D
and the value of the depth image d(uir , vir ). There- vectors from each point to the centroid of the inliers,
fore the procedure is: locate a feature at position as explained in the next paragraph. The refined Tref
(uvis , vvis ) in the visible image, convert it to a po- is obtained from the difference between the centroids
sition in the IR depth image using Eqn. 1, sample between the two point sets.
the depth d(uir , vir ) and calculate the 3D coordinate
(Xc , Yc , Zc ) (in camera coordinates) using Eqn. 4. Consider a set of 3D points Ik (1 < k ≤ N ) where
N is the number of points. A matrix I can be created
0

  with the points as column vectors. A set Ik (1 < k ≤


fxir 0 u0ir N ) of similar points differing only by some noise from
Kir =  0 fyir v0ir  (2) Ij , is rotated by a rotation matrix Rt to form Jk =
0 0 1 Rt Ik , also arranged in a matrix of column vectors J.

475 0 320
 The dot product of an original point in the set and
Kir =  0 475 240  (3) its rotated, noisy match is Jk · Ikt . The sum of all
0 0 1 the dot products is the trace of the the 3x3 matrix
  A = JIt . If we were to rotate the points in J by the
Xc 0
correct R then the trace of I It = RJIt = RA would
 Yc  = Kir −1
d(uir , vir ) (4) be at a maximum. If the points Ik are (xik , yik , zik )
Zc and Jk are (xjk , yjk , zjk ) then A can be calculated
with the sums in Eqn. 5.
3 Algorithm The point sets Ik are the vectors obtained by
subtracting the 3D point set of inliers from frame
We detect features using the Scale Invariant Fea- f ramei from the set’s centroid. Likewise Jk is cre-
ture Transform (SIFT) algorithm [13]. Features are ated from the points in f ramei+1 .
matched providing 2D point matches between suc-
cessive image frames. There were approximately Finding the rotation matrix is the problem of
200 to 1500 SIFT features per 640x480 visible im- maximizing the trace (Eqn. 6). Using Singular Value
age, an average of 150 matches were found be- Decomposition (SVD) of A in Eqn. 7 allows us to
tween successive frames. These are converted to 3D create a matrix V U t to maximize this sum of dot
point matches using the depth information as fol- products (Eqn. 8), this is our desired rotation ma-
lows. Each (uvis , vvis ) is converted to an position trix in Eqn. 10 that we use as the refined estimate of
in the infrared depth image (uir , vir ) which is then the inlier 3D matches. Since the result of the SVD
converted to 3D coordinates relative to the camera operation provides two orthonormal matrices U, V ,
the product V U t will also be orthonormal and hence
a proper rotation matrix.

89
 P P P 
Pk xik xjk Pk xik yjk Pk xik zjk
A =  Pk yik xjk Pk yik yjk Pk yik zjk
 (5)
k zik xjk k zik yjk k zik zjk
trace = tr[RA] (6)
SV D(A)− > A = U DV t (7)
trace = tr[RA] = tr[RUDVt ] (8)
tracemax = tr[V U U DV t ] = tr[V DV t ] = diag(D) (9)
t

R = V U t (10)
A verification step prevented the relative pose
from being used if the number of inliers was below Figure 5: Recovered trajectory of mobile robot in rect-
a minimum (15 matches were used) or if the trans- angular path.
lation or rotation were spuriously large (the robot
travelled less than 30 cm a second so 50 cm per frame
was a translation threshold, 60 degrees was consid-
ered a liberal limit on rotation between two frames).
In our experiments, we allowed full 6-DOF mo-
tion despite the fact that the robot was moving on a
flat floor in all but one case (the ’zig-zag’ sequence
on the ramps Figs. 8, 9). In most cases the recovered
trajectory was along the Z = 0 plane (our horizontal
floor plane), however, one could obtain better ac-
curacy if the pose estimates were limited to indoor
robot cases where the motion could be limited to a
3-DOF case. We demonstrate two examples in Sec-
tion 4.2 where the Kinect sensor is not attached to Figure 6: Closer view of the start,end of Fig. 5. The
the robot and is moved around by hand. accumulated error is visible as the difference between
the start and end camera position.
4 Experiments
4.1 Mobile Robot Experiments
The Pioneer mobile robot with Kinect sensor was
tested in 12 different runs in indoor situations. The
first we have ground truth error for the start, end
locations. Fig. 5) shows a run where the robot trav-
elled 21 m in a rectangular path around a set of lab
tables. The error was 1.2 m off at the end, a zoomed
in section of the end is shown in Fig. 6. For the other
experiments it was not easy to get ground truth, it
was difficult to remote control the robot to pass ac-
curately over desired landmarks.
This same sequence was analyzed with a tighter
tolerance on the RANSAC fit, this provided a path
more aligned with the walls (X,Y axes), however, a
rotation error falsely shows the path deviating off in
a diagonal path near the end (Fig. 7). The bottom
of the image shows the 3D path shown from the side, Figure 7: Recovered trajectory of mobile robot in rect-
as in most of the sequences the vertical error is not angular path.
large relative to the path length.
The ’zig-zag’ sequence shown in Fig. 8 involved
the robot going down a shallow ramp, driving level,

90
and then climbing a shallow ramp. The ramp angle
can be seen in Fig. 9.

Figure 11: Long straight hallway test.

Figure 8: Zig-zag sequence (see end view in Fig. 9).

Figure 12: Figure 8 trajectory.

two end segments should be more parallel (due to


Figure 9: End view of Zig-zag sequence in Fig 8. a computer crash the robot did not complete a full
loop). The right image shows a failed test in which
The system was tested on a straight line run, the robot did travel forwards and backwards in a
along a long hallway 50m hallways shown in Fig. hallway in roughly the same direction, the failure
10 the result path is shown in Fig. 11. There are likely at the point of turning around.
some orientation errors at a few spots in the sequence
which warped the overall path which was mostly
straight.

Figure 13: An incomplete loop (robot did not com-


plete loop, only 3 sides were captured) and a hallway
test.

Figure 10: Long straight hallway test (see Fig. 11). 4.2 Hand-held 6-DOF Experiments
Our main focus is on mobile robot navigation, but we
Another experiment in the same room had a did conduct two tests where the Kinect sensor was
larger error, seen in Fig. 12. This was one of the not attached to the robot and instead moved around
few sequences that had an error in the vertical direc- by hand. The first sequence (Fig. 14) has the Kinect
tion, the path ends up about 80cm below the floor moved around in a figure-8 path, this has error that
by the end of the sequence. is a bit worse than the robot mounted tests. It should
Fig. 13 shows two runs of approximately 25 m be noted that with some landmark storage as in Zhu
each. The left run is mostly correct except that the

91
et al [21] this error could be repaired for mapping moments of sharp rotation. Future work will see
tasks. In the second sequence in Fig. 15 the Kinect features will be matched across more than adjacent
sensor was moved around to spell ’NCART’ (labora- frames to reduce the probability of large errors be-
tory name) 1 . The final ’T’ appears to have some ing made. More accurate calibration of the Kinect
error. sensor data will yield better results still.
Finally, future work includes changing the SIFT
feature processing to one using a GPU accelerated
Figure 14: The Kinect sensor moved by hand around SIFT or SURF implementation to allow real time
the lab on a 36 m figure-8 path. Some orientation visual odometry on the robot.
error occurs about half way through.

References
[1] http://en.wikipedia.org/wiki/kinect.

[2] http://openkinect.org/wiki/mainp age.

[3] Y. Aloimonos. Visual navigation: Flies, bees and


ugv’s. In Visual Navigation: From Biological Sys-
tems to Unmanned Ground Vehicles. Advances in
Cimputer Vision, volume 2, 1997.

Figure 15: The Kinect sensor held in hand and [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Surf:
moved around to spell ’NCART’ Speeded up robust features. In Computer Vision
and Image Understanding (CVIU), Vol. 110, No. 3,
pages 346–359, 2008.

[5] R. Bunschoten and B. Krse. Visual odometry from


5 Conclusion an omnidirectional vision system. In Proceedings -
IEEE International Conference on Robotics and Au-
A visual odometry system for a mobile robot was
tomation, pages 577–583, 2003.
demonstrated that utilizes 3D data available from
the 3D video sensor, Microsoft’s Kinect. SIFT fea- [6] Y. Cheng, M. Maimone, and L. Matthies. Visual
tures were detected in a visible image, their 3D odometry on the mars exploration rovers. In Con-
coordinates determined from the depth image, a ference Proceedings - IEEE International Conference
RANSAC algorithm used the 3D matches to for ro- on Systems, Man and Cybernetics, pages 903–910,
tation and translation between image frames. The 2005.
successive R, T measurments were chained together
to recover the robot’s trajectory. [7] P. Corke, D. Strelow, and S. Singh. Omnidirectional
Our prototype system demonstrated satisfactory visual odometry for a planetary rover. In IEEE/RSJ
performance, considering various imperfections in International Conference on Intelligent Robots and
the system. It demonstrates the use of both 2D and Systems, pages 4007–4012, 2004.
3D information for ego-motion estimation. The sys- [8] M. Fischler and R. Bolles. Random sample consen-
tem differs from Kinect visual SLAM systems in that sus: a paradigm for model fitting with applications
the 3D data is used in the camera pose estimation, to image analysis and automated cartography. In
instead of being added afterwards to create a 3D en- Communications of the ACM, pages 381–395, 1981.
vironment model.
Further tests are needed to be done to more accu- [9] P. Furgale and T. Barfoot. Stereo mapping and lo-
rately measure the error, however, qualitative eval- calization for long-range path following on rough ter-
uation demonstrates the limits of the current sys- rain. In IEEE International Conference on Robotics
tem. The system is fragile to a single frame-to-frame and Automation, pages 4410–4416, 2010.
relative pose being incorrect, particularly those in
[10] A. Howard. Real-time stereo visual odometry for
1 this
was inspired by G. Roth’s research group webpage autonomous ground vehicles. In IEEE/RSJ Interna-
before the year 2000 where a recovered camera path spelling tional Conference on Intelligent Robots and Systems,
the group name was the laboratory logo. pages 3946–3952, 2008.

92
[11] M. Kaess, K. Ni, and F. Dellaert. Flow separation
for fast and robust stereo odometry. In Proceedings -
IEEE International Conference on Robotics and Au-
tomation, pages 3539–3544, 2009.
[12] A. Levin and R. Szeliski. Visual odometry and map
correlation. In Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition, pages 1611–1618, 2004.
[13] D. Lowe. Object recognition from local scale-
invariant features. In Proceedings of the IEEE In-
ternational Conference on Computer Vision, pages
1150–1157, 1999.
[14] R. Mannadiar. Efficient online egomotion estima-
tion using visual and inertial readings. In Proc.
of CRV’09 (Canadian Conference on Computer and
Robot Vision, pages 244–251, 2009.
[15] L. Matthies and S. Shafer. Error modelling in stereo
navigation. In IEEE Journal of Robotics and Au-
tomation, pages RA–3(3), 1987.
[16] K. Ni and F. Dellaert. Stereo tracking and three-
point/one-point algorithms - a robust approach in
visual odometry. In Proceedings - International Con-
ference on Image Processing, pages 2777–2780, 2006.
[17] D. Nistr, O. Naroditsky, and J. Bergen. Visual odom-
etry. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recog-
nition, pages 1652–1659, 2004.
[18] C. Olson, L. Matthies, M. Schoppers, and M. Mai-
mone. Stereo ego-motion improvements for robust
rover navigation. In Proceedings - IEEE Inter-
national Conference on Robotics and Automation,
pages 1099–1104, 2001.
[19] F. PAGEL. Robust monocular egomotion estimation
based on an iekf. In Proc. of CRV’09 (Canadian
Conference on Computer and Robot Vision, pages
213–220, 2009.
[20] S. Se, T. Barfoot, and P. Jasiobedzki. Visual mo-
tion estimation and terrain modeling for planetary
rovers. In European Space Agency, (Special Publica-
tion) ESA SP, (603), pages 785–792, 2005.
[21] Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and
H. Sawhney. Ten-fold improvement in visual odom-
etry using landmark matching. In Proceedings of the
IEEE International Conference on Computer Vision,
page art. no. 4409062, 2007.

93

You might also like