Kinect 10
Kinect 10
87
Fig. 3 shows the imagery captured by by our sys-
tem from the Kinect. To map a point between the
visible and infrared images a mapping is necessary
to account for the different focal lengths and image
centers between the two sensors. This calibration
was necessary for each different Kinect sensor. The
pinhole model can be represented in the camera ma-
Figure 2: Front view of Microsoft’s Kinect sensor, trices Kvis and Kir) for the visible and IR depth
highlighting the RGB camera, the infrared detector, cameras respectively. The focal lengths were fea-
and the IR laser. sured to be approximately fxvis = fyvis = 520 and
fxir = fyxir = 475 between three Kinect sensors we
tested. However, it was found necessary to calibrate
The Kinect sensor contains two imagers: a color at least the ratios between the focal lengths and im-
visible light camera and a infrared depth sensor. age centers, as well as approximating the stereo dis-
Each has a horizontal field of view (FoV) of 57 de- parity for a given depth, to provide a good mapping
grees, a vertical FoV of 43 degrees, and returns depth between objects in both images. A perfect pinhole
and visible data at a resolution of 640x480 at a rate model was used in our implementation, neglecting
of 30Hz. The infrared camera is located 2.5 cm away radial and thin prism distortion.
from the visible light camera, and so for accurate cal- Under the following assumptions the mapping
ibration the stereo disparity between these cameras can be separated into the two separate simple lin-
should be used ear expressions in Eqn. 1; 1) the cameras have no
Depth information is obtained by a structured in- rotation relative to each other, 2) there is no skew
frared laser that emits a pattern of dots and measures factor (element k12 in either matrix), and 3) stereo
the parallax shift of the dots for each pixel in the in- disparity effects are neglected. The four parameters
frared detector. The depth data is stored in blocks of (A, B, C, D) in Eqn. 1 were calculated with a least
11 bits per pixel, for up to 2048 levels of depth sen- squares fit of a number of points where the same
sitivity. The Kinect’s workings are detailed in [1]. world point was identified in both images. Regard-
The Kinect is accessed directly via the USB in- ing the stereo disparity between the two cameras,
terface, without the need for the actual Xbox game two image sensors are b = 2.5cm apart and the fo-
console that it was intended to be connected to. We cal length of the IR depth camera is approximately
used the interface from the OpenKinect project [2] fxir = fyxir = 475 therefore at a distance of z = 3m
which provides libaries and code to access the two there is a disparity of d = fxir zb = 475 0.025 3 = 4
image streams (visible and IR depth) on the Kinect’s pixels. If we assume all objects are at this distance
USB interface. and the point matches between visible and IR images
during calibration are for objects at this distance,
then this disparity will be contained in (A, B, C, D).
2.1 Calibration and Point mapping For objects that deviate from this distance, such
The following describes and justifies the simple lin- as an object at z = 1.5m the disparity will be
ear calibration performed to map points between the d = 475 0.025
1.5 = 7.9 pixels, or an object further at
visible and IR depth images, and describes how a 3D z = 5m the disparity will be d = 475 0.025
5 = 2.4 pix-
point is calculated from a 2D image point. els. This results in a horizontal error of 7.9 − 4 = 3.9
and 2.4 − 4 = −1.6 pixels respectively. This error is
acceptable since the IR depth image does not have
sharp edges due to its recognition of a projected IR
dot pattern. SIFT or SURF features on objects that
are at least 5 pixels away from a depth discontinuity
for mostly fronto-parallel object planes will not pro-
duce an error in distance. SIFT or SURF features
that are on the borders will likley just be filtered out
in the RANSAC fitting stage. Furthermore, features
that are on an edge discontinuity could be false fea-
Figure 3: Examples of images captured by the Kinect tures anyways formed not by object features but by
sensor. Left: visible image (turned to greyscale). the foreground, background and will (should) not be
Right: Infra-red depth image. matched anyways.
88
uir = A · uvis + B, vir = C · vvis + D (1) (Xc , Yc , Zc ) using the depth information and the in-
frared camera calibration information as described
Fig. 4 shows the parameters we obtained for three in the previous section.
Kinect sensors. The first, K0 is the one used in the
experiments. The variety of the parameters demon- A set of 3D matches is provided for each succes-
strate the necessity of performing this calibration for sive frame f ramei and f ramei+1 . A rotation ma-
each Kinect sensor. trix R and translation vector T is calculated (the
translation is in real world units accurate to scale).
These are calculated using up to N (200 in our im-
plementation) RANSAC random trials where 4 3D
matches are used to calculate a hypothesis Rh and
Th . The full set of 3D matches between f ramei and
f ramei+1 is tested with Rh , Th to see if the points
Figure 4: Table of calibration parameters used in our align. A threshold of 1.5cm was used in three di-
first order approximation for mapping points between mensional distance to classify a 3D point match as
the visible and IR depth images. an inlier.
The trial with the largest number of inliers is cho-
The depth of a point in the IR depth image can sen and a refined estimate of the pose is calculated.
be mapped to a depth using the unit vector direction The refined rotation Rref is obtained from a com-
given by the inverse of the Kir) matrix in Eqn. 3 mon practice of using the SVD decomposition of 3D
and the value of the depth image d(uir , vir ). There- vectors from each point to the centroid of the inliers,
fore the procedure is: locate a feature at position as explained in the next paragraph. The refined Tref
(uvis , vvis ) in the visible image, convert it to a po- is obtained from the difference between the centroids
sition in the IR depth image using Eqn. 1, sample between the two point sets.
the depth d(uir , vir ) and calculate the 3D coordinate
(Xc , Yc , Zc ) (in camera coordinates) using Eqn. 4. Consider a set of 3D points Ik (1 < k ≤ N ) where
N is the number of points. A matrix I can be created
0
89
P P P
Pk xik xjk Pk xik yjk Pk xik zjk
A = Pk yik xjk Pk yik yjk Pk yik zjk
(5)
k zik xjk k zik yjk k zik zjk
trace = tr[RA] (6)
SV D(A)− > A = U DV t (7)
trace = tr[RA] = tr[RUDVt ] (8)
tracemax = tr[V U U DV t ] = tr[V DV t ] = diag(D) (9)
t
R = V U t (10)
A verification step prevented the relative pose
from being used if the number of inliers was below Figure 5: Recovered trajectory of mobile robot in rect-
a minimum (15 matches were used) or if the trans- angular path.
lation or rotation were spuriously large (the robot
travelled less than 30 cm a second so 50 cm per frame
was a translation threshold, 60 degrees was consid-
ered a liberal limit on rotation between two frames).
In our experiments, we allowed full 6-DOF mo-
tion despite the fact that the robot was moving on a
flat floor in all but one case (the ’zig-zag’ sequence
on the ramps Figs. 8, 9). In most cases the recovered
trajectory was along the Z = 0 plane (our horizontal
floor plane), however, one could obtain better ac-
curacy if the pose estimates were limited to indoor
robot cases where the motion could be limited to a
3-DOF case. We demonstrate two examples in Sec-
tion 4.2 where the Kinect sensor is not attached to Figure 6: Closer view of the start,end of Fig. 5. The
the robot and is moved around by hand. accumulated error is visible as the difference between
the start and end camera position.
4 Experiments
4.1 Mobile Robot Experiments
The Pioneer mobile robot with Kinect sensor was
tested in 12 different runs in indoor situations. The
first we have ground truth error for the start, end
locations. Fig. 5) shows a run where the robot trav-
elled 21 m in a rectangular path around a set of lab
tables. The error was 1.2 m off at the end, a zoomed
in section of the end is shown in Fig. 6. For the other
experiments it was not easy to get ground truth, it
was difficult to remote control the robot to pass ac-
curately over desired landmarks.
This same sequence was analyzed with a tighter
tolerance on the RANSAC fit, this provided a path
more aligned with the walls (X,Y axes), however, a
rotation error falsely shows the path deviating off in
a diagonal path near the end (Fig. 7). The bottom
of the image shows the 3D path shown from the side, Figure 7: Recovered trajectory of mobile robot in rect-
as in most of the sequences the vertical error is not angular path.
large relative to the path length.
The ’zig-zag’ sequence shown in Fig. 8 involved
the robot going down a shallow ramp, driving level,
90
and then climbing a shallow ramp. The ramp angle
can be seen in Fig. 9.
Figure 10: Long straight hallway test (see Fig. 11). 4.2 Hand-held 6-DOF Experiments
Our main focus is on mobile robot navigation, but we
Another experiment in the same room had a did conduct two tests where the Kinect sensor was
larger error, seen in Fig. 12. This was one of the not attached to the robot and instead moved around
few sequences that had an error in the vertical direc- by hand. The first sequence (Fig. 14) has the Kinect
tion, the path ends up about 80cm below the floor moved around in a figure-8 path, this has error that
by the end of the sequence. is a bit worse than the robot mounted tests. It should
Fig. 13 shows two runs of approximately 25 m be noted that with some landmark storage as in Zhu
each. The left run is mostly correct except that the
91
et al [21] this error could be repaired for mapping moments of sharp rotation. Future work will see
tasks. In the second sequence in Fig. 15 the Kinect features will be matched across more than adjacent
sensor was moved around to spell ’NCART’ (labora- frames to reduce the probability of large errors be-
tory name) 1 . The final ’T’ appears to have some ing made. More accurate calibration of the Kinect
error. sensor data will yield better results still.
Finally, future work includes changing the SIFT
feature processing to one using a GPU accelerated
Figure 14: The Kinect sensor moved by hand around SIFT or SURF implementation to allow real time
the lab on a 36 m figure-8 path. Some orientation visual odometry on the robot.
error occurs about half way through.
References
[1] http://en.wikipedia.org/wiki/kinect.
Figure 15: The Kinect sensor held in hand and [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Surf:
moved around to spell ’NCART’ Speeded up robust features. In Computer Vision
and Image Understanding (CVIU), Vol. 110, No. 3,
pages 346–359, 2008.
92
[11] M. Kaess, K. Ni, and F. Dellaert. Flow separation
for fast and robust stereo odometry. In Proceedings -
IEEE International Conference on Robotics and Au-
tomation, pages 3539–3544, 2009.
[12] A. Levin and R. Szeliski. Visual odometry and map
correlation. In Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition, pages 1611–1618, 2004.
[13] D. Lowe. Object recognition from local scale-
invariant features. In Proceedings of the IEEE In-
ternational Conference on Computer Vision, pages
1150–1157, 1999.
[14] R. Mannadiar. Efficient online egomotion estima-
tion using visual and inertial readings. In Proc.
of CRV’09 (Canadian Conference on Computer and
Robot Vision, pages 244–251, 2009.
[15] L. Matthies and S. Shafer. Error modelling in stereo
navigation. In IEEE Journal of Robotics and Au-
tomation, pages RA–3(3), 1987.
[16] K. Ni and F. Dellaert. Stereo tracking and three-
point/one-point algorithms - a robust approach in
visual odometry. In Proceedings - International Con-
ference on Image Processing, pages 2777–2780, 2006.
[17] D. Nistr, O. Naroditsky, and J. Bergen. Visual odom-
etry. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recog-
nition, pages 1652–1659, 2004.
[18] C. Olson, L. Matthies, M. Schoppers, and M. Mai-
mone. Stereo ego-motion improvements for robust
rover navigation. In Proceedings - IEEE Inter-
national Conference on Robotics and Automation,
pages 1099–1104, 2001.
[19] F. PAGEL. Robust monocular egomotion estimation
based on an iekf. In Proc. of CRV’09 (Canadian
Conference on Computer and Robot Vision, pages
213–220, 2009.
[20] S. Se, T. Barfoot, and P. Jasiobedzki. Visual mo-
tion estimation and terrain modeling for planetary
rovers. In European Space Agency, (Special Publica-
tion) ESA SP, (603), pages 785–792, 2005.
[21] Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and
H. Sawhney. Ten-fold improvement in visual odom-
etry using landmark matching. In Proceedings of the
IEEE International Conference on Computer Vision,
page art. no. 4409062, 2007.
93