0% found this document useful (0 votes)
12 views

1320161

This paper presents a method for real-time depth camera tracking using existing CAD models and the Iterative Closest Point (ICP) algorithm, which does not require preprocessing of the models. The approach is evaluated using a Kinect depth sensor and shows improved stability and reduced drift compared to traditional edge-based and depth-based SLAM methods. The method allows for accurate camera pose estimation in augmented reality applications, leveraging the advantages of predefined CAD models for precise tracking in industrial settings.

Uploaded by

lekhaquoc999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

1320161

This paper presents a method for real-time depth camera tracking using existing CAD models and the Iterative Closest Point (ICP) algorithm, which does not require preprocessing of the models. The approach is evaluated using a Kinect depth sensor and shows improved stability and reduced drift compared to traditional edge-based and depth-based SLAM methods. The method allows for accurate camera pose estimation in augmented reality applications, leveraging the advantages of predefined CAD models for precise tracking in industrial settings.

Uploaded by

lekhaquoc999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Journal of Virtual Reality and Broadcasting, Volume 13(2016), no.

Real-time depth camera tracking with CAD models and ICP


Otto Korkalo∗ and Svenja Kahn‡

VTT Technical Research Centre of Finland
P.O. Box 1000, FI-02044 VTT, Finland
[email protected]

Department of Virtual and Augmented Reality
Fraunhofer IGD
Darmstadt, Germany

Abstract it does not require any preprocessing of the CAD mod-


els. We evaluated the approach using the Kinect depth
In recent years, depth cameras have been widely uti- sensor, and compared the results to a 2D edge-based
lized in camera tracking for augmented and mixed method, to a depth-based SLAM method, and to the
reality. Many of the studies focus on the methods ground truth. The results show that the approach is
that generate the reference model simultaneously with more stable compared to the edge-based method and
the tracking and allow operation in unprepared envi- it suffers less from drift compared to the depth-based
ronments. However, methods that rely on predefined SLAM.
CAD models have their advantages. In such meth-
ods, the measurement errors are not accumulated to
Keywords: Augmented reality, Mixed reality,
the model, they are tolerant to inaccurate initializa-
Tracking, Pose estimation, Depth camera, K INECT,
tion, and the tracking is always performed directly in
CAD model, ICP
reference model’s coordinate system. In this paper,
we present a method for tracking a depth camera with
existing CAD models and the Iterative Closest Point 1 Introduction
(ICP) algorithm. In our approach, we render the CAD
model using the latest pose estimate and construct a Augmented reality (AR) provides an intuitive way to
point cloud from the corresponding depth map. We show relevant information to guide a user in complex
construct another point cloud from currently captured tasks like maintenance, inspection, construction and
depth frame, and find the incremental change in the navigation [Azu97, vKP10]. In AR, the image streams
camera pose by aligning the point clouds. We utilize are superimposed in real-time with virtual information
a GPGPU-based implementation of the ICP which ef- that is correctly aligned with the captured scene in 3D.
ficiently uses all the depth data in the process. The For example, assembly instructions can be virtually at-
method runs in real-time, it is robust for outliers, and tached to an object of interest in the real world, or an
object of the real world can be highlighted in the aug-
Digital Peer Publishing Licence mented camera image [HF11]. In augmented assem-
Any party may pass on this Work by electronic bly, it is also important to visualize the quality of the
work: the user may have forgotten to install a part, the
means and make it available for download under
part may have been installed in a wrong position, or
the terms and conditions of the current version
a wrong part may have been used. For this purpose,
of the Digital Peer Publishing Licence (DPPL).
the real scene and its digital counterpart have to be
The text of the licence may be accessed and
compared to find the possible 3D differences between
retrieved via Internet at
them [KBKF13]. Furthermore, diminished reality is a
http://www.dipp.nrw.de/. technique where the user’s view is altered by remov-

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

ing real objects from the images and possibly replac- to common problems that appear in monocular camera
ing them with virtual content [MF01]. For example, in tracking including changes in illumination, repetitive
AR assisted decoration, existing furniture is removed textures and lack of features. Typical depth camera
and replaced with digital furniture to aid in planning a technologies (time-of-flight, structured light) rely on
new room lay-out. active illumination so they can also operate in low light
AR, diminished reality and other related applica- conditions. The appearance of the depth maps depend
tions require that the position and the orientation mainly on the 3D geometry of the scene, and thus,
(pose) of the camera (user’s view) can be estimated depth cameras are attractive devices for camera track-
and tracked precisely in real-time. The most common ing. Recent research on depth camera based tracking
approach is to analyze the captured 2D images, and focus mainly on SLAM and other approaches that cre-
various optical tracking methods have been proposed ate the reference model during the operation. Such
from easily detectable fiducial markers to natural im- trackers can perform in unprepared environments, but
age features [ZDB08, LF05]. Simultaneous localiza- they still have drawbacks compared to the trackers that
tion and mapping (SLAM) approaches are attractive utilize predefined models.
since they do not require any preparation of the envi- In this paper, we present and evaluate a model-based
ronment in order to operate. Instead, the scene model tracking method for depth cameras that utilizes pre-
is reconstructed from the image observations while defined CAD models to obtain the camera pose. We
simultaneously tracking the camera [BBS07, KM07, take the advantage of precise CAD models commonly
DRMS07]. However, in most of the AR applica- available in industrial applications, and apply iterative
tions, the camera pose has to be defined exactly in closest point (ICP) algorithm for registering the latest
the reference object’s coordinate frame, and model- camera pose with the incoming depth frame. We use
based tracking solutions are desirable. The model- direct method, where all the depth data is used without
based tracking methods aim to fit features (typically explicit feature extraction. With a GPGPU implemen-
edges) extracted from the camera image to 2D projec- tation of the ICP, the method is fast and runs in real
tions of the 3D model of the reference target to esti- time frame rates. The main benefits of the proposed
mate the 6-DoF transformation between them [LF05]. approach are:
A common requirement of 2D image-based camera • In contrast to monocular methods, the approach
pose estimation approaches is that the captured scene is robust with both textured and non-textured ob-
needs to provide features which are visible in the 2D jects and with monochromatic surfaces. The ap-
camera image and which can be analyzed in order to proach does not require any explicit feature ex-
estimate the camera pose. For example, due to a lack traction from the (depth) cameras frames.
of detectable 2D features, it is very difficult to estimate
the camera pose if the captured scene has untextured • In contrast to depth-based SLAM methods, mea-
monochromatic surfaces or the lighting conditions are surement and tracking errors are not accumu-
difficult. Strong shadows are indistinguishable from lated, the method is faster, and it always tracks
actual edges, reflections of light disturb the feature de- directly in the reference target’s coordinate sys-
tection and dim illumination increases the noise level. tem. The approach is robust for differences be-
In recent years, 2D imaging has been complemented tween the CAD model and the real target geom-
by the development of depth cameras. They operate at etry. Thus, it can be used in applications such as
up to 30 frames per second, and measure each pixel’s difference detection for quality inspection.
distance from the camera to the object in the real
• Virtually any 3D CAD model can be used for
world [HLCH12, GRV+ 13]. While initially very ex-
tracking. The only requirement is that the model
pensive and rather inaccurate, technological advance-
needs to be rendered, and that the correspond-
ments have led to the development of cheap and more
ing depth map has to be retrieved from the depth
precise depth cameras for the consumer mass market.
buffer for the tracking pipeline.
Depth sensors have become commodity hardware and
their availability, price and size are nowadays close to The remainder of this paper is structured as fol-
conventional 2D cameras. lows: in Section 2, we give an overview of model-
Depth cameras have clear advantages in terms of based optical tracking methods as well as methods uti-
camera pose estimation and tracking. They are tolerant lizing depth cameras. In Section 3, we detail our CAD

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

model-based depth camera tracking approach. Section tion of each particle. Edge-based methods have also
4 provides an evaluation of the method. We describe been realized with point features. In [VLF04], 3D
the datasets and the evaluation criteria, and compare points lying on the model surface were integrated with
the results to the ground truth, to a 2D edge-based the pose estimation loop together with the edges.
method, and to a depth-based SLAM method. In Sec-
tion 5 we present the results, and experiment with the
factors that affect to the performance of the approach. 2.2 Real-time depth camera tracking
Finally, in Section 6, the results are discussed and a
The Kinect sensor was the first low-cost device to cap-
brief description of future work is presented.
ture accurate depth maps at real-time frame rates. Af-
ter it was released, many researcher used the sensor
2 Related work for real-time depth-based and RGB-D based SLAM.
Many of the studies incorporate iterative closest point
2.1 Real-time model-based tracking of (ICP) in the inter-frame pose update. In ICP based
monocular cameras pose update, the 3D point pairing is a time consum-
ing task and several variants have been proposed to
Edges are relatively invariant to illumination changes, reduce the computational load for real-time perfor-
and they are easy to detect from the camera im- mance. In KinectFusion [NIH+ 11], an efficient GPU
ages. There are multiple studies that focus on model- implementation of the ICP algorithm was used for the
based monocular tracking using edges. In the typi- pose update in depth-based SLAM. The ICP variant
cal approach, the visible edges of the 3D CAD model of the KinectFusion utilizes projective data associa-
are projected to the camera image using the camera tion and point-to-plane error metrics. With a par-
pose from a previous time step, and aligned with the allelized GPU implementation, all of the depth data
edges that are extracted from the latest camera frame. can be used efficiently without explicitly selecting the
The change of the pose between the two consecutive point correspondences for the ICP. In [TAC11], a bi-
frames is found by minimizing the reprojection error objective cost function combining the depth and pho-
of the edges. One of the first real-time edge-based im- tometric data was used in ICP for visual odometry.
plementations was presented in [Har93], where a set of As in KinectFusion, the method uses an efficient di-
control points are sampled from the model edges and rect approach where the cost is evaluated for every
projected to the image. The algorithm then searches pixel without explicit feature selection. The SLAM
for strong image gradients from the camera frame approach presented in [BSK+ 13] represents the scene
along the direction of control point normals. The geometry with a signed distance function, and finds the
maximum gradient is considered to be the correspon- change in camera pose parameters by minimizing the
dence for the current control point projection. Finally, error directly between the distance function and ob-
the camera pose is updated by minimizing the sum served depth leading to faster and more accurate result
of squared differences between the point correspon- compared to KinectFusion.
dences. SLAM and visual odometry typically utilize the en-
The method presented in [Har93] is sensitive to out- tire depth images in tracking and the reference model
liers (e.g. multiple strong edges along the search line, is reconstructed from the complete scene. In object
partial occlusions), and a wrong image gradient max- tracking however, the reference model is separated
imum may be assigned to a control point leading to a from the background and the goal is to track a moving
wrong pose estimate. Many papers propose improve- target in a possibly cluttered environment, and with
ments to the method. In [DC02], robust M-estimators less (depth) information and geometrical constraints.
were used to lower the importance of outliers in the op- In [CC13], a particle filter is used for real-time RGB-D
timization loop, a RANSAC scheme was applied e.g. based object tracking. The approach uses both photo-
in [AZ95, BPS05], and a multiple hypothesis assign- metric and geometrical features in a parallelized GPU
ment was used in conjunction with a robust estimator implementation, and uses point coordinates, normals
e.g. in [WVS05]. In [KM06], a particle filter was used and color for likelihood evaluation. ICP was used in
to find the globally optimal pose. The system was im- [PLW11] for inter-frame tracking of the objects that
plemented using a GPU which enabled fast rendering are reconstructed from the scene on-line. Furthermore,
of visible edges as well as efficient likelihood evalua- the result from ICP is refined by using the 3D edges of

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

the objects similarly to [DC02].


Although SLAM enables straightforward deploy-
ment of an augmented reality system, model-based
methods still have their advantages compared to
SLAM. Especially in industrial AR applications, it is
important that the camera pose is determined exactly
in the target object’s coordinate system so that the vir-
tual content can be rendered in exactly the correct po-
sition in the image. As SLAM methods track the cam-
era in the first frame’s coordinate system, they may
drift due to wrong initialization or inaccuracies in the
reconstructed model. The depth measurements are dis-
turbed by lens and depth distortions, and for example,
Kinect devices suffer from strong non-linear depth dis- Figure 1: Top left: The raw depth frame captured from
tortions as described in [HKH12]. In SLAM methods, the Kinect sensor. Top right: The artificial depth map
the measurement errors will eventually accumulate, rendered using Kinect’s intrinsics and pose from the
which may cause the tracker to drift. Model-based previous time step. Bottom left: The difference image
approaches however solve the camera pose directly in of the rendered depth map and the raw depth frame be-
the reference target’s coordinate system and allow the fore pose update. Bottom right: Corresponding differ-
camera pose estimate to ”slide” to the correct result. ence image after the pose update. The colorbar units
Scene geometry also sets limitations on the perfor- are in mm.
mance of depth-based SLAM methods. In [MIK+ 12],
it was found that with Kinect devices, the minimum
size of object details in the reconstruction is approx- ICP for finding the transformation between the point
imately 10 mm, which also represent the minimum clouds. The ICP implementation is a modified version
radius of curvature in the scene that can be captured. of KinFu, an open source implementation of Kinect-
Thus, highly concave scenes and sharp edges may be Fusion available in the PCL library [RC11]. In the
problematic for depth-based SLAM. In model-based following we revise the method and detail the modifi-
tracking, the reference CAD model is accurate and cations we made to the original implementation. The
does not depend on the measurement accuracy or the block diagram of the method is shown in Figure 2.
object geometry. Thus, the tracking errors are dis-
tributed more evenly compared to SLAM.
3.2 Camera model and notations
3 CAD model-based depth camera
The depth camera is modeled with the conventional
tracking pinhole camera model. The senor intrinsics are de-
noted with K, which is a 3 × 3 upper triangular matrix
3.1 Overview of the approach having sensor’s focal lengths and principal point. We
The goal of model-based depth camera tracking is to denote the sensor extrinsics (pose) with P = [R|t],
estimate the pose of the camera relative to a target ob- where R is the 3 × 3 camera orientation matrix and t
ject of the real world at every time step by utilizing a is the camera position vector.
reference model of the target in the process. We use a We denote a 3D point cloud with a set of 3D ver-
3D CAD model of the target as a reference. The main tices V = {v1 , v2 , ...} where vi = (xi , yi , zi )T , and
idea of our approach is to construct a 3D point cloud similarly, we denote a set of point normal vectors with
from the latest incoming raw depth frame, and align it N = {n1 , n2 , ...}. To indicate the reference coordi-
with a point cloud that we generate from the reference nate system of a point cloud, we use superscript g for
model using the sensor intrinsics and extrinsics from global coordinate frame (i.e. reference model’s coordi-
the previous time step. The incremental change in the nate system) and c for camera coordinate frame. Sub-
sensor pose is then multiplied to the pose of the last scripts s and d refer to the source and to the destination
time step. Figure 1 illustrates the principle. We utilize point sets used in ICP, respectively.

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

3.3 Generating and preprocessing the depth


maps
The process starts by capturing a raw depth frame from
the sensor and applying two optional steps: lens dis-
tortion correction and reducing the noise by filtering.
For compensating the lens distortions, we use a stan-
dard polynomial lens distortion model. A bilateral fil-
ter is used to smooth the depth frame while keeping the
depth discontinuities sharp. In the original implemen-
tation, bilateral filtering was used to prevent the noisy
measurements from being accumulated in the recon-
structed model, but the lens distortions were ignored.
In our experiments, we evaluated the approach with
both options turned on and off. The captured depth
map is converted into a three-level image pyramid.
At each pyramid level l, the down scaled depth im-
age pixels are back projected to 3D space for con-
structing 3D point clouds Vsc,l in the camera coordi-
nate frame. Additionally, normals Nsc,l are calculated Figure 2: Block diagram of the model-based depth
for the vertices. The point clouds and normals are camera tracking approach. The change in the depth
stored into arrays of the same size as the depth image sensor pose is estimated by aligning the captured depth
at current image pyramid level. frame with the depth frame obtained by rendering the
We render the reference CAD model from the pre- reference model with the previous time step’s pose
vious time step’s camera view using the latest depth estimate. Lens distortion compensation and bilateral
camera pose estimate Pk−1 and the depth sensor in- smoothing of the raw depth frame (marked with *) are
trinsics K in the process. The frame size is set to the optional steps in the processing pipeline.
size of the raw depth frames. We read the correspond-
ing depth map from the depth buffer, and construct a the result Vsg,l is compared with the point cloud Vdg,l
depth image pyramid similarly to the raw depth maps. to evaluate the alignment error. The error is minimized
We construct 3D point clouds Vdc,l for each pyramid to get the incremental change P0 , which is accumu-
level l, and calculate the corresponding normals Ndc,l . lated to Pk . Initially, Pk is set to Pk−1 . A different
Finally, we transform the point cloud to the global co- number of ICP iterations is calculated for each pyra-
ordinate system to obtain Vdl,g , and rotate the normals mid level and in the original implementation of KinFu,
accordingly. the number of iterations is set to L = {10, 5, 4} (start-
We run the lens distortion compensation on the ing from the coarsest level). In addition to that, we
CPU, and as in the original implementation, the rest experimented with only one ICP run for each pyramid
of the preprocessing steps are performed in the GPU level, and set L = {1, 1, 1}.
using the CUDA language. KinFu utilizes a point-to-plane error metric to com-
pute the cost of the difference between the point
3.4 Incremental pose update with ICP clouds. The points of the source and destination point
clouds are matched to find a set of point pairs. For
The change of the camera pose between two consecu- each point pair, the distance between the source point
tive time steps k − 1 and k is estimated by finding the and the correponding destination point’s tangent plane
rigid 6-DoF transformation P0 = [R0 |t0 ] that aligns is calculated. Then the difference between the point
the source point cloud Vsg with the destination point clouds is defined as the sum of squared distances:
cloud Vdg . The procedure is done iteratively using ICP X
at different pyramid levels, starting from the coarsest ((R0 vs,i + t0 − vd,i ) · nd,i )2 . (1)
i
level and proceeding to the full scale point clouds. At
c,l
each ICP iteration, the point cloud Vs is transformed The rotation matrix R0 is linearized around the pre-
to the world frame with the latest estimate of Pk , and vious pose estimate to construct a linear least squares

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

problem. Assuming small incremental changes in the no need for extracting and matching interest points or
rotation, the linear approximation of R0 becomes other cues or features. The only requirement is that a
  depth map from the desired camera view can be ren-
1 −γ β dered effectively, and retrieved from the depth buffer.
R̃0 =  γ 1 −α , (2) Complex CAD models can be effectively rendered us-
−β α 1 ing commonly available tools. In our experiments, we
used OpenSG to manipulate and render the model.
where α, β and γ are the rotations around x, y and z
axes respectively. Denoting r0 = (α, β, γ)T , the error
can be written as 4 Evaluation methods and data
X
((vs,i −vd,i )·nd,i +r0 ·(vs,i ×nd,i )+t0 ·nd,i )2 . (3) We evaluated the accuracy, stability and robustness of
i the proposed approach by comparing the tracking re-
The minimization problem is solved by calculating sults to ground truth in three different tracking scenar-
the partial derivatives of Equation 3 with respect to the ios and with six datasets. We also compared the results
transformation parameters r0 and t0 and setting them to KinFu and to the edge-based monocular method
to zero. The equations are collected into a linear sys- presented in [WWS07]. Additionally, we compared
tem of the form Ax = b, where x consists of the the computational time required for sensor pose up-
transformation parameters, b is the residual and A is date between the different tracking methods.
a 6 × 6 symmetric matrix. The system is constructed In this section, we describe the data collection pro-
in the GPU, and solved using Cholesky decomposition cedure, the error metrics that we used to evaluate the
in the CPU. results, and the datasets that we collected from the ex-
To define the point pairs between the source and the periments. For simplicity, we refer to the proposed ap-
destination point clouds, KinFu utilizes projective data proach as ”model-based method”, and the 2D model
association. At each ICP iteration, the points of V g,l based approach as ”edge-based method”.
s
are transformed to the camera coordinate system of the
previous time step, and projected to the image domain: 4.1 Data collection procedure
We conducted the experiments with offline data that
u = proj(K · R−1
k−1 · (vs − tk−1 )), (4)
we captured from three test objects using the Kinect
where proj(·) is the perspective projection including depth sensor. For each data sequence, we captured
the dehomogenization of the points. The set of ten- 500 depth frames at a resolution of 640 × 480 pix-
tative point correspondences are then defined between els and frame rate of 10 FPS. In addition to depth
the points of Vsg,l and the points of Vdg,l that corre- frames, we captured the RGB frames for evaluating
spond to the image pixel coordinates u. the performance of the edge-based method. To col-
The tentative point correspondences are checked for lect the ground truth camera trajectories, we attached
outliers by calculating their Euclidean distance and an- the sensor to a Faro measurement arm, and solved
gle between their normal vectors. If the points are too the hand-eye calibration of the system as described
distant from each other, or the angle is too large, the in [KHW14]. For KinFu, we set the reconstruction
point pair is ignored from the ICP update. In our ex- volume to the size of each target’s bounding box and
periments, we used a 50 mm threshold for the distance aligned it accordingly. The model-based method was
and a 20 degree threshold for the angle. The Kinect run without lens distortion compensation and bilateral
cannot produce range measurements from some mate- filtering, and we used L = {10, 5, 4} ICP iterations.
rials like reflective surfaces, under heavy sunlight, out- We also experimented with other settings, and the re-
side its operating range and from occluded surfaces, sults are discussed in Section 5.5. The test targets and
and such source points are ignored too. Furthermore, the corresponding 3D CAD models are shown in Fig-
we ignore destination points that have infinite depth ure 3.
value, i.e. the depth map pixels where no object points For the evaluation runs, we initialized the trackers
are projected when rendering the depth map. to the ground truth pose, and let them run as long as
The proposed tracking approach simplifies the use the estimated position and orientation remained within
of 3D CAD models in visual tracking since there is the predefined limits. Otherwise the tracker was con-

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Figure 3: The reference CAD models used to evaluate the proposed approach. Top and bottom left: Target 1
consist of several convex objects attached to a common plane. The model is partially textured and partially
plain white. Middle: Target 2 is a car’s dashboard. The model differs from its real counterpart from the steering
wheel, gear stick as well as the middle console. Right: Target 3 does not have geometry in vertical dimension
and the ICP based approach is not fully constrained by the target.

sidered to be drifting, and its pose was reset back to the error in orientation, and calculated the mean of ab-
the ground truth. The tracker’s pose was reset if the solute differences between the angles. We define the
absolute error between the estimated position and the angle error as the angle difference between the quater-
ground truth was more that 20 cm, or if the angle dif- nion representations of the orientations. We calculated
ference was more than 10 degrees. the corresponding standard deviations for evaluating
Due to lens and depth distortions as well as noise the jitter, and used the number of required tracker re-
in the depth measurements, the hand-eye calibration sets as a measure for robustness.
between the Faro measurement arm and the Kinect
device is inaccurate. The result depends on the cal-
ibration data, and the calibration obtained with close 4.2.2 3D reprojection errors
range measurements may give inaccurate results with In AR applications, it is essential that the rendered
long range data and vice versa. Thus, we estimated model is aligned accurately with the view and the re-
the isometric transformation between the resulting tra- projection error is typically used to measure the ac-
jectories and ground truth, and generated a corrected curacy of vision-based trackers. In 2D analysis, the
ground truth trajectory for each sequence individually. reprojection error is calculated by summing up the
For the final results, we repeated the tests using the squared differences between the observed and repro-
corrected ground truth trajectories as reference. jected model points in the image domain after the cam-
era pose update. We use a similar approach in 3D,
4.2 Evaluation criteria and calculate the differences between the observed and
rendered depth maps. We define two error metrics us-
4.2.1 Absolute accuracy
ing the depth: error metric A and error metric B.
We measured the accuracy of the trackers by calcu- The error metric A is the difference between the
lating the mean of absolute differences between the depth map rendered using the ground truth pose and
estimated sensor positions and the (corrected) ground the depth map rendered using the estimated pose. This
truth over the test sequences. Similarly, we measured measures the absolute accuracy of the tracker. It takes

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Processing step Timing


Model-based method
Constructing the artificial depth map 12 %
Preprocessing the raw depth 11 %
Preprocessing the artificial depth 11 %
Updating the pose 66 %
Total, desktop PC 60 ms
Figure 4: 3D error metrics used in the evaluation. Left:
Total, laptop PC 160 ms
The difference between the depth map rendered using
the ground truth pose and the depth map rendered us- KinFu
ing the estimated pose (error metric A). Right: The Preprocessing the raw depth 10 %
difference between the depth map rendered with the Updating the pose 50 %
estimated pose and the raw depth frame (error metric Volume integration 35 %
B). The colorbar units are in mm. Raycasting the artificial depth 5%
Total, desktop PC 130 ms
Total, laptop PC 240 ms
Edge-based method
into account the range measurement errors, but can- Edge shader and sampling 50 %
not distinguish the inaccuracies in hand-eye calibra- Finding point correspondences 29 %
tion from the real positioning errors. The error met- Updating the pose 21 %
ric can also be used to evaluate the monocular edge- Total, laptop PC 15 ms
based method. The error metric is defined for the pix-
els where either the first or the second input depth map Table 1: Timing results for camera pose update with
has a valid value. different methods. Model-based tracker and KinFu
were evaluated with laptop (Intel i7-3740QM 2.7 GHz
The error metric B is the difference between the with Nvidia NVS 5200M) and desktop PC (Intel i7-
depth map rendered using the estimated pose and the 870 3 GHz with Nvidia GTS 450). The edge-based
raw depth map captured from the camera. The error method was evaluated with the laptop only.
metric is similar to the 2D reprojection error, and it
describes how well the model is aligned with the cap-
4.2.3 Computational performance
tured depth images. The lens distortions and errors
in range measurements may cause inaccurate pose es- We evaluated the computational load of different ap-
timation, for which the error metric is not sensitive. proaches by measuring the time to perform the main
However, it is important for AR applications as it mea- steps required for the pose update. The evaluation
sures how accurately the virtual objects can be overlaid was conducted with a desktop computer (Intel i7-870
over the (depth) images. The error metric is defined 3 GHz with Nvidia GTS 450 graphics card) and with
only for the pixels where both input depth maps have a laptop (Intel i7-3740QM 2.7 GHz with Nvidia NVS
valid values. 5200M). The results are shown in Table 1. The timing
results for model-based approach with other parame-
The error metrics are illustrated in Figure 4. For the terizations are discussed in section 5.5.
evaluation, we calculated difference images using the
error metrics A and B, and visualized the results using
histograms. Each histogram bin contains the number 4.3 Datasets
of positive and negative differences at a bin size of 2 4.3.1 Target 1
mm. We normalized the histograms so that the max-
imum value of the bins was set to one, and the other Target 1 has seven objects attached to a common plane:
bins were scaled respectively. To emphasize the distri- two pyramids, two half spheres and two boxes. The
bution of the errors, we ignored coarse outliers (abso- size of the plane is approximately 1 × 1.5 m, and the
lute differences over 50 mm) from the histograms, and objects are from 10 to 12 cm in height. The target has
calculated their ratio in difference images to tables. variance in shape in every dimension, and the objects

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

have sharp edges and corners. Thus, it is constraining Sequence 2.2 The sequence starts such that the cam-
both the depth-based as well as the monocular edge- era is pointing to the right side of the target and is
based tracking methods. Furthermore, the object has relatively close in distance. The camera is moved
textured and non-textured parts. The surface material around the gear stick so that the target fills the
gives a good response to the Kinect, but in some ex- camera view almost completely. Then, the cam-
periments, the camera was moved very close to the era is moved back to the right side and pulled
target and part of the depth measurements were lost back so that the whole target becomes visible in
(the minimum distance for range measurements with the camera. During the sequence, there is no no-
the Kinect is approximately 40 cm). We captured three table change in roll or pitch angles in the camera
sequences from Target 1 as follows: orientation.

Sequence 1.1 The sequence starts such that the whole 4.3.3 Target 3
target is in camera view. The camera is moved
from side to side four times so that the optical Target 3 is a plastic object with a matte, light red sur-
center is directed to the center of the target. In face. The shape of the object is smooth and curved,
the last part of the sequence, the camera is moved and it has no vertical changes in geometry. Thus, the
closer to the target, and the range measurements ICP is not constrained in every dimension. The tar-
are partially lost. get is also challenging for the 2D edge-based tracker,
since the object’s outer contour is the only edge to be
Sequence 1.2 The sequence starts on the right side of used in registration process. We captured the follow-
the target so that approximately half of the tar- ing sequence from Target 3:
get is visible. The camera is moved closer to the
target and the range measurements are partially Sequence 3.1 The sequence starts from the right side
lost. Finally, the camera is moved from side to such that the target is completely in the camera
side twice. view and the camera is directed towards the cen-
ter of the target. The camera is moved to the left
Sequence 1.3 The sequence starts from the left side of side so that the target is kept completely in the
the target so that approximately half of the target camera view, and the distance to the target re-
is visible. The camera is moved closer to the tar- mains constant. During the sequence, there is no
get and is rotated from side to side (yaw angle). notable change in roll or pitch angles in the cam-
Finally, the camera is moved back and forth. Dur- era orientation.
ing the sequence, the camera is moved close to
the target, and the range measurements are par-
tially lost.
5 Results
5.1 Sequence 1.1
4.3.2 Target 2
All trackers perform robustly in Sequence 1.1. Figure
Target 2 is a car dashboard of regular size and mate- 5 shows the absolute errors of the trajectories (posi-
rial. Compared to the reference CAD model, the target tions) given by the different methods. Neither model-
does not have the steering wheel and the gear stick and based nor KinFu trackers are reset during the test, and
the middle console are different. Similarly to Target 1, the monocular edge-based tracker is reset twice. The
Target 2 has variance in shape in every dimension as absolute translation error of the model-based tracker
well as relatively sharp edges. We captured two se- remains mostly under 20 mm. Compared to the model-
quences from Target 2 as follows: based method, the edge-based tracker is on average
more accurate but suffers more from jitter and occa-
Sequence 2.1 The sequence starts such that the dash- sional drifting. The translation error of KinFu is small
board is completely in the camera view. The in the beginning but increases as the tracker proceeds,
camera is moved closer to the left side, and then and reaches a maximum of approximately 40 mm near
around the gear stick to the right side of the target. frame 250. The mean error of the model-based tracker
During the sequence, there is no notable change is 14.4 mm and the standard deviation 5.9 mm (Ta-
in roll or pitch angles in the camera orientation. ble 2). The corresponding values for KinFu and edge-

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Figure 5: Absolute error of the estimated camera position using different tracking methods. Red curves refer
to the model-based tracker, green to KinFu and blue to the edge-based method. Vertical lines denote tracker
resets. Y-axis indicates the error value at each frame in mm, and x-axis is the frame number.

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Figure 6: The distribution of the errors computed using the error metric A. The coarse outliers (absolute value
more than 50 mm) are ignored. The histograms are normalized so that their maximum values are set to one,
and the other values are scaled respectively.

Figure 7: The distribution of the errors computed using the error metric B. The coarse outliers (absolute value
more than 50 mm) are ignored. The histograms are normalized so that their maximum values are set to one,
and the other values are scaled respectively.

based trackers are 20.2 mm (10.7 mm) and 18.7 mm edge-based method 7.3 %.
(26.4 mm) respectively. The angle errors behave simi- To evaluate how accurately virtual data could be
larly compared to the translation errors and the rest of registered with raw depth video, we calculated the
the results are shown in Table 3. reprojection errors for the model-based method and
KinFu using the error metric B. The error histograms
The distribution of the reprojection errors com- in Figure 7 show that the errors of the model-based
puted using the error metric A are shown in Figure tracker are symmetrically distributed around zero. The
6. The error distribution of each tracker is symmetric. ratio of the coarse outliers is 1.1 % (Table 5). The error
The model-based and the edge-based methods slightly distribution of the KinFu tracker is centered around +6
overestimate the distance to the target, and the result mm, and the shape is skewed towards positive values.
of KinFu is opposite which on average underestimates The ratio of outliers is 5.3 %.
the distance. The model-based approach has the nar-
rowest and KinFu the broadest distribution of errors.
5.2 Sequences 1.2 and 1.3
Table 4 shows the ratio of coarse outliers (absolute dif-
ferences over 50 mm) in the difference images. The ra- Compared to Sequence 1.1, the model-based tracker
tio of outliers for the model-based tracker and KinFu performs more accurately in Sequences 1.2 and 1.3. In
are similar (4.6 % and 4.2 % respectively), and for the Sequence 1.2, the mean absolute error of the position

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Model-based KinFu Edge-based Model-based KinFu Edge-based


Seq 1.1 14.4 (5.9) 20.2 (10.7) 18.7 (26.4) Seq 1.1 4.6 % 4.2 % 7.3 %
Seq 1.2 5.3 (3.8) 43.4 (26.3) 26.0 (37.0) Seq 1.2 1.9 % 11.6 % 9.4 %
Seq 1.3 9.0 (5.0) 54.1 (36.0) 26.4 (38.6) Seq 1.3 3.1 % 11.6 % 11.0 %
Seq 2.1 7.2 (3.2) 15.7 (10.4) 75.6 (47.0) Seq 2.1 4.4 % 13.9 % 44.2 %
Seq 2.2 6.8 (3.2) 16.8 (6.5) 67.3 (34.8) Seq 2.2 4.8 % 8.3 % 41.7 %
Seq 3.1 50.5 (28.4) 24.5 (8.8) 50.4 (47.6) Seq 3.1 25.8 % 5.2 % 18.4 %

Table 2: Mean absolute errors and standard deviations Table 4: The ratio of outliers in difference images cal-
of estimated sensor position (in mm). culated using the error metric A.

Model-based KinFu Edge-based Model-based KinFu


Seq 1.1 0.6 (0.3) 1.0 (0.5) 1.0 (1.2) Seq 1.1 1.1 % 5.3 %
Seq 1.2 0.6 (0.4) 2.7 (1.6) 1.8 (2.3) Seq 1.2 0.9 % 5.5 %
Seq 1.3 0.5 (0.5) 2.6 (1.5) 1.6 (1.9) Seq 1.3 0.7 % 11.6 %
Seq 2.1 0.5 (0.3) 0.9 (0.6) 3.5 (2.0) Seq 2.1 35.9 % 58.3 %
Seq 2.2 0.5 (0.2) 1.0 (0.5) 4.6 (2.0) Seq 2.2 34.9 % 47.2 %
Seq 3.1 1.8 (1.2) 1.4 (0.5) 3.0 (2.8) Seq 3.1 8.4 % 5.2 %

Table 3: Mean absolute errors and standard deviations Table 5: The ratio of outliers in difference images cal-
of estimated sensor orientation (in degrees). culated with the error metric B.

is 5.3 mm and standard deviation 3.8 mm. In Sequence The distribution of the reprojection errors in Figures
1.3, the corresponding values are 9.0 mm and 5.0 mm 6 and 7 are similar to Sequence 1.1. Also, the ratio of
respectively. The tracker is reset three times during outliers in Tables 4 and 5 are consistent with the track-
Sequence 1.3 and can track Sequence 1.2 completely ing errors. Figure 8 has example images of the eval-
without resets. In Sequences 1.2 and 1.3, the camera uation process in Sequence 1.2. As shown in the im-
is moved closer to the target and the depth data is par- ages, the depth data is incomplete and partially miss-
tially lost. ing since the sensor is closer to the target than its mini-
Presumably KinFu suffers from the incomplete mum sensing range. Both model-based approaches are
depth data, and the mean absolute error and the stan- able to maintain the tracks accurately, but the drift of
dard deviation in Sequence 1.2 are more than doubled KinFu is clearly visible.
compared to Sequence 1.1, and almost tripled in Se-
quence 1.3. The number of resets of KinFu are six 5.3 Sequences 2.1 and 2.2
and three in Sequences 1.2 and 1.3 respectively. In
Sequence 1.2, the resets occur close to the frame 400 The CAD model of Target 2 differs from its real coun-
where the camera is close to the target and approxi- terpart, and there are coarse outliers in the depth data
mately half of the depth pixels are lost. The accuracy of Sequences 2.1 and 2.2. The translation errors in
of the edge-based method decreases slightly too. It Figure 5 show that both the model-based tracker and
is reset seven times during Sequence 1.2 and eleven KinFu perform robustly, and the trackers are not reset
times in Sequence 1.3. In Sequence 1.3, between the during the tests. The edge-based method suffers from
frames 150 and 200, all of the trackers are reset mul- drift and it is reset five times in both experiments. Ta-
tiple times. During that time interval, the camera is bles 2 and 3 as well as Figure 5 show that the accuracy
moved close to the target and approximately half of of the model-based method is comparable to the first
the depth pixels are lost. Additionally, the camera is three experiments, and that the approach is the most
rotated relatively fast around its yaw axis. Tables 2 accurate from the methods.
and 3 show the rest of the results. The error histograms based on the error metric A

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Figure 8: Tracker performance evaluation examples in different scenarios. Top row images are from the frame
150 of Sequence 1.2 and bottom row images are from the frame 250 of Sequence 2.1. Top row images 1-2 (from
the left): Results of the model-based method calculated with the 3D error metric A and B. Top row images 3-4:
Corresponding results for KinFu. Top row image 5: The result of the edge-based method calculated with the
3D error metric A. Bottom row images are ordered similarly to the top row. The colorbar units are in mm.

are shown in Figure 6. The results of the model- 5.4 Sequence 3.1
based tracker are similar to the first three experiments,
and the errors are distributed symmetrically with close Target 3 does not constrain the ICP in the vertical di-
to zero mean. The error distributions of KinFu and mension and the model-based tracker fails to track the
the edge-based method are more wide spread and the camera. Figure 5 shows that the model-based tracker
drift of the edge-based method is especially visible. drifts immediately after the initial reset, and that there
For the model-based tracker and KinFu, the ratio of are only a few sections in the experiment where the
outliers in reprojection errors are similar to Target 1, tracker is stable (but still off from the ground truth
and for the edge-based method the ratio clearly in- trajectory). Since the model-based tracker was drift-
creases. The error histograms based on the error met- ing, we did not compensate the bias in the hand-eye
ric B show that the model-based tracker performs con- calibration for any of the methods (see Section 4.1).
sistently, and that the reprojected model was aligned The edge-based tracker performs better and it is able
to the captured depth frames without bias. The KinFu to track the camera for most of the frames, although
tracker has more a widespread error distribution. Ta- it was reset seven times during the test. KinFu per-
ble 5 shows that there are more coarse outliers in the forms equally well compared to the previous experi-
results of KinFu as well. Note, that due to differences ments, and it is able to track the camera over the whole
between the reference CAD model and its real coun- sequence without significant drift. The result is unex-
pected since KinFu’s camera pose estimation is based
terpart, the number of outliers is relatively high in both
methods. on the ICP. We assume that noisy measurements are
accumulated to the 3D reconstruction, and these inac-
curacies in the model are constraining ICP in the ver-
The images in Figure 8 show tracking examples tical dimension.
from Sequence 2.1. The difference images computed
using the error metric B show that the model-based 5.5 Factors affecting the accuracy
tracker aligns the observed depth maps accurately with
the rendered model, and the real differences are clearly In AR applications, it is essential that the tracking sys-
distinguishable from the images. With KinFu, the real tem performs without lag and as close to real-time
differences and positioning errors are mixed. The er- frame rates as possible. When a more computation-
ror metric A shows that the model-based approach is ally intensive method is used for the tracking, a lower
close to ground truth and major errors are present only frame rate is achieved and the wider baseline between
around the edges of the target. successive frames needs to be matched in a pose up-

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

Filtered Undistorted Iteration test


Seq 1.1 14.5 (6.0) 13.9 (5.4) 14.5 (5.8)
Seq 2.1 7.2 (3.3) 5.3 (2.6) 7.7 (3.9)

Table 6: Mean absolute error and standard deviation


of the estimated sensor position with different track-
ing options using the model-based tracker. ”Filtered”
Figure 9: The spatial distribution of the positive (left refers to experiments where the bilateral filtering of the
image) and negative (right image) depth differences raw depth frames was turned on, ”Undistorted” refers
between the depth map rendered with the pose esti- to experiments with (spatial) lens distortion compensa-
mate given by model-based tracker and the raw depth tion and ”Iteration test” to experiments where the ICP
map captured from the camera (error metric B). The was run only once at each pyramid level.
images were constructed by calculating the mean er-
rors for every pixel over Sequence 1.3. To emphasize
the sensor inaccuracies, the results were thresholded to [HKH12]. We did not evaluate the effect of the range
±10 mm. The error distribution is similar compared to measurement errors quantitatively, but in applications
the image presented in [HKH12]. The colorbar units that require very precise tracking, the compensation of
are in mm. such errors should be considered.

date. We evaluated the effect of lens distortions, raw 6 Discussion and conclusion
data filtering and the number of ICP iterations sepa-
rately to the accuracy in Sequences 1.1 and 2.1. Each We proposed a method for real-time CAD model-
of them increases the computational time and are op- based depth camera tracking that uses ICP for pose
tional. Table 6 shows the results. Compared to the update. We evaluated the method with three real life
results shown in Table 2 (lens distortion compensa- reference targets and with six datasets, and compared
tion off, bilateral filtering off, number of ICP itera- the results to depth-based SLAM, to a 2D edge-based
tions set to L = {10, 5, 4}), it can be seen that the method and to the ground truth.
bilateral filtering step does not improve the accuracy, The results show that the method is more robust
and can be ignored for the model-based tracking ap- compared to the 2D edge-based method and suffers
proach. Lens distortion compensation improved the less from jitter. Compared to depth-based SLAM, the
accuracy slightly in Sequence 1.1, but improves the method is more accurate and has less drift. Despite
accuracy by approximately 26 % in Sequence 2.1. Re- incomplete range measurements, noise, and inaccura-
ducing the number of iterations in ICP does not have cies in the Kinect depth measurements, the 3D repro-
notable change in Sequence 1.1 and decreases the ac- jection errors are distributed evenly and are close to
curacy by 7 % in Sequence 2.1. With the laptop PC, zero mean. For applications that require minimal lag
the lens distortion compensation (computed in CPU) and fast frame rates, it seems sufficient to run the IPC
takes approximately 7 ms and the tracker with ICP iterations only once for each pyramid level. This does
iterations L = {1, 1, 1} 50 ms versus 160 ms with not affect to the accuracy or jitter, but speeds up the
L = {10, 5, 4}. Bilateral filtering (computed in GPU) processing time significantly. In our experiments, fil-
does not add notable computational load. tering the raw depth frames did not improve the track-
In addition to noise and lens distortions, the Kinect ing accuracy, but for applications that require very pre-
suffers from depth distortions that depend on the mea- cise tracking, the lens distortions should be compen-
sured range and that are unevenly distributed in the sated. Additionally, the Kinect sensor suffers from
image domain [HKH12]. We calculated the mean pos- depth measurement errors. The distribution of the er-
itive and negative residual images over Sequence 1.3 rors in the image domain is complex, and a depth cam-
using the error metric B and the model-based tracker. era model that compensates the errors pixel-wise (e.g.
We thresholded the images to ±10 mm to emphasize [HKH12]) should be considered.
the sensor depth measurement errors and to deduct The ICP may not converge to the global optimum
the pose estimation errors. Figure 9 shows the er- if the target object does not have enough geometrical
ror images, which are similar to the observations in constraints (the problem has been discussed e.g. in

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

[GIRL03]). This leads to wrong pose estimates and may lead to drift. We envision that the method could
drift, and limits the use of the method to objects that be improved by making partial 3D shape reconstruc-
have variance in shape in all three dimensions. How- tions online, and appending the results to the CAD
ever, in our experiments, KinFu was more stable with model for more constraining geometry. Other sugges-
such object and did not drift during the tests. The exact tion for improvement is to complete the method with
reason for this behavior is unclear to us, but we assume an edge-based approach to prevent the tracker from
that the inaccuracies and noise in range measurements drifting. For example, a 3D cube fully constraints the
are accumulated to the reference model constraining ICP as long as three faces are seen by the camera. But
the tracker. if the camera is moved so that only one face is visible,
We excluded the tracker initialization from this pa- only the distance to the model is constrained. How-
per. In practical applications, the automated initializa- ever, the edge information would be still constraining
tion is required, and to initialize the camera pose one the camera pose.
may apply methods developed for RGB-D based 3D
object detection (e.g. [HLI+ 13]) or methods that rely
on depth information only (e.g. [SX14]). As the ICP 7 Acknowledgments
aligns the model and the raw depth frames in a com-
mon coordinate system, the model-based method (as The authors would like to thank professor Tapio Takala
well as the edge-based method) is forgiving to inaccu- from Aalto University, Finland for valuable com-
rate initialization. The maximum acceptable pose er- ments, and Alain Boyer from VTT Technical Research
ror in the initialization stage depends on the reference Centre of Finland for language revision.
model geometry. Detailed surfaces with a lot of repet-
itive geometry may guide the ICP to local minimum,
but smooth and dominant structures allow the tracker
to slide towards the correct pose. References
Although we did not evaluate the requirements for
the size of the reference model’s appearance in the [AZ95] Martin Armstrong and Andrew Zisser-
camera view, some limitations can be considered. The man, Robust object tracking, Asian Con-
projection of small or distant objects occupy relatively ference on Computer Vision, vol. I, 1995,
small proportion of the depth frames, and the relative pp. 58–61, ISBN 9810071884.
noise level of the depth measurements increases. Thus,
the geometrical constraints may become insufficient [Azu97] Ronald T. Azuma, A survey of aug-
for successful camera pose estimation. Additionally, mented reality, Presence: Teleopera-
if the camera is moved fast or rotated quickly between tors and Virtual Environments 6 (1997),
the consecutive frames, the initial camera pose from no. 4, 355–385, ISSN 1054-7460, DOI
the previous time step may differ significantly from 10.1162/pres.1997.6.4.355.
the current pose. Thus, small or distant objects may
be treated completely as outliers, and the pose update [BBS07] Gabriele Bleser, Mario Becker, and
would fail. The exact requirements for the reference Didier Stricker, Real-time vision-based
model’s visual extent in the camera view depend on tracking and reconstruction, Journal of
the size of the objects and how the camera is moved. Real-Time Image Processing 2 (2007),
Similar methods as suggested for automatic initializa- no. 2, 161–175, ISSN 1861-8200, DOI
tion could be used in background process to reinitialize 10.1007/s11554-007-0034-0.
the pose whenever it has been lost.
With the proposed approach, virtually any CAD
Citation
model can be used for depth camera tracking. It is re-
Otto Korkalo and Svenja Kahn, Real-time depth ca-
quired that the model can be efficiently rendered from
mera tracking with CAD models and ICP, Journal of
the desired camera pose and that the corresponding
Virtual Reality and Broadcasting, 13(2016), no. 1,
depth map can be retrieved from the depth buffer. The
August 2016, urn:nbn:de:0009-6-44132,
models that do not have variance in shape in every di-
DOI 10.20385/1860-2037/13.2016.1, ISSN 1860-2037.
mension do not completely constrain the ICP which

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

[BPS05] Gabriele Bleser, Yulian Pastarmov, and [GRV+ 13] Higinio Gonzalez-Jorge, Belén Riveiro,
Didier Stricker, Real-time 3D camera Esteban Vazquez-Fernandez, Joaquı́n
tracking for industrial augmented real- Martı́nez-Sánchez, and Pedro Arias,
ity applications, WSCG ’2005: Full Pa- Metrological evaluation of Mi-
pers: The 13-th International Conference crosoft Kinect and Asus Xtion sen-
in Central Europe on Computer Graph- sors, Measurement 46 (2013), no. 6,
ics, Visualization and Computer Vision 1800–1806, ISSN 0263-2241, DOI
2005 in co-operation with Eurograph- 10.1016/j.measurement.2013.01.011.
ics: University of West Bohemia, Plzen,
Czech Republic (Václav Skala, ed.), 2005, [Har93] Chris Harris, Tracking with rigid models,
HDL 11025/10951, pp. 47–54, ISBN 80- Active vision (Andrew Blake and Alan
903100-7-9. Yuille, eds.), MIT Press, Cambridge, MA,
1993, pp. 59–73, ISBN 0-262-02351-2.
[BSK+ 13] Erik Bylow, Jürgen Sturm, Christian Kerl,
[HF11] Steven Henderson and Steven Feiner, Ex-
Fredrik Kahl, and Daniel Cremers, Real-
ploring the benefits of augmented real-
Time camera tracking and 3D recon-
ity documentation for maintenance and
struction using signed distance functions,
repair, IEEE Transactions on Visualiza-
Robotics: Science and Systems (RSS)
tion and Computer Graphics 17 (2011),
Conference 2013, vol. 9, 2013, ISBN 978-
no. 10, 1355–1368, ISSN 1077-2626, DOI
981-07-3937-9.
10.1109/TVCG.2010.245.
[CC13] Changhyun Choi and Henrik I. Chris- [HKH12] Daniel Herrera C., Juho Kannala, and
tensen, RGB-D object tracking: a par- Janne Heikkilä, Joint depth and color
ticle filter approach on GPU, 2013 camera calibration with distortion correc-
IEEE/RSJ International Conference on In- tion, IEEE Transactions on Pattern Anal-
telligent Robots and Systems, 2013, DOI ysis and Machine Intelligence 34 (2012),
10.1109/IROS.2013.6696485, pp. 1084– no. 10, 2058–2064, ISSN 0162-8828, DOI
1091. 10.1109/TPAMI.2012.125.
[DC02] Tom Drummond and Roberto Cipolla, [HLCH12] Miles Hansard, Seungkyu Lee, Ouk Choi,
Real-time visual tracking of complex and Radu Horaud, Time of Flight Cam-
structures, IEEE Transactions on Pattern eras: Principles, Methods, and Appli-
Analysis and Machine Intelligence 24 cations, SpringerBriefs in Computer Sci-
(2002), no. 7, 932–946, ISSN 0162-8828, ence, Springer, London, 2012, ISBN
DOI 10.1109/TPAMI.2002.1017620. 978-1-4471-4658-2, DOI 10.1007/978-1-
4471-4658-2.
[DRMS07] Andrew J. Davison, Ian D. Reid,
Nicholas D. Molton, and Olivier Stasse, [HLI+ 13] Stefan Hinterstoisser, Vincent Lepetit,
MonoSLAM: Real-time single camera Slobodan Ilic, Stefan Holzer, Gary Brad-
SLAM, IEEE Transactions on Pattern ski, Kurt Konolige, and Nassir Navab,
Analysis and Machine Intelligence 29 Model Based Training, Detection and
(2007), no. 6, 1052–1067, ISSN 0162- Pose Estimation of Texture-Less 3D Ob-
8828, DOI 10.1109/TPAMI.2007.1049. jects in Heavily Cluttered Scenes, Com-
puter Vision – ACCV 2012: 11th
[GIRL03] Natasha Gelfand, Leslie Ikemoto, Szymon Asian Conference on Computer Vision,
Rusinkiewicz, and Marc Levoy, Geomet- Daejeon, Korea, November 5-9, 2012,
rically Stable Sampling for the ICP Algo- Revised Selected Papers (Berlin) (Ky-
rithm, Fourth International Conference on oung Mu Lee, Yasuyuki Matsushita,
3-D Digital Imaging and Modeling 3DIM, James M. Rehg, and Zhanyi Hu, eds.),
2003, DOI 10.1109/IM.2003.1240258, Lecture Notes in Computer Science, Vol.
pp. 260–267, ISBN 0-7695-1991-1. 7724, vol. 1, Springer, 2013, DOI

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

10.1007/978-3-642-37331-24 2, pp. 548– Camera Fusion in Robotics, IEEE/RSJ


562, ISBN 978-3-642-37330-5. International Conference on Intelligent
Robots and Systems, 2012.
[KBKF13] Svenja Kahn, Ulrich Bockholt, Ar-
jan Kuijper, and Dieter W. Fellner, [NIH+ 11] Richard A. Newcombe, Shahram Izadi,
Towards precise real-time 3D differ- Otmar Hilliges, David Molyneaux, David
ence detection for industrial applica- Kim, Andrew J. Davison, Pushmeet Kohli,
tions, Computers in Industry 64 (2013), Jamie Shotton, Steve Hodges, and An-
no. 9, 1115–1128, ISSN 0166-3615, DOI drew Fitzgibbon, KinectFusion: real-
10.1016/j.compind.2013.04.004. time dense surface mapping and track-
ing, 10th IEEE International Sympo-
[KHW14] Svenja Kahn, Dominik Haumann, and
sium on Mixed and Augmented Real-
Volker Willert, Hand-eye calibration with
ity (ISMAR), 2011, IEEE, 2011, DOI
a depth camera: 2D or 3D?, 2014 In-
10.1109/ISMAR.2011.6092378, pp. 127–
ternational Conference on Computer Vi-
136, ISBN 978-1-4577-2183-0.
sion Theory and Applications (VISAPP),
IEEE, 2014, pp. 481–489. [PLW11] Youngmin Park, Vincent Lepetit, and
Woontack Woo, Texture-less object track-
[KM06] Georg Klein and David W. Murray, Full-
ing with online training using an RGB-
3D Edge Tracking with a Particle Filter,
D camera, 10th IEEE International Sym-
Proceedings of the British Machine Vision
posium on Mixed and Augmented Re-
Conference (Mike Chantler, Bob Fisher,
ality (ISMAR), 2011, IEEE, 2011, DOI
and Manuel Trucco, eds.), BMVA Press,
10.1109/ISMAR.2011.6092377, pp. 121–
2006, DOI 10.5244/C.20.114, pp. 114.1–
126, ISBN 978-1-4577-2183-0.
114.10, ISBN 1-901725-32-4.
[RC11] Radu B. Rusu and Steve Cousins, 3D
[KM07] Georg Klein and David Murray, Paral-
is here: point cloud library (PCL),
lel tracking and mapping for small AR
2011 IEEE International Conference on
workspaces, 6th IEEE and ACM Inter-
Robotics and Automation (ICRA), IEEE,
national Symposium on Mixed and Aug-
2011, DOI 10.1109/ICRA.2011.5980567,
mented Reality ISMAR 2007, 2007, DOI
pp. 1–4, ISBN 978-1-61284-386-5.
10.1109/ISMAR.2007.4538852, pp. 225–
234, ISBN 978-1-4244-1749-0. [SX14] Shuran Song and Jianxiong Xiao, Slid-
ing Shapes for 3D Object Detection in
[LF05] Vincent Lepetit and Pascal Fua, Monocu-
Depth Images, Computer Vision – ECCV
lar model-based 3D tracking of rigid ob-
2014: 13th European Conference, Zurich,
jects, Foundations and Trends in Com-
Switzerland, September 6-12, 2014, Pro-
puter Graphics and Vision 1 (2005),
ceedings (David Fleet, Tomas Pajdla,
no. 1, 1–89, ISSN 1572-2740, DOI
Bernt Schiele, and Tinne Tuytelaars,
10.1561/0600000001.
eds.), Lecture Notes in Computer Science,
[MF01] Steve Mann and James Fung, VideoOrbits Vol. 8694, vol. 6, Springer, 2014, DOI
on eye tap devices for deliberately dimin- 10.1007/978-3-319-10599-44 1, pp. 634–
ished reality or altering the visual percep- 651, ISBN 978-3-319-10598-7.
tion of rigid planar patches of a real world
scene, International Symposium on Mixed [TAC11] Tommi Tykkälä, Cédric Audras, and
Reality (ISMR2001), 2001, pp. 48–55. Andrew I. Comport, Direct itera-
tive closest point for real-time visual
[MIK+ 12] Stephan Meister, Shahram Izadi, Push- odometry, 2011 IEEE International
meet Kohli, Martin Hämmerle, Carsten Conference on Computer Vision Work-
Rother, and Daniel Kondermann, When shops (ICCV Workshops), IEEE, 2011,
can we use KinectFusion for ground truth DOI 10.1109/ICCVW.2011.6130500,
acquisition?, Workshop on Color-Depth pp. 2050–2056, ISBN 978-1-4673-0062-9.

urn:nbn:de:0009-6-44132, ISSN 1860-2037


Journal of Virtual Reality and Broadcasting, Volume 13(2016), no. 1

[vKP10] Rick van Krevelen and Ronald Poelman,


Survey of augmented reality technologies,
applications and limitations, The Interna-
tional Journal of Virtual Reality 9 (2010),
no. 2, 1–20, ISSN 1081-1451.

[VLF04] Luca Vacchetti, Vincent Lepetit, and


Pascal Fua, Combining edge and tex-
ture information for real-time accu-
rate 3D camera tracking, Third IEEE
and ACM International Symposium on
Mixed and Augmented Reality ISMAR
2004, IEEE, 2004, DOI 10.1109/IS-
MAR.2004.24, pp. 48–56, ISBN 0-7695-
2191-6.

[WVS05] Harald Wuest, Florent Vial, and Di-


dier Stricker, Adaptive line tracking with
multiple hypotheses for augmented real-
ity, Fourth IEEE and ACM International
Symposium on Mixed and Augmented
Reality (ISMAR’05), IEEE, 2005, DOI
10.1109/ISMAR.2005.8, pp. 62–69, ISBN
0-7695-2459-1.

[WWS07] Harald Wuest, Folker Wientapper, and


Didier Stricker, Adaptable model-based
tracking using analysis-by-synthesis tech-
niques, Computer Analysis of Images
and Patterns: 12th International Confer-
ence, CAIP 2007, Vienna, Austria, August
27-29, 2007. Proceedings (Berlin) (Wal-
ter G. Kropatsch, Martin Kampel, and
Allan Hanbury, eds.), Lecture Notes in
Computer Science, Vol. 4673, Springer,
2007, DOI 10.1007/978-3-540-74272-23 ,
pp. 20–27, ISBN 978-3-540-74271-5.

[ZDB08] Feng Zhou, Henry Been-Lirn Duh, and


Mark Billinghurst, Trends in augmented
reality tracking, interaction and display:
a review of ten years of ISMAR, ISMAR
’08 Proceedings of the 7th IEEE/ACM
International Symposium on Mixed and
Augmented Reality (Mark A. Livingston,
ed.), IEEE, 2008, DOI 10.1109/IS-
MAR.2008.4637362, pp. 193–202, ISBN
978-1-4244-2840-3.

urn:nbn:de:0009-6-44132, ISSN 1860-2037

You might also like