1320161
1320161
ing real objects from the images and possibly replac- to common problems that appear in monocular camera
ing them with virtual content [MF01]. For example, in tracking including changes in illumination, repetitive
AR assisted decoration, existing furniture is removed textures and lack of features. Typical depth camera
and replaced with digital furniture to aid in planning a technologies (time-of-flight, structured light) rely on
new room lay-out. active illumination so they can also operate in low light
AR, diminished reality and other related applica- conditions. The appearance of the depth maps depend
tions require that the position and the orientation mainly on the 3D geometry of the scene, and thus,
(pose) of the camera (user’s view) can be estimated depth cameras are attractive devices for camera track-
and tracked precisely in real-time. The most common ing. Recent research on depth camera based tracking
approach is to analyze the captured 2D images, and focus mainly on SLAM and other approaches that cre-
various optical tracking methods have been proposed ate the reference model during the operation. Such
from easily detectable fiducial markers to natural im- trackers can perform in unprepared environments, but
age features [ZDB08, LF05]. Simultaneous localiza- they still have drawbacks compared to the trackers that
tion and mapping (SLAM) approaches are attractive utilize predefined models.
since they do not require any preparation of the envi- In this paper, we present and evaluate a model-based
ronment in order to operate. Instead, the scene model tracking method for depth cameras that utilizes pre-
is reconstructed from the image observations while defined CAD models to obtain the camera pose. We
simultaneously tracking the camera [BBS07, KM07, take the advantage of precise CAD models commonly
DRMS07]. However, in most of the AR applica- available in industrial applications, and apply iterative
tions, the camera pose has to be defined exactly in closest point (ICP) algorithm for registering the latest
the reference object’s coordinate frame, and model- camera pose with the incoming depth frame. We use
based tracking solutions are desirable. The model- direct method, where all the depth data is used without
based tracking methods aim to fit features (typically explicit feature extraction. With a GPGPU implemen-
edges) extracted from the camera image to 2D projec- tation of the ICP, the method is fast and runs in real
tions of the 3D model of the reference target to esti- time frame rates. The main benefits of the proposed
mate the 6-DoF transformation between them [LF05]. approach are:
A common requirement of 2D image-based camera • In contrast to monocular methods, the approach
pose estimation approaches is that the captured scene is robust with both textured and non-textured ob-
needs to provide features which are visible in the 2D jects and with monochromatic surfaces. The ap-
camera image and which can be analyzed in order to proach does not require any explicit feature ex-
estimate the camera pose. For example, due to a lack traction from the (depth) cameras frames.
of detectable 2D features, it is very difficult to estimate
the camera pose if the captured scene has untextured • In contrast to depth-based SLAM methods, mea-
monochromatic surfaces or the lighting conditions are surement and tracking errors are not accumu-
difficult. Strong shadows are indistinguishable from lated, the method is faster, and it always tracks
actual edges, reflections of light disturb the feature de- directly in the reference target’s coordinate sys-
tection and dim illumination increases the noise level. tem. The approach is robust for differences be-
In recent years, 2D imaging has been complemented tween the CAD model and the real target geom-
by the development of depth cameras. They operate at etry. Thus, it can be used in applications such as
up to 30 frames per second, and measure each pixel’s difference detection for quality inspection.
distance from the camera to the object in the real
• Virtually any 3D CAD model can be used for
world [HLCH12, GRV+ 13]. While initially very ex-
tracking. The only requirement is that the model
pensive and rather inaccurate, technological advance-
needs to be rendered, and that the correspond-
ments have led to the development of cheap and more
ing depth map has to be retrieved from the depth
precise depth cameras for the consumer mass market.
buffer for the tracking pipeline.
Depth sensors have become commodity hardware and
their availability, price and size are nowadays close to The remainder of this paper is structured as fol-
conventional 2D cameras. lows: in Section 2, we give an overview of model-
Depth cameras have clear advantages in terms of based optical tracking methods as well as methods uti-
camera pose estimation and tracking. They are tolerant lizing depth cameras. In Section 3, we detail our CAD
model-based depth camera tracking approach. Section tion of each particle. Edge-based methods have also
4 provides an evaluation of the method. We describe been realized with point features. In [VLF04], 3D
the datasets and the evaluation criteria, and compare points lying on the model surface were integrated with
the results to the ground truth, to a 2D edge-based the pose estimation loop together with the edges.
method, and to a depth-based SLAM method. In Sec-
tion 5 we present the results, and experiment with the
factors that affect to the performance of the approach. 2.2 Real-time depth camera tracking
Finally, in Section 6, the results are discussed and a
The Kinect sensor was the first low-cost device to cap-
brief description of future work is presented.
ture accurate depth maps at real-time frame rates. Af-
ter it was released, many researcher used the sensor
2 Related work for real-time depth-based and RGB-D based SLAM.
Many of the studies incorporate iterative closest point
2.1 Real-time model-based tracking of (ICP) in the inter-frame pose update. In ICP based
monocular cameras pose update, the 3D point pairing is a time consum-
ing task and several variants have been proposed to
Edges are relatively invariant to illumination changes, reduce the computational load for real-time perfor-
and they are easy to detect from the camera im- mance. In KinectFusion [NIH+ 11], an efficient GPU
ages. There are multiple studies that focus on model- implementation of the ICP algorithm was used for the
based monocular tracking using edges. In the typi- pose update in depth-based SLAM. The ICP variant
cal approach, the visible edges of the 3D CAD model of the KinectFusion utilizes projective data associa-
are projected to the camera image using the camera tion and point-to-plane error metrics. With a par-
pose from a previous time step, and aligned with the allelized GPU implementation, all of the depth data
edges that are extracted from the latest camera frame. can be used efficiently without explicitly selecting the
The change of the pose between the two consecutive point correspondences for the ICP. In [TAC11], a bi-
frames is found by minimizing the reprojection error objective cost function combining the depth and pho-
of the edges. One of the first real-time edge-based im- tometric data was used in ICP for visual odometry.
plementations was presented in [Har93], where a set of As in KinectFusion, the method uses an efficient di-
control points are sampled from the model edges and rect approach where the cost is evaluated for every
projected to the image. The algorithm then searches pixel without explicit feature selection. The SLAM
for strong image gradients from the camera frame approach presented in [BSK+ 13] represents the scene
along the direction of control point normals. The geometry with a signed distance function, and finds the
maximum gradient is considered to be the correspon- change in camera pose parameters by minimizing the
dence for the current control point projection. Finally, error directly between the distance function and ob-
the camera pose is updated by minimizing the sum served depth leading to faster and more accurate result
of squared differences between the point correspon- compared to KinectFusion.
dences. SLAM and visual odometry typically utilize the en-
The method presented in [Har93] is sensitive to out- tire depth images in tracking and the reference model
liers (e.g. multiple strong edges along the search line, is reconstructed from the complete scene. In object
partial occlusions), and a wrong image gradient max- tracking however, the reference model is separated
imum may be assigned to a control point leading to a from the background and the goal is to track a moving
wrong pose estimate. Many papers propose improve- target in a possibly cluttered environment, and with
ments to the method. In [DC02], robust M-estimators less (depth) information and geometrical constraints.
were used to lower the importance of outliers in the op- In [CC13], a particle filter is used for real-time RGB-D
timization loop, a RANSAC scheme was applied e.g. based object tracking. The approach uses both photo-
in [AZ95, BPS05], and a multiple hypothesis assign- metric and geometrical features in a parallelized GPU
ment was used in conjunction with a robust estimator implementation, and uses point coordinates, normals
e.g. in [WVS05]. In [KM06], a particle filter was used and color for likelihood evaluation. ICP was used in
to find the globally optimal pose. The system was im- [PLW11] for inter-frame tracking of the objects that
plemented using a GPU which enabled fast rendering are reconstructed from the scene on-line. Furthermore,
of visible edges as well as efficient likelihood evalua- the result from ICP is refined by using the 3D edges of
problem. Assuming small incremental changes in the no need for extracting and matching interest points or
rotation, the linear approximation of R0 becomes other cues or features. The only requirement is that a
depth map from the desired camera view can be ren-
1 −γ β dered effectively, and retrieved from the depth buffer.
R̃0 = γ 1 −α , (2) Complex CAD models can be effectively rendered us-
−β α 1 ing commonly available tools. In our experiments, we
used OpenSG to manipulate and render the model.
where α, β and γ are the rotations around x, y and z
axes respectively. Denoting r0 = (α, β, γ)T , the error
can be written as 4 Evaluation methods and data
X
((vs,i −vd,i )·nd,i +r0 ·(vs,i ×nd,i )+t0 ·nd,i )2 . (3) We evaluated the accuracy, stability and robustness of
i the proposed approach by comparing the tracking re-
The minimization problem is solved by calculating sults to ground truth in three different tracking scenar-
the partial derivatives of Equation 3 with respect to the ios and with six datasets. We also compared the results
transformation parameters r0 and t0 and setting them to KinFu and to the edge-based monocular method
to zero. The equations are collected into a linear sys- presented in [WWS07]. Additionally, we compared
tem of the form Ax = b, where x consists of the the computational time required for sensor pose up-
transformation parameters, b is the residual and A is date between the different tracking methods.
a 6 × 6 symmetric matrix. The system is constructed In this section, we describe the data collection pro-
in the GPU, and solved using Cholesky decomposition cedure, the error metrics that we used to evaluate the
in the CPU. results, and the datasets that we collected from the ex-
To define the point pairs between the source and the periments. For simplicity, we refer to the proposed ap-
destination point clouds, KinFu utilizes projective data proach as ”model-based method”, and the 2D model
association. At each ICP iteration, the points of V g,l based approach as ”edge-based method”.
s
are transformed to the camera coordinate system of the
previous time step, and projected to the image domain: 4.1 Data collection procedure
We conducted the experiments with offline data that
u = proj(K · R−1
k−1 · (vs − tk−1 )), (4)
we captured from three test objects using the Kinect
where proj(·) is the perspective projection including depth sensor. For each data sequence, we captured
the dehomogenization of the points. The set of ten- 500 depth frames at a resolution of 640 × 480 pix-
tative point correspondences are then defined between els and frame rate of 10 FPS. In addition to depth
the points of Vsg,l and the points of Vdg,l that corre- frames, we captured the RGB frames for evaluating
spond to the image pixel coordinates u. the performance of the edge-based method. To col-
The tentative point correspondences are checked for lect the ground truth camera trajectories, we attached
outliers by calculating their Euclidean distance and an- the sensor to a Faro measurement arm, and solved
gle between their normal vectors. If the points are too the hand-eye calibration of the system as described
distant from each other, or the angle is too large, the in [KHW14]. For KinFu, we set the reconstruction
point pair is ignored from the ICP update. In our ex- volume to the size of each target’s bounding box and
periments, we used a 50 mm threshold for the distance aligned it accordingly. The model-based method was
and a 20 degree threshold for the angle. The Kinect run without lens distortion compensation and bilateral
cannot produce range measurements from some mate- filtering, and we used L = {10, 5, 4} ICP iterations.
rials like reflective surfaces, under heavy sunlight, out- We also experimented with other settings, and the re-
side its operating range and from occluded surfaces, sults are discussed in Section 5.5. The test targets and
and such source points are ignored too. Furthermore, the corresponding 3D CAD models are shown in Fig-
we ignore destination points that have infinite depth ure 3.
value, i.e. the depth map pixels where no object points For the evaluation runs, we initialized the trackers
are projected when rendering the depth map. to the ground truth pose, and let them run as long as
The proposed tracking approach simplifies the use the estimated position and orientation remained within
of 3D CAD models in visual tracking since there is the predefined limits. Otherwise the tracker was con-
Figure 3: The reference CAD models used to evaluate the proposed approach. Top and bottom left: Target 1
consist of several convex objects attached to a common plane. The model is partially textured and partially
plain white. Middle: Target 2 is a car’s dashboard. The model differs from its real counterpart from the steering
wheel, gear stick as well as the middle console. Right: Target 3 does not have geometry in vertical dimension
and the ICP based approach is not fully constrained by the target.
sidered to be drifting, and its pose was reset back to the error in orientation, and calculated the mean of ab-
the ground truth. The tracker’s pose was reset if the solute differences between the angles. We define the
absolute error between the estimated position and the angle error as the angle difference between the quater-
ground truth was more that 20 cm, or if the angle dif- nion representations of the orientations. We calculated
ference was more than 10 degrees. the corresponding standard deviations for evaluating
Due to lens and depth distortions as well as noise the jitter, and used the number of required tracker re-
in the depth measurements, the hand-eye calibration sets as a measure for robustness.
between the Faro measurement arm and the Kinect
device is inaccurate. The result depends on the cal-
ibration data, and the calibration obtained with close 4.2.2 3D reprojection errors
range measurements may give inaccurate results with In AR applications, it is essential that the rendered
long range data and vice versa. Thus, we estimated model is aligned accurately with the view and the re-
the isometric transformation between the resulting tra- projection error is typically used to measure the ac-
jectories and ground truth, and generated a corrected curacy of vision-based trackers. In 2D analysis, the
ground truth trajectory for each sequence individually. reprojection error is calculated by summing up the
For the final results, we repeated the tests using the squared differences between the observed and repro-
corrected ground truth trajectories as reference. jected model points in the image domain after the cam-
era pose update. We use a similar approach in 3D,
4.2 Evaluation criteria and calculate the differences between the observed and
rendered depth maps. We define two error metrics us-
4.2.1 Absolute accuracy
ing the depth: error metric A and error metric B.
We measured the accuracy of the trackers by calcu- The error metric A is the difference between the
lating the mean of absolute differences between the depth map rendered using the ground truth pose and
estimated sensor positions and the (corrected) ground the depth map rendered using the estimated pose. This
truth over the test sequences. Similarly, we measured measures the absolute accuracy of the tracker. It takes
have sharp edges and corners. Thus, it is constraining Sequence 2.2 The sequence starts such that the cam-
both the depth-based as well as the monocular edge- era is pointing to the right side of the target and is
based tracking methods. Furthermore, the object has relatively close in distance. The camera is moved
textured and non-textured parts. The surface material around the gear stick so that the target fills the
gives a good response to the Kinect, but in some ex- camera view almost completely. Then, the cam-
periments, the camera was moved very close to the era is moved back to the right side and pulled
target and part of the depth measurements were lost back so that the whole target becomes visible in
(the minimum distance for range measurements with the camera. During the sequence, there is no no-
the Kinect is approximately 40 cm). We captured three table change in roll or pitch angles in the camera
sequences from Target 1 as follows: orientation.
Sequence 1.1 The sequence starts such that the whole 4.3.3 Target 3
target is in camera view. The camera is moved
from side to side four times so that the optical Target 3 is a plastic object with a matte, light red sur-
center is directed to the center of the target. In face. The shape of the object is smooth and curved,
the last part of the sequence, the camera is moved and it has no vertical changes in geometry. Thus, the
closer to the target, and the range measurements ICP is not constrained in every dimension. The tar-
are partially lost. get is also challenging for the 2D edge-based tracker,
since the object’s outer contour is the only edge to be
Sequence 1.2 The sequence starts on the right side of used in registration process. We captured the follow-
the target so that approximately half of the tar- ing sequence from Target 3:
get is visible. The camera is moved closer to the
target and the range measurements are partially Sequence 3.1 The sequence starts from the right side
lost. Finally, the camera is moved from side to such that the target is completely in the camera
side twice. view and the camera is directed towards the cen-
ter of the target. The camera is moved to the left
Sequence 1.3 The sequence starts from the left side of side so that the target is kept completely in the
the target so that approximately half of the target camera view, and the distance to the target re-
is visible. The camera is moved closer to the tar- mains constant. During the sequence, there is no
get and is rotated from side to side (yaw angle). notable change in roll or pitch angles in the cam-
Finally, the camera is moved back and forth. Dur- era orientation.
ing the sequence, the camera is moved close to
the target, and the range measurements are par-
tially lost.
5 Results
5.1 Sequence 1.1
4.3.2 Target 2
All trackers perform robustly in Sequence 1.1. Figure
Target 2 is a car dashboard of regular size and mate- 5 shows the absolute errors of the trajectories (posi-
rial. Compared to the reference CAD model, the target tions) given by the different methods. Neither model-
does not have the steering wheel and the gear stick and based nor KinFu trackers are reset during the test, and
the middle console are different. Similarly to Target 1, the monocular edge-based tracker is reset twice. The
Target 2 has variance in shape in every dimension as absolute translation error of the model-based tracker
well as relatively sharp edges. We captured two se- remains mostly under 20 mm. Compared to the model-
quences from Target 2 as follows: based method, the edge-based tracker is on average
more accurate but suffers more from jitter and occa-
Sequence 2.1 The sequence starts such that the dash- sional drifting. The translation error of KinFu is small
board is completely in the camera view. The in the beginning but increases as the tracker proceeds,
camera is moved closer to the left side, and then and reaches a maximum of approximately 40 mm near
around the gear stick to the right side of the target. frame 250. The mean error of the model-based tracker
During the sequence, there is no notable change is 14.4 mm and the standard deviation 5.9 mm (Ta-
in roll or pitch angles in the camera orientation. ble 2). The corresponding values for KinFu and edge-
Figure 5: Absolute error of the estimated camera position using different tracking methods. Red curves refer
to the model-based tracker, green to KinFu and blue to the edge-based method. Vertical lines denote tracker
resets. Y-axis indicates the error value at each frame in mm, and x-axis is the frame number.
Figure 6: The distribution of the errors computed using the error metric A. The coarse outliers (absolute value
more than 50 mm) are ignored. The histograms are normalized so that their maximum values are set to one,
and the other values are scaled respectively.
Figure 7: The distribution of the errors computed using the error metric B. The coarse outliers (absolute value
more than 50 mm) are ignored. The histograms are normalized so that their maximum values are set to one,
and the other values are scaled respectively.
based trackers are 20.2 mm (10.7 mm) and 18.7 mm edge-based method 7.3 %.
(26.4 mm) respectively. The angle errors behave simi- To evaluate how accurately virtual data could be
larly compared to the translation errors and the rest of registered with raw depth video, we calculated the
the results are shown in Table 3. reprojection errors for the model-based method and
KinFu using the error metric B. The error histograms
The distribution of the reprojection errors com- in Figure 7 show that the errors of the model-based
puted using the error metric A are shown in Figure tracker are symmetrically distributed around zero. The
6. The error distribution of each tracker is symmetric. ratio of the coarse outliers is 1.1 % (Table 5). The error
The model-based and the edge-based methods slightly distribution of the KinFu tracker is centered around +6
overestimate the distance to the target, and the result mm, and the shape is skewed towards positive values.
of KinFu is opposite which on average underestimates The ratio of outliers is 5.3 %.
the distance. The model-based approach has the nar-
rowest and KinFu the broadest distribution of errors.
5.2 Sequences 1.2 and 1.3
Table 4 shows the ratio of coarse outliers (absolute dif-
ferences over 50 mm) in the difference images. The ra- Compared to Sequence 1.1, the model-based tracker
tio of outliers for the model-based tracker and KinFu performs more accurately in Sequences 1.2 and 1.3. In
are similar (4.6 % and 4.2 % respectively), and for the Sequence 1.2, the mean absolute error of the position
Table 2: Mean absolute errors and standard deviations Table 4: The ratio of outliers in difference images cal-
of estimated sensor position (in mm). culated using the error metric A.
Table 3: Mean absolute errors and standard deviations Table 5: The ratio of outliers in difference images cal-
of estimated sensor orientation (in degrees). culated with the error metric B.
is 5.3 mm and standard deviation 3.8 mm. In Sequence The distribution of the reprojection errors in Figures
1.3, the corresponding values are 9.0 mm and 5.0 mm 6 and 7 are similar to Sequence 1.1. Also, the ratio of
respectively. The tracker is reset three times during outliers in Tables 4 and 5 are consistent with the track-
Sequence 1.3 and can track Sequence 1.2 completely ing errors. Figure 8 has example images of the eval-
without resets. In Sequences 1.2 and 1.3, the camera uation process in Sequence 1.2. As shown in the im-
is moved closer to the target and the depth data is par- ages, the depth data is incomplete and partially miss-
tially lost. ing since the sensor is closer to the target than its mini-
Presumably KinFu suffers from the incomplete mum sensing range. Both model-based approaches are
depth data, and the mean absolute error and the stan- able to maintain the tracks accurately, but the drift of
dard deviation in Sequence 1.2 are more than doubled KinFu is clearly visible.
compared to Sequence 1.1, and almost tripled in Se-
quence 1.3. The number of resets of KinFu are six 5.3 Sequences 2.1 and 2.2
and three in Sequences 1.2 and 1.3 respectively. In
Sequence 1.2, the resets occur close to the frame 400 The CAD model of Target 2 differs from its real coun-
where the camera is close to the target and approxi- terpart, and there are coarse outliers in the depth data
mately half of the depth pixels are lost. The accuracy of Sequences 2.1 and 2.2. The translation errors in
of the edge-based method decreases slightly too. It Figure 5 show that both the model-based tracker and
is reset seven times during Sequence 1.2 and eleven KinFu perform robustly, and the trackers are not reset
times in Sequence 1.3. In Sequence 1.3, between the during the tests. The edge-based method suffers from
frames 150 and 200, all of the trackers are reset mul- drift and it is reset five times in both experiments. Ta-
tiple times. During that time interval, the camera is bles 2 and 3 as well as Figure 5 show that the accuracy
moved close to the target and approximately half of of the model-based method is comparable to the first
the depth pixels are lost. Additionally, the camera is three experiments, and that the approach is the most
rotated relatively fast around its yaw axis. Tables 2 accurate from the methods.
and 3 show the rest of the results. The error histograms based on the error metric A
Figure 8: Tracker performance evaluation examples in different scenarios. Top row images are from the frame
150 of Sequence 1.2 and bottom row images are from the frame 250 of Sequence 2.1. Top row images 1-2 (from
the left): Results of the model-based method calculated with the 3D error metric A and B. Top row images 3-4:
Corresponding results for KinFu. Top row image 5: The result of the edge-based method calculated with the
3D error metric A. Bottom row images are ordered similarly to the top row. The colorbar units are in mm.
are shown in Figure 6. The results of the model- 5.4 Sequence 3.1
based tracker are similar to the first three experiments,
and the errors are distributed symmetrically with close Target 3 does not constrain the ICP in the vertical di-
to zero mean. The error distributions of KinFu and mension and the model-based tracker fails to track the
the edge-based method are more wide spread and the camera. Figure 5 shows that the model-based tracker
drift of the edge-based method is especially visible. drifts immediately after the initial reset, and that there
For the model-based tracker and KinFu, the ratio of are only a few sections in the experiment where the
outliers in reprojection errors are similar to Target 1, tracker is stable (but still off from the ground truth
and for the edge-based method the ratio clearly in- trajectory). Since the model-based tracker was drift-
creases. The error histograms based on the error met- ing, we did not compensate the bias in the hand-eye
ric B show that the model-based tracker performs con- calibration for any of the methods (see Section 4.1).
sistently, and that the reprojected model was aligned The edge-based tracker performs better and it is able
to the captured depth frames without bias. The KinFu to track the camera for most of the frames, although
tracker has more a widespread error distribution. Ta- it was reset seven times during the test. KinFu per-
ble 5 shows that there are more coarse outliers in the forms equally well compared to the previous experi-
results of KinFu as well. Note, that due to differences ments, and it is able to track the camera over the whole
between the reference CAD model and its real coun- sequence without significant drift. The result is unex-
pected since KinFu’s camera pose estimation is based
terpart, the number of outliers is relatively high in both
methods. on the ICP. We assume that noisy measurements are
accumulated to the 3D reconstruction, and these inac-
curacies in the model are constraining ICP in the ver-
The images in Figure 8 show tracking examples tical dimension.
from Sequence 2.1. The difference images computed
using the error metric B show that the model-based 5.5 Factors affecting the accuracy
tracker aligns the observed depth maps accurately with
the rendered model, and the real differences are clearly In AR applications, it is essential that the tracking sys-
distinguishable from the images. With KinFu, the real tem performs without lag and as close to real-time
differences and positioning errors are mixed. The er- frame rates as possible. When a more computation-
ror metric A shows that the model-based approach is ally intensive method is used for the tracking, a lower
close to ground truth and major errors are present only frame rate is achieved and the wider baseline between
around the edges of the target. successive frames needs to be matched in a pose up-
date. We evaluated the effect of lens distortions, raw 6 Discussion and conclusion
data filtering and the number of ICP iterations sepa-
rately to the accuracy in Sequences 1.1 and 2.1. Each We proposed a method for real-time CAD model-
of them increases the computational time and are op- based depth camera tracking that uses ICP for pose
tional. Table 6 shows the results. Compared to the update. We evaluated the method with three real life
results shown in Table 2 (lens distortion compensa- reference targets and with six datasets, and compared
tion off, bilateral filtering off, number of ICP itera- the results to depth-based SLAM, to a 2D edge-based
tions set to L = {10, 5, 4}), it can be seen that the method and to the ground truth.
bilateral filtering step does not improve the accuracy, The results show that the method is more robust
and can be ignored for the model-based tracking ap- compared to the 2D edge-based method and suffers
proach. Lens distortion compensation improved the less from jitter. Compared to depth-based SLAM, the
accuracy slightly in Sequence 1.1, but improves the method is more accurate and has less drift. Despite
accuracy by approximately 26 % in Sequence 2.1. Re- incomplete range measurements, noise, and inaccura-
ducing the number of iterations in ICP does not have cies in the Kinect depth measurements, the 3D repro-
notable change in Sequence 1.1 and decreases the ac- jection errors are distributed evenly and are close to
curacy by 7 % in Sequence 2.1. With the laptop PC, zero mean. For applications that require minimal lag
the lens distortion compensation (computed in CPU) and fast frame rates, it seems sufficient to run the IPC
takes approximately 7 ms and the tracker with ICP iterations only once for each pyramid level. This does
iterations L = {1, 1, 1} 50 ms versus 160 ms with not affect to the accuracy or jitter, but speeds up the
L = {10, 5, 4}. Bilateral filtering (computed in GPU) processing time significantly. In our experiments, fil-
does not add notable computational load. tering the raw depth frames did not improve the track-
In addition to noise and lens distortions, the Kinect ing accuracy, but for applications that require very pre-
suffers from depth distortions that depend on the mea- cise tracking, the lens distortions should be compen-
sured range and that are unevenly distributed in the sated. Additionally, the Kinect sensor suffers from
image domain [HKH12]. We calculated the mean pos- depth measurement errors. The distribution of the er-
itive and negative residual images over Sequence 1.3 rors in the image domain is complex, and a depth cam-
using the error metric B and the model-based tracker. era model that compensates the errors pixel-wise (e.g.
We thresholded the images to ±10 mm to emphasize [HKH12]) should be considered.
the sensor depth measurement errors and to deduct The ICP may not converge to the global optimum
the pose estimation errors. Figure 9 shows the er- if the target object does not have enough geometrical
ror images, which are similar to the observations in constraints (the problem has been discussed e.g. in
[GIRL03]). This leads to wrong pose estimates and may lead to drift. We envision that the method could
drift, and limits the use of the method to objects that be improved by making partial 3D shape reconstruc-
have variance in shape in all three dimensions. How- tions online, and appending the results to the CAD
ever, in our experiments, KinFu was more stable with model for more constraining geometry. Other sugges-
such object and did not drift during the tests. The exact tion for improvement is to complete the method with
reason for this behavior is unclear to us, but we assume an edge-based approach to prevent the tracker from
that the inaccuracies and noise in range measurements drifting. For example, a 3D cube fully constraints the
are accumulated to the reference model constraining ICP as long as three faces are seen by the camera. But
the tracker. if the camera is moved so that only one face is visible,
We excluded the tracker initialization from this pa- only the distance to the model is constrained. How-
per. In practical applications, the automated initializa- ever, the edge information would be still constraining
tion is required, and to initialize the camera pose one the camera pose.
may apply methods developed for RGB-D based 3D
object detection (e.g. [HLI+ 13]) or methods that rely
on depth information only (e.g. [SX14]). As the ICP 7 Acknowledgments
aligns the model and the raw depth frames in a com-
mon coordinate system, the model-based method (as The authors would like to thank professor Tapio Takala
well as the edge-based method) is forgiving to inaccu- from Aalto University, Finland for valuable com-
rate initialization. The maximum acceptable pose er- ments, and Alain Boyer from VTT Technical Research
ror in the initialization stage depends on the reference Centre of Finland for language revision.
model geometry. Detailed surfaces with a lot of repet-
itive geometry may guide the ICP to local minimum,
but smooth and dominant structures allow the tracker
to slide towards the correct pose. References
Although we did not evaluate the requirements for
the size of the reference model’s appearance in the [AZ95] Martin Armstrong and Andrew Zisser-
camera view, some limitations can be considered. The man, Robust object tracking, Asian Con-
projection of small or distant objects occupy relatively ference on Computer Vision, vol. I, 1995,
small proportion of the depth frames, and the relative pp. 58–61, ISBN 9810071884.
noise level of the depth measurements increases. Thus,
the geometrical constraints may become insufficient [Azu97] Ronald T. Azuma, A survey of aug-
for successful camera pose estimation. Additionally, mented reality, Presence: Teleopera-
if the camera is moved fast or rotated quickly between tors and Virtual Environments 6 (1997),
the consecutive frames, the initial camera pose from no. 4, 355–385, ISSN 1054-7460, DOI
the previous time step may differ significantly from 10.1162/pres.1997.6.4.355.
the current pose. Thus, small or distant objects may
be treated completely as outliers, and the pose update [BBS07] Gabriele Bleser, Mario Becker, and
would fail. The exact requirements for the reference Didier Stricker, Real-time vision-based
model’s visual extent in the camera view depend on tracking and reconstruction, Journal of
the size of the objects and how the camera is moved. Real-Time Image Processing 2 (2007),
Similar methods as suggested for automatic initializa- no. 2, 161–175, ISSN 1861-8200, DOI
tion could be used in background process to reinitialize 10.1007/s11554-007-0034-0.
the pose whenever it has been lost.
With the proposed approach, virtually any CAD
Citation
model can be used for depth camera tracking. It is re-
Otto Korkalo and Svenja Kahn, Real-time depth ca-
quired that the model can be efficiently rendered from
mera tracking with CAD models and ICP, Journal of
the desired camera pose and that the corresponding
Virtual Reality and Broadcasting, 13(2016), no. 1,
depth map can be retrieved from the depth buffer. The
August 2016, urn:nbn:de:0009-6-44132,
models that do not have variance in shape in every di-
DOI 10.20385/1860-2037/13.2016.1, ISSN 1860-2037.
mension do not completely constrain the ICP which
[BPS05] Gabriele Bleser, Yulian Pastarmov, and [GRV+ 13] Higinio Gonzalez-Jorge, Belén Riveiro,
Didier Stricker, Real-time 3D camera Esteban Vazquez-Fernandez, Joaquı́n
tracking for industrial augmented real- Martı́nez-Sánchez, and Pedro Arias,
ity applications, WSCG ’2005: Full Pa- Metrological evaluation of Mi-
pers: The 13-th International Conference crosoft Kinect and Asus Xtion sen-
in Central Europe on Computer Graph- sors, Measurement 46 (2013), no. 6,
ics, Visualization and Computer Vision 1800–1806, ISSN 0263-2241, DOI
2005 in co-operation with Eurograph- 10.1016/j.measurement.2013.01.011.
ics: University of West Bohemia, Plzen,
Czech Republic (Václav Skala, ed.), 2005, [Har93] Chris Harris, Tracking with rigid models,
HDL 11025/10951, pp. 47–54, ISBN 80- Active vision (Andrew Blake and Alan
903100-7-9. Yuille, eds.), MIT Press, Cambridge, MA,
1993, pp. 59–73, ISBN 0-262-02351-2.
[BSK+ 13] Erik Bylow, Jürgen Sturm, Christian Kerl,
[HF11] Steven Henderson and Steven Feiner, Ex-
Fredrik Kahl, and Daniel Cremers, Real-
ploring the benefits of augmented real-
Time camera tracking and 3D recon-
ity documentation for maintenance and
struction using signed distance functions,
repair, IEEE Transactions on Visualiza-
Robotics: Science and Systems (RSS)
tion and Computer Graphics 17 (2011),
Conference 2013, vol. 9, 2013, ISBN 978-
no. 10, 1355–1368, ISSN 1077-2626, DOI
981-07-3937-9.
10.1109/TVCG.2010.245.
[CC13] Changhyun Choi and Henrik I. Chris- [HKH12] Daniel Herrera C., Juho Kannala, and
tensen, RGB-D object tracking: a par- Janne Heikkilä, Joint depth and color
ticle filter approach on GPU, 2013 camera calibration with distortion correc-
IEEE/RSJ International Conference on In- tion, IEEE Transactions on Pattern Anal-
telligent Robots and Systems, 2013, DOI ysis and Machine Intelligence 34 (2012),
10.1109/IROS.2013.6696485, pp. 1084– no. 10, 2058–2064, ISSN 0162-8828, DOI
1091. 10.1109/TPAMI.2012.125.
[DC02] Tom Drummond and Roberto Cipolla, [HLCH12] Miles Hansard, Seungkyu Lee, Ouk Choi,
Real-time visual tracking of complex and Radu Horaud, Time of Flight Cam-
structures, IEEE Transactions on Pattern eras: Principles, Methods, and Appli-
Analysis and Machine Intelligence 24 cations, SpringerBriefs in Computer Sci-
(2002), no. 7, 932–946, ISSN 0162-8828, ence, Springer, London, 2012, ISBN
DOI 10.1109/TPAMI.2002.1017620. 978-1-4471-4658-2, DOI 10.1007/978-1-
4471-4658-2.
[DRMS07] Andrew J. Davison, Ian D. Reid,
Nicholas D. Molton, and Olivier Stasse, [HLI+ 13] Stefan Hinterstoisser, Vincent Lepetit,
MonoSLAM: Real-time single camera Slobodan Ilic, Stefan Holzer, Gary Brad-
SLAM, IEEE Transactions on Pattern ski, Kurt Konolige, and Nassir Navab,
Analysis and Machine Intelligence 29 Model Based Training, Detection and
(2007), no. 6, 1052–1067, ISSN 0162- Pose Estimation of Texture-Less 3D Ob-
8828, DOI 10.1109/TPAMI.2007.1049. jects in Heavily Cluttered Scenes, Com-
puter Vision – ACCV 2012: 11th
[GIRL03] Natasha Gelfand, Leslie Ikemoto, Szymon Asian Conference on Computer Vision,
Rusinkiewicz, and Marc Levoy, Geomet- Daejeon, Korea, November 5-9, 2012,
rically Stable Sampling for the ICP Algo- Revised Selected Papers (Berlin) (Ky-
rithm, Fourth International Conference on oung Mu Lee, Yasuyuki Matsushita,
3-D Digital Imaging and Modeling 3DIM, James M. Rehg, and Zhanyi Hu, eds.),
2003, DOI 10.1109/IM.2003.1240258, Lecture Notes in Computer Science, Vol.
pp. 260–267, ISBN 0-7695-1991-1. 7724, vol. 1, Springer, 2013, DOI