User-Perspective Augmented Reality Magic Lens From Gradients
User-Perspective Augmented Reality Magic Lens From Gradients
Figure 1: An example Augmented Reality application showcasing the difference between user-perspective and device-perspective magic lens
interfaces. (a) Real world environment only. (b) Augmented Reality scene with the conventional device-perspective magic lens. (c) AR scene
rendered with our user-perspective magic lens prototype.
87
to work outdoors in strong sunlight. Another approach to scene The NaviCam was a video-see-through AR system consisting of a
reconstruction is stereo vision, where the depth of the scene is re- palmtop TV with a mounted camera and tethered to a workstation.
constructed by matching two views of a stereo camera pair. The The video from the camera is captured, augmented, and then dis-
advantage of stereo reconstruction is that it can work with standard played on the TV. This hand-held video-see-through approach soon
cameras, it does not need active illumination, and there are no major became the norm for Augmented Reality interfaces [Zhou et al.
restrictions with regard to outdoor scenes. 2008]. Optical see-through AR approaches (e.g. [Bimber et al.
2001; Olwal and Höllerer 2005; Waligora 2008] can implement
Some stereo reconstruction algorithms can provide quite accurate perspectively correct AR magic lenses without the need for scene
depth maps, but this comes at a performance penalty. Fully accu- reconstruction but have to cope with convergence mismatches of
rate depth maps cannot yet be achieved at frame rate. Real-time augmentations and real objects behind the display unless they use
stereo can produce depth maps that are sufficient for many appli- stereoscopic displays.
cations, but they are not very good for the purpose of re-rendering
a real world scene. Typically, real-time stereo approaches achieve There have been efforts in the AR community to design and de-
speed by using a small depth range (limiting the number of different velop video see-through head-worn displays that maintain a seam-
depth values), resulting in a scene model composed of distinct front less parallax-free view of the augmented world [State et al. 2005;
facing planes. Re-rendering this model from a new point-of-view Canon 2014]. This problem is slightly simpler than correct per-
can result in a scene composed of obvious distinct layers. spective representation of the augmented world on hand-held magic
lenses since the relationship between the imaging device and the
In this paper we present a new approach to solving the problem user’s eyes is relatively fixed.
of creating a user-perspective magic lens. We observe that accu-
rate dense scene reconstruction is a requirement imposed by the With the proliferation of smartphones and tablets AR has reached
traditional rendering methods, and not an inherent requirement of the mainstream consumer market; this has made hand-held video-
creating a user-perspective view. By taking a different approach to see-through the most common type of AR and it is what is often
rendering, we lower the requirements for reconstruction while also assumed by the term “magic lens” when used in the context of AR
achieving good results. We do this by using image-based rendering [Mohring et al. 2004; Olsson and Salo 2011]. Since the display
(IBR) [Shum and Kang 2000]. IBR can produce high quality results of the augmented environment from the perspective of the device’s
with only limited scene models by leveraging existing imagery of camera introduces a potentially unwanted shift of perspective, there
the scene. This fits very well with the nature of our problem. is renewed interest in solutions for seamless user-perspective repre-
sentation of the augmented world on such self-contained mobile AR
The key to our approach is the adoption of a recent gradient do- platforms. User studies conducted using simulated [Baričević et al.
main IBR algorithm [Kopf et al. 2013] that is paired with a novel
2012] or spatially constrained [Čopič Pucihar et al. 2013; Čopič Pu-
semi-dense stereo matching algorithm we developed. The IBR al-
cihar et al. 2014] systems have shown that user-perspective views
gorithm we use renders from the gradients in the image instead of
have benefits over device-perspective views. Several systems have
the pixel color values. It achieves good results as long as the depth
attempted to create a user-perspective view by warping the video
estimates of the strongest gradients are good, even if the depths of
of a video-see-through magic lens [Hill et al. 2011; Matsuda et al.
the weak gradients are incorrect. This fits well with the general be-
2013; Tomioka et al. 2013]; however these approaches can only
havior of stereo reconstruction, but we exploit it further by using a
approximate the true user-perspective view as they are unable to
semi-dense stereo algorithm to compute depths only at the strongest
change the point of view and therefore do not achieve the geomet-
gradients.
rically correct view frustum.
With this approach we have created a geometrically-correct user-
The most directly relevant work to this paper is the geometrically-
perspective magic lens with better performance and visual quality
correct user-perspective hand-held augmented reality magic lens
than previous systems. Furthermore, we use only passive sensing,
system in [Baričević et al. 2012]. That prototype system was built
and support fully dynamic scenes with no prior modeling. Due to
using a Kinect depth sensor and a Wiimote. The Wiimote is used to
the use of face tracking, we do not require instrumenting the user.
track goggles worn by the user in order to obtain the head position.
Although our prototype system is tethered to a workstation and
The approach relies on the fairly high quality depth information
powered by a GPU, we are confident that given the rate of advance-
provided by the Kinect to obtain an accurate 3D model of the real
ment of mobile hardware this will be possible on a self-contained
world; the final scene is then rendered using conventional rendering
mobile platform in just a few years.
methods (raycasting and scanline rendering). While the approach is
fairly straightforward, it has certain constraints. Firstly, the system
2 Related Work does not gracefully handle dynamic scenes as the scene is rendered
in two layers with different real time characteristics. One layer is
The “magic lens” metaphor was first introduced by Bier et al. at Xe- rendered from the live Kinect stream and updates immediately, the
rox PARC [Bier et al. 1993] as a user interface paradigm developed other is rendered from a volumetric scene model that updates more
for traditional desktop GUI environments. The basic idea is that of slowly. Secondly, active depth sensors like the Kinect cannot oper-
a movable window that alters the display of the on-screen objects ate well under strong sunlight (or any other strong light source that
underneath it. This window acts like an information filter that can emits at their frequency).
reveal hidden objects, alter the visualization of data, or otherwise
modify the view within the region that the window covers. Stereo reconstruction is one of the most well researched areas of
computer vision. A full overview is well beyond the scope of this
This concept of an information filtering widget was quickly adopted paper. For an excellent review of the field we refer the reader to
outside traditional desktops. Viega et al. developed 3D versions [Scharstein and Szeliski 2002]. In recent years, a number of al-
of the magic lens interface, both as flat windows and as volumet- gorithms have been proposed that take advantage of GPU hard-
ric regions [Viega et al. 1996]. The Virtual Tricorder [Wloka and ware to achieve real-time performance [Wang et al. 2006; Yu et al.
Greenfield 1995] was a interaction device for an immersive VR en- 2010; Zhang et al. 2011; Kowalczuk et al. 2013]. While these algo-
vironment that featured a mode in which a hand-held tool revealed rithms can produce fairly accurate dense disparity maps, the real-
altered views of the 3D world. [Rekimoto and Nagao 1995] in- time speeds are achieved for relatively low resolutions and narrow
troduced hand-held Augmented Reality with the NaviCam system. disparity ranges. Our stereo algorithm is inspired by PatchMatch
88
to introduce the idea. We also give a more detailed explanation of
how we adapted the method for our system in Section 5 below.
The main idea behind gradient domain methods is that an image
can be reconstructed from its gradients by performing an integra-
tion. Therefore, if one needed to generate an image corresponding
to a new viewpoint of a scene (as in a user-perspective magic lens),
one could do so by integrating the gradient images for those view-
points. These gradient images can be obtained by reprojecting the
gradients computed for an existing view of a scene for which there
(a) (b)
is scene geometry information. Since strong gradients are generally
sparse in a scene, and since stereo matching algorithms work best
at strong gradients, this approach provides a way to create a high
quality image even without a fully dense and accurate depth map
as long as the strongest gradients are correctly reprojected. While
there will be errors in the reprojected gradient image, they will be
mostly confined to weak gradients that do not have a large effect
on the integration of the final solution. In contrast, a standard re-
projection method would result in a noisy solution with much more
noticeable artifacts.
(c) (d) Using a rendering method that only requires good depth informa-
tion at the gradients gives us the opportunity to optimize our stereo
reconstruction. Instead of the standard approach of computing a
dense depth map across the input image pair, we can compute semi-
dense depth maps that only have information at the parts of the im-
age that have strong gradients. The depth of the rest of the image
can then be approximated by filling in depth values extrapolated
from the computed parts of the depth map. As long as the depth
information for the strongest gradients is correct, the final rendered
solution for the novel view will not have significant artifacts.
(e) (f) In order to achieve this goal we have developed a novel semi-dense
stereo matching algorithm inspired by PatchMatch [Barnes et al.
Figure 2: The steps to rendering a novel view: (a) input image, (b) 2009]. The algorithm is simple and fast, but it computes accurate
gradient magnitudes of input, (c) mask of strongest gradients, (d) results over the areas of interest. A detailed description of the algo-
disparity map for masked area, (e) filled-in disparity map, (f) final rithm is given in Section 4 below.
solution. (Note: (a) - (e) are for left camera, (f) is for final pose)
3.1 Creating a novel view
[Barnes et al. 2009], an iterative probabilistic algorithm for finding The basic steps to generating a novel view with our approach are
dense image correspondences. PatchMatch is a general algorithm shown in Figure 2. The input to the pipeline is a stereo pair (Figure
and has been applied to the field of stereo matching before. [Bleyer 2a shows left image) and a desired position for the novel view.
et al. 2011] proposed a stereo matching algorithm based on Patch-
Match primarily designed to support matching slanted surfaces, al- The first step (Figure 2b) is to filter the input image pair in order to
though it also supports front facing planes. In [Pradeep et al. 2013] produce a mask that marks the pixels that are at the strong gradi-
this was adapted for real-time 3D shape reconstruction by using a ents. We define the gradients as forward difference between neigh-
faster matching cost and relying on a volumentric fusion process to bors. The overall strength of the gradient is computed by taking the
compensate for the noisy per-frame depth maps. maximum between the horizontal and vertical strengths, which are
defined as the average of the per channel absolute differences.
Imaged-based rendering techniques create novel views of a scene
from existing images [Shum and Kang 2000]. These novel views We then apply a threshold to this gradient strength image to create
can be rendered either purely from input image data [Levoy and a gradient mask. We use a global threshold for the entire image.
Hanrahan 1996], or by using some form of geometry [Shade et al. The threshold can be either a set fixed value or the current average
1998; Debevec et al. 1996]. Our approach is based on the gradient- gradient magnitude. In practice, we find a fixed threshold between
domain image-based rendering work by Kopf et al. [2013]. Their 5 and 10 to work well. We first clean the mask by removing pix-
method creates novel views by computing dense depth maps for the els that have no neighbors above the threshold and then perform a
input images, reprojecting the gradients of the images to the novel dilation step (Figure 2c).
view position, and finally using Poisson integration [Pérez et al. Next, our stereo matching algorithm is run over the masked pixels.
2003] to generate the novel view. This results in a semi-dense disparity map (Figure 2d) with good
depth estimates for the masked areas with strong gradients, and no
3 Overview data for the rest of the image. We then perform a simple extrapo-
lation method to fill-in the disparity map across the image (Figure
As mentioned above, our approach is based on the gradient domain 2e). Then the 3D position of each pixel is computed from the dis-
image-based rendering algorithm by Kopf et al. [2013]. For a de- parity map. The renderer takes the 3D position information, as well
tailed description of the algorithm we refer the reader to the original as the desired novel view’s camera parameters (position, view frus-
paper; here we will only give a brief high level overview in order tum, etc.) and generates the final image (Figure 2f).
89
4 Stereo Reconstruction zero, while the costs are initialized to the maximum possible value.
In our implementation we use unsigned 8-bit values to store the
One of the most important considerations in the development of our disparities, with a disparity range of [0, 255]. The costs are stored
algorithm was the need to run as fast as possible. This led to a par- as unsigned 16-bit values, giving a range of [0, 216 -1]. The upper
allel GPU-based approach, which in turn set additional constraints. limit is above the maximum possible value that can be returned as a
One of the principal tenets of GPU computing (or SIMD computing matching cost, so initializing the cost to 0xffff simplifies the search
in general) is to avoid code path divergence. That is, each thread in for the minimum cost disparity since there is no need to treat the
a concurrently running group of threads should execute the same first candidate disparity differently from the rest.
steps in the same order at the same time, just using different data.
This demand led to several design decisions regarding our algo- Random Search The random search step consists of generating
rithm. a random disparity value, computing the matching cost given that
disparity, and keeping it if the cost is lower than the current cost.
4.1 Mask indexing This can then be repeated a number of times before continuing to
the propagation step.
The mask computed from the gradient magnitudes determines the
pixels for which the stereo algorithm will compute disparities. The way the random disparity is generated requires some discus-
However, since the algorithm is implemented on the GPU using sion. Regular PatchMatch [Barnes et al. 2009] initializes fully ran-
CUDA, using this mask directly would be inefficient. A naı̈ve ap- domly from all possible correspondences, and the random search
proach would be to run a thread per pixel and simply exit the thread is done by randomly searching from all possible correspondences
if the pixel is not in the mask. However, this is very inefficient, within a shrinking window centered on the current solution. Our
as these threads will not truly exit. The SIMD nature of the GPU approach is different. Firstly, the initialization and random search
hardware requires all the threads that are concurrently running on a is a single unified step. Secondly, the random disparity is not gen-
core to follow the same code path. If even one thread in that group erated from the disparity range but from the valid indices for that
is in the mask and needs to run the algorithm, then all the threads epipolar line. We are matching only the strong gradients that are
in the group might as well run since they would introduce (almost) within our masks.
no overhead. In order to get any performance gain, all the pixels In general, if a part of the scene is labeled as a strong gradient in the
in the image region covered by the group would have to be outside left image it will also be labeled as a strong gradient in the right im-
the mask. This is rare in natural images, as there are almost always age (and vice-versa). This is not the case for parts that are occluded
a few strong gradient pixels in any part of the image. This means in one image of the pair, but those do not have a correct match
that the naı̈ve approach to the gradient-guided semi-dense stereo anyway. If follows that a pixel within the gradient mask of one im-
algorithm degenerates to a dense algorithm. age will have its corresponding pixel within the gradient mask of
In order to prevent this waste of computational power we re-group the other image. Since the gradients are generally sparse, this sig-
the gradient pixels so that they cluster together. We process the nificantly reduces the possible valid disparities. This reduction in
mask image to create an array of pixel indices. Each row of the search space means that each random guess has a higher probability
mask is traversed in parallel, and when a pixel that is inside the of being correct, which improves convergence.
mask is encountered, its index is saved in the output array at the Therefore, when generating a random disparity we sample from the
same row and in the next available column. Pixels outside the mask space of valid indices, not from the full disparity range. As men-
are simply ignored. As a result the indices of the masked pixels are tioned above, the first column of each row in the index masks stores
densely stored in the output array. The count of masked pixels in the number of valid pixels. This value is used as the range of a
a row is saved in the first column of the output array. This process uniform random distribution. We generate a random integer from
creates a mask whose blocks are mostly completely full or com- this distribution, this number gives us the column in the index mask
pletely empty, with only a few that are partially full. This mask is row to sample. The index stored in that column gives us our ran-
much more suitable for parallel processing on GPU architectures. dom match candidate. We then compute the matching cost for this
candidate correspondence, if the cost is lower than the current cost
4.2 Stereo matching we save the disparity and the cost as the current best match. For the
matching cost we use the standard sum of absolute differences over
Now that we have a mask of the strong gradients in the image, we a 7 × 7 support window. This process can be iterated, in our current
can run stereo matching on them. We implemented a simple, fast, implementation we run two iterations.
and accurate stereo matching algorithm inspired by PatchMatch.
Our algorithm takes the basic ideas of random search and propaga-
Propagation The random search step will generate a very noisy
tion from PatchMatch and applies it to the domain of semi-dense
disparity map where most of the disparities are wrong, but some are
stereo matching at the gradients and in parallel. Although inspired
correct. The propagation step serves to propagate the good matches
by PatchMatch, the specific details are somewhat different due to
across the image. Here our algorithm also differs significantly from
the nature of the problem.
PatchMatch.
The algorithm consists of two main steps: Random Search and
Taking the standard PatchMatch approach to propagation would
Propagation. The full algorithm is run for a number of iterations,
present several problems for our application scenario. Firstly, the
and in each iteration each step is iterated as well. Each iteration
computation cost is too high. In the serial version the image is
of each step is fully parallel at the individual pixel level. Only the
processed linearly from one corner to the next. At each pixel the
steps themselves and their iterations are serialized.
disparities of the preceding horizontal and vertical neighbors are
used as possible new disparities and new matching cost are com-
Data and Initialization The algorithm takes as its input the stereo puted. If the cost of a candidate disparity is lower than the current
image pair and the arrays with the mask indices. It outputs the one, the new disparity is adopted. Computing the matching cost is
disparity values and matching costs for each camera of the stereo expensive in general, and doing it serially is prohibitive. The per-
pair. Before the algorithm is run, the disparities are initialized to formance would be far too slow for real-time use. Parallel versions
90
of PatchMatch have been proposed, but they are still not well suited eration contributes somewhat to the disparity edge fattening, but it
to our application. Although the computations are done in parallel, improves the disparity map overall.
much more are needed per pixel. Even the parallel versions require
too many expensive matching cost computations per frame. Finally, after the previous steps we can fill in the remainder of the
disparity map. We assign new values to any pixels that are still
Secondly, PatchMatch is meant for computing dense correspon- left as invalid, or were not in the gradient mask. To extrapolate
dences. We only compute disparities within the masked areas. This the disparity map we use a simple linear search across the epipolar
means there are large gaps in the image. Although it is possible in lines. From each pixel (again in parallel) we search left and right for
principle to propagate by skipping those gaps, this would violate the first pixel that is valid. We look at the two disparity values and
the assumption of propagating between neighbors and it is unlikely adopt the lower one (taking the lower value instead of interpolating
that that kind of propagation would be useful. In the case of par- helps prevent occlusion edges from bleeding into occluded areas).
allel implementations of PatchMatch, the propagation is limited in This is perhaps an overly simplistic approach, and it does result in
radius so it would not be able to skip gaps anyway. considerable streaks in the disparity map. However, these streaks
are mainly over low gradient strength areas and therefore do not
We take a simpler approach to the propagation step. Instead of cause many artifacts in the final re-rendered image.
propagating serially through the entire image, we have each pixel
in parallel check its neighborhood. Instead of computing another
matching cost for each of its neighbor’s disparities, the pixel uses 4.4 Performance and accuracy
the neighbor’s cost as a proxy for what the cost would be for this
pixel if it had the same disparity. The idea behind this is that if our Through experimentation with the number of iterations of our
disparity is the same as that of our neighbor, our matching costs stereo algorithm, we settled on two overall iterations, each doing
will likely be very similar as well. We chose the neighbor with the two iterations of search and three iterations of propagation. This
lowest cost and take its disparity as a candidate solution, we only means that in total we only perform ten matching cost computations
now compute a new matching cost. If this new cost is lower than our per frame per masked pixel. Despite this we get accurate disparity
old cost we accept the new disparity, otherwise we keep the old one. results. Table 1 gives the timings and error rates for the Teddy and
This means that each iteration of the propagation step only does one Cones pairs from the Middlebury dataset [Scharstein and Szeliski
matching cost computation. In our current implementation we run 2003]. Figures 3 and 4 show the disparity maps and disparity errors.
three iterations of the propagation step. Because of the probabilistic nature of the algorithm we have an
effective disparity range of 256, even though we only compute the
4.3 Post processing cost for ten disparity levels. To achieve the equivalent precision a
plane sweep algorithm would have to check all disparity levels and
Most stereo algorithms have a post-processing step that follows the perform an order of magnitude more matching cost computations
initial computation of the disparity map. The purpose of this step (256). Even if the plane sweep skipped over unmasked areas, it
is to further refine the disparities. We take a fairly simple approach would not significantly reduce the runtime because of the GPU code
to post processing. The goal is to determine which of the computed path divergence problem mentioned above.
disparities are likely correct. Then the rest of the image is filled in
with extrapolated values. Table 1: Per-frame timings and error rates for the Teddy and Cones
datasets. The resolution of the input images and the disparity maps
The simplest way to determine good disparities is to run a con-
is 450x375. The error rate is the percentage of pixels within unoc-
sistency check. This involves comparing the left and right dispar-
cluded masked areas with a disparity error greater than 1 pixel.
ity map and only keeping the values for those pixels whose tar-
get points back at them, i.e., only keep correspondence pairs. This
Teddy Cones
eliminates the parts of the image that are occluded in the other view,
and therefore cannot have a good match. Although this works well Timings
for a standard plane sweep algorithm, in our case it could cause er- Computing mask 2.98 ms 3.18 ms
rors because our search is probabilistic and there is no guarantee Stereo matching 12.59 ms 16.82 ms
that pixels that are unoccluded and belong to a correspondence pair Post-processing 1.92 ms 1.57 ms
will point to each other. It is possible for only one pixel of the pair
to point to its match, while the other one points elsewhere. To help Error rate 15.47% 7.52%
with this we run a step prior to the consistency check. For each
pixel p in a disparity map we check if its match p0 = q points back.
If it does not, we compare the matching costs of the two pixels. Prototype In our prototype system the stereo camera has a native
Since the matching cost is symmetric it should be the same (and resolution of 1024x768, but in order to improve performance we
minimal) for a correspondence pair. If q has a higher matching cost reduce this to 512x384 for the stereo matching algorithm. We do,
then p, we set its match q 0 to p and set the cost. This will always however, use the full-color image for the matching, instead of the
create a better solution. This process is run in both left-to-right and common grayscale reduction. Although the stereo matching is at
right-to-left directions. After this step, we run a traditional consis- half resolution, this is upscaled back to full resolution before com-
tency check. Pixels that are not part of a correspondence pair are puting the gradient positions and calling the IBR algorithm.
labeled as invalid.
Since the invalid pixels are in the masked area, they are impor- 5 Rendering
tant so we do not want to naı̈vely fill them in the same way as the
unmasked pixels. Instead we attempt to grow the valid disparity As mentioned above, the basic idea of the method is to create a
values into the invalid ones. This is a parallel process where each novel view by integrating gradient images that are formed by re-
invalid pixel checks it direct neighbors and adopts the lowest valid projecting the gradients of the original view. Integrating a solution
disparity from the neighbors, this invalid pixels is now marked valid just from the gradients is a computationally expensive operation,
but its cost is set to maximum. Each iteration of this further grows even with a GPU-based parallel implementation. It can take many
the disparity, we settled on five iterations for our system. The op- iterations for the integration to converge to a solution, partly due
91
(a) (b) (a) (b)
Figure 3: Stereo matching for Teddy dataset. (a) Left input image. Figure 4: Stereo matching for Cones dataset. (a) Left input image.
(b) Raw disparity. (c) Final (filled-in) disparity. (d) Disparity error: (b) Raw disparity. (c) Final (filled-in) disparity. (d) Disparity error:
white - correct, black - error greater than 1 pixel, gray - not in mask white - correct, black - error greater than 1 pixel, gray - not in mask
or excluded because of occlusion. or excluded because of occlusion.
to the unknown constant of integration. The method by Kopf et al. Finally, the third step is the integration of the final solution from the
[2013] uses an approximate solution (the data term) as an initial gradients, initializing with and biasing toward the data term. This
solution in order to significantly reduce the number of iterations. step is fairly expensive, but its runtime is mostly constant, depend-
ing mainly on the number of iterations.
The key to the approximation step is to consider that when a gra-
dient changes position from the original view to the new view it The original work by Kopf et al. used a super-resolution frame-
should alter the color of the regions that it passes over. To clar- buffer for rendering all the steps in the algorithm, i.e., the frame-
ify, consider a quadrilateral whose two opposing edges are the gra- buffer size is several times larger than the input resolution. They
dient’s original position and the new position. This quad can be also bias the final solution toward the approximate solution. We
drawn over the original view, and the gradient value can be applied take a somewhat different approach. We observe that we can treat
to the pixels that the quad covers. This may add or subtract to those the approximate solution as simply the low frequency component
pixels’ value. If this process is done for all gradients, the result- of the final solution, while the reprojected gradients can provide the
ing image will be very similar to what the correct image should high frequency detail. We then use the approximate solution just as
be from the new view. For a more in-depth description, please see a initial solution, and do not bias towards it during the integration.
[Kopf et al. 2013]. This then allows us to use a much lower resolution image for our
data term, since it only needs to capture low frequency information.
5.1 Performance considerations By using a lower resolution data term we significantly improve per-
formance. We set the data term resolution to a quarter of the regular
While developing our prototype system we aimed to strike a bal- framebuffer resolution. We also reduce the number of integration
ance between real-time performance and good image quality. The steps to five, and use a framebuffer size smaller than the original
various bottlenecks were identified through profiling, and adjust- image. Although our framebuffer size (640x480) is smaller than
ments were made to reduce the run-time while minimizing any loss the raw input resolution, it does not actually lower the quality of
of quality. Here we give some details about those considerations. the final results. This is because the field of view of the user’s frus-
tum is usually narrower than that of the camera. As a result, the
The IBR algorithm can be divided into three distinct steps that have input image is effectively scaled up when shown on the magic lens
different performance behaviors. and therefore still oversampled by the framebuffer.
The first step is the rendering of the data term, which is surprisingly The final augmented image is rendered at 800x600, which is the
the most expensive. The performance hit here comes from the num- resolution of our display. The various resolutions in our pipeline
ber and size of the quads. Each quad corresponds to a gradient, so were empirically determined to give a good balance of performance
there are twice as many quads as there are pixels (one horizontal and quality for our system.
and one vertical). Furthermore, the nature of the shifting gradi-
ents means that each quad will typically generate a large number of
fragments. The cost of this step changes considerably based on the 6 Prototype
novel view position.
Our system consists of a hand-held magic lens rig tethered to a
The second (also fastest) step is rendering the gradients images, i.e., workstation. The rig, shown in Figure 5, was built using common
simply reprojecting the lines of the gradients at their new positions. off-the-shelf parts. The central component of the magic lens is a
92
6.2 Face tracking
93
7 Results the gradients. This is likely because this step makes OpenGL and
CUDA synchronize which forces all GPU operations to complete,
Some examples of the type of results we get can be seen in Figures it also imposes a synchronization with the CPU.
1a, 2, and 6.
Figure 6 shows a simple example of an AR scene, with both the 7.2 Discussion
user’s view (top) and the corresponding screen capture from the
magic lens display (bottom). The view frustum inside the magic Overall, our system provides quite satisfactory results but it does
lens is well aligned with the outside and the perspective of the scene have some remaining challenges. From a user perspective, the chal-
matches that of the outside. The screen capture taken at the same lenges are issues with the view frustum, and issues with the image
moment shows that the image quality of the magic lens view is quite quality. From a technical standpoint these are caused by issues with
good with only minimal rendering artifacts. face tracking, stereo reconstruction, rendering, and calibration.
Figure 2 shows the main steps of our approach for a somewhat clut-
tered live scene with various different features: dark areas, bright
areas, textured surfaces, homogenous surfaces, specularities, and
thin geometry. The stereo matching is only run on a small per-
centage of the image, and the filled-in disparity map is very coarse.
However, the final rendering has relatively minor artifacts.
Table 2: Average per-frame timings for our prototype implementa- Figure 7: Comparison between full resolution and reduced reso-
tions. Average framerate is about 16 FPS. lution. Left is data term, right is solution. Top is full resolution,
bottom is reduced resolution.
Timing (ms)
Frame total 62.32
Prepare input pair 3.11
Stereo matching 7.92
Post-processing 3.73
Consistency check 0.18
Grow disparity 0.81
Fill disparity 2.74
Compute and update positions 7.34
Image-based rendering 36.33
Data term 13.66
Gradients 8.04
Merge left and right 1.08
Conjugate gradient solver 13.55
Other 3.89
The performance of our final system across the various steps in our
pipeline can be seen in Table 2. The system has an overall aver-
age framerate of 16 FPS. The largest aggregate cost and about half
the total cost is the image-based rendering. The stereo matching is
very fast at less than 8ms. However post-processing adds another
3.7ms, most of which is spent on filling in the disparity. This is a
very simple step, but it is not yet optimized and performs poorly if Figure 8: Comparison between result using ground truth versus
the masked regions are too sparse. Another unexpectedly high cost our stereo matching. Top is with ground truth, bottom is with our
at over 7ms is the computing and updating of the 3D positions of stereo algorithm.
94
View frustum The heart of the user-perspective magic lens prob- sion boundaries. In areas that are visible from the viewer’s position
lem is providing a correct view frustum for the user. While our but not seen from the cameras, the gap is filled by smooth streaks
system generally accomplishes this goal, it has some constraints. connecting the edges.
Firstly, since it is a fully live system it can only show what the stereo
cameras currently see. Although we use cameras with a fairly wide 8 Conclusion and Future Work
field of view, it is still possible for the user to orient the magic lens
in such a way that the user’s view frustum includes areas that the We have presented a new approach to creating a geometrically-
cameras do not see. This problem is somewhat mitigated by the correct user-perspective magic lens, based on leveraging the gra-
fact that the best way to use a user-perspective magic lens is to hold dients in the real world scene. The key to our approach is in the
it straight in order to get the widest view. This then keeps the de- coupling of a recent image-based rendering algorithm with a novel
sired view frustum within the region visible by the cameras. Never- semi-dense stereo matching algorithm. Our stereo algorithm is fast
theless, this issue warrants some discussion. Currently our system and accurate in the areas of interest. The use of image-based render-
simply fills in those areas using information from the known edges. ing provides us with good imagery, even with limited scene model
A possible simple solution to this problem could be to use fisheye detail. Based on this approach we built a prototype device using
lenses or additional cameras in order to get a 180◦ view of the scene common off-the-shelf hardware.
behind the display. In [Baričević et al. 2012] the approach was to
create a model of the environment and render from the model, this In addition to the various possible improvements to the system we
way the out-of-sight areas could still be rendered if they were once would also like to evaluate the system with a formal user study.
visible. This type of compromise approach where currently visible Previous user studies on user-perspective magic lenses have either
areas are rendered from live data, while out-of-sight areas are ren- been in simulation [Baričević et al. 2012] or with approximations
dered from a model could also be a promising solution here. Since [Čopič Pucihar et al. 2013; Čopič Pucihar et al. 2014]. We hope
we use image-based rendering, the scene model can simply be a to be able to do a fair comparison between device-perspective and
collection of keyframes with depth maps. user-perspective magic lenses with a full real system.
95
Č OPI Č P UCIHAR , K., C OULTON , P., AND A LEXANDER , J. 2014. S ARAGIH , J. M., L UCEY, S., AND C OHN , J. 2009. Face Align-
The Use of Surrounding Visual Context in Handheld AR: Device ment through Subspace Constrained Mean-Shifts. In Interna-
vs. User Perspective Rendering. In Proceedings of the SIGCHI tional Conference of Computer Vision (ICCV).
Conference on Human Factors in Computing Systems, ACM,
New York, NY, USA, CHI ’14, 197–206. S CHARSTEIN , D., AND S ZELISKI , R. 2002. A Taxonomy and
Evaluation of Dense Two-Frame Stereo Correspondence Algo-
D EBEVEC , P. E., TAYLOR , C. J., AND M ALIK , J. 1996. Mod- rithms. International Journal of Computer Vision 47, 1-3, 7–42.
eling and Rendering Architecture from Photographs: A Hybrid
S CHARSTEIN , D., AND S ZELISKI , R. 2003. High-accuracy stereo
Geometry- and Image-based Approach. In Proceedings of the
depth maps using structured light. In Computer Vision and Pat-
23rd Annual Conference on Computer Graphics and Interactive
tern Recognition, 2003. Proceedings. 2003 IEEE Computer So-
Techniques, ACM, New York, NY, USA, SIGGRAPH ’96, 11–
ciety Conference on, vol. 1, I–195–I–202 vol.1.
20.
S HADE , J., G ORTLER , S., H E , L.- W., AND S ZELISKI , R. 1998.
H ILL , A., S CHIEFER , J., W ILSON , J., DAVIDSON , B., G ANDY, Layered Depth Images. In Proceedings of the 25th Annual
M., AND M AC I NTYRE , B. 2011. Virtual transparency: intro- Conference on Computer Graphics and Interactive Techniques,
ducing parallax view into video see-through AR. In Proceedings ACM, New York, NY, USA, SIGGRAPH ’98, 231–242.
of the 10th IEEE International Symposium on Mixed and Aug-
mented Reality (ISMAR), 2011, 239–240. S HUM , H., AND K ANG , S. B., 2000. Review of image-based ren-
dering techniques.
KOPF, J., L ANGGUTH , F., S CHARSTEIN , D., S ZELISKI , R., AND
G OESELE , M. 2013. Image-based Rendering in the Gradient S TATE , A., K ELLER , K. P., AND F UCHS , H. 2005. Simulation-
Domain. ACM Trans. Graph. 32, 6 (Nov.), 199:1–199:9. Based Design and Rapid Prototyping of a Parallax-Free, Ortho-
scopic Video See-Through Head-Mounted Display. In Proceed-
KOWALCZUK , J., P SOTA , E., AND P EREZ , L. 2013. Real- ings of the 4th IEEE/ACM International Symposium on Mixed
Time Stereo Matching on CUDA Using an Iterative Refinement and Augmented Reality, IEEE Computer Society, Washington,
Method for Adaptive Support-Weight Correspondences. Circuits DC, USA, ISMAR ’05, 28–31.
and Systems for Video Technology, IEEE Transactions on 23, 1
(Jan), 94–104. T OMIOKA , M., I KEDA , S., AND S ATO , K. 2013. Approximated
user-perspective rendering in tablet-based augmented reality. In
L EVOY, M., AND H ANRAHAN , P. 1996. Light Field Render- Mixed and Augmented Reality (ISMAR), 2013 IEEE Interna-
ing. In Proceedings of the 23rd Annual Conference on Com- tional Symposium on, 21–28.
puter Graphics and Interactive Techniques, ACM, New York,
NY, USA, SIGGRAPH ’96, 31–42. V IEGA , J., C ONWAY, M. J., W ILLIAMS , G., AND PAUSCH , R.
1996. 3D magic lenses. In Proceedings of the 9th annual ACM
M ATSUDA , Y., S HIBATA , F., K IMURA , A., AND TAMURA , H. symposium on user interface software and technology, ACM,
2013. Poster: Creating a user-specific perspective view for mo- New York, NY, USA, UIST ’96, 51–58.
bile mixed reality systems on smartphones. In 3D User Inter-
faces (3DUI), 2013 IEEE Symposium on, 157–158. WALIGORA , M. 2008. ”Virtual Windows: Designing and Im-
plementing a System for Ad-hoc, Positional Based Rendering”.
M OHRING , M., L ESSIG , C., AND B IMBER , O. 2004. Video see- Master’s thesis, University of New Mexico, Department of Com-
through AR on consumer cell-phones. In Mixed and Augmented puter Science.
Reality, 2004. ISMAR 2004. Third IEEE and ACM International
Symposium on, 252–253. WANG , L., L IAO , M., G ONG , M., YANG , R., AND N ISTER , D.
2006. High-Quality Real-Time Stereo Using Adaptive Cost Ag-
O LSSON , T., AND S ALO , M. 2011. Online user survey on current gregation and Dynamic Programming. In 3D Data Processing,
mobile augmented reality applications. In Proceedings of the Visualization, and Transmission, Third International Symposium
10th IEEE International Symposium on Mixed and Augmented on, 798–805.
Reality (ISMAR), 2011, 75–84.
W LOKA , M. M., AND G REENFIELD , E. 1995. The Virtual Tri-
O LWAL , A., AND H ÖLLERER , T. 2005. POLAR: Portable, Opti- corder: A Uniform Interface for Virtual Reality. In Proceedings
cal See-through, Low-cost Augmented Reality. In Proceedings of the 8th Annual ACM Symposium on User Interface and Soft-
of the ACM Symposium on Virtual Reality Software and Technol- ware Technology, ACM, New York, NY, USA, UIST ’95, 39–40.
ogy, ACM, New York, NY, USA, VRST ’05, 227–230.
Y U , W., C HEN , T., F RANCHETTI , F., AND H OE , J. 2010. High
P ÉREZ , P., G ANGNET, M., AND B LAKE , A. 2003. Poisson Image Performance Stereo Vision Designed for Massively Data Paral-
Editing. ACM Trans. Graph. 22, 3 (July), 313–318. lel Platforms. Circuits and Systems for Video Technology, IEEE
Transactions on 20, 11 (Nov), 1509–1519.
P RADEEP, V., R HEMANN , C., I ZADI , S., Z ACH , C., B LEYER ,
M., AND BATHICHE , S. 2013. MonoFusion: Real-time 3D re- Z HANG , K., L U , J., YANG , Q., L AFRUIT, G., L AUWEREINS , R.,
construction of small scenes with a single web camera. In Mixed AND VAN G OOL , L. 2011. Real-Time and Accurate Stereo: A
and Augmented Reality (ISMAR), 2013 IEEE International Sym- Scalable Approach With Bitwise Fast Voting on CUDA. Circuits
posium on, 83–88. and Systems for Video Technology, IEEE Transactions on 21, 7
(July), 867–878.
R EKIMOTO , J., AND NAGAO , K. 1995. The World Through the
Computer: Computer Augmented Interaction with Real World Z HOU , F., D UH , H.-L., AND B ILLINGHURST, M. 2008. Trends
Environments. In Proceedings of the 8th Annual ACM Sympo- in augmented reality tracking, interaction and display: A review
sium on User Interface and Software Technology, ACM, New of ten years of ISMAR. In Mixed and Augmented Reality, 2008.
York, NY, USA, UIST ’95, 29–36. ISMAR 2008. 7th IEEE/ACM International Symposium on, 193–
202.
S ARAGIH , J., AND M C D ONALD , K., 2014. FaceTracker.
facetracker.net. accessed 1 June 2014.
96