Ray Tracing
Ray Tracing
John Hart
Rendering Algorithms
Jesse D. Hall
John C. Hart
University of Illinois
Abstract
Assisted by recent advances in programmable graphics hardware, fast rasterization-based techniques have made
significant progress in photorealistic rendering, but still only render a subset of the effects possible with ray
tracing. We are closing this gap with the implementation of ray-triangle intersection as a pixel shader on existing
hardware. This GPU ray-intersection implementation reconfigures the geometry engine into a ray engine that
efficiently intersects caches of rays for a wide variety of host-based rendering tasks, including ray tracing, path
tracing, form factor computation, photon mapping, subsurface scattering and general visibility processing.
Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism
Keywords: Hardware acceleration, ray caching, ray classification, ray coherence, ray tracing, pixel shaders.
1. Introduction
Hardware-accelerated rasterization has made great
strides in simulating global illumination effects, such
as shadows35, 25, 7 , reflection3 , multiple-bounce reflection5 ,
refraction9 , caustics29 and even radiosity13 . Nonetheless
some global illumination effects have eluded rasterization
solutions, and may continue to do so indefinitely. The
environment map provides polygon rasterization with
limited global illumination capabilities by approximating
the irradiance of all points on an object surface with the
irradiance at a single point3 . This single-point irradiance
approximation can result in some visually obvious errors,
such as the boat in wavy water shown in Figure 1.
Ray tracing of course simulates all of these effects and
more. It can provide true reflection and refraction, complete
with local and multiple bounces. Complex camera models with compound lenses are easier to simulate using ray
tracing15 . Numerous global illumination methods are based
on ray tracing including path tracing12 , Monte-Carlo ray
tracing33 and photon mapping10 .
Ray tracing is classically one of the most time consuming operations on the CPU, and the graphics community has
been eager to accelerate it using whatever methods possible.
Hardware-based accelerations have included CPU-specific
tuning, distribution across parallel processors and even conc The Eurographics Association 2002.
8- 1
Figure 1: What is wrong with this environment-mapped picture? (1) The boat does not meet its reflection, (2) the boat
is reflected in the water behind it, and (3) some aliasing can
be seen in the reflection.
ray tracer capitalized on a variety of spatial, ray and memory coherencies to best utilize CPU optimizations such as
caching, branch prediction, instruction reordering, speculative execution and SSE instructions. Their implementation
ran at an average of 30 million intersections per second on
an 800 Mhz Pentium III. They were able to trace between
200K and 1.5M rays per second, which was over ten times
faster than POV-Ray and Rayshade.
There have been a large number of implementations of
ray tracers on MIMD computers26 . These implementations
focus on issues of load balancing and memory utilization.
One recent implementation on 60 processors of an SGI Origin 2000 was able to render at 5122 resolution scenes of from
20 to 2K patches at rates ranging from two to 20 Hz19 .
Special purpose hardware has also been designed for ray
tracing. The AR350 is a production graphics accelerator designed for the off-line (non-real-time) rendering of scenes
with sophisticated Renderman shaders8 . A ray tracing system designed around multiprocessors with smart memory is
also in progress23 .
Our ray engine is similar in spirit to another GPU-based
ray tracing implementation that simulates a state machine24 .
This state-based approach breaks ray tracing down into several states, including grid traversal, ray-triangle intersection
and shading. This approach performs the entire ray tracing
algorithm on the GPU, avoiding the slow readback process
required for GPU-CPU communication that our approach
must deal with. The state-based method however is not particularly efficient on present and near-future GPUs due to
the lack of control flow in the fragment program, resulting
in a large portion of pixels (from 90% to 99%) remaining
idle if they are in a different state than the one currently being executed. Our approach has been designed to organize
ray tracing to achieve full utilization of the GPU.
3. Ray Tracing with the GPU
3.1. Ray Casting
The core of any ray tracer is the intersection of rays with geometry. Rays are represented parametrically as r(t) = o + td
where o is the ray origin, d is the ray direction and t 0 is
a real parameter corresponding to points along the ray. The
classic implementation of recursive ray tracing casts each
ray individually and intersects it against the scene geometry.
This process generates a list of parameters ti corresponding
to points of intersection with the scenes geometric primitives. The least positive element of this list is returned as the
first intersection, the one closest to the ray origin.
Figure 2(a) illustrates ray casting as a crossbar. This illustration represents the rays with horizontal lines and the
(unorganized) geometric primitives (e.g. triangles) with vertical lines. The crossing points of the horizontal and vertical
lines represent intersection tests between rays and triangles.
c The Eurographics Association 2002.
8- 2
Figure
Programmablepixel
pixelshading
shading is
is aa crossbar.
Figure
3: 3:
Programmable
crossbar.
vanced shading [Lindholm et al. 2001]. These programmable elements
be separated
into an
twoall-pairs
components,
This can
crossbar
represents
checkthe
of vertex
every shader
ray
and the pixel
as shown
in Figure
3(b).
against
everyshader,
triangle.
Since
their inception,
ray tracers
The vertex shader replaces the graphics pipeline with a userhave
avoided checking every triangle against every primiprogrammable stream processor. This stream processor cannot
tive
through
the use
of spatial
coherent
data
structures
on
change
the number
of vertices
passing
through
it, but
it can change
both
the rays
and theincluding
geometry.
Thesecolor
data and
structures
the vertex
attributes,
position,
texture reorcoordinates. the crossbar into a sparse overlapping block structure,
ganize
pixel
the per-pixel
and application
as The
shown
in shader
Figuregeneralizes
2(b). Nevertheless
theaccess
individual
blocks
of texture.
The pixel
shader canthat
perform
arithmetic
operations
are
themselves
full crossbars
perform
an all pairs
com-on
the texture coordinates before they index into the texture, and can
parison
on their
subset arithmetic
of the raysoperations
and geometry.
then perform
additional
with the fetched texture
result.
In
a
single
pass,
the
pixel
shader computes
each
pixel in
The result of ray casting is the identification
of the
geomeisolation, and cannot access data stored at other pixels in the frametry (if any) intersected first by each ray. This result is a series
buffer.
of The
points
in the
crossbar,
no greater
than one
per horizontal
speed
of modern
graphics
accelerators
is indicated
by vertex
line
(ray).
These
first
intersections
are
shown
black3,disks
rate, which measures the vertical bandwidth of as
Figure
and its
pixel
rate, 2(c).
which
measures
the horizontal
bandwidth.are
The
pixel
in
Figure
The
other ray-triangle
intersections
indirate iswith
an order
magnitude
faster
than the
vertex rate
modern
cated
openofcircles
and are
ignored
in simple
ray on
casting.
graphics cards.
3.2.
3.3 Programmable
Mapping RayShading
CastingHardware
to Programmable Shading
Hardware
Graphics
accelerators have been designed to implement a
pipeline
thatray
converts
vertices
model
coorWe map the
castingpolygons
crossbar in
Figure from
2 to the
rasterization
crossbar to
in viewport
Figure 3 by
distributing Once
the rays
the pixels
dinates
coordinates.
in across
viewport
coordi-and
broadcasting
a stream
of the
triangles
to each
by interpolatsending their
nates,
rasterization
fills
polygon
withpixel
pixels,
coordinates
down
theand
geometry
as the in
vertex
attribute data
ing
the depth,
color
texturepipeline
coordinates
a perspective(e.g. color, texture coordinates) of screen filling quadrilaterals.
correct
fashion.
During
interpolated
texture
The rays
are stored
in rasterization,
two screen-resolution
textures.
Thecocolor
ordinates
index
memory
to stores
map anthe
image
texture of
of each pixel
ofinto
the texture
ray-origins
texture
coordinates
onto
the polygon.
the origin
of the ray. The color of each pixel of the ray-directions
3.4. Discussion
4 decision
Ray-Triangle
on the
GPU
The
to store raysIntersection
in texture and triangles
as vertex
attributes
based
initially on precision.
Sinceintersection
rays can betreats
The pixelwas
shader
implementation
of ray-triangle
the GPUwith
as a five
SIMD
et require
al. 2000].
In this
specified
realparallel
valuesprocessor
whereas [Peercy
triangles
nine
is treated
as an to
accumulator
datavalues
array of 5wemodel,
foundthe
it framebuffer
easier and more
accurate
store the ray
vectors (r, g, b, , z), and texture maps are used as data arrays for inatput
theand
lower
textureOperations
precisions.
variables.
on this data array are performed using
image-processing
combinations
the textures intersection
and the framebuffer.
We also chose to
implementofray-triangle
as
Pixel shaders are sequences of these image-processing combinaa pixel
shader
instead
of
a
vertex
shader.
Vertex
shaders
do
tion operators. While compilers exist for multipass programming
[Peercy et al. 2000; Proudfoot et al. 2001], the current limitations
of pixel shaders required complete knowledge and control of the
available instructions and registers to implement ray intersection.
4.1
3.4
Discussion
The decision to store rays in texture and triangles as vertex attributes was based partially on precision. The geometry pipeline
supports full-precision 32-bit floating point values whereas the texture pipeline is restricted to 8-bit clamped fixed-point values. Rays
8can be specified with five real values whereas triangles require nine.
Input
the triangle from the host, and computes the additional redundant values in the vertex shader.
The texture coordinates for texture zero (s0 ,t0 ) are special and are not constant across the quadrilateral. They are
instead set to (0, 0), (1, 0), (1, 1), (0, 1) at the four vertices,
and rasterization interpolates these values linearly across the
quads pixels. These texture coordinates are required by the
pixel shader to access each pixels corresponding ray in the
screen-sized ray-origins and ray-directions textures.
4.2. Output
The output of the ray-triangle intersection needs to be
queried by the CPU, which can be an expensive operation due to the asymmetric AGP bus on personal computers
(which sends data to the graphics card much faster than it can
receive it). The following output format is designed to return
as little data as necessary, limiting itself to the index of the
triangle that intersects the ray closest to its origin, using the
z-buffer to manage the ray parameter t of the intersection.
Color. The color channel contains the color of the first triangle the ray intersects (if any). For typical ray tracing applications, this color will be a unique triangle id. These triangle
ids can index into an appearance model for the subsequent
shading of the ray-intersection results.
Alpha. Our pixel shader intersection routine conditionally
sets the fragments alpha value to indicate ray intersection.
The alpha channel can then be used as a mask by other applications if the rays are coherent (e.g. like eye rays through
the pixels in the frame buffer).
The t-Buffer. The t-value of each intersection is computed
and replaces the pixels z-value. The built-in z-test is used
so the new t-value will overwrite the existing t-value stored
in the z-buffer if the new value is smaller. This allows the
z-buffer to maintain the least positive solution t for each ray.
Since the returned t value is always non-negative, the t-value
maintained by the z-buffer always corresponds to the first
triangle the ray intersects.
4.3. Intersection
We examined a number of efficient ray-triangle intersection
tests6, 2, 18 , and managed to reorganize one18 to fit in a pixel
shader.
Our revised ray-triangle intersection is evaluated as
ao = o a,
(1)
bod = bo d,
(5)
u = ac aod,
(6)
bo = o b,
n ao
,
t =
nd
aod = ao d,
(2)
(3)
v = ab aod, (7)
(4)
w = bc bod.
(8)
8- 4
Figure 4: Leaky teapot, due to the low precision implementation on PS1.4 pixel shaders used to test the performance
of ray-triangle intersection. Our simulations using the precision available on upcoming hardware are indistinguishable
from software renderings.
8- 5
Geometry
Cache
Results
Triangle
Data as
Flat Quad
Attributes
Relevant
Intersection
Data as
Pixels
Front End
Ray Data
as
Texture
Maps
similar rays intersect a collection of spatially coeherent triangles. In order to maintain full buckets of coherent rays,
we utilize a ray cache21 . Ray caching allows some rays to
be cast in arbitrary order such that intersection computations
can be performed as batch processes.
Once the search has chosen a bucket, rays are stolen from
that nodes siblings to fill the bucket to avoid wasting intersection computations. Due to the greedy search and the node
merging described next, this ensures that buckets sent to the
ray engine are always as full as possible, even though in the
ray tree they are typically only 50-80% full.
Once a bucket has been removed from the tree and traced,
it can often leave neighboring buckets containing only a few
rays. Our algorithm walks back up the tree from the removed
bucket leaf node, collecting subtrees into a single bucket
leaf node if the number of rays in the subtree has fallen below a threshold. Our tests showed that this process typically
merged only a single level of the tree.
The CPU performs a ray bucket intersection test1 against
the octree cells to determine which should be sent to the
GPU. We also used the vertex shader to cull back-facing triangles as well as triangles outside the ray bucket from intersection consideration. The vertex shader cannot change the
number of vertices passing through it, but it can transform
c The Eurographics Association 2002.
8- 6
the screen-filling quad containing the triangle data to an offscreen location which causes it to be clipped.
the rays and triangles passed to it, its performance is independent of scene structure
GPU(R, T ) = O(RT ).
5.4. Results
We implemented the ray engine on a simulator for an upcoming GPU based on the expected precision and capabilities needed to support the Direct3D 9 specification. These
capabilities allow us to produce full precision images that
lack the artifacts shown earlier in Figure 4.
We used the ray engine to classically ray trace a teapot
room and an office scene, shown in Figure 6(a) and (c).
We applied the ray engine to a Monte-Carlo ray tracer that
implemented distributed ray tracing and photon mapping,
which resulted in Figure 6(b). The ray engine was also used
to ray trace two views of one floor from the Soda Hall
dataset, shown in Figures 6(d) and (e).
The performance is shown in Figure 1. Since our implementation is on a non-realtime simulator, we have estimated
our performance using the execution rates measured on the
GeForce 4. We measured the performance in rays per second, which measures the number of rays intersected with
the entire scene per second. This figure includes the expensive traversal of the ray-tree and triangle octree as well as the
ray-triangle intersections.
Scene
Teapot Room Classical
Teapot Room Monte-Carlo
Office
Soda Hall Top View
Soda Hall Side View
Polygons
Rays/sec.
2,650
2,650
34,000
11,052
11,052
206,905
149,233
114,499
129,485
131,302
8- 7
(9)
(10)
(11)
(a)
(b)
(c)
(c)
(c)
Figure 6: Images tested by the ray engine: teapot Cornell box ray traced classically (a) and Monte Carlo (b), office (c), and
Soda Hall side (d) and top (e) views.
GeForce4 w/full
AGP 4x bandwidth
GeForce4 perf.
400
GeForce3 perf.
300
200
100
10
15
20
25
triangles
30
35
40
45
50
Figure 7: Theoretical performance in millions of raytriangle intersection tests per second on the GPU with = 4.
8- 8
Scene
% GPU Rays
89%
71%
65%
70%
89%
rays and triangles are sent to the GPU. The remaining rays
and triangles are traced by the CPU. The best performers
resulted from classical ray tracing of the teapot room and
the ray casting of the Soda Hall side view. The numerous
bounces from Monte Carlo ray tracing likely reduce the coherence on all but the eye rays. Coherence was reduced in the
office scene due to the numerous small triangles that filled
the triangle cache before the ray cache could be optimally
filled. The Soda Hall top view contains a lot of disjoint small
silhouette wall polygons that likely failed to fill the triangle cache for a given optimally filled ray cache.
CPU only
plus GPU
Asynch. Readback
Infinitely Fast GPU
Rays/sec.
Speedup
135,812
165,098
183,273
234,102
22%
34%
73%
Rays/sec.
Speedup
78%
81%
89%
135,812
147,630
157,839
165,098
8%
16%
22%
CPU
416
512
515
System
GPU Rays
Table 4: Tuning the ray engine by varying the range of triangles T sent to the GPU, measured on the teapot room.
R
Rays/sec.
Speedup
CPU
64
128
256
512
135,812
165,098
177,647
180,558
175,904
22%
31%
33%
29%
form more caching. For example, for the teapot room classical ray tracing, we were able to achieve a 52% speedup
over the CPU by setting R to 256 and hand tuning the octree
resolution.
7. Conclusions
8- 9
References
1.
A RVO , J., AND K IRK , D. B. Fast ray tracing by ray classification. Proc. SIGGRAPH 87 (July 1987), 5564.
2.
3.
4.
5.
19.
20.
21.
22.
23.
24.
6.
E RICKSON , J.
Pluecker coordinates.
Ray Tracing News 10, 3 (1997), 11.
www.acm.org/tog/resources/RTNews/html/rtnv10n3.html#art11.
7.
F ERNANDO , R., F ERNANDEZ , S., BALA , K., AND G REEN BERG , D. P. Adaptive shadow maps. Proc. SIGGRAPH 2001
(Aug. 2001), 387390.
25.
R EEVES , W. T., S ALESIN , D. H., AND C OOK , R. L. Rendering antialiased shadows with depth maps. Proc. of SIGGRAPH
87 (Jul. 1987), 283291.
8.
H ALL , D. The AR350: Todays ray trace rendering processor. In Hot 3D Presentations, P. N. Glagowski, Ed. Graphics
Hardware 2001, Aug. 2001, pp. 1319.
26.
9.
27.
10.
28.
S ZIRMAY-K ALOS , L., AND P URGATHOFER , W. Global raybundle tracing with hardware acceleration. Proc. Eurographics Rendering Workshop (June 1998), 247258.
11.
29.
12.
30.
13.
31.
14.
K IPFER , P., AND S LUSALLEK , P. Transparent distributed processing for rendering. Proc. Parallel Visualization and Graphics Symposium (1999), 3946.
15.
KOLB , C., H ANRAHAN , P. M., AND M ITCHELL , D. A realistic camera model for computer graphics. Proc. SIGGRAPH
95 (Aug. 1995), 317324.
32.
16.
33.
34.
17.
W EGHORST, H., H OOPER , G., AND G REENBERG , D. Improved computational methods for ray tracing. ACM Trans.
on Graphics 3, 1 (Jan. 1984), 5269.
18.
35.
8 - 10
Abstract
We capitalize on recent advances in modern programmable graphics hardware, originally designed to
support advanced local illumination models for shading, to instead perform two different kinds of global
illumination models for light transport. We first use the new floating-point texture map formats to find
matrix radiosity solutions for light transport in a diffuse environment, and use this example to investigate the differences between GPU and CPU performance on matrix operations. We then examine
multiple-scattering subsurface light transport, which can be modeled to resemble a single radiosity gathering step. We use a multiresolution meshed atlas to organize a hierarchy of precomputed subsurface
links, and devise a three-pass GPU algorithm to render in real time the subsurface-scattered illumination of an object, with dynamic lighting and viewing.
Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Subsurface
Scattering
1. Introduction
The programmable shading units of graphics processors designed for z-buffered textured triangle rasterization have transformed the GPU into general purpose
streaming processors, capable of performing a wider
variety of graphics, scientific and general purpose processing.
A key challenge in the generalization of the GPU to
non-rasterization applications is the mapping of wellknown algorithms to the streaming execution model
and limited resources of the GPU. Some popular
graphics and scientific algorithms and data structures,
such as ray tracing 19, 3 as well as preconditioned conjugate gradient and multigrid solvers 2 , have nevertheless been implemented on, or at least accelerated by,
the GPU. This paper expands the horizon of photorealistic rendering algorithms that the GPU can accelerate to include matrix radiosity and subsurface scattering, and describes how the techniques could eventually lead to a GPU implementation of hierarchical
radiosity.
Matrix radiosity is a classic technique for simulating light transport in diffuse scenes 7 . It is capable
of synthesizing and depicting the lighting, soft shadowing and color bleeding found in scenes of diffuse
materials. Our mapping of radiosity to the GPU uses
the floating-point texture format to hold the radiosity matrix, but also pays attention to cache coherence
and the order of computation to efficiently perform
a Jacobi iteration which gathers radiosities as it iterates toward a solution. Given precomputed form factors, we are thus able to both compute and display
a radiosity solution entirely on the GPU. While the
geometry is fixed, the emittance is not, and our GPU
algorithm can support dynamic relighting as well as
dynamic alteration of patch reflectances.
Lensch et al.15 shows how the BSSRDF formulation
of subsurface (multiple diffuse) scattering12 resembles
a single radiosity gathering step. Light transport in
the object interior is computed by gathering, for each
patch, the diffused image of the Fresnel transmitted
irradiances from the other patches. The BSSRDF can
be integrated to form a throughput factor15 that re-
8 - 11
2. Previous Work
2.1. Radiosity
Early radiosity techniques beginning with Goral et al.7
required computationally intensive solutions, which
led to numerous acceleration strategies including
shooting5 and overrelaxation. Though form factor
computation is generally considered the bottleneck of
radiosity techniques, hardware accelerated methods
such as the hemicube6 have been available for a long
time, though recent improvements exist17 .
Graphics researchers have looked to hardware acceleration of radiosity long before commodity graphics processors became programmable1 . For example,
Keller14 used hardware-accelerated OpenGL to accelerate radiosity computations. He performed a quasiMonte Carlo particle simulation for light transport on
the CPU. He then placed OpenGL point light sources
at the particle positions, setting their power to the radiosity represented by the particles power along the
random walk path. He then used OpenGL to render
the direct illumination due to these particles, integrating their contribution with an accumulation buffer.
Martin et al.16 computed a coarse-level hierarchical radiosity solution on the CPU, and used graphics
hardware to refine the solution by texture mapping
the residual.
In each of these cases, graphics hardware is used to
accelerate elements of the radiosity solution, but the
bulk of the processing occurs on the CPU. Our goal
is to port the bulk of the radiosity solution process to
the GPU, using the CPU for preprocessing.
translucent surfaces, and used a path tracing simulation to render skin and leaves. Pharr et al.18 extended these techniques into a general Monte-Carlo
ray tracing framework. Jensen and Buhler13 used a
dipole and diffusion to approximate multiple scattering. These methods made clear the importance of subsurface scattering to the graphics community, and led
some to consider additional approximations and accelerations.
Jensen et al.12 accelerates subsurface scattering
using a hierarchical approach consisting of several
passes. Their first pass finds surface irradiances from
external light whereas the second pass transfers these
irradiances to the other surface patches. Their hierarchical approach uses an octree to volumetrically represent the scattered irradiances in the interior of the
translucent substance.
Hao et al.11 approximated subsurface scattering using only local illumination, by bleeding illumination
from neighboring vertices to approximate local back
scattering. They precompute a scattering term for
source lighting expressed in a piecewise linear basis.
They reconstructed scattering per-vertex from directional light by linearly interpolating these terms computed from the nearest surrounding samples. Their
technique was implemented on the CPU but achieved
real-time rendering speeds.
Sloan et al.21 uses a similar precomputation strategy, though using a spherical harmonic basis for
source lighting. They precomputed per-vertex a transfer matrix of spherical harmonic coefficients from
environment-mapped incoming radiance to per-vertex
exit radiance that includes the effects of intra-object
shadowing and interreflection in addition to subsurface scattering. They were able to compress these large
transfer matrices in order to implement the evaluation
of precomputed radiance transfer entirely on the GPU
to achieve real-time frame rates.
Lensch et al.15 approximated back scattering by filtering the incident illumination stored in a texture atlas. The shape of these kernels is surface dependent
and precomputed before lighting is applied. They also
approximated forward scattering by precomputing a
vertex-to-vertex throughput factor, which resembles a
form factor. Forward scattering is rendered by performing a step similar to radiosity gathering, by collecting for a given vertex the irradiance from the other
vertices scaled by the throughput factor.
While Lensch et al.15 benefited from some hardware
acceleration, for example using a vertex shader to accumulate external irradiances, the application of the
vertex-to-vertex throughput factors to estimate forward scattering, and the atlas kernel filtering to estimate backward scattering is performed on the CPU
c The Eurographics Association 2003.
8 - 12
(a)
(b)
(c)
3. Matrix Radiosity
The recent support for floating point texture formats provides an obvious application to matrix problems. Matrix radiosity7 expresses the illumination of
a closed diffuse environment as a linear system
Bi = Ei + i
N
X
Fij Bj
(1)
j=1
(2)
8 - 13
(k+1)
Bi
= Ei
X
j6=i
Mij
Bj
.
Mii
(3)
(4)
Figure 2: A simple radiosity scene solved and displayed completely on the GPU.
In order to smoothly interpolate the displayed radiosities, the 1-D radiosity texture B would need to be
resampled into a 2-D displayed radiosity texture. An
easy way to perform this resampling is to create an
atlas of the scene, such as the meshed atlas described
in Section 2.3, and render the texture image of the
atlas using 1-D texture coordinates corresponding to
each vertexs index in the radiosity vector.
As Figure 3 shows, our GPU Jacobi radiosity implementation takes about twice as many iterations
to converge as does a Gauss-Seidel solution on the
CPU. Moreover, each Jacobi iteration of our GPU
solver (29.7 iterations/sec. on a NVIDIA GeForce FX
5900 Ultra) takes longer to run than a Gauss-Seidel
iteration on the CPU (40 iterations/sec. on an AMD
Athlon 2800+). This corresponds to 141 MFLOPS/s
and 190 MFLOPS/s for the GPU and CPU, respectively (the floating point operations for indexing on
the GPU are not included in this number).
We found, however, that the CPU implementation
is limited by memory bandwidth, while the GPU implementation is only using a fraction (perhaps as little as 10%) of its available bandwidth. This difference
can be observed by comparing the super-linear CPU
curve to the linear GPU curve in Figure 4. We believe
c The Eurographics Association 2003.
8 - 14
~
;
x
,
w
~
)L
(x
~ i )(~ni
~ i )d~
i dA(xi ).
i
i
o
o
i
i,
A
(5)
Integration is performed over the surface A at points
xi and all incident light directions. The term S is the
BSSRDF, which relates the amount of outgoing radiance at point xo in direction
~ o , given that there is
incoming radiance at some other surface point xi in
direction
~ o.
R R
Z
B(xo ) =
E(xi )Rd (xi , xo )dA(xi )
Lo (xo ,
~ o) =
Figure 4: Though the CPU currently outperforms
the GPU on this matrix radiosity example, the CPU
is memory bandwidth bound and its performance degrades on larger matrices, whereas the GPU is compute bound and its performance scales linearly.
the GPU curve overtakes the CPU curve for matrices larger than 2000 elements, but we were limited
by a maximum single texture resolution of 2048. As
computation speeds have historically increased faster
than memory bandwidth, we also expect the GPU will
readily outperform the CPU in the near future.
4. Subsurface Scattering
We base our subsurface scattering scheme on the
method derived by Jensen et al.13 . The standard rendering equation using a BRDF approximation has
(6)
(7)
xi S
Z
Li (xi ,
~ i )Ft (,
~ i )|~ni
~ i |d~
i (8)
E(xi ) =
8 - 15
Bi =
N
X
Fij Ej
Fij
Z
Rd (xj , xi )dxj dxi . (10)
xi Pi
Pass 2:
Pass 3:
Follow Links Fresnel and
and Accumulate Display
(9)
j=1
1
=
Ai
Pass 1:
Illuminate
and Fresnel
xj Pj
For a static model, we can precompute the Fij factors between all pairs of patches. Using (9), the radiosity due to diffuse multiple scattering now reduces
to a simple inner product for each patch resulting in
O(N 2 ) operations to compute the incident scattered
irradiance for all patches.
A simple way to reduce the number of interactions
is to turn to a clustering strategy like that used to
solve the N-body problem and hierarchical radiosity10 .
This is particularly applicable to subsurface scattering
since the amount of scattered radiance drops rapidly
as it travels further through the media. This implies
that patches that are far away from the patch whose
scattered radiosity we are computing may be clustered
together and be replaced by a single term to approximate the interaction.
4.2. A Three Pass GPU Method
We now formalize a solution the diffuse subsurface
scattering equation (9) as a three pass GPU scheme
(as shown in Fig. 5) preceded by a pre-computation
phase. By assuming that our geometry is not deforming we are able to precompute all of our throughput
factors Fij between interacting surfaces.
Our first pass to the GPU computes the amount
of light incident on every patch of our model, and
scales this by the Fresnel term, storing the radiosity
that each patch emits internal to the object. This map
forms our radiosity map.
Our second pass acts as a gathering phase evaluating equation (9). For every patch/texel the transmitted radiosity is gathered and scaled by the precomputed throughput factors and stored into a scattered
irradiance map.
In our third and final pass we render our geometry to the screen using the standard OpenGL lighting model. The contribution from subsurface scattered
light is added in by applying the scattered irradiance
texture map to the surface of the object scaled by the
Fresnel term.
Pass 1
Pass 2
Pass 3
Figure 6: Pass one plots triangles using their texture coordinates (left), interpolating vertex colors set
to their direct illumination scaled by a Fresnel transmission term. Pass two transfers these radiances (center) via precomputed scattering links implemented as
dependent texture lookups into a MIP-map of the previous pass result (left). Pass three scales the resulting
radiances by a Fresnel transmission factor and texture
maps the result onto the original model.
8 - 16
over the surface hierarchy. Lastly, by using a parameterization scheme as a domain to store and compute
our surface samples, the number of surface samples
is independent of both the tessellation of our geometry, and the resolution of the screen we are rendering
to. This marks a difference between earlier interactive
subsurface scattering approaches where vertex to vertex interactions were used to discretize the scattering
equation.
Full evaluation of equation (9) of patch to patch
throughput factors would require P 2 interactions,
where P is the number of patches (and also texels in
our texture atlas). Each interaction forms a link. Since
all of our patches are in texture space, we need only
store a u, v location and a throughput factor for each
link. By adding an LOD term into the link structure
we can access any level in the MIP-map surface hierarchy. Based on our computation and space restrictions
we assign some maximum number of links L that we
store for each patch Pb in the base level of our texture
map.
For a non-adaptive approach we can choose some
level l in the hierarchy. We then build all links from
patches Pb to Pl where Pb are the patches at the lowest
level in the hierarchy, and Pl are patches at level l in
the hierarchy.
An adaptive top-down approach for the construction of links may be done similar to that of hierarchical radiosity. For every patch in the lowest level of
our hierarch Pb , we start by placing the root patch Pr
of our hierarchy onto a priority queue (with highest
throughput factor at the top). We then recursively remove the patch at the top of the queue, compute the
throughput factors from Pb to its four children, and
insert the four children into the priority queue. This
process is repeated until the queue size grows to the
threshold number of links desired.
Once L adaptive links for each patch/texel in our
atlas have been created,we store
the result into L
textures maps that are Pb Pb in size. We use
an fp16 (16-bit floating point) texture format supported on the GeForceFX to pack the link information: u, v, Fij ,LOD into the four color channels to be
used during pass two of our rendering stage. To reduce
storage in favor of more links, we opted to reduce our
form factor to a single scalar. Form factors per color
channel may be supported at a cost of space and bandwidth.
In the case of a non-adaptive approach, the link locations and LOD are the same for every patch Pb ,
we therefore store this information in constant registers on the graphics card during the second pass. The
throughput factors, however, vary per-link. We store
the form factor information into L/4 texture maps
where each texel holds four throughput factors (corresponding to 4 links) in the rgb channels.
4.2.2. Pass 1: Radiosity Map
Given our face cluster hierarchy and MIP-mappable
parameterization, we must first compute the Ej s for
the patches in our scene. To do this, we start by computing a single incident irradiance for every texel in
our texture atlas, thereby evaluating lighting incident
on all sides of the model. We scale the result of the
incident illumination by the Fresnel term, storing the
result in the texture atlas. Each texel now holds the
amount of irradiance that is transferred through the
interface to the inside of the model. This step is similar
to the illumination map used in Lensch et al.15 .
To accomplish this efficiently on the GPU, we use
the standard OpenGL lighting model in a vertex program on the GPU. Using the OpenGL render-totexture facility, we send our geometry down the graphics pipeline. The vertex program computes the lighting
on the vertices scaled by the Fresnel term placing it in
the color channel and swaps the texture coordinates
for the vertices on output. The radiosity stored in the
color channel is then linearly interpolated as the triangle gets rendered into the texture atlas. Our method
does not prevent the use of more advanced lighting
models and techniques for evaluating the incident irradiance on the surface the object.
As an alternative to computing our transmitted radiosity in a vertex program, we could perform the computation entirely in a fragment program per-texel. In
addition to higher accuracy, this approach may have
improved performance for high triangle count models.
A geometry image (e.g. every texel stores surface position) and a normal map may be precomputed for our
object and stored as textures on the GPU. Rendering the radiosity map would only entail sending a single quadrilateral down the graphics pipeline texture
mapped with the geometry image and normal map.
The lighting model and Fresnel computation can take
place directly in the fragment shader.
We use the automatic MIP-mapping feature available in recent OpenGL version to compute the average radiosity at all levels of our surface hierarchy. The
radiosity map is then used in the next pass of this
process.
4.2.3. Pass 2: Scattered Irradiance Map
This pass involves evaluating equation (9) for every
texel in our texture atlas. We render directly to the
texture atlas, by sending a single quadrilateral to the
GPU. In a fragment program, for each texel, we traverse the L links stored during the pre-computation
8 - 17
res.
fps
Pass 1
Pass 2
Pass 3
Head
Dolphin
Bunny
512
5122
5122
61.10
30.35
30.33
11%
43%
37%
82%
43%
34%
7%
14%
28%
Head
Dolphin
Bunny
10242
10242
10242
15.40
15.09
12.05
13%
8%
18%
85%
85%
68%
2%
7%
14%
(a)
(a)
These links are similar to the links used for hierarchical radiosity. Hierarchical radiosity would follow
c The Eurographics Association 2003.
8 - 18
2.
D. R. Baum and J. M. Winget. Real time radiosity through parallel processing and hardware
acceleration. Computer Graphics, 24(2):6775,
1990. (Proc. Interactive 3D Graphics 1990).
J. Bolz, I. Farmer, E. Grinspun, and P. Schr
oder.
Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Trans. on Graphics,
22(3):to appear, July 2003. (Proc. SIGGRAPH
2003).
3.
4.
5.
6.
M. F. Cohen and D. P. Greenberg. The hemicube: A radiosity solution for complex environments. Computer Graphics, 19(3):3140, 1985.
(Proc. SIGGRAPH 85).
7.
8.
9.
8 - 19
Jesse Hall
John Hart
John Snyder
Microsoft Corporation
University of Illinois
University of Illinois
Microsoft Research
Abstract
We compress storage and accelerate performance of precomputed
radiance transfer (PRT), which captures the way an object shadows, scatters, and reflects light. PRT records over many surface
points a transfer matrix. At run-time, this matrix transforms a
vector of spherical harmonic coefficients representing distant,
low-frequency source lighting into exiting radiance. Per-point
transfer matrices form a high-dimensional surface signal that we
compress using clustered principal component analysis (CPCA),
which partitions many samples into fewer clusters each approximating the signal as an affine subspace. CPCA thus reduces the
high-dimensional transfer signal to a low-dimensional set of perpoint weights on a per-cluster set of representative matrices.
Rather than computing a weighted sum of representatives and
applying this result to the lighting, we apply the representatives to
the lighting per-cluster (on the CPU) and weight these results perpoint (on the GPU). Since the output of the matrix is lowerdimensional than the matrix itself, this reduces computation. We
also increase the accuracy of encoded radiance functions with a
new least-squares optimal projection of spherical harmonics onto
the hemisphere. We describe an implementation on graphics
hardware that performs real-time rendering of glossy objects with
dynamic self-shadowing and interreflection without fixing the
view or light as in previous work. Our approach also allows
significantly increased lighting frequency when rendering diffuse
objects and includes subsurface scattering.
Keywords: Graphics Hardware, Illumination, Monte Carlo Techniques,
Rendering, Shadow Algorithms.
1. Introduction
Global illumination effects challenge real-time graphics, especially in area lighting environments that require integration over
many light source samples. We seek to illuminate an object from
a dynamic, low-frequency lighting environment in real time,
including shadowing, interreflection, subsurface scattering, and
complex (anisotropic) reflectance.
These effects can be measured as radiance passing through spherical shells about the surface point p in Figure 1. Source radiance
originates from an infinite sphere (environment map).
Transferred incident radiance passes through an infinitesimal
hemisphere, and equals the source radiance decreased by selfshadowing and increased by interreflection. Exiting radiance
passes outward through an infinitesimal hemisphere, and results
from the BRDF times the transferred incident radiance, plus
subsurface scattering.
The spherical harmonic (SH) basis provides a compact, aliasavoiding representation for functions of radiance over a sphere or
hemisphere [Cabral et al. 1987][Sillion et al. 1991][Westin et al.
1992][Ramamoorthi and Hanrahan 2001]. Low-frequency source
illumination, which small vectors (e.g. N=25) of SH coefficients
8 - 20
projection onto the clusters PCA vectors. CPCA reduces not only
signal storage (n rather than N2 scalars per point) but also the runtime computation. Instead of multiplying an NN transfer matrix
by an N-dimensional light vector at each p, we precompute this
multiplication in each cluster for each of its PCA vectors and
accumulate weighted versions of the n resulting N-vectors.
CPCA on diffuse transfer provides a similar savings in storage
and computation.
We describe two technical contributions which may have wider
applicability. The first is a very general signal approximation
method using CPCA. Though used before in machine learning
applications [Kambhatla and Leen 1994][Kambhatla and Leen
1997][Tipping and Bishop 1999], it is new to computer graphics.
To increase spatial coherence, we augment the method by redistributing points to clusters according to an overdraw metric.
The second contribution is the use of the optimal least-squares
projection of the SH basis onto the hemisphere, which significantly reduces error compared to approaches used in the past
[Sloan et al. 2002][Westin et al. 1992].
2. Related Work
Various representations encapsulate precomputed or acquired
global illumination. Light fields [Gortler et al. 1996][Levoy and
Hanrahan 1996] record radiance samples as they pass through a
pair of viewing planes whereas surface light fields [Chen et al.
2002][Miller et al. 1998][Nishino et al. 1999][Wood et al. 2000]
record 4D exiting radiance sampled over an objects surface.
Both techniques support arbitrary views but fix lighting relative to
the object.
Precomputed radiance transfer (PRT) [Sloan et al. 2002] parameterizes transferred incident radiance in terms of low-frequency
source lighting, allowing changes to lighting as well as viewpoint. We build on PRT and its generalization to anisotropic
BRDFs [Kautz et al. 2002], but speed up performance and reduce
error in three ways: we record exiting radiance instead of transferred incident, use least-squares optimal projection of
hemispherical functions, and compress using CPCA. We also
extend PRT to include subsurface scattering. In parallel work,
Lehtinen and Kautz [2003] approximate PRT using PCA. Our
CPCA decoding reduces approximation error and maps well to the
GPU, resulting in 2-3 times better performance.
Other illumination precomputation methods also support dynamic
lighting. Matusik et al. [2002] handle limited, non-real-time
lighting change with a surface reflectance field measured over a
sparsely sampled directional light basis, stored on the visual hull
of an acquired object. Hakura et al. [2000] support real-time
lighting change with parameterized textures, but constrain viewing and lighting changes to a 2D subspace (e.g. a 1D circle of
viewpoints 1D swing angle of a hanging light source). [Sloan et
al. 2002] compares PRT to many other precomputed approaches
for global illumination.
Precomputed illumination datasets are huge, motivating compression. Light fields were compressed using vector quantization
(VQ) and entropy coding [Levoy and Hanrahan 1996], and reflectance fields using block-based PCA [Matusik et al. 2002].
Surface light fields have been compressed with the DCT [Miller
et al. 1998], an eigenbasis (PCA) [Nishino et al. 1999], and
generalizations of VQ or PCA to irregular sampling patterns
[Wood et al. 2000]. Our CPCA compression strategy improves
[Wood et al. 2000] by hybridizing VQ and PCA in a way that
reduces error better than either by itself. Unlike [Chen et al.
2002] which compresses a 4D surface light field over each 1-ring
PRT in [Sloan et al. 2002] represents transferred incident radiance (Figure 1). It is derived from a Monte Carlo simulation
illuminating geometry by the SH basis functions. This decouples
the way an object shadows itself from its reflectance properties,
and allows different BRDFs to be substituted at run-time. Here
we seek to approximate the transfer signal to reduce computation.
To measure approximation error properly, we must know the
BRDF. For example, a smooth BRDF weights low-frequency
components of transferred radiance more than high-frequency
components.
To measure signal errors properly, we include BRDF scaling by
encoding the exiting radiance transfer matrix at p, Mp. Its component, Mp,ij, represents the linear influence of source lighting
basis function j to exiting radiance basis function i. It can be
numerically integrated over light directions s and view directions
v over the hemisphere H={(x,y,z) | z 0 and x2+y2+z2=1} via
M p ,ij =
v H s H
yi (v) Tp s, y j ( s ) B(v, s ) sz ds dv
8 - 21
10000
1000
100
10
1
1
0.1
0.01
0.001
0.0001
0.00001
10
100
1000
10000
100000
1000000
10000000
n'=0
n'=1
n'=2
n'=4
n'=8
n'=16
uncompressed storage cost
Figure 2: CPCA error analysis using static PCA. Each curve represents
how squared error varies with various numbers of clusters (1, 2, 4, ,
16k) using a given number of principal components in each cluster (n = 0,
1, 2, 4, 8, and 16). The signal approximated was a 25D shadowed diffuse
transfer vector over a bird statue model from [Sloan et al. 2002] having
48668 sample points. 20 VQ iterations were used, followed by PCA in
each cluster.
yi (v ) y j ( s ) B(v, s ) s z ds dv
v H s H
Exit and transferred radiance at a surface point are actually functions over a hemisphere, not a sphere. For the SH basis, there is
complete freedom in evaluating the function on the opposite
hemisphere when projecting it to the SH basis. Transfer in [Sloan
et al. 2002] and the formulas above in Section 3.1 implicitly zero
the opposite hemisphere by integrating only over the hemisphere.
Westin et al. [1992] used a reflection technique. It is also possible to use other bases such as Zernike polynomials lifted to the
hemisphere [Koenderink et al. 1996].
Our approach uses the least-squares optimal projection of the SH
basis onto the hemisphere described in the Appendix. The technique represents any SH-bandlimited spherical function restricted
to the hemisphere without error. In contrast, zero-hemisphere
projection incurs 35% worst-case and 20% average-case RMS
error integrated over the hemisphere for all unit-power spherical
signals formed by linear combinations of the 5th order SH basis.
The odd reflection technique [Westin et al. 1992] is even worse.
Beyond theoretical results, we also see visibly better accuracy on
our experimental objects using optimal projection (see Figure 7).
Given a vector b which projects a hemispherical function into the
SH basis by zeroing out the opposite hemisphere, the optimal
hemispherical projection is simply A-1 b where A is defined in the
appendix. Therefore, the optimally projected exiting radiance
transfer matrix is given by
M p = A-1 B A-1 R p Tp
(1)
where the n+1 n-vectors x0, x1, , xn are constant over the cluster
and the n scalar weights w1p , w2p , , wnp vary for each point p on
the surface. To reduce signal dimensionality, n n . The vector
x0 is called the cluster mean, and the vectors xi, i 1 are called the
cluster PCA vectors. Together, the clusters mean and PCA
vectors are called its representative vectors.
CPCA (called VQPCA in [Kambhatla and Leen 1994]
[Kambhatla and Leen 1997] and local PCA or piecewise PCA
in the machine learning literature under the general title of mixtures of linear subspaces) generalizes PCA (single cluster, n > 0)
and VQ (many clusters, n = 0). VQ approximates a signal as a
piecewise constant while PCA assumes it is globally linear.
CPCA exploits the local linearity of our radiance transfer signal
by breaking it down into clusters, approximating each with a
separate affine subspace.
The simplest CPCA method is to first cluster the points using VQ,
and then compute a PCA fit in each of the resulting clusters
[Kambhatla and Leen 1994].
VQ Clustering The LBG algorithm [Linde et al. 1980] performs
the initial clustering. Given a desired number of clusters, the
algorithm starts with clusters generated by random points from the
signal and then classifies each point into the cluster having minimum distance to its representative. Each cluster representative is
then updated to the mean of all its points, and the algorithm iterated until no points are switched or an iteration count is reached.
Per-Cluster PCA We first compute the cluster mean, x0. We
then compute a mkn matrix of residuals after subtracting the
mean, C = [xp1-x0, xp2-x0, , xpnk-x0]T, where mk is the number of
points in cluster k. Computing an SVD yields C = U D VT where
U and VT are rotation matrices and D is a diagonal matrix whose
elements are sorted in decreasing order. The first n rows of V
(columns of VT) are the cluster PCA vectors. A point pjs projection weights (n-vector wp j ) are given by the first n columns of
row j of UD (they are also given simply by the dot product of xpjx0 with each of the PCA vectors). This provides a least-squares
optimal linear approximation of C from combinations of n fixed
vectors. Total squared error over all cluster points is given by
mk
j =1
xpj - xpj
= i = n +1 Di2 = j =1 x p j - x0
n
mk
- i =1 Di2
n
8 - 22
xp - xp
x p - x0
- i =1 ( x p - x0 ) i xi
n
1000
iterative pca
100
iterative/adaptive pca
10
10
12
14
16
0.1
0.01
0.001
0.0001
8 - 23
(a) original
(b) reclassification
(c) recl.+supercluster
mean overdraw: 2.03
1.79
1.60
Figure 5: Overdraw reduction using cluster coherence optimization on
256 clusters of a 625D glossy transfer signal with n=8 and a reclassification error budget of 10% of the original error. Triangle color
indicates overdraw: red = 3, yellow = 2, and green = 1.
( )
n < N, this reduces computation and makes the vertex data small
enough for vertex shaders on current graphics cards. For example, for N=25 and n=5, we save more than a factor of 4.
Finally, we evaluate the exiting radiance function at vp by dotting
the vector y(vp) with ep . We evaluate y at vp using a texture map
in the same way as [Kautz et al. 2002] evaluated yT(vp) B, but we
can now perform this evaluation and dot product in a pixel shader.
Diffuse surfaces simplify the computation but CPCA achieves a
similar reduction. In this case, t p i l computes shading where tp is
an N-dimensional transfer vector and l is the lightings SH projection as before. Using CPCA, we encode tp as an affine
combination of per-cluster representatives and precompute in each
cluster the dot product of the light with these vectors. The final
shade is a weighted combination of n+1 scalar values ti i l which
are constant over a cluster, via
8 - 24
+ wnp (tn i l )
7. Results
) on CPU
Figure 10 compares rendering quality of various transfer encodings on an example bird model with a glossy anisotropic BRDF.
We experimentally picked a number of clusters for VQ (n=0) and
a number of representative vectors for pure PCA (mc=1) such that
DrawCluster sends the clusters geometry to the GPU and runs a
rendering performance matched that from CPCA with n=8,
vertex shader computing the linear combination of the wip with
mc=256. For CPCA, we used iterative PCA encoding from
the per-cluster constants. If p = 0, the wip s are also set to zero
Section 4.2. We applied superclustering (Section 4.3) to both VQ
so that blending vertices from other clusters does not effect the
and CPCA to the extent permitted by hardware register limits (it is
result. In other words, we blend using a linear partition of unity
unnecessary for pure PCA since there is only
one cluster). Example images, encoding error,
and rendering rates appear in the figure for all
three methods as well as the uncompressed
original. Methods used before in computer
graphics [Nishino et al. 1999][Wood et al. 2000]
perform poorly: pure PCA is smooth but has
high error; VQ reduces error but has obvious
cluster artifacts. Our CPCA result (third column)
is very faithful to the uncompressed image on
the far right.
Figure 11 shows the effect on encoding accuracy
Order 10, static, n=1 Order 10, iter, n=1
Order 10, iter, n=2 Order 10, iter, n=4
Order 5
of varying the per-cluster number of representaSE=15.293
SE=8.83
SE=2.23
SE=0.432
Uncompressed
Figure 6: Higher-order lighting for diffuse transfer (simple two-polygon scene). The left four tive vectors (n). The two rows show results on
columns show CPCA-encoded results for 10th order lighting (N=100) using various numbers of two models, one smooth (bird, bottom) and one
representatives (n) and mc=64. The rightmost column shows uncompressed 5th order lighting bumpier (Buddha, top). Each column corre(N=25) used in [Sloan et al. 2002]. Note how shadows are sharpened at higher order and how sponds to a different n. The signal encoded
iterative PCA adapts cluster shapes to the transfer signal better than static PCA (leftmost two represents glossy transfer for an anisotropic
columns). CPCA with n=4 provides an accurate approximation that can be rendered in real-time.
8 - 25
8 - 26
9.
9.1
Let f(s) be a function over the hemisphere s=(x,y,z), s H. We approximate f as a linear combination of SH basis functions yi(s) restricted to H
where these basis functions are no longer orthogonal. So we seek
f(s) i ci yi(s)
such that this approximation has minimum squared error over H. We call
this vector c the least-squares optimal hemispherical projection of f.
To derive the coefficients ci of this projection, we minimize squared error
E = H (f(s) - i ci yi(s))2 ds
This is an unconstrained minimization problem with the ci forming the
degrees of freedom. So we take E/ck and set it to 0:
E/ck = H 2 (f(s) - i ci yi(s)) yk(s) ds = 0
i ci H yi(s) yk(s) ds = H f(s) yk(s) ds
This reduces to Ac=b or c=A-1 b where A is the symmetric matrix
Aik = H yi(s) yk(s) ds
and b is the vector of integrals over the hemisphere of f(s) multiplied by
the SH basis functions
bk = H f(s) yk(s) ds
Alternatively, b can be thought of as the standard SH projection of a
spherical extension of f which returns 0 when evaluated on the other half
of the sphere, called the zero-hemisphere hemispherical projection. Note
that A can be inverted once regardless of the function f(s). Note also that
A is the identity matrix when integrating over the entire sphere.
Readers familiar with biorthogonal bases used for wavelets will find this
familiar; y(s) is the primal basis and A-1 y(s) forms its dual basis.
For 5th order SH projection (25 basis functions), the matrix A is nearly
singular its smallest singular value is 6.5910-6 whereas its largest
singular value is 1 (for comparison, the second smallest singular value is
3.1010-4). We can therefore discard one of the SH basis functions, since
at least one is very well approximated as a linear combination of the others
when restricted to a single hemisphere. A simple analysis shows that
discarding the l=1,m=0 SH basis function (i.e., the SH basis function that
is linear in z) has the smallest squared error, 1.4810-5, when approximated as a linear combination of the other basis functions.
9.2
We first compare the difference between the zero-hemisphere and leastsquares optimal projections. The integral, H (i ci yi(s))2 ds, of the
squared value of an approximated function specified by its least-squares
optimal coefficient vector c is given by cT A c. If, as
before, b is the zero-hemisphere hemispherical
projection of f, then c = A-1 b is the optimal leastsquares hemispherical projection of f. The squared
difference between these two projections integrated
over H is
E1=(c-b)T A (c-b)= cT [(A - I)T A (A - I)] c = cT Q1 c
where I is the identity matrix. E1 attains a maximum
value of 0.125 and an average value of 0.0402 over
all signals formed by linear combinations of up to
5th order SH basis functions having unit squared
integral over the sphere; i.e., over all unit-length
25D vectors c. Worst- and average-case errors are
derived as the largest and average singular value of
the symmetric matrix Q1. These are large differences
as a fraction of the original unit-length signal; using
the RMS norm enlarges them still more via a square
root. Optimal projection represents any element of
this function space without error.
(a) Zero-hemisphere,
(b) Optimal Least-Squares,
(c) Original signal,
Another way of restricting the SH basis to the
1625
1625
2525 (zero-hemisphere from [Sloan
hemisphere ([Westin et al. 1992]) is to reflect fs
et al. 2002])
value about z to form a function defined over the
Figure 7: Projection comparison for glossy transfer matrices. Note the increased fidelity of the whole sphere, via
optimal least-squares projection (b) compared to zero-hemisphere (a) especially at the silhouettes
(blue and red from colored light sources) where the Fresnel factor in the BRDF has high energy.
Essentially, using optimal least-squares matches accuracy of a 2525 matrix from [Sloan et al.
2002] via a 1625 matrix (compare b and c).
8 - 27
f odd ( x, y , z ) =
if z 0
{ ff (( xx,, yy,, z),z), otherwise
For 7th order SH basis functions, 28 are even and thus produce nonzero
coefficients. An error measure for even reflection is identical to E2 except
that its diagonal matrix D* scales by 2 the even basis functions and zeroes
the odd. This projection provides worst-case error over all unit-length
signals c of 0.022 and average-case error of 0.0036; still significant but far
less than either the zero-hemisphere or odd reflection projections.
Interestingly, even reflection using a smaller 5th order basis with only 15
relevant basis functions provides 0.193 worst-case and 0.030 average-case
error better average-case error than zero-hemisphere projection with
many fewer coefficients.
8 - 28
Uncompressed
45.7Hz, SE=101304
41.8Hz, SE=14799
45.5Hz, SE=294.4
3.7Hz, SE=0
Figure 10: VQ vs. PCA vs. CPCA quality results for matched rendering performance. The transfer signal encoded was a 2425 (600D) glossy transfer
matrix for an anisotropic BRDF. CPCA achieves much better visual and quantitative accuracy than VQ and pure PCA. Rendering frame rates and error
measurements are listed below each of the four columns. CPCA was encoded using the iterative method of Section 4.2.
36.4Hz, SE=21077.5
24.2Hz, SE=8524.1
3.3Hz, SE=0
58.9Hz, SE=9510.75
57.1Hz, SE=2353.09
45.5Hz, SE=294.421
31.9Hz, SE=66.7495
3.7Hz, SE=0
CPCA, n=2
CPCA, n=4
CPCA, n=8
CPCA, n=12
uncompressed
Figure 11: Varying the number of representatives per cluster (n). Signal is glossy 2425 transfer with anisotropic BRDF, mc=256.
8 - 29