Computer Game Development / Design: Edited by Wolfgang Engel
Computer Game Development / Design: Edited by Wolfgang Engel
Computer Game Development / Design: Edited by Wolfgang Engel
Exploring recent developments in the rapidly evolving field of real-time rendering, GPU Pro6: Advanced
Rendering Techniques assembles a high-quality collection of cutting-edge techniques for advanced graphics
processing unit (GPU) programming. It incorporates contributions from more than 45 experts who cover the latest
developments in graphics programming for games and movies.
The book covers advanced rendering techniques that run on the DirectX or OpenGL runtimes, as well as on any
Techniques
Rendering
Advanced
other runtime with any language available. It details the specific challenges involved in creating games across
the most common consumer software platforms such as PCs, video consoles, and mobile devices.
The book includes coverage of geometry manipulation, rendering techniques, handheld devices programming,
effects in image space, shadows, 3D engine design, graphics-related tools, and environmental effects. It also
includes a dedicated section on general purpose GPU programming that covers CUDA, DirectCompute, and
OpenCL examples.
In color throughout, GPU Pro6 presents ready-to-use ideas and procedures that can help solve many of your daily
graphics programming challenges. Example programs with downloadable source code are also provided on the
book’s CRC Press web page.
K24427 Engel
ISBN: 978-1-4822-6461-6
90000
9 781482 264616
Edited by Wolfgang Engel
www.allitebooks.com
GPU Pro6
www.allitebooks.com
www.allitebooks.com
GPU Pro6
Advanced Rendering Techniques
www.allitebooks.com
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but
the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to
trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained.
If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical,
or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without
written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright
Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a
variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to
infringe.
T385.G26674 2015
006.6’6‑‑dc23 2015006268
www.allitebooks.com
Contents
Acknowledgments xv
I Geometry Manipulation 1
Wolfgang Engel
www.allitebooks.com
vi Contents
II Rendering 63
Christopher Oat
www.allitebooks.com
Contents vii
www.allitebooks.com
viii Contents
www.allitebooks.com
Contents ix
IV Shadows 295
Wolfgang Engel
www.allitebooks.com
x Contents
VI Compute 433
Carsten Dachsbacher
Index 555
Acknowledgments
The GPU Pro: Advanced Rendering Techniques book series covers ready-to-use
ideas and procedures that can help to solve many of your daily graphics program-
ming challenges.
The sixth book in the series wouldn’t have been possible without the help of
many people. First, I would like to thank the section editors for the fantastic job
they did. The work of Wessam Bahnassi, Marius Bjørge, Carsten Dachsbacher,
Michal Valient, and Christopher Oat ensured that the quality of the series meets
the expectations of our readers.
The great cover screenshots were contributed by Ubisoft. They show the game
Assassin’s Creed IV: Black Flag.
The team at CRC Press made the whole project happen. I want to thank
Rick Adams, Charlotte Byrnes, Kari Budyk, and the entire production team,
who took the articles and made them into a book.
Special thanks goes out to our families and friends, who spent many evenings
and weekends without us during the long book production cycle.
I hope you have as much fun reading the book as we had creating it.
—Wolfgang Engel
P.S. Plans for an upcoming GPU Pro 7 are already in progress. Any comments,
proposals, or suggestions are highly welcome (wolfgang.engel@gmail.com).
xv
Web Materials
Example programs and source code to accompany some of the chapters are avail-
able on the CRC Press website: go to http://www.crcpress.com/product/isbn/
9781482264616 and click on the “Downloads” tab.
The directory structure closely follows the book structure by using the chapter
numbers as the name of the subdirectory.
Updates
Updates of the example programs will be posted on the website.
xvii
I
Geometry
Manipulation
The “Geometry Manipulation” section of the book focuses on the ability of graph-
ics processing units (GPUs) to process and generate geometry in exciting ways.
The first article in this section, “Dynamic GPU Terrain” by David Pangerl,
presents a GPU-based algorithm to dynamically modify terrain topology and
synchronize the changes with a physics simulation.
The next article, “Bandwidth-Efficient Procedural Meshes in the GPU via
Tessellation” by Gustavo Bastos Nunes and João Lucas Guberman Raza, covers
the procedural generation of highly detailed meshes with the help of the hardware
tessellator while integrating a geomorphic-enabled level-of-detail (LOD) scheme.
The third article in this section is “Real-Time Deformation of Subdivision
Surfaces on Object Collisions” by Henry Schäfer, Matthias Nießner, Benjamin
Keinert, and Marc Stamminger. It shows how to mimic residuals such as scratches
or impacts with soft materials like snow or sand by enabling automated fine-scale
surface deformations resulting from object collisions. This is achieved by using
dynamic displacement maps on the GPU.
The fourth and last article in this section, “Realistic Volumetric Explosions
in Games” by Alex Dunn, covers a single-pass volumetric explosion effect with
the help of ray marching, sphere tracing, and the hardware tessellation pipeline
to generate a volumetric sphere.
—Wolfgang Engel
www.allitebooks.com
1
I
1.1 Introduction
Rendering terrain is crucial for any outdoor scene. However, it can be a hard task
to efficiently render a highly detailed terrain in real time owing to huge amounts
of data and the complex data segmentation it requires. Another universe of com-
plexity arises if we need to dynamically modify terrain topology and synchronize
it with physics simulation. (See Figure 1.1.)
This article presents a new high-performance algorithm for real-time terrain
rendering. Additionally, it presents a novel idea for GPU-based terrain modifica-
tion and dynamics synchronization.
Figure 1.1. Dynamic terrain simulation in action with max (0.1 m) resolution rendered
with 81,000 tris in two batches.
3
4 I Geometry Manipulation
1.2 Overview
The basic goal behind the rendering technique is to create a render-friendly mesh
with topology that can smoothly handle lowering resolution with distance with
minimal render calls.
1.4 Rendering
Rendering terrain was one of the most important parts of the algorithm develop-
ment. We needed a technique that would require as few batches as possible with
as little offscreen mesh draw as possible.
We ended up with a novel technique that would render the whole terrain in
three or fewer batches for a field of view less than 180 degrees and in five batches
for a 360-degree field of view.
This technique is also very flexible and adjustable for various fields of view
and game scenarios.
1.4.1 Algorithm
It all starts with the render mesh topology and vertex attributes. A render mesh
is designed to discretely move on per level resolution grid with the camera field
of view in a way that most of the mesh details are right in front of the camera
view. A GPU then transforms the render mesh with the terrain height data.
1. Dynamic GPU Terrain 5
Figure 1.2. The two neighboring levels showing the intersection and geomorphing at-
tributes.
Render mesh topology. As mentioned before, the terrain mesh topology is the
most important part of the algorithm.
Terrain render mesh topology is defined by quad resolution R, level size S,
level count L, and center mesh level count Lc :
• R, the quad resolution, is the edge width of the lowest level (Level0 ) and
defines the tessellation when close to the terrain.
• S, the level size, defines the number of edge quads. Level 0 is a square made
of S × S quads, each of size R × R.
• Lc , the center mesh level count, is the number of levels (from 0 to Lc ) used
for the center mesh.
Each resolution level R is doubled, which quadruples the level area size. Levels
above 0 have cut out the part of the intersection with lower levels except the
innermost quad edge, where level quads overlap by one tile to enable smooth
geomorphing transition and per-level snap movement. (See Figure 1.2.)
6 I Geometry Manipulation
Figure 1.3. A blue center mesh (Mesh 0); green top side mesh (Mesh 1); and white
left, bottom, and right side meshes (Mesh 2, Mesh 3, and Mesh 4, respectively). On
the intersection of Mesh 0 and Mesh 1, the mesh tri overlap is visible. It is also very
important that all mesh rectangles are cut into triangles in the same way (which is why
we cannot use the same mesh for Mesh 0 and Mesh 1).
All vertices have a level index encoded in the vertex color G channel. The
vertex color channels R and B are used to flag geomorphing X and Z blending
factors.
With this method, we get a large tri-count mesh that would, if rendered,
have most of the triangles out of the rendering view. To minimize the number
of offscreen triangles, we split the render mesh into five parts: the center mesh
with Lc levels (Mesh 0) and four-sided meshes with levels from Lc + 1 to L
(Mesh 1–Mesh 4).
The center mesh is always visible, whereas side meshes are tested for visibility
before rendering.
With this optimization, we gain two additional render batches; however, the
rendering is reduced by 76% when a field of view is less than 180 degrees. (See
Figure 1.3.)
For low field of view angles (60 degrees or less), we could optimize it further
by creating more side meshes. For example, if we set Lc to 2 and create eight side
meshes, we would reduce the render load by an additional 55% (we would render
1. Dynamic GPU Terrain 7
Choosing terrain parameters. Render mesh topology parameters play a very im-
portant role in performance, so they should be chosen according to each project’s
requirements.
Consider a project where we need a landfill with a rather detailed modification
resolution and neither big rendering size (∼ 200 × 200 m) nor view distance
(∼500 m).
And now a bit of mathematics to get render mesh numbers:
R×S×2L−1
• View extend (how far will the terrain be visible?)—V = 2 .
Because we had lots of scenarios where the camera was looking down on the
terrain from above, we used a reasonably high center mesh level count (Lc 4),
which allowed us to render the terrain in many cases in a single batch (when we
were rendering the center mesh only).
We ended up with the quad resolution R 0.1 m, the level size S 100, the level
count L 8, and the center mesh level count Lc 4. We used a 2048 × 2048 texture
for the terrain data. With these settings we got a 10 cm resolution, a render view
extend of ∼1 km, and a full tri count of 127,744 triangles. Because we used a field
of view with 65 degrees, we only rendered ∼81,000 triangles in three batches.
As mentioned previously, these parameters must be correctly chosen to suit
the nature of the application. (See Figures 1.4 and 1.5.)
Figure 1.5. Wire frame showing different levels of terrain mesh detail.
The following level shift is used to skip resolution levels that are too small for
the camera at ground height:
i n t s h i f t =( i n t ) f l o o r ( l o g ( 1 + c a m e r a g r o u n d h e i g h t / 5 ) ) ;
f l o a t s n a p v a l u e=Q ;
f l o a t s n a p m a x=2 snapvalue ;
p o s s n a p 0 . x=f l o o r ( c a m e r a p o s . x / s n a p m a x + 0 . 0 1 f ) snapmax ;
1. Dynamic GPU Terrain 9
p o s s n a p 0 . z=f l o o r ( c a m e r a p o s . z / s n a p m a x + 0 . 0 1 f ) snapmax ;
f l o a t l e v e l s n a p=s n a p v a l u e ;
T T e r r a i n R e n d e r e r P a r a m s [ 0 ] . z=p o s s n a p 0 . x − c a m e r a p o s . x ;
T T e r r a i n R e n d e r e r P a r a m s [ 0 ] . w=p o s s n a p 0 . z − c a m e r a p o s . z ;
TVector lsnap ;
l s n a p . x=f l o o r ( p o s s n a p 0 . x / l + 0 . 0 1 f ) l;
l s n a p . z=f l o o r ( p o s s n a p 0 . z / l + 0 . 0 1 f ) l;
T T e r r a i n R e n d e r e r P a r a m s [ a ] . x=l s n a p . x − possnap0 . x ;
T T e r r a i n R e n d e r e r P a r a m s [ a ] . y=l s n a p . z − possnap0 . z ;
T T e r r a i n R e n d e r e r P a r a m s [ a ] . z=l s n a p . x − camerapos . x ;
T T e r r a i n R e n d e r e r P a r a m s [ a ] . w=l s n a p . z − camerapos . z ;
}
Vertex shader. All other terrain-rendering algorithm calculations are done in the
vertex shader:
f l o a t 4 p o s 0=T T e r r a i n R e n d e r e r P a r a m s [ 1 6 ] ;
f l o a t 4 s i z 0=T T e r r a i n R e n d e r e r P a r a m s [ 1 7 ] ;
//
f l o a t 4 p o s W S=i n p u t . p o s ;
//
i n t l e v e l=i n p u t . t e x 1 . g ;
p o s W S . x z+=T T e r r a i n R e n d e r e r P a r a m s [ l e v e l ] . xy ;
//
i n t x m i d=i n p u t . t e x 1 . r ;
i n t z m i d=i n p u t . t e x 1 . b ;
f l o a t g e o m o r p h=i n p u t . t e x 1 . a ;
//
f l o a t l e v e l s i z e =i n p u t . t e x 2 . x ;
f l o a t l e v e l s i z e 2 =i n p u t . t e x 2 . y ;
//
o u t p u t . c o l o r 0 =1;
//
f l o a t 4 p o s t e r r a i n=p o s W S ;
//
p o s t e r r a i n =( p o s t e r r a i n − p o s 0 ) / s i z 0 ;
10 I Geometry Manipulation
//
o u t p u t . t e x 0 . x y=p o s t e r r a i n . x z ;
//
f l o a t 4 g e o 0=p o s W S ;
f l o a t 4 g e o x=p o s W S ;
f l o a t 4 g e o 1=p o s W S ;
//
g e o x =( g e o x − p o s 0 ) / s i z 0 ;
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
// o u t p u t c e n t e r g e o a s t e x 0
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
o u t p u t . t e x 0 . x y=g e o x . x z ;
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
// sa m p l e c e n t e r h e i g h t
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
f l o a t h e i x =t e x 2 D l o d ( U s e r 7 S a m p l e r C l a m p , f l o a t 4 ( geox . x , geox . z ,
0 , 0 ) ).r;
//
h e i x=h e i x siz0 . y + pos0 . y ;
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
// geomorphing
// / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
i f ( geomorph > 0 )
{
f l o a t g e o s n a p=l e v e l s i z e ;
//
i f ( xmid )
{
g e o 0 . x−=g e o s n a p ;
g e o 1 . x+=g e o s n a p ;
}
//
i f ( zmid )
{
g e o 0 . z−=g e o s n a p ;
g e o 1 . z+=g e o s n a p ;
}
//
g e o 0 =( g e o 0 − p o s 0 ) / s i z 0 ;
g e o 1 =( g e o 1 − p o s 0 ) / s i z 0 ;
//
f l o a t h e i 0 =t e x 2 D l o d ( U s e r 7 S a m p l e r C l a m p ,
f l o a t 4 ( geo0 . x , geo0 . z , 0 , 0 ) ).r;
f l o a t h e i 1 =t e x 2 D l o d ( U s e r 7 S a m p l e r C l a m p ,
f l o a t 4 ( geo1 . x , geo1 . z , 0 , 0 ) ).r;
// geomorph
f l o a t h e i g e o =( h e i 0+h e i 1 ) 0.5 siz0 . y + pos0 . y ;
//
p o s W S . y=l e r p ( h e i x , h e i g e o , g e o m o r p h ) ;
}
else
{
p o s W S . y=h e i x ;
}
//
p o s W S . w =1;
o u t p u t . p o s =m u l ( p o s W S , TFinalMatrix ) ;
1. Dynamic GPU Terrain 11
Figure 1.6. A sample of a static render mesh for a dynamic terrain on a small area.
www.allitebooks.com
12 I Geometry Manipulation
Copy
Modification shader
Figure 1.7. Modifications (red rectangle) in the large main terrain texture (blue rect-
angle) are done in a very small area.
1.5.2 Plow
A plow modification shader is the most complex terrain modification that we do.
The idea is to displace the volume moved by the plow in front of the plow while
simulating the compression, terrain displacement, and volume preservation.
We use the texture query to measure how much volume the plow would remove
(the volume displaced from the last plow location). Then we use the special plow
distribution mask and add the displaced volume in front of the plow.
Finally, the erosion simulation creates a nice terrain shape.
1.5.3 Erosion
Erosion is the most important terrain modification. It is performed for a few
seconds everywhere a modification is done to smooth the terrain and apply a
more natural look.
Erosion is a simple function that sums target pixel height difference for neigh-
boring pixels, performs a height adjustment according to the pixel flowability
parameter, and adds a bit of a randomization for a natural look.
Unfortunately, we have not yet found a way to link the erosion simulation
with the volume preservation.
1. Dynamic GPU Terrain 13
1.5.4 Wheels
Wheel modification is a simulation of a cylindrical shape moving over a terrain.
It uses a terrain data compression factor to prevent oversinking and to create a
wheel side supplant.
We tried to link this parameter with the terrain data flowability parameter
(to reduce the texture data), but it led to many problems related to the erosion
effect because it also changes the flowability value.
1.7 Problems
1.7.1 Normals on Cliffs
Normals are calculated per pixel with the original data and with a fixed offset
(position offset to calculate slope). This gives a very detailed visual terrain shape
even from a distance, where vertex detail is very low. (See Figure 1.8.)
The problem occurs where flowability is very low and the terrain forms cliffs.
What happens is that the triangle topology is very different between high and low
details, and normals, which are calculated from the high-detailed mesh, appear
detached. (See Figure 1.9.)
One way of mitigating this would be to adjust normal calculation offset with
the edge size, where flowability is low, but with this we could lose other normal
details.
14 I Geometry Manipulation
Figure 1.8. Normal on cliffs problem from up close. High-detail topology and normals
are the same, and this result is a perfect match.
Figure 1.9. Normal on cliffs problem from a distance. Low-detail topology (clearly
visible in the bottom wire frame image) and per-pixel normals are not the same.
1.8 Conclusion
1.8.1 Future Work
At the moment, the algorithm described here uses a single texture for the whole
terrain and as such is limited by either the extend or the resolution. By adding a
16 I Geometry Manipulation
texture pyramid for coarser terrain detail levels, we could efficiently increase the
render extend and not sacrifice the detail.
Mesh 0 and Mesh 2 (as well as Mesh 1 and Mesh 2) are theoretically the same,
so we could reuse them to optimize their memory requirements.
1. Dynamic GPU Terrain 17
Only one level quad edge makes a noticeable transition to a higher level (a
lower-lever detail) at a close distance. By adding more overlapping quad edges
on lower levels, we would be able to reduce the effect and make smoother geo-
morphing.
Currently, we have not yet found a way to maintain the terrain volume, so the
simulation can go into very strange places (e.g., magically increasing volume).
Because we have already downloaded change parts for collision synchroniza-
tion, we could also use this data to calculate the volume change and adjust
simulation accordingly.
1.8.2 Summary
This paper presents a novel algorithm for terrain rendering and manipulation on
a GPU.
In Section 1.4, “Rendering,” we showed in detail how to create and efficiently
render a very detailed terrain in two or three render batches.
In Section 1.5, “Dynamic Modification,” we demonstrated how the terrain can
be modified in real time and be synchronized with the CPU base collision.
Figure 1.10 provides an example of the algorithm at work.
2
I
Bandwidth-Efficient Procedural
Meshes in the GPU via
Tessellation
Gustavo Bastos Nunes and João Lucas
Guberman Raza
2.1 Introduction
Memory bandwidth is still a major bottleneck in current off-the-shelf graphics
pipelines. To address that, one of the common mechanisms is to replace bus con-
sumption for arithmetic logic unit (ALU) instructions in the GPU. For example,
procedural textures on the GPU mitigate this limitation because there is little
overhead in the communication between CPU and GPU. With the inception of
DirectX 11 and OpenGL 4 tessellator stage, we are now capable of expanding pro-
cedural scenarios into a new one: procedural meshes in the GPU via parametric
equations, whose analysis and implementation is the aim of this article.
By leveraging the tessellator stage for generating procedural meshes, one is
capable of constructing a highly detailed set of meshes with almost no overhead
in the CPU to GPU bus. As a consequence, this allows numerous scenarios such
as constructing planets, particles, terrain, and any other object one is capable of
parameterizing. As a side effect of the topology of how the tessellator works with
dynamic meshes, one can also integrate the procedural mesh with a geomorphic-
enabled level-of-detail (LOD) schema, further optimizing their shader instruction
set.
19
20 I Geometry Manipulation
a function that may take one or more parameters. For 3D space, the mathemati-
cal function in this article shall be referenced as a parametric equation of g(u, v),
where u and v are in the [0, 1] range. There are mechanisms other than paramet-
ric surface equations, such as implicit functions, that may be used to generate
procedural meshes. However, implicit functions don’t map well to tessellator use,
because its results imply if a point is in or out of a surfaces mesh, which is best
used in the geometry shader stage via the marching cubes algorithm [Tatarchuk
et al. 07]. Performance-wise, the geometry shader, unlike the tessellator, was not
designed to have a massive throughput of primitives.
Although the tessellator stage is performant for generating triangle primitives,
it contains a limit on the maximum number of triangle primitives it can generate.
As of D3D11, that number is 8192 per patch. For some scenarios, such as simple
procedural meshes like spheres, that number may be sufficient. However, to
circumvent this restriction so one may be able to have an arbitrary number of
triangles in the procedural mesh, the GPU must construct a patch grid. This is
for scenarios such as terrains and planets, which require a high poly count. Each
patch in the grid refers to a range of values within the [0, 1] domain, used as a
source for u and v function parameters. Those ranges dissect the surface area of
values into adjacent subareas. Hence, each one of those subareas that the patches
define serve as a set of triangles that the tessellator produces, which themselves
are a subset of geometry from the whole procedural mesh.
To calculate the patch range p we utilize the following equation:
1
p= √ ,
α
where α is the number of patches leveraged by the GPU. Because each patch
compromises a square area range, p may then serve for both the u and the v
range for each produced patch. The CPU must then send to the GPU, for each
patch, a collection of metadata, which is the patches u range, referenced in this
article as [pumin , pumax ], and the patches v range, referenced in this article as
[pvmin , pvmax ]. Because the tessellator will construct the entire geometry of the
mesh procedurally, there’s no need to send geometry data to the GPU other than
the patch metadata previously described. Hence, this article proposes to leverage
the point primitive topology as the mechanism to send metadata to the GPU,
because it is the most bandwidth-efficient primitive topology due to its small
memory footprint. Once the metadata is sent to the GPU, the next step is to set
the tessellation factors in the hull shader.
then set the tessellation factor per domain edge as well as the primitive’s interior.
The tessellation factor determines the number of triangle primitives that are
generated per patch. The higher the tessellation factor set in the hull shader for
each patch, the higher the number of triangle primitives constructed. The hull
shader’s requirement for this article is to produce a pool of triangle primitives,
which the tessellator shader then leverages to construct the mesh’s geometry
procedurally. Hence, the required tessellation factor must be set uniformly to
each patch edges and interior factors, as exemplified in the code below:
H S _ C O N S T A N T _ D A T A _ O U T P U T B e z i e r C o n s t a n t H S ( I n p u t P a t c h <V S _
CONTROL_POINT_OUTPUT ,
I N P U T _ P A T C H _ S I Z E > ip , u i n t P a t c h I D : S V _ P r i m i t i v e I D )
{
HS_CONSTANT_DATA_OUTPUT Output ;
Output . Edges [ 0 ] = g_fTessellationFactor ;
Output . Edges [ 1 ] = g_fTessellationFactor ;
Output . Edges [ 2 ] = g_fTessellationFactor ;
Output . Edges [ 3 ] = g_fTessellationFactor ;
Output . Inside [ 0 ] = Output . Inside [1]= g_fTessellationFacto r ;
return Output ;
}
Because the patch grid will have primitives that must end up adjacent to each
other, the edges of each patch must have the same tessellation factor, otherwise
a patch with a higher order set of tessellation might leave cracks in the geometry.
However, the interior of the primitive might have different tessellation factors per
patch because those primitives are not meant to connect with primitives from
other patches. A scenario where altering the tessellation factor may be leveraged
is for geomorphic LOD, where the interior tessellation factor is based from the
distance of the camera to the procedural mesh. The hull shader informs the
tessellator how to constructs triangle primitives, which the domain shader then
leverages. This LOD technique is exemplified in the high poly count procedural
mesh shown in Figures 2.1 and 2.2, with its subsequent low poly count procedural
mesh in Figures 2.3 and 2.4.
pu = pumin + du ∗ pumax ,
pv = pvmin + dv ∗ pvmax ,
22 I Geometry Manipulation
Figure 2.1. A high poly count mesh with noise. Figure 2.2. The same mesh in Figure 2.1, but
shaded.
Figure 2.3. The same mesh in Figure 2.1, but with Figure 2.4. The same mesh in Figure 2.3, but
a lower tessellation factor. shaded.
Figure 2.5. Parametric heart generated in the Figure 2.6. Deformed cylinder generated in the
GPU. GPU.
float t = v;
x = cos ( s pi ) sin ( t 2 pi ) −
pow ( abs ( sin ( s pi ) sin ( t 2 pi ) ) , 0 . 5 f ) 0.5 f ;
y = cos ( t 2 pi ) 0.5 f ;
z = sin ( s pi ) sin ( t 2 pi ) ;
f l o a t 3 heart = f l o a t 3 (x , y , z ) ;
return heart ;
}
to further states in the GPU. Depending on the optimization, the former can be
done on the client or in the GPU hull shader via setting the tessellation factor
to 0.
Another set of optimizations relates to normal calculations when using noise
functions. Calculating the normal of the produced vertices might be one of the
main bottlenecks, because one needs to obtain the nearby positions for each pixel
(on per-pixel lighting) or per vertex (on per-vertex lighting). This circumstance
becomes even further problematic when leveraging a computationally demanding
noise implementation. Take the example in the proposal by [Perlin 04]. It calcu-
−→
lates the new normal (Nn ) by doing four evaluations of the noise function while
−
→
leveraging the original noiseless normal (No ):
F0 = F (x, y, z),
Fx = F (x + , y, z),
Fy = F (x, y + , z),
Fz = F (x, y, z + ),
−→ Fx − F0 Fy − F0 Fz − F0
dF = , , ,
−→ → −→
−
Nn = normalize(No + dF ).
However, given that the domain shader for each vertex passes its coordinates
(u, v) in tangent space, in relation to the primitive that each vertex belongs to,
→
−
one might be able to optimize calculating the normal vector ( N ) by the cross
→
− →
−
product of the tangent ( T ) and binormal ( B ) vectors (which themselves will
also be in tangent space) produced by the vertices in the primitive:
−→
F0 = g(u, v) + normalize(No ) × F (g(u, v)),
−→
Fx = g(u + , v) + normalize(No ) × F (g(u + , v)),
−→
Fy = g(u, v + ) + normalize(No ) × F (g(u, v + )),
→
−
T = Fx − F0 ,
→
−
B = Fy − F0 ,
→
−
N = T × B,
where the parametric function is g(u, v), the noise function that leverages the
−
→
original point is F (g(u, v)), and the original normal is No . This way, one only
does three fetches, as opposed to four, which is an optimization in itself because
noise fetches are computationally more expensive than doing the cross product.
Lastly, in another realm of optimization mechanisms, the proposed algorithm
produces a high quantity of triangles, of which the application might not be able
2. Bandwidth-Efficient Procedural Meshes in the GPU via Tessellation 25
2.7 Conclusion
For the proposed algorithm, the number of calculations linearly increases with
the number of vertices and patches, thus making it scalable into a wide range
of scenarios, such as procedural terrains and planets. An example of such a
case would be in an algorithm that also leverages the tessellation stages, such
as in [Dunn 15], which focuses on producing volumetric explosions. Other do-
mains of research might also be used to extend the concepts discussed herein, due
to their procedural mathematical nature, such as dynamic texture and sounds.
Lastly, as memory access continues to be a performance bottleneck, especially
in hardware-constrained environments such as mobile devices, inherently mathe-
matical processes that result in satisfactory visual outputs could be leveraged to
overcome such limitations.
2.8 Acknowledgments
João Raza would like to thank his family and wife for all the support they’ve
provided him. Gustavo Nunes would like to thank his wife and family for all
their help. A special thanks goes to their friend F. F. Marmot.
Bibliography
[Green 05] Simon Green. “Implementing Improved Perlin Noise.” In GPU Gems
2, edited by Matt Farr, pp. 409–416. Reading, MA: Addison-Wesley Profes-
sional, 2005.
[Owens et al. 08] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J.
Phillips. “GPU Computing.” Proceedings of the IEEE 96:5 (2008), 96.
[Perlin 04] Ken Perlin. “Implementing Improved Perlin Noise.” In GPU Gems,
edited by Randima Fernando, pp. 73–85. Reading, MA: Addison-Wesley Pro-
fessional, 2004.
[Tatarchuk et al. 07] N. Tatarchuk, J. Shopf, and C. Decoro. “Real-Time Iso-
surface Extraction Using the GPU Programmable Geometry Pipeline.” In
Proceedings of SIGGRAPH 2007, p. 137. New York: ACM, 2007.
[Dunn 15] Alex Dunn. “Realistic Volumetric Explosions in Games.” In GPU Pro
6: Advanced Rendering Techniques, edited by Wolfgang Engel, pp. 51–62.
Boca Raton, FL: CRC Press, 2015.
3
I
Real-Time Deformation of
Subdivision Surfaces on
Object Collisions
Henry Schäfer, Matthias Nießner,
Benjamin Keinert, and Marc Stamminger
3.1 Introduction
Scene environments in modern games include a wealth of moving and animated
objects, which are key to creating vivid virtual worlds. An essential aspect in
dynamic scenes is the interaction between scene objects. Unfortunately, many
real-time applications only support rigid body collisions due to tight time budgets.
In order to facilitate visual feedback of collisions, residuals such as scratches
or impacts with soft materials like snow or sand are realized by dynamic decal
texture placements. However, decals are not able to modify the underlying surface
geometry, which would be highly desirable to improve upon realism. In this
chapter, we present a novel real-time technique to overcome this limitation by
enabling fully automated fine-scale surface deformations resulting from object
collisions. That is, we propose an efficient method to incorporate high-frequency
deformations upon physical contact into dynamic displacement maps directly on
the GPU. Overall, we can handle large dynamic scene environments with many
objects (see Figure 3.1) at minimal runtime overhead.
An immersive gaming experience requires animated and dynamic objects.
Such dynamics are computed by a physics engine, which typically only considers
a simplified version of the scene in order to facilitate immediate visual feedback.
Less attention is usually paid to interactions of dynamic objects with deformable
scene geometry—for example, footprints, skidmarks on sandy grounds, and bullet
impacts. These high-detail deformations require a much higher mesh resolution,
their generation is very expensive, and they involve significant memory I/O. In
27
28 I Geometry Manipulation
Figure 3.1. Our method allows computation and application of fine-scale surface defor-
mations on object collisions in real time. In this example, tracks of the car and barrels
are generated on the fly as the user controls the car.
Tile data
Mip 0 Mip 2
Mip 1
UV offset in page Pages
Figure 3.2. The tile-based texture format for analytic displacements: each tile stores a
one-texel overlap to avoid the requirement of adjacency pointers. In addition, a mipmap
pyramid, computed at a tile level, allows for continuous level-of-detail rendering. All
tiles are efficiently packed in a large texture array.
aligned in parameter space; that is, the parametric domain of the tiles matches
with the Catmull-Clark patches, thus providing consistent (u, v)-parameters for
s(u, v) and D(u, v). In order to evaluate the biquadratic function D(u, v), 3 × 3
scalar control points (i.e., subpatch; see Figure 3.3) need to be accessed (see Sec-
tion 3.2.1). At base patch boundaries, this requires access to neighboring tiles.
(a) (b)
struct TileDescriptor
{
i n t p a g e ; // t e x t u r e s l i c e
int uOffset ; // t i l e s t a r t u
int vOffset ; // t i l e s t a r t v
uint size ; // t i l e width , h e i g h t
u i n t n M i p m a p ; // number o f mipmaps
};
T i l e D e s c r i p t o r G e t T i l e ( B u f f e r<u i n t> d e s c S R V , u i n t p a t c h I D )
{
TileDescriptor desc ;
uint offset = patchID 4;
desc . page = descSRV [ offset ] ;
desc . uOffset = descSRV [ offset + 1 ] ;
desc . vOffset = descSRV [ offset + 2 ] ;
uint sizeMip = descSRV [ offset + 3 ] ;
desc . size = 1 << ( s i z e M i p >> 8 ) ;
desc . nMipmap = ( sizeMip & 0 xff ) ;
return desc ;
}
Listing 3.1. Tile descriptor: each tile corresponds to a Catmull-Clark base face and is
indexed by the face ID.
This access could be done using adjacency pointers, yet pointer traversal is ineffi-
cient on modern GPUs. So we store for each tile a one-texel overlap, making tiles
self-contained and such pointers unnecessary. While this involves a slightly larger
memory footprint, it is very beneficial from a rendering perspective because all
texture access is coherent. In addition, a mipmap pyramid is stored for every tile,
allowing for continuous level of detail. Note that boundary overlap is included at
all levels.
All tiles—we assume a fixed tile size—are efficiently packed into a large texture
array (see Figure 3.2). We need to split up tiles into multiple pages because the
texture resolution is limited to 16,000 × 16,000 on current hardware. Each page
corresponds to a slice of the global texture array. In order to access a tile, we
maintain a buffer, which stores a page ID and the (u, v) offset (within the page)
for every tile (see Listing 3.1). Entries of this buffer are indexed by corresponding
face IDs of base patches.
www.allitebooks.com
32 I Geometry Manipulation
function D(u, v) is then evaluated using the B-spline basis functions Bi2 :
2
2
D(u, v) = Bi2 (T (u))Bj2 (T (v))di,j ,
i=0 j=0
where the subpatch domain parameters û, v̂ are given by the linear transforma-
tion T ,
1 1
û = T (u) = u − u + and v̂ = T (v) = v − v + .
2 2
In order to obtain the displaced surface normal Nf (u, v), the partial deriva-
tives of f (u, v) are required:
∂ ∂ ∂ ∂
f (u, v) = s(u, v) + Ns (u, v)D(u, v) + Ns (u, v) D(u, v).
∂u ∂u ∂u ∂u
∂
In this case, ∂u Ns (u, v) would involve the computation of the Weingarten equa-
tion, which is costly. Therefore, we approximate the partial derivatives of f (u, v)
(assuming small displacements) by
∂ ∂ ∂
f (u, v) ≈ s(u, v) + Ns (u, v) D(u, v),
∂u ∂u ∂u
∂
which is much faster to compute. The computation of ∂v f (u, v) is analogous.
T e x t u r e 2 D A r r a y<f l o a t > g _ d i s p l a c e m e n t D a t a : r e g i s t e r ( t6 ) ;
B u f f e r<u i n t> g_tileDescriptors : r e g i s t e r ( t7 ) ;
float A n a l y t i c D i s p l a c e m e n t ( i n u i n t p a t c h I D , i n f l o a t 2 uv ,
i n o u t f l o a t du , i n o u t f l o a t d v )
{
TileDescriptor tile = GetTile ( g_tileDescriptors , patchID ) ;
c o o r d s −= f l o a t 2 ( 0 . 5 , 0 . 5 ) ;
int2 c = int2 ( rount ( coords ) ) ;
float d [ 9];
d [ 0 ] = g_displacementData [ i n t 3 ( c . x −1 , c . y −1 , tile . page ) ]. x;
d [ 1 ] = g_displacementData [ i n t 3 ( c . x −1 , c . y −0 , tile . page ) ]. x;
d [ 2 ] = g_displacementData [ i n t 3 ( c . x −1 , c . y +1 , tile . page ) ]. x;
d [ 3 ] = g_displacementData [ i n t 3 ( c . x −0 , c . y −1 , tile . page ) ]. x;
d [ 4 ] = g_displacementData [ i n t 3 ( c . x −0 , c . y −0 , tile . page ) ]. x;
d [ 5 ] = g_displacementData [ i n t 3 ( c . x −0 , c . y +1 , tile . page ) ]. x;
d [ 6 ] = g_displacementData [ i n t 3 ( c . x +1 , c . y −1 , tile . page ) ]. x;
d [ 7 ] = g_displacementData [ i n t 3 ( c . x +1 , c . y −0 , tile . page ) ]. x;
d [ 8 ] = g_displacementData [ i n t 3 ( c . x +1 , c . y +1 , tile . page ) ]. x;
du = tile . size ;
dv = tile . size ;
return displacement ;
}
pixel shader computation in Listing 3.4. Note that shading normals are obtained
on a per-pixel basis, leading to high-quality rendering even when the tessellation
budget is low.
Evaluating f (u, v) and Nf (u, v) for regular patches of the Catmull-Clark patch
is trivial because tiles correspond to surface patches. Regular patches generated
by feature-adaptive subdivision, however, only correspond to a subdomain of a
specific tile. Fortunately, the feature-adaptive subdivision framework [Nießner
et al. 12] provides local parameter offsets in the domain shader to remap the
subdomain accordingly.
Irregular patches only remain at the finest adaptive subdivision level and
cover only a few pixels. They require a separate rendering pass because they
are not processed by the tessellation stage; patch filling quads are rendered in-
stead. To overcome the singularity of irregular patches, we enforce the par-
∂ ∂
tial derivatives of the displacement function ∂u D(u, v) and ∂v D(u, v) to be 0
at extraordinary vertices; i.e., all adjacent displacement texels at tile corners
corresponding to a non–valence-four vertex are restricted to be equal. Thus,
34 I Geometry Manipulation
void d s _ m a i n _ p a t c h e s ( in H S _ C O N S T A N T _ F U N C _ O U T input ,
i n O u t p u t P a t c h <H u l l V e r t e x , 16> p a t c h ,
in f l o a t 2 d o m a i n C o o r d : S V _ D o m a i n L o c a t i o n ,
out OutputVertex output )
{
// e v a l t h e b a s e s u r f a c e s ( u , v )
float3 worldPos = 0 , tangent = 0 , bitangent = 0;
E v a l S u r f a c e ( patch , domainCoord , worldPos , tangent , b i t a n g e n t ) ;
float3 normal = normalize ( cross ( Tangent , BiTangent ) ) ;
f l o a t du = 0 , dv = 0 ;
f l o a t displacement = AnalyticDisplacement ( patch [ 0 ] . patchID ,
d o m a i n C o o r d , du , d v ) ;
w o r l d P o s += d i s p l a c e m e n t normal ;
f l o a t 4 p s _ m a i n ( in O u t p u t V e r t e x input ) : S V _ T A R G E T
{
// compute p a r t i a l d e r i v a t i v e s o f D( u , v )
f l o a t du = 0 , dv = 0 ;
f l o a t displacement = AnalyticDisplacement ( input . patchID ,
i n p u t . p a t c h C o o r d , du , d v ) ;
// compute b a s e s u r f a c e normal N s ( u , v )
float3 surfNormal = normalize ( cross ( input . tangent ,
input . bitangent ) ) ;
float3 tangent = input . tangent + surfNormal du ;
float3 bitangent = input . bitangent + surfNormal dv ;
// compute a n a l y t i c d i s p l a c e m e n t s h a d i n g normal N f ( u , v )
float3 normal = normalize ( cross ( tangent , bitangent ) ) ;
// s h a d i n g
...
}
Nf (u, v) = Ns (u, v) ∀(u, v)extraordinary . A linear blend between this special treat-
ment at extraordinary vertices and the regular Nf (u, v) ensures a consistent C 1
surface everywhere.
Figure 3.4. Algorithm overview: (a) Subdivision surfaces with quadratic B-spline dis-
placements are used as deformable object representation. (b) The voxelization of the
overlapping region is generated for an object penetrating the deformable surface. (c) The
displacement control points are pushed out of the voxelization, (d) creating a surface
capturing the impact.
In the case that both objects are deformable, we form two collision pairs, with
each deformable acting as a rigid penetrating object for the other deformable and
only applying a fraction of the computed deformations in the first pass.
3.4 Pipeline
In this section, we describe the implementation of the core algorithm and highlight
important details on achieving high-performance deformation updates.
36 I Geometry Manipulation
3.4.2 Voxelization
Once we have identified all potential deformable object collisions (see above),
we approximate the shape of penetrating objects using a variant of the binary
solid voxelization of Schwarz [Schwarz 12]. The voxelization is generated by a
rasterization pass where an orthogonal camera is set up corresponding to the
overlap region of the objects’ bounding volumes. In our implementation, we
use a budget of 224 voxels, requiring about 2 MB of GPU memory. Note that
it is essential that the voxelization matches the shape as closely as possible to
achieve accurate deformations. We thus determine tight bounds of the overlap
regions and scale the voxelization anisotropically to maximize the effective voxel
resolution.
Figure 3.5. Generation of the OBB for voxelization: a new OBB is derived from the
intersecting OBBs of the deformable and the penetrating object.
at least one of the faces is completely outside of the penetrating object OBB. The
voxelization is then performed toward the opposite direction of the face, which is
on the outside. We use either of two kernels to perform the voxelization and fill
the adaptively scaled voxel grid forward or backward, respectively, as shown in
Listing 3.5.
RWByteAddressBuffer g_voxels : r e g i s t e r ( u1 ) ;
f l o a t 4 P S _ V o x e l i z e S o l i d ( in O u t p u t V e r t e x input ) : S V _ T A R G E T
{
// t r a n s f o r m f r a g m e n t p o s i t i o n t o v o x e l g r i d
float3 fGridPos = input . posOut . xyz / input . posOut . w ;
fGridPos . z = g_gridSize . z ;
int3 p = int3 ( fGridPos . x , fGridPos . y , fGridPos . z + 0 . 5 ) ;
// a p p l y a d a p t i v e v o x e l g r i d s c a l e
uint address = p . x g_gridStride . x
+ p.y g_gridStride . y
+ ( p . z >> 5 ) 4;
#i f d e f VOXELIZE BACKWARD
g_voxels . InterlockedXor ( address ,
˜ ( 0 x f f f f f f f f u << ( p . z & 3 1 ) ) ) ;
// f l i p a l l v o x e l s b e l o w
f o r ( p . z = ( p . z & ( ˜ 3 1 ) ) ; p . z > 0 ; p . z −= 3 2 ) {
a d d r e s s −= 4 ;
g_voxels . InterlockedXor ( address , 0 xffffffffu ) ;
}
#e l s e
g _ v o x e l s . I n t e r l o c k e d X o r ( a d d r e s s , 0 x f f f f f f f f u << ( p . z & 3 1 ) ) ;
// f l i p a l l v o x e l s b e l o w
f o r ( p . z = ( p . z | 3 1 ) + 1 ; p . z < g _ g r i d S i z e . z ; p . z += 3 2 ) {
a d d r e s s += 4 ;
g_voxels . InterlockedXor ( address , 0 xffffffffu ) ;
}
#e n d i f
}
Listing 3.5. Pixel shader implementation of the binary voxelization using atomic
operations.
38 I Geometry Manipulation
// t h r e a d I d x t o t i l e c o o r d
f l o a t 2 t i l e U V = C o m p u t e T i l e C o o r d ( patchID , tile ,
blockIdx , threadIdx ) ;
// t h r e a d I d x t o ( sub −) p a t c h c o o r d
float2 patchUV = ComputePatchCoord ( patchID , patchLevel ,
blockIdx , threadIdx ) ;
// e v a l s u r f a c e and a p p l y d i s p l a c e m e n t
float3 worldPos = 0;
float3 normal = 0;
// t r a v e r s e r a y u n t i l l e a v i n g s o l i d
float3 rayOrigin = mul (( g_matWorldToVoxel ) ,
float4 ( worldPos , 1 . 0 ) ) . xyz ;
float3 rayDir = normalize ( mul (( float3x3 ) g_matWorldToVoxel ,
−n o r m a l . x y z ) ) ;
f l o a t distOut = 0;
i f ( ! VoxelDDA ( rayOrigin , rayDir , dist ) ) ;
return ;
f l o a t tEnter , tExit ;
i f ( ! i n t e r s e c t R a y V o x e l G r i d ( o r i g i n , dir , t E n t e r , tExit ) )
return false ;
// c h e c k i f r a y i s s t a r t i n g i n v o x e l volume
i f ( IsOutsideVolume ( gridPos ) )
return false ;
int3 step = 1;
i f ( dir . x <= 0 . 0 ) s t e p . x = −1;
i f ( dir . y <= 0 . 0 ) s t e p . y = −1;
i f ( dir . z <= 0 . 0 ) s t e p . z = −1;
i f ( t E n t e r + t >= t E x i t ) b r e a k ;
if ( IsVoxelSet ( gridPos ) )
dist = t ;
i f ( t M i n . x <= t ) { t M i n . x += d t . x ; g r i d P o s . x += s t e p . x ; }
i f ( t M i n . y <= t ) { t M i n . y += d t . y ; g r i d P o s . y += s t e p . y ; }
i f ( t M i n . z <= t ) { t M i n . z += d t . z ; g r i d P o s . z += s t e p . z ; }
i f ( I s O u t s i d e V o l u m e ( g r i d P o s ) ) break ;
}
return ( dist > 0) ;
}
Listing 3.7. Implementation of the voxel digital differential analyzer (DDA) algorithm
for ray casting.
3. Real-Time Deformation of Subdivision Surfaces on Object Collisions 41
Deformable
surface
Intersecting volume
(a) (b) (c)
Figure 3.6. (a) Illustration of the ray casting behavior when tracing from the surface
of the deformable through the voxelized volume. (b) The incorrect deformations that
occur when using the distance of the first exit of the ray. (c) Tracing the ray throughout
the complete volume yields correct deformations.
0 34 2
34
1
17 2 1 3
14
11 3 11 12 14 00 17 2
0 3 1
8 1
2 8 0
3
adj_patches[14] = [17, 34, 11, 8] adj_edges[14] = [0, 1, 1, 1]
Figure 3.7. Example of the adjacency information storage scheme: for the green patch,
we store all neighboring patch indices and indices of the shared edges in the neighboring
patches in counterclockwise order.
shared edges’ indices oriented with respect to the respective neighboring patch
as depicted in Figure 3.7.
In our implementation, we handle the edge overlap separately from the corner
overlap region, as the corners require special treatment depending on whether
the patch is regular or connected to an irregular vertex.
Edge overlap. Using the precomputed adjacency information, we first update the
edge overlap by scattering the boundary displacements coefficients to the adjacent
neighbors overlap region. This process is depicted in Figure 3.8 for a single patch.
Corner overlap. Finally, we have to update the corner values of the overlap re-
gion to provide a consistent evaluation of the analytic displacement maps during
rendering. The treatment of the corner values depends on the patch type. For
42 I Geometry Manipulation
(a) (b)
Figure 3.8. Edge overlap update for a single patch: (a) The direction of the overlap
updates originates from the blue patch. (b) The two adjacent patches (red and yellow)
receive their resulting overlap data by gathering the information from the blue patch.
(a) (b)
Figure 3.9. Corner overlap update at regular vertices. (a) The direction of the corner
overlap update originates from the blue patch. The required information is also stored
in the direct neighbors of the green patch after the edge overlap update pass. (b) The
resulting corner overlap update is gathered from the overlap of the adjacent yellow
patch.
3. Real-Time Deformation of Subdivision Surfaces on Object Collisions 43
(a) (b)
Figure 3.10. Corner overlap update at irregular vertices: (a) texels to be gathered and
(b) the result of scattering the resulting average value to the adjacent tiles.
coefficient from the adjacent patch’s edge to the boundary corner as depicted in
Figure 3.9(b).
In order to provide a watertight and consistent evaluation in the irregular
case, all four corner coefficients must contain the same value. Therefore, we run
a kernel per irregular vertex and average the interior corner coefficients of the
connected patches (see Figure 3.10(a)). Then, we scatter the average to the four
corner texels of the analytic displacement map in the same kernel.
In the end, the overlap is updated and the deformed mesh is prepared for
rendering using displacement mapping.
3.5 Optimizations
3.5.1 Penetrated Patch Detection
The approach described in the previous sections casts rays for each texel of each
patch of a deformed object: a compute shader thread is dispatched for each texel
to perform the ray casting in parallel. This strategy is obviously inefficient since
only a fraction of the patches of a deformed object will be affected. This can be
prevented by culling patches that are outside the overlap regions. To this end, we
compute whether the OBB of the penetrating object and the OBB of each patch
of the object to be deformed do overlap. For this test, we extend the OBBs of the
patches by the maximum encountered displacement to handle already displaced
patches’ surfaces properly. In case an overlap is detected, the patch (likely to be
intersected by the penetrating object) is marked, and its patch index is enqueued
for further processing. Also, the update of tile overlaps is only necessary for these
marked patches.
penetrating object’s OBB with all patches of the scene. For each patch, a thread
is dispatched. In the compute shader the OBB of the patch is computed on the
fly from the patches’ control points and overlap tested against the OBB of the
penetrating object. If an overlap is found, the patch index is appended to the list
to be handled for further processing.
3.6 Results
In this section, we provide several screenshots showing the qualitative results of
our real-time deformations pipeline. The screenshots in Figure 3.12 are taken
from our example scene (see Figure 3.1) consisting of a snowy deformable terrain
3. Real-Time Deformation of Subdivision Surfaces on Object Collisions 45
[ numthreads ( ALLOCATOR_BLOCKSIZE , 1 , 1) ]
void AllocateTilesCS ( uint3 DTid : SV_DispatchThreadID ) {
uint tileID = DTid . x ;
i f ( t i l e I D >= g _ N u m T i l e s ) r e t u r n ;
if ( I s T i l e I n t e r s e c t e d ( t i l e I D ) && I s N o t A l l o c a t e d ( t i l e I D ) )
allocTile ( tileID ) ;
}
Listing 3.8. Compute shader for tile memory allocation with atomic operations.
CPU GPU
Collider Intersected patches
… 3 2 1 0
per collider
Collision pairs
Deformable
Intersect
Batch 0
patches <–> collider OBBs
1
All intersected 2
patch indices
Tile 3
descriptors
Tile Memory Management
Physics Update
Collision pairs
Collider OBB,
Collider Voxelization Voxelization
GPU mesh
GPU
mesh
Ray Casting/Surface Update
Collision pairs
GPU patch
adjacency Update Overlap
subdivision surface, dynamic objects like the car and barrels, and static objects
such as trees and houses.
We start by presenting the parameters that impact the overall quality of the
deformations.
46 I Geometry Manipulation
(a) (b)
(c) (d)
Figure 3.12. Results of the proposed deformation pipeline on snowy surface: (a) an
example of animated character deforming the surface; (b) high-quality geometric detail
including shadows and occlusion; and (c) wireframe visualization of a deformed surface.
(d) Even at low tessellation densities, the deformation stored in the displacement map
can provide visual feedback in shading.
(a) (b)
Figure 3.13. Comparison of deformation quality using different tile resolutions per
patch: (a) The higher resolution (128 × 128) captures high-frequency detailwhile (b) the
lower (32 × 32) does not.
Figure 3.14. Choosing too coarse a voxel grid cannot capture the shape of the pene-
trating object (wheel) well and results in a low-quality deformation.
48 I Geometry Manipulation
With batching
Figure 3.15. Timings in milliseconds on an NVIDIA GTX 780 for the different optimiza-
tions. The first three bars (from top to bottom) show the effects of our optimizations
for a tile resolution of 128 × 18 texels, while the last bar shows the timings for a 32 × 32
tile resolution with all optimizations enabled.
3.6.3 Performance
In this section we provide detail timings of our deformation pipeline, including
the benefits of the optimizations presented in Section 3.5. While we use the stan-
dard graphics pipeline for rendering and the voxelization of the models, including
hardware tessellation for the subdivision surfaces, we employ compute shaders for
patch-OBB intersection, memory management, ray casting (DDA), and updating
the tile overlap regions.
Figure 3.15 summarizes the performance of the different pipeline stages and
the overall overhead per deformable-penetrator collision pair measured on an
NVIDIA GTX 780 using a default per-patch tile size of 128 × 128.
The measurements in Figure 3.15 show that ray casting is the most expensive
stage of our algorithm. With a simple patch–voxel volume intersection test we
can greatly improve the overall performance by starting ray casting and overlap
updates only for the affected patches. This comes at the cost of spending addi-
tional time on the intersection test, which requires reading the control points of
each patch. Because fetches from global memory are expensive, we optimize the
intersection stage by computing the intersection with multiple penetrating objects
after reading the control points, which further improves overall performance.
3. Real-Time Deformation of Subdivision Surfaces on Object Collisions 49
The chosen displacement tile size—as expected—only influences the ray cast-
ing and overlap stage. Because the computational overhead for the higher tile
resolution is marginal, the benefits in deformation quality easily pay off.
3.7 Conclusion
In this chapter, we described a method for real-time visual feedback of surface
deformations on collisions with dynamic and animated objects. To the best of
our knowledge, our system is the first to employ a real-time voxelization of the
penetrating object to update a displacement map for real-time deformation. Our
GPU deformation pipeline achieves deformations in far below a millisecond for a
single collision and scales with the number of deforming objects since only objects
close to each other need to be tested. We believe that this approach is ideally
suited for complex scene environments with many dynamic objects, such as in
future video game generations. However, we emphasize that the deformations
aim at a more detailed and dynamic visual appearance in real-time applications
but cannot be considered as a physical simulation. Therefore, we do not support
elasticity, volume preservation, or topological changes such as fractures.
3.8 Acknowledgments
This work is co-funded by the German Research Foundation (DFG), grant GRK-
1773 Heterogeneous Image Systems.
Bibliography
[Burley and Lacewell 08] Brent Burley and Dylan Lacewell. “Ptex: Per-Face
Texture Mapping for Production Rendering.” In Proceedings of the Nine-
teenth Eurographics Conference on Rendering, pp. 1155–1164. Aire-la-Ville,
Switzerland: Eurographics Association, 2008.
[Coumans et al. 06] Erwin Coumans et al. “Bullet Physics Library: Real-Time
Physics Simulation.” http://bulletphysics.org/, 2006.
[Nießner and Loop 13] Matthias Nießner and Charles Loop. “Analytic Displace-
ment Mapping Using Hardware Tessellation.” ACM Transactions on Graph-
ics 32:3 (2013), article 26.
50 I Geometry Manipulation
[Nießner et al. 12] Matthias Nießner, Charles Loop, Mark Meyer, and Tony
DeRose. “Feature-Adaptive GPU Rendering of Catmull-Clark Subdivision
Surfaces.” ACM Transactions on Graphics 31:1 (2012), article 6.
[Schäfer et al. 14] Henry Schäfer, Benjamin Keinert, Matthias Nießner,
Christoph Buchenau, Michael Guthe, and Marc Stamminger. “Real-Time
Deformation of Subdivision Surfaces from Object Collisions.” In Proceed-
ings of HPG’14, pp. 89–96. New York: ACM, 2014.
[Schwarz 12] Michael Schwarz. “Practical Binary Surface and Solid Voxelization
with Direct3D 11.” In GPU Pro 3: Advanced Rendering Techniques, edited
by Wolfgang Engel, pp. 337–352. Boca Raton, FL: A K Peters/CRC Press,
2012.
4
I
4.1 Introduction
In games, explosions can provide some of the most visually astounding effects.
This article presents an extension of well-known ray-marching [Green 05] tech-
niques for volume rendering fit for modern GPUs, in an attempt to modernize
the emulation of explosions in games
Realism massively affects the user’s level of immersion within a game, and
previous methods for rendering explosions have always lagged behind that of
production quality [Wrennige and Zafar 11]. Traditionally, explosions in games
are rendered using mass amounts of particles, and while this method can look
good from a static perspective, the effect starts to break down in dynamic scenes
with free-roaming cameras. Particles are camera-facing billboards and, by nature,
always face the screen; there is no real concept of rotation or multiple view angles,
just the same texture projected onto the screen with no regard for view direction.
By switching to a volumetric system, explosions look good from all view angles
as they no longer depend on camera-facing billboards. Furthermore, a single
volumetric explosion can have the same visual quality as thousands of individual
particles, thus, removing the strain of updating, sorting, and rendering them
all—as is the case with particle systems.
By harnessing the power of the GPU and the DirectX 11 tessellation pipeline,
I will show you that single-pass, fully volumetric, production-quality explosions
are now possible in the current generation of video games. We will be exploring
volumetric rendering techniques such as ray marching and sphere tracing, as well
as utilizing the tessellation pipeline to optimize these techniques.
There are certain drawbacks to the technique, such as it not being as generic
a system as particles. It’s more of a bespoke explosion system and like particles,
the effect is generally quite pixel heavy from a computational perspective.
51
52 I Geometry Manipulation
Noise Volume
Color Gradient
Pass
Through
4.3 Offline/Preprocessing
First, we must create a 3D volume of noise, which we can use later to create some
nice noise patterns. We can do this offline to save precious cycles later in the pixel
shader. This noise is what’s going to give the explosions their recognizable cloud-
like look. In the implemention described here, simplex noise was used—however,
it should be noted that it isn’t a requirement to use simplex noise; in fact, in your
own implementation you are free to use whatever type of noise you want, so long
as it tiles correctly within our volume. In order to conserve bandwidth and fully
utilize the cache, size, and format of this texture is detrimental to the performance
of the technique. The implementation demonstrated here uses a 32×32×32 sized
volume with a 16-bit floating point format, DXGI_FORMAT_R16_FLOAT . The noise is
calculated for each voxel of the volume using its UVW coordinate as the position
parameter for the noise function.
4. Realistic Volumetric Explosions in Games 53
Figure 4.2. The life of a vertex. In an actual implementation, the level of subdivision
should be much higher than shown in the diagram.
4.4 Runtime
As the effect will be utilizing the tessellation pipeline of the graphics card for ren-
dering, it is required to submit a draw call using one of the various patch primitive
types available in Direct X. For this technique, the D3D11_PRIMITIVE_TOPOLOGY1_
CONTROL_POINT_PATCHLIST primitive type should be used as we only need to submit
a draw call that emits a single vertex. This is because the GPU will be doing
the work of expanding this vertex into a semi-hull primitive. The life of a vertex
emitted from this draw call throughout this technique is shown in Figure 4.2.
from case to case though, so I’d suggest profiling to find the best fit for your own
implementations.
Once the patch has been subdivided, the next stage of the tessellation pipeline
takes over. With the domain shader, we first transform the vertices into a screen-
aligned hemisphere shape, with the inside of the sphere facing the camera and
the radius set to that of the explosion. Then we perform a technique called
sphere tracing [Hart 94] to shrink wrap the hemisphere around the explosion to
form a tight-fitting hull. Sphere tracing is a technique not unlike ray marching,
where starting at an originating point (a vertex on the hemisphere hull), we move
along a ray toward the center of the hemisphere. Normally, when ray marching,
we traverse the ray at fixed size intervals, but when sphere tracing, we traverse
the ray at irregular intervals, where the size of each interval is determined by a
distance field function evaluated at each step. A distance field function represents
the signed distance to the closest point on an implicit surface from any point in
space. (You can see an example of a signed distance function for an explosion in
Listing 4.1).
// R e t u r n s t h e d i s t a n c e t o t h e s u r f a c e o f a s p h e r e .
f l o a t S p h e r e D i s t a n c e ( f l o a t 3 pos , f l o a t 3 s p h e r e P o s , f l o a t radius )
{
float3 relPos = pos − spherePos ;
return length ( relPos ) − radius
}
// R e t u r n s t h e d i s t a n c e t o t h e s u r f a c e o f an e x p l o s i o n .
f l o a t DrawExplosion
(
f l o a t 3 posWS ,
float3 spherePosWS ,
f l o a t radiusWS ,
fl o a t displacementWS ,
out f l o a t displacementOut
)
{
displacementOut = FractalNoise ( posWS ) ;
// R e t u r n s a n o i s e v a l u e by t e x t u r e l o o k u p .
f l o a t Noise ( const float3 uvw )
{
r e t u r n _ N o i s e T e x R O . S a m p l e ( g _ n o i s e S a m , uvw , 0 ) ;
}
// C a l c u l a t e s a f r a c t a l n o i s e v a l u e from a world −s p a c e p o s i t i o n .
f l o a t FractalNoise ( const float3 posWS )
{
const float3 animation = g_AnimationSpeed g_time ;
[ unroll ]
f o r ( u i n t i=0 ; i<k N u m b e r O f N o i s e O c t a v e s ; i++ )
{
n o i s e V a l u e += a m p l i t u d e Noise ( uvw ) ;
amplitude = g_NoiseAmplitudeFactor ;
uvw = g_NoiseFrequencyFactor ;
}
return noiseValue ;
}
f l o a t S p h e r e ( f l o a t 3 pos , f l o a t 3 s p h e r e P o s , f l o a t radius )
{
float3 relPos = pos − spherePos ;
return length ( relPos ) − radius
}
f l o a t C o n e ( f l o a t 3 pos , f l o a t 3 c o n e P o s , f l o a t radius )
{
float3 relPos = pos − conePos ;
f l o a t d = l e n g t h ( r e l P o s . xz ) ;
d −= l e r p ( r a d i u s 0.5f , 0 , 1 + relPos . y/ radius ) ;
d = m a x ( d ,− r e l P o s . y − r a d i u s ) ;
d = max ( d , relPos . y − radius ) ;
return d ;
}
f l o a t 2 h = r a d i u s . xx f l o a t 2 ( 1 . 0 f , 1 . 5 f ) ; // Width , R a d i u s
f l o a t 2 d = abs ( f l o a t 2 ( l e n g t h ( r e l P o s . xz ) , r e l P o s . y ) ) − h ;
f l o a t B o x ( f l o a t 3 pos , f l o a t 3 b o x P o s , f l o a t 3 b )
{
float3 relPos = pos − boxPos ;
float3 d = abs ( relPos ) − b ;
f l o a t T o r u s ( f l o a t 3 pos , f l o a t 3 t o r u s P o s , fl o a t radius )
{
float3 relPos = pos − boxPos ;
f l o a t 2 t = r a d i u s . xx float2 ( 1 , 0.01 f ) ;
f l o a t 2 q = f l o a t 2 ( l e n g t h ( r e l P o s . xz ) − t . x , relPos . y ) ;
return length ( q ) − t . y ;
}
// R e n d e r i n g a c o l l e c t i o n o f p r i m i t i v e s can be a c h i e v e d by
// u s i n g m u l t i p l e p r i m i t i v e d i s t a n c e f u n c t i o n s , combined
// wi t h t h e min f u n c t i o n .
f l o a t Cluster ( float3 pos )
{
f l o a t 3 s p h e r e P o s A = f l o a t 3 ( −1 , 0 , 0 ) ;
float3 spherePosB = float3 ( 1 , 0 , 0) ;
f l o a t sphereRadius = 0.75 f ;
Listing 4.3. HLSL source: a collection of distance functions for various primitives.
58 I Geometry Manipulation
Off On
the intersection geometry can produce an ugly banding artifact in which the slices
of the volume are completely visible.
The extra step trick [Crane et al. 07] attempts to minimize this artifact by
adding one final step at the end of the ray marching, passing in the world-space
position of the pixel instead of the next step position. Calculating the world-
space position of the pixel can be done any way you see fit; in the approach
demonstrated here, we have reconstructed world-space position from the depth
buffer. (See Figure 4.4.)
4.5.3 Lighting
Lighting the explosion puffs from a directional light is possible by performing a
similar ray-marching technique to the one seen earlier while rendering the explo-
sion [Ikits et al. 03]. Let’s go back to when we were rendering via ray marching.
In order to accurately calculate lighting, we need to evaluate how much light
has reached each step along the ray. This is done at each rendering step by ray
marching from the world-space position of the step toward the light source, ac-
cumulating the density (in the case of the explosion this could be the opacity of
the step) until either the edge of the volume has been reached or the density has
reached some maximum value (i.e., the pixel is fully shadowed).
Because this is rather expensive, as an optimization you don’t need to calculate
the lighting at every step. Depending on the amount of steps through the volume
and the density of those steps, you can just calculate the lighting value for one
in every x steps and reuse this value for the next steps. Use best judgement and
check for visual artifacts while adjusting the x variable.
4.6 Results
The screenshots in Figures 4.5–4.7 were rendered using 100 steps (but only a few
rays will actually use this much) with the shrink wrapping optimization enabled.
4. Realistic Volumetric Explosions in Games 59
Figure 4.5. A shot of a clustered volumetric explosion. Here, a collection of spheres has
been used to break up the obvious shape of a singular sphere.
Figure 4.6. Varying the displacement to color gradient over time can provide a powerful
fourth dimension to the effect.
4.7 Performance
The performance of this explosion technique is certainly comparable to that of
a particle-based explosion. With the shrink wrapping optimization, rendering
times can be significantly reduced under the right circumstances. In Figure 4.8,
you’ll see a visual comparison of the shrink wrapping technique and the effect it
60 I Geometry Manipulation
Figure 4.7. As the life of an explosion comes to an end, the entire effect turns more
smoke than flame.
Figure 4.8. Here you can see the amount of rays required to step through the explosion
(more red means more steps are required): with no shrink wrapping optimizations
(left), with shrink wrapping and early out for fully opaque pixels (middle), and the final
rendering (right).
Figure 4.9. See how the shrink wrapping optimization improves the render time. All
numbers were captured using a GTX980 (driver v344.11). Timings were averaged over
200 frames.
back buffer just before rendering an explosion, then, once the explosions have
been rendered, up-sampling the texture associated with the low-resolution render
target by rendering it to the full-resolution back buffer.
There are several corner cases to be aware of—depth testing and edge inter-
sections to name a couple—that are out of the scope of this article. I recommend
reading [Cantlay 07], in which these are thoroughly explained.
4.8 Conclusion
Volumetric explosions undoubtedly provide a much richer visual experience over
particle-based techniques, and as I’ve shown, it’s possible to use them now in the
current generation of games. This article has demonstrated how to best utilize the
modern graphics pipeline and DirectX, taking full advantage of the tessellation
pipeline. The optimization methods described allow for implementing this effect
with a minimal impact on frame times.
4.9 Acknowledgments
The techniques described in this article are an extension of the previous works of
Simon Green in the area of real-time volume rendering.
Bibliography
[Cantlay 07] Iain Cantlay. “High-Speed, Off-Screen Particles.” In GPU Gems
3, edited by Hubert Nguyen, pp. 535–549. Reading, MA: Addison-Wesley
Professional, 2007.
[Crane et al. 07] Keenan Crane, Ignacio Llamas, and Sarah Tariq. “Real-Time
Simulation and Rendering of 3D Fluids.” In GPU Gems 3, edited by Hubert
Nguyen, pp. 653–694. Reading, MA: Addison-Wesley Professional, 2007.
[Green 05] Simon Green. “Volume Rendering For Games.” Presented at Game
Developer Conference, San Francisco, CA, March, 2005.
[Hart 94] John C. Hart. “Sphere Tracing: A Geometric Method for the An-
tialiased Ray Tracing of Implicit Surfaces.” The Visual Computer 12 (1994),
527–545.
[Ikits et al. 03] Milan Ikits, Joe Kniss, Aaron Lefohn, and Charles Hansen. “Vol-
ume Rendering Techniques.” In GPU Gems, edited by Randima Fernando,
pp. 667–690. Reading, MA: Addison-Wesley Professional, 2003.
[Wrennige and Zafar 11] Magnus Wrennige and Nafees Bin Zafar. “Production
Volume Rendering.” SIGGRAPH Course, Vancouver, Canada, August 7–11,
2011.
II
Rendering
This is an exciting time in the field of real-time rendering. With the release of new
gaming consoles comes new opportunities for technological advancement in real-
time rendering and simulation. The following articles introduce both beginners as
well as expert graphics programmers to some of the latest trends and technologies
in the field of real-time rendering.
Our first article is “Next-Generation Rendering in Thief” by Peter Sikachev,
Samuel Delmont, Uriel Doyon, and Jean-Normand Bucci in which a number of
advanced rendering techniques, specifically designed for the new-generation of
gaming consoles, are presented. The authors discuss real-time reflections, contact
shadows, and compute-shader-based postprocessing techniques.
Next is “Grass Rendering and Simulation with LOD” by Dongsoo Han and
Hongwei Li. In this article, the authors present a GPU-based system for grass
simulation and rendering. This system is capable of simulating and rendering
more than 100,000 blades of grass, entirely on the GPU, and is based on earlier
work related to character hair simulation.
“Hybrid Reconstruction Antialiasing” by Michal Drobot provides the reader
with a full framework of antialiasing techniques specially designed to work ef-
ficiently with AMD’s GCN hardware architecture. The author presents both
spatial and temporal antialiasing techniques and weighs the pros and cons of
many different implementation strategies.
Egor Yusov’s “Real-Time Rendering of Physically Based Clouds Using Pre-
computed Scattering” provides a physically based method for rendering highly
realistic and efficient clouds. Cloud rendering is typically very expensive, but here
the author makes clever use of lookup tables and other optimizations to simulate
scattered light within a cloud in real time.
Finally, we have “Sparse Procedural Volume Rendering” by Doug McNabb in
which a powerful technique for volumetric rendering is presented. Hierarchical
data structures are used to efficiently light and render complex volumetric effects
in real time. The author also discusses methods in which artists can control volu-
metric forms and thus provide strong direction on the ultimate look of volumetric
effects.
The new ideas and techniques discussed in this section represent some of the
latest developments in the realm of real-time computer graphics. I would like
64 II Rendering
to thank our authors for generously sharing their exciting new work and I hope
that these ideas inspire readers to further extend the state of the art in real-time
rendering.
—Christopher Oat
1
II
Next-Generation Rendering
in Thief
Peter Sikachev, Samuel Delmont, Uriel Doyon,
and Jean-Normand Bucci
1.1 Introduction
In this chapter we present the rendering techniques used in Thief, which was
developed by Eidos Montreal for PC, Playstation 3, Playstation 4, Xbox 360,
and Xbox One. Furthermore, we concentrate solely on techniques, developed
exclusively for the next-generation platforms, i.e., PC, Playstation 4, and Xbox
One.
We provide the reader with implementation details and our experience on a
range of rendering methods. In Section 1.2, we discuss our reflection rendering
system. We describe each tier of our render strategy as well as final blending and
postprocessing.
In Section 1.3, we present a novel contact-hardening shadow (CHS) approach
based on the AMD CHS sample. Our method is optimized for Shader Model 5.0
and is capable of rendering high-quality large shadow penumbras at a relatively
low cost. Section 1.4 describes our approach toward lit transparent particles
rendering.
Compute shaders (CSs) are a relatively new feature in graphics APIs, intro-
duced first in the DirectX 11 API. We have been able to gain substantial benefits
for postprocessing using CSs. We expound upon our experience with CSs in
Section 1.5.
Performance results are presented in the end of each section. Finally, we
conclude and indicate further research directions in Section 1.6.
65
66 II Rendering
1.2 Reflections
Reflections rendering has always been a tricky subject for game engines. As long
as the majority of games are rasterization based, there is no cheap way to get
correct reflections rendered in the most general case. That being said, several
methods for real-time reflection rendering produce plausible results in special
cases.
Figure 1.1. From left to right: cube map reflections only, SSR + cube maps, IBR +
cube maps, and SSR + IBR + cube maps. [Image courtesy Square Enix Ltd.]
Numerous variations of the methods discussed above have been proposed and
used in real-time rendering. For instance, localized, or parallax-corrected cube
maps [Lagarde and Zanuttini 13] are arguably becoming an industry standard.
In the next sections, we will describe the reflection system we used in Thief.
• screen-space reflections (SSR) for opaque objects, dynamic and static, within
a human height of a reflecting surface;
• localized cube map reflections to fill the gaps between IBR proxies;
• global cube map reflections, which are mostly for view-independent sky-
boxes.
Each tier serves as a fallback solution to the previous one. First, SSR ray-marches
the depth buffer. If it does not have sufficient information to shade a fragment
(i.e., the reflected ray is obscured by some foreground object), it falls back to
image-based reflection. If none of the IBR proxies are intersected by the reflection
ray, the localized cube map reflection system comes into play. Finally, if no
appropriate localized cube map is in proximity, the global cube map is fetched.
Transition between different tiers is done via smooth blending, as described in
Section 1.2.6.
Reflected object
Reflected surface
Figure 1.2. Screen-space reflections linear steps (green) and binary search steps (orange
and then red).
1. Next-Generation Rendering in Thief 69
where depth is denoted with d, k1 is the linear factor, and k2 is the exponential
factor.
Additionally, we use a bunch of early-outs for the whole shader. We check
if the surface has a reflective component and if a reflection vector points to the
camera. The latter optimization does not significantly deteriorate visual quality,
as in these situations SSR rarely yields high-quality results anyway and the re-
flection factor due to the Fresnel equation is already low. Moreover, this reduces
the SSR GPU time in the case when IBR GPU time is high, thus balancing the
total.
However, one should be very careful when implementing such an optimization.
All fetches inside the if-clause should be done with a forced mipmap level; all
variables used after should be initialized with a meaningful default value, and
the if-clause should be preceded with a [branch] directive. The reason is that a
shader compiler might otherwise try to generate a gradient-requiring instruction
(i.e., tex2D) and, therefore, flatten a branch, making the optimizations useless.
70 II Rendering
Figure 1.3. IBR tile-based rendering optimization. IBR proxy is shown in orange. Tiles
are shown in dotted blue lines, and vertical sides extensions are shown in orange dotted
lines. The affected tiles are shaded with thin red diagonal lines.
1: xmin = 1
2: xmax = −1
3: for all IBR proxies in front of player and facing player do
4: find AABB of the current proxy
5: for all vertices of AABB do
6: calculate vertex coordinate in homogeneous clip space
7: w := |w|
8: calculate vertex coordinate in screen clip space
9: xmin := min(x, xmin )
10: xmax := max(x, xmax )
11: end for
12: for all vertical edges of AABB in screen space do
13: calculate intersections x1 and x2 with top and bottom of the screen
14: xmin := min(x1 , x2 , xmin )
15: xmax := max(x1 , x2 , xmax )
16: end for
17: for all IBR tiles do
18: if the tile overlaps with [xmin , xmax ] then
19: add the proxy to the tile
20: end if
21: end for
22: end for
Algorithm 1.1. Algorithm for finding affected tiles for an IBR proxy.
Additionally, in order to limit the number of active IBR proxies in the frame,
we introduced the notion of IBR rooms. Essentially, an IBR room defines an
AABB so that a player can see IBR reflections only from the IBR proxies in
www.allitebooks.com
72 II Rendering
Figure 1.4. Non-glossy reflection rendering (left) and CHGR (right). [Image courtesy
Square Enix Ltd.]
the same room. Moreover, the lower plane of an IBR room’s AABB defines the
maximum reflection extension of each of the proxies inside it. This allowed us to
drastically limit the number of reflections when a player is looking down.
As a side note, Thief has a very dynamic lighting environment. In order to
keep the IBR reflection in sync with the dynamic lights, IBR had to be scaled
down based on the light intensity. This makes the IBR planes disappear from
reflection when lights are turned off. Although this is inaccurate since the IBR
textures are generated from the default lighting setup, it was not possible to know
which parts of the plane were actually affected by dynamic lighting.
Also, IBRs were captured with particles and fog disabled. Important particles,
like fire effects, were simulated with their own IBRs. Fog was added accordingly
to the fog settings and the reflection distance after blending SSR and IBR.
// World−s p a c e u n i t i s 1 c e n t i m e t e r
int distanceLo = int ( worldSpaceDistance ) % 256;
int distanceHi = int ( worldSpaceDistance ) / 256;
float3 reflectedCameraToWorld =
reflect ( cameraToWorld , worldSpaceNormal ) ;
f l o a t reflectionVectorLength =
max ( length ( reflectedCameraToWorld ) , FP_EPSILON ) ;
f l o a t worldSpaceDistance = 255.0 f ( packedDistance . x +
256.0 f packedDistance . y ) /
reflectionVectorLength ;
...
// R e f l e c t i o n s o r t i n g and b l e n d i n g
...
float4 screenSpaceReflectedPosition =
mul ( float4 ( reflectedPosition , 1) , worldToScreen ) ;
s c r e e n S p a c e R e f l e c t e d P o s i t i o n /= s c r e e n S p a c e R e f l e c t e d P o s i t i o n . w ;
R e f l e c t i o n D i s t a n c e = l e n g t h ( s c r e e n S p a c e R e f l e c t e d P o s i t i o n . xy −
s c r e e n S p a c e F r a g m e n t P o s i t i o n . xy ) ;
utilize R8G8B8A8 textures for color and depth information. As 8 bits does not
provide enough precision for distance, we pack the distance in two 8-bit channels
during the SSR pass, as shown in Listing 1.1.
The IBR pass unpacks the depth, performs blending, and then converts this
world-space distance into screen-space distance as shown in Listing 1.2. The
reason for this is twofold. First, the screen-space distance fits naturally into the
[0, 1] domain. As we do not need much precision for the blurring itself, we can
re-pack it into a single 8-bit value, ensuring a natural blending. Second, the
screen-space distance provides a better cue for blur ratio: the fragments farther
away from the viewer should be blurred less than closer ones, if both have the
same reflection distance.
The second step is to dilate the distance information. For each region, we
select the maximum distance of all the pixels covered by the area of our blur
kernel. The reason for this is that the distance value can change suddenly from one
pixel to the next (e.g., when a close reflection proxy meets a distant background
pixel). We wish to blur these areas with the maximum blur coefficient from
the corresponding area. This helps avoid sharp silhouettes of otherwise blurry
objects. This problem is very similar to issues encountered with common depth-
74 II Rendering
Figure 1.5. SSR only (left) and SSR blended with IBR (right). [Image courtesy Square
Enix Ltd.]
1. Next-Generation Rendering in Thief 75
Figure 1.6. Reflection blending without sorting (left) and with sorting (right). [Image
courtesy Square Enix Ltd.]
Figure 1.7. Reflection without bump (left) and with bump as a postprocess (right).
[Image courtesy Square Enix Ltd.]
When a reflected ray enters into one of these cracks and hits the skybox,
it results in a bright contrast pixel because most Thief scenes typically use a
skybox that is much brighter than the rest of the environment. To fix this, we
used localized cube maps taken along the playable path. Any primitive within
reach of a localized cube map would then use it in the main render pass as the
reflected environment color.
Technically, the cube map could be mapped and applied in screen space using
cube map render volumes, but we chose to simply output the cube map sample
into a dedicated render target. This made the cube map material-bound and
removed its dependency with the localized cube map mapping system.
The main render pass in Thief would output the following data for reflective
primitives:
After generating the IBR and SSR half-resolution reflection texture, the fi-
nal color is computed by adding SSR, IBR, and finally the environment reflec-
tion color (i.e., cube map color). If the material or platform does not support
IBR/SSR, the color would simply be added to the material-lit color and the extra
render targets are not needed.
1. Next-Generation Rendering in Thief 77
Figure 1.8. Creation of the multiple volumes for covering the whole environment. [Image
courtesy Square Enix Ltd.]
Note here that we needed the diffuse lighting intensity to additionally scale
down the IBR and cube map color because they were captured with the default
lighting setup, which could be very different from the current in-game lighting
setup. This scale was not required for the SSR because it is real-time and accu-
rate, while the IBR proxies and cube maps are precomputed offline.
Figure 1.9. Build process generating the individual cube map captures. [Image courtesy
Square Enix Ltd.]
using only one, given the rendering stresses of an environment. For example,
certain environment condition might force us to use only one technique. Poor
lighting condition could be a good example where you do not want to pay the
extra cost of an expensive cube map or IBR planes.
We found that screen-space reflections were very easy for our art team to
integrate. For this reason, we used SSR as our base tool for most of our reflection
needs. This would then dictate where some of our IBR planes should go; it was
a fallback solution when SSR failed.
1.2.10 Results
We came up with a robust and fast reflection system that is ready for the next-
generation consoles. Both SSR and IBR steps take around 1–1.5 ms on Playsta-
tion 4 (1080p) and Xbox One (900p). However, these are worst case results, i.e.,
taken on a synthetic scene with an SSR surface taking up the whole screen and 50
IBR proxies visible. For a typical game scene, the numbers are usually lower than
that. Reflection postprocessing is fairly expensive (around 2 ms). However, we
did not have time to implement it using compute shaders, which could potentially
save a lot of bandwidth.
1. Next-Generation Rendering in Thief 79
Figure 1.10. Shadow with ordinary filtering (left) and with contact-hardening shadows
(right). [Image courtesy Square Enix Ltd.]
Our reflection system does not support rough reflections. Taking into account
the emerging interest in physically based rendering solutions, we are looking into
removing this limitation. Reprojection techniques also look appealing both for
quality enhancement and bandwidth reduction.
Light source
Caster
Search region
Shadow map
Shader Model 5.0’s new intrinsic GatherRed() accelerates this step by sampling
four values at once. In Thief, we decided to use a 8 × 8 kernel size, which
actually performs 16 samples instead of 64 for a Shader Model 4.0 implementation
(see Listing 1.4). Increasing the size of the kernel will allow a larger penumbra,
since points that are farther from the shaded one can be tested, but it obviously
increases the cost as the number of texture fetches grows.
Because the penumbra width (or blurriness) is tightly related to the size of
the kernel, which depends on the shadow map resolution and its projection in
world space, this leads to inconsistent and variable penumbra width when the
shadow map resolution or the shadow frustum’s FOV changes for the same light
caster/receiver setup. Figure 1.12 shows the issue.
To fix this issue in Thief, we extended the CHS by generating mips for the
shadow map in a prepass before the CHS application by downsizing it iteratively.
Those downsizing operations are accelerated with the use of the GatherRed()
intrinsic as well. Then, in the CHS step, we dynamically chose the mip that gives
Figure 1.12. For the same 8 × 8 search grid, a smaller search region due to higher
resolution shadow map (left) and a bigger search region due to wider shadow frustum
(right).
1. Next-Generation Rendering in Thief 81
#d e f i n e KERNEL SIZE 8
f l o a t wantedTexelSizeAt1UnitDist =
wantedPenumbraWidthAt1UnitDist / KERNEL_SIZE ;
f l o a t texelSizeAt1UnitDist =
2 TanFOVSemiAngle / shadowMapResolution ;
f l o a t MaxShadowMip =
−l o g ( t e x e l S i z e A t 1 U n i t D i s t / w a n t e d T e x e l S i z e A t 1 U n i t D i s t ) / l o g ( 2 ) ;
M a x S h a d o w M i p = m i n ( f l o a t ( M I P S _ C O U N T −1) , m a x ( M a x S h a d o w M i p , 0 . 0 ) ) ;
// both BlkSearchShadowMipIndex and MaxShadowMip a r e p a s s e d
// t o t h e s h a d e r a s p a r a m e t e r s
i n t BlkSearchShadowMipIndex = ceil ( MaxShadowMip ) ;
Listing 1.3. Algorithm for choosing a mip from a user-defined penumbra width, the
shadow map resolution, and the FOV angle of the shadow frustum.
Figure 1.13. Shadow-map mips layout. [Image courtesy Square Enix Ltd.]
a kernel size in world space that is closer to a user-defined parameter. Listing 1.3
shows how the mip index is computed from this user-defined parameter, the
shadow map resolution, and the FOV angle of the shadow frustum. This process
can be done on the CPU and the result is passed to a shader as a parameter.
Unfortunately, the GatherRed() intrinsic does not allow mip selection. There-
fore, the mips are stored in an atlas, as shown in Figure 1.13, and we offset the
texture coordinates to sample the desired mip. This is achieved by applying a
simple offset scale to the coordinates in texture space (see Listing 1.4).
In order to save on fragment instructions, the function returns, as an early
out, a value of 0.0 (fully shadowed) if the average blocker depth is equal to 1.0
(found a blocker for all samples in the search region) or returns 1.0 (fully lit) if
the average blocker depth is equal to 0.0 (no blocker found). Listing 1.4 shows
the details of the average-blocker-depth compute.
82 II Rendering
#d e f i n e KERNEL SIZE 8
#d e f i n e BFS2 ( KERNEL SIZE − 1 ) / 2
f l o a t 3 b l k T c = f l o a t 3 ( i n T c . xy , i n D e p t h ) ;
// T c B i a s S c a l e i s a s t a t i c a r r a y h o l d i n g t h e o f f s e t −s c a l e
in the s h a d o w map f o r every mips .
float4 blkTcBS = TcBiasScale [ BlkSearchShadowMipIndex ] ;
blkTc . xy = b l k T c B S . xy + blkTc . xy b l k T c B S . zw ;
// g vShadowMapDims . xy i s t h e shadow map r e s o l u t i o n
// g vShadowMapDims . zw i s t h e shadow map t e x e l s i z e
f l o a t 2 b l k A b s T c = ( g _ v S h a d o w M a p D i m s . xy blkTc . xy ) ;
f l o a t 2 fc = b l k A b s T c − floor ( b l k A b s T c ) ;
blkTc . xy = blkTc . xy − ( fc g _ v S h a d o w M a p D i m s . zw ) ;
fl o a t blkCount = 0; f l o a t avgBlockerDepth = 0;
[ l o o p ] f o r ( i n t r o w = −B F S 2 ; r o w <= B F S 2 ; r o w += 2 )
{
f l o a t 2 t c = b l k T c . x y + f l o a t 2(− B F S 2 g _ v S h a d o w M a p D i m s . z ,
row g_vShadowMapDimensions . w ) ;
[ u n r o l l ] f o r ( i n t c o l = −B F S 2 ; c o l <= B F S 2 ; c o l += 2 )
{
f l o a t 4 d e p t h 4 = s h a d o w T e x . G a t h e r R e d ( p o i n t S a m p l e r , tc . xy ) ;
f l o a t 4 b l k 4 = ( b l k T c . z z z z <= d e p t h 4 ) ? ( 0 ) . x x x x : ( 1 ) . x x x x ;
float4 fcVec = 0;
i f ( r o w == −B F S 2 )
{
i f ( c o l == −B F S 2 )
fcVec = f l o a t 4 ((1.0 − fc . y ) (1.0 − f c . x ) ,
(1.0 − f c . y ) , 1 , (1.0 − f c . x ) ) ;
e l s e i f ( c o l == B F S 2 )
f c V e c = f l o a t 4 ( ( 1 . 0 − f c . y ) , (1.0 − f c . y ) fc . x , fc . x , 1) ;
else
f c V e c = f l o a t 4 ( ( 1 . 0 − f c . y ) , (1.0 − f c . y ) , 1 , 1 ) ;
}
e l s e i f ( r o w == B F S 2 )
{
i f ( c o l == −B F S 2 )
f c V e c = f l o a t 4 ( ( 1 . 0 − f c . x ) , 1 , f c . y , (1.0 − f c . x ) fc . y ) ;
e l s e i f ( c o l == B F S 2 )
fcVec = f l o a t 4 ( 1 , fc . x , fc . x fc . y , fc . y ) ;
else
fcVec = f l o a t 4 ( 1 , 1 , fc . y , fc . y ) ;
}
else
{
i f ( c o l == −B F S 2 )
f c V e c = f l o a t 4 ( ( 1 . 0 − f c . x ) , 1 , 1 , (1.0 − f c . x ) ) ;
e l s e i f ( c o l == B F S 2 )
fcVec = f l o a t 4 ( 1 , fc . x , fc . x , 1) ;
else
fcVec = float4 (1 ,1 ,1 ,1) ;
}
b l k C o u n t += d o t ( b l k 4 , f c V e c . x y z w ) ;
a v g B l o c k e r D e p t h += d o t ( d e p t h 4 , f c V e c . x y z w b l k 4 ) ;
t c . x += 2 . 0 g _ v S h a d o w M a p D i m s . z ;
}
}
i f ( b l k C o u n t == 0 . 0 ) // E a r l y o u t − f u l l y l i t
return 1.0 f ;
e l s e i f ( b l k C o u n t == K E R N E L _ S I Z E K E R N E L _ S I Z E ) // F u l l y shadowed
return 0.0 f ;
a v g B l o c k e r D e p t h /= b l k C o u n t ;
1.3.3 Filtering
The final CHS step consists of applying a dynamic filter to the shadow map to
obtain the light attenuation term. In this step, we also take advantage of the
shadow-map mips. The main idea is to use higher-resolution mips for the sharp
area of the shadow and lower-resolution mips for the blurry area. In order to
have a continuous and unnoticeable transition between the different mips, we use
two mips selected from the penumbra estimation and perform one filter operation
for each mip before linearly blending the two results (see Figure 1.14). Doing so
Figure 1.14. Mips used for the filtering, depending on the user-defined region search
width and the penumbra estimation. [Image courtesy Square Enix Ltd.]
84 II Rendering
#d e f i n e KERNEL SIZE 8
#d e f i n e FS2 ( KERNEL SIZE − 1 ) / 2
f l o a t Ratio = penumbraWidth ;
f l o a t clampedTexRatio = max ( MaxShadowMip − 0.001 , 0 . 0 ) ;
f l o a t texRatio = min ( MaxShadowMip Ratio , c l a m p e d T e x R a t i o ) ;
f l o a t texRatioFc = texRatio − floor ( texRatio ) ;
u i n t t e x t u r e I n d e x = m i n ( u i n t ( t e x R a t i o ) , M I P S _ C O U N T −2) ;
f l o a t 4 h i g h M i p T c B S = T c B i a s S c a l e [ t e x t u r e I n d e x ] ; // h i g h e r r e s
f l o a t 4 l o w M i p T c B S = T c B i a s S c a l e [ t e x t u r e I n d e x + 1 ] ; // l o w e r r e s
// Pack mips Tc i n t o a f l o a t 4 , xy f o r h i g h mip , zw f o r low mip
f l o a t 4 M i p s T c = f l o a t 4 ( h i g h M i p T c B S . x y + i n T c . x y h i g h M i p T c B S . zw ,
l o w M i p T c B S . xy + inTc . xy l o w M i p T c B S . zw ) ;
float4 MipsAbsTc = ( g_vShadowMapDims . xyxy MipsTc ) ;
float4 MipsFc = MipsAbsTc − floor ( MipsAbsTc ) ;
MipsTc = MipsTc − ( MipsFc g_vShadowMapDims . zwzw ) ;
...
// Apply t h e same dynamic w e i g h t m a t r i x t o both mips
// u s i n g r a t i o a l o n g wi t h t h e c o r r e s p o n d i n g MipsTc and MipsFc
...
return lerp ( highMipTerm , lowMipTerm , texRatioFc ) ;
gives a realistic effect with variable levels of blurriness, using the same kernel
size (8 × 8 in Thief ) through the whole filtering. The highest mip index possible
(which corresponds to a penumbra estimation of 1.0) is the same one used in the
blocker search step.
As described above, we need to get the attenuation terms for both selected
mips before blending them. A dynamic weight matrix is computed by feeding
four matrices into a cubic Bézier function, depending only on the penumbra
estimation, and used to filter each mip (not covered here; see [Gruen 10] for the
details). Like the previous steps, this is accelerated using the GatherCmpRed()
intrinsic [Gruen and Story 09]. Listing 1.5 shows how to blend the filtered mips
to obtain the final shadow attenuation term.
The number of shadow map accesses for the blocker search is 16 (8 × 8 kernel
with the use of GatherCmpRed()) and 2 × 16 for the filter step (8 × 8 kernel for each
mip with the use of GatherCmpRed()), for a total of 48 texture fetches, producing
very large penumbras that are independent from the shadow resolution (though
the sharp areas still are dependent). A classic implementation in Shader Model
4.0 using a 8 × 8 kernel with no shadow mipmapping would perform 128 accesses
for smaller penumbras, depending on the shadow resolution.
Performance-wise, on an NVIDIA 770 GTX and for a 1080p resolution, the
CHS takes 1–2 ms depending on the shadow exposure on the screen and the
shadow map resolution. The worst case corresponds to a shadow covering the
whole screen.
1. Next-Generation Rendering in Thief 85
Figure 1.15. Old-generation Thief particles rendering (left) and next-generation version
(right). Notice the color variation of the fog due to different lighting. [Image courtesy
Square Enix Ltd.]
Figure 1.16. Gaussian-based DoF with circular bokeh (top) and DoF with hexagonal
bokeh (bottom). [Image courtesy Square Enix Ltd.]
This functionality can be used for numerous applications. One obvious use
case is decreasing bandwidth for postprocesses, which computes a convolution
of a fairly large radius. In Thief, we used this feature for depth-of-field (DoF)
computations, as will be described below.
Figure 1.17. Fetching of texels with a filter kernel using local data storage.
1.5.3 Results
To understand the LDS win, we tested different implementations of the DoF ker-
nel filters. For a DoF pass using a kernel with radius = 15 for a FP16 render
target, we got 0.15 ms without LDS, 0.26 with vectorized LDS structure, and
0.1 ms for de-vectorized LDS on AMD HD7970. Both next-generation consoles
have shown a speedup with a similar factor. In contrast, using LDS on NVIDIA
GPUs (GeForce 660 GTX) resulted in no speedup at all in the best case. As a
result, on AMD GPUs (which include next-generation consoles), using compute
shaders with LDS can result in a significant (33%) speedup if low-level perfor-
mance considerations (e.g., banked memory) are taken into account.
88 II Rendering
T e x t u r e 2 D i n p u t T e x t u r e : r e g i s t e r ( t0 ) ;
R W T e x t u r e 2 D <f l o a t 4> o u t p u t T e x t u r e : r e g i s t e r ( u 0 ) ;
[ numthreads ( NR_THREADS , 1 , 1) ]
void main ( uint3 groupThreadID : SV_GroupThreadID ,
uint3 dispatchThreadID : SV_DispatchThreadID )
{
// Read t e x t u r e t o LDS
int counter = 0;
for ( int t = groupThreadID . x ;
t < NR_THREADS + 2 KERNEL_RADIUS ;
t += N R _ T H R E A D S , c o u n t e r += N R _ T H R E A D S )
{
i n t x = clamp (
dispatchThreadID . x + counter − KERNEL_RADIUS ,
0 , inputTexture . Length . x − 1) ;
fCache [ t ] = inputTexture [ int2 ( x , dispatchThreadID . y ) ] ;
}
GroupMemoryBarrierWithGroupSync () ;
...
//Do t h e a c t u a l b l u r
...
o u t p u t T e x t u r e [ d i s p a t c h T h r e a d I D . xy ] = v O u t C o l o r ;
}
1.6 Conclusion
In this chapter, we gave a comprehensive walkthrough for the rendering tech-
niques we implemented for the next-generation versions of Thief. We presented
our reflection system, the contact-hardening shadow algorithm, particles lighting
approach, and compute shader postprocesses. Most of these techniques were inte-
grated during the later stages of Thief production, therefore they were used less
extensively in the game than we wished. However, we hope that this postmortem
will help game developers to start using the techniques, which were not practical
on the previous console generation.
1.7 Acknowledgments
We would like to thank Robbert-Jan Brems, David Gallardo, Nicolas Longchamps,
Francis Maheux, and the entire Thief team.
1. Next-Generation Rendering in Thief 89
T e x t u r e 2 D i n p u t T e x t u r e : r e g i s t e r ( t0 ) ;
R W T e x t u r e 2 D <f l o a t 4> o u t p u t T e x t u r e : r e g i s t e r ( u 0 ) ;
[ numthreads ( NR_THREADS , 1 , 1) ]
void main ( uint3 groupThreadID : SV_GroupThreadID ,
uint3 dispatchThreadID : SV_DispatchThreadID )
{
// Read t e x t u r e t o LDS
int counter = 0;
for ( int t = groupThreadID . x ;
t < NR_THREADS + 2 KERNEL_RADIUS ;
t += N R _ T H R E A D S , c o u n t e r += N R _ T H R E A D S )
{
i n t x = clamp (
dispatchThreadID . x + counter − KERNEL_RADIUS ,
0 , inputTexture . Length . x − 1) ;
float4 tex = inputTexture [ int2 (x , dispatchThreadID . y ) ] ;
fCacheR [ t ] = tex . r ;
fCacheG [ t ] = tex . g ;
fCacheB [ t ] = tex . b ;
fCacheA [ t ] = tex . a ;
}
GroupMemoryBarrierWithGroupSync () ;
...
//Do t h e a c t u a l b l u r
...
o u t p u t T e x t u r e [ d i s p a t c h T h r e a d I D . xy ] = v O u t C o l o r ;
}
Listing 1.7. Final kernel implementation. Notice that we make a separate LDS
allocation for each channel.
Bibliography
[Andersson 13] Zap Andersson. “Everything You Always Wanted to Know About
mia material.” Presented in Physically Based Shading in Theory and Prac-
tice, SIGGRAPH Course, Anaheim, CA, July 21–25, 2013.
[Andreev 13] Dmitry Andreev. “Rendering Tricks in Dead Space 3.” Game
Developers Conference course, San Francisco, CA, March 25–29, 2013.
[Gruen and Story 09] Holger Gruen and Jon Story. “Taking Advantage of Di-
rect3D 10.1 Features to Accelerate Performance and Enhance Quality.” Pre-
90 II Rendering
[White and Barré-Brisebois 11] John White and Colin Barré-Brisebois. “More
Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed:
The Run.” Presented in Advances in Real-Time Rendering in Games, SIG-
GRAPH Course, Vancouver, August 7–11, 2011.
[Wright 11] Daniel Wright. “Image Based Reflections.” http://udn.epicgames.
com/Three/ImageBasedReflections.html, 2011.
2
II
2.1 Introduction
Grass rendering and simulation are challenging topics for video games because
grass can cover large open areas and require heavy computation for simulation.
As an extension of our previous hair technology, TressFX [Tre 13], we chose grass
because it has unique challenges. (See Figure 2.1.) Our initial plan was to support
rendering many individual grass blades covering a wide terrain and simulating
their interactions using rigid bodies and wind.
To satisfy our requirements, we developed an efficient and scalable level-of-
detail (LOD) system for grass using DirectX 11. In addition to LOD, a master-
and-slave system reduces simulation computation dramatically but still preserves
the quality of the simulation.
91
92 II Rendering
Expansion
Figure 2.3. Expand the stem to grass blade in vertex shader. The two overlapping
vertices from two triangles are dragged to opposite positions in the vertex shader.
expansion direction follows the binormal of the grass blade, and the expansion
width is chosen by the user. Moreover, we reduce the expansion width gradually
from bottom knots to top in order to make a sharp blade tip. This geometry
expansion was firstly implemented in a geometry shader, but the performance
was not satisfying. We then adopted the approach presented in TressFX, where
they do the geometry expansion in the vertex shader by expanding two degraded
triangles to normal, nondegenerate triangles. Figure 2.3 illustrates the process of
expanding the degenerate triangles. We also modified our in-house blade mod-
eling tool so that its output became a triangle strip: each knot in the stem is
duplicated into two vertices at the same coordinate as the knot, and one line
segment becomes two triangles. At runtime, we upgrade triangles by translating
two overlapping vertices at the knot position in opposite directions determined
by the modulo 2 result of the vertex ID, e.g., SV_VertexID.
Regarding the rendering, we do not have much freedom in choosing a shading
model for the grass blade for there are thousands of blades to be rendered in one
frame. It must be a lightweight shading model that still can create a promising,
natural appearance for the grass blade. Under this constraint, we adopt the con-
ventional Phong model and replace its ambient component with Hemispherical
Ambient Light [Wilhelmsen 09]. Hemispherical ambient light is a good approx-
imation of the true ambient color of grass in lieu of the more precise ambient
occlusion and color, which can be very expensive to generate. It is computed as
the sum of sky light and earth light as shown in following equation:
where ratio is defined by the dot product between the hemisphere’s “up” direction
and the vertex normal.
We also investigated screen-space translucency [Jimenez and Gutierrez 10],
but the increase in the visual quality is minor, and it added an extra 50 shader
instructions, so we did not use it.
Besides the lighting model, a grass blade has both a front and a back face, and
thus we cannot disable face culling on the GPU. We rely on DirectX’s semantic
94 II Rendering
2.3 Simulation
Like hair simulation in TressFX, each grass blade is represented as vertices and
edges. We usually use 16 vertices for each blade and 64 for the thread group size
for compute shaders, but it is possible to change the thread group size to 32 or
128.
For hair simulation in TressFX, three constraints (edge length, global shape,
and local shape) are applied after integrating gravity. The TressFX hair simula-
tion includes a “head” transform. This is required in a hair simulation since the
character’s head can change its position and orientation, but we do not require
this transform for grass. We can also skip applying the global shape constraint
because grass gets much less force due to the absence of head movement. For the
edge length constraint and the local shape constraint, two to three iterations are
usually good enough.
The last step of simulation before going to LOD is to run a kernel to prevent
grass blades from going under the ground. This can be done simply by moving
each vertex position above the position of the blade root vertex position.
Histogram Table
Rearrange
The sorting algorithm we choose must run on the GPU efficiently and support
key and value pairs. Also, we need to count how many keys are less than a given
distance threshold so that we can determine the work item size for dispatch.
Choosing radix sort could give us an extra benefit that, if we quantize the distance
value in 8 bits, we need only one pass. Normally, radix sort needs four passes to
sort 32-bit keys with an 8-bit radix; see Figure 2.4.
After simulating master blades and updating slave vertices, ComputeCamera
Distance in Listing 2.1 calculates the distance from the camera position to each
blade. Also, frustum culling is performed here, and a negative distance value
will be assigned if the blade is outside of the camera’s frustum. We quantize the
distance values to 8 bits using the maximum distance given as user input.
Listings 2.2, 2.3, and 2.4 show the full code of radix sort. The inputs of radix
sort are QuantizedLODDistance and LODSortedStrandIndex . PrefixScan performs a
prefix scan of all the elements of QuantizedLODDistance . Before running the next
kernels of radix sort, we read the prefix scan data on the CPU and compute the
LOD ratio, which is a ratio of the number of valid blades to the number of total
blades. We use this LOD ratio to compute the thread group size for simulation
during the next frame.
Listing 2.5 shows how we can use a prefix scan to get the LOD ratio. We
first calculate the quantized distance threshold and simply read the value of the
prefix-scan array using the quantized distance threshold as an index; the prefix
scan stores counts of values.
96 II Rendering
R W S t r u c t u r e d B u f f e r <u i n t> Q u a n t i z e d L O D D i s t a n c e : r e g i s t e r ( u6 ) ;
R W S t r u c t u r e d B u f f e r <u i n t> L O D S o r t e d S t r a n d I n d e x : r e g i s t e r ( u7 ) ;
[ numthreads ( THREAD_GROUP_SIZE , 1 , 1) ]
void ComputeCameraDistance ( uint GIndex : SV_GroupIndex ,
uint3 GId : SV_GroupID ,
uint3 DTid : SV_DispatchThreadID )
{
uint globalBladedIndex , globalRootVertexIndex ;
// C a l c u l a t e i n d i c e s above h e r e .
// Q u a n t i z e d i s t a n c e i n t o 8 b i t s ( 0 ˜ 2ˆ8 −1)
// s o t h a t r a d i x s o r t can s o r t i t i n one p a s s .
i f ( dist < 0 | | dist > maxDist )
dist = maxDist ;
#d e f i n e RADIX 8 // 8 b i t
#d e f i n e RADICES ( 1 << RADIX) // 256 o r 0 x100
#d e f i n e RADIX MASK (RADICES − 1 ) // 255 o r 0xFF
#d e f i n e THREAD GROUP SIZE RADICES
cbuffer CBRadixSort : r e g i s t e r ( b0 )
{
int numElement ;
i n t bits ;
f l o a t dummy [ 2 ] ;
}
// UAVs
R W S t r u c t u r e d B u f f e r <u i n t> Q u a n t i z e d L O D D i s t a n c e : r e g i s t e r ( u0 ) ;
R W S t r u c t u r e d B u f f e r <u i n t> h i s t o g r a m T a b l e : r e g i s t e r ( u1 ) ;
R W S t r u c t u r e d B u f f e r <u i n t> p a r t i c i a l l y S o r t e d D a t a : r e g i s t e r ( u2 ) ;
R W S t r u c t u r e d B u f f e r <u i n t> p r e f i x S c a n : r e g i s t e r ( u3 ) ;
R W S t r u c t u r e d B u f f e r <u i n t> L O D S o r t e d S t r a n d I n d e x : r e g i s t e r ( u4 ) ;
R W S t r u c t u r e d B u f f e r <u i n t> p a r t i c i a l l y S o r t e d V a l u e : r e g i s t e r ( u5 ) ;
// I n i t i a l i z e s h a r e d memory .
sharedMem [ localId ] = 0;
GroupMemoryBarrierWithGroupSync () ;
particiallySortedData [ globalId ]
= QuantizedLODDistance [ globalId ] ;
particiallySortedValue [ globalId ]
= LODSortedStrandIndex [ globalId ] ;
Listing 2.2. Constant buffer, UAVs, and histogram table kernels in radix sort.
f o r ( u i n t i = 0 ; i < n u m H i s t o g r a m s ; i++ )
{
s u m += h i s t o g r a m T a b l e [ R A D I C E S i + localId ] ;
histogramTable [ RADICES i + localId ] = sum ;
}
}
// There i s o n l y one t h r e a d g r o u p .
[ numthreads ( THREAD_GROUP_SIZE , 1 , 1) ]
void PrefixScan ( uint GIndex : SV_GroupIndex ,
uint3 GId : SV_GroupID ,
98 II Rendering
f o r ( u i n t i = 0 ; i < i t e r ; i++ )
{
i f ( k >= p o w 2 ( i ) )
sharedMem [ k ] = sharedMemPrefixScan [ k ]
+ s h a r e d M e m P r e f i x S c a n [ k−p o w 2 ( i ) ] ;
GroupMemoryBarrierWithGroupSync () ;
sharedMemPrefixScan [ k ] = sharedMem [ k ] ;
GroupMemoryBarrierWithGroupSync () ;
}
if ( localId > 0 )
prefixScan [ localId ] = sharedMemPrefixScan [ localId −1];
else
prefixScan [ localId ] = 0;
}
Listing 2.3. Column scan histogram table and prefix scan kernels in radix sort.
s h a r e d M e m [ l o c a l I d ] += p r e f i x S c a n [ l o c a l I d ] ;
histogramTable [ index ] = sharedMem [ localId ] ;
}
if ( l o c a l I d == 0 )
2. Grass Rendering and Simulation with LOD 99
{
f o r ( i n t i = 0 ; i < R A D I C E S ; i++ )
{
uint element = particiallySortedData [ groupId
RADICES + i ] ;
u i n t v a l u e = ( e l e m e n t >> b i t s ) & R A D I X _ M A S K ;
uint index ;
if ( g r o u p I d == 0 )
{
index = prefixScan [ value ] ;
p r e f i x S c a n [ v a l u e ]++;
}
else
{
index = histogramTable [ RADICES ( g r o u p I d −1) + v a l u e ] ;
histogramTable [ RADICES ( g r o u p I d −1) + v a l u e ]++;
}
QuantizedLODDistance [ index ] =
particiallySortedData [ groupId RADICES + i ] ;
LODSortedStrandIndex [ index ] =
particiallySortedValue [ groupId RADICES + i ] ;
}
}
}
Listing 2.4. Prefix scan table and rearrange kernels in radix sort.
// d i st Th r e sh o ldL O D i s a d i s t a n c e t h r e s h o l d f o r LOD
// and maxDistanceLOD i s t h e maximum d i s t a n c e f o r q u a n t i z a t i o n .
unsigned i n t quantizedDistThre sho ld Lod =
( unsigned i n t ) ( ( distThresholdLOD / maxDistanceLOD ) 255. f ) ;
2.3.3 Wind
There are two kinds of wind motions: local ambient and global tidal motions.
Local ambient motion is small scale and is independent of neighboring blades.
In TressFX, wind was applied to each vertex by calculating the force from the
wind and edge vectors. In grass, we simplified this by grabbing the tip vertex and
moving it along the wind vector. This simple method works as well as the force-
based approach. The amount of displacement is controlled by the magnitude of
the wind. To prevent a visible directional pattern, perturbations are added into
the wind directions and magnitudes.
Global tidal motion is also simple. This is wavy motion and neighbor blades
should work together. In our grass, we simply sweep the grass field with large
cylindrical bars and the collision handling system generates the nice wave motion.
100 Contents
2.4 Conclusion
With 32,768 master blades and 131,072 slave blades, simulating an entire grass
field takes around 2.3 ms without LODs. Because radix sort takes around 0.3 ms,
we see that simulation time can easily drop by more than 50% with LODs using
reasonable distance thresholds.
In our test, we applied only one distance threshold. However, it is also possible
to use multiple distance thresholds. This would allow us to smoothly change
between LOD regions and reduce popping problems during camera movement.
Bibliography
[Bouatouch et al. 06] Kadi Bouatouch, Kévin Boulanger, and Sumanta Pat-
tanaik. “Rendering Grass in Real Time with Dynamic Light Sources.” Rap-
port de recherche RR-5960, INRIA, 2006.
[Jimenez and Gutierrez 10] Jorge Jimenez and Diego Gutierrez. “Screen-Space
Subsurface Scattering.” In GPU Pro: Advanced Rendering Techniques,
edited by Wolfgang Engel, pp. 335–351. Natick, MA: A K Peters, Ltd., 2010.
Hybrid Reconstruction
Antialiasing
Michal Drobot
3.1 Introduction
In this article, we present the antialiasing (AA) solution used in the Xbox One
and Playstation 4 versions of Far Cry 4, developed by Ubisoft Montreal: hybrid
reconstruction antialiasing (HRAA). We present a novel framework that utilizes
multiple approaches to mitigate aliasing issues with a tight performance budget
in mind.
The Xbox One, Playstation 4, and most AMD graphics cards based on the
GCN architecture share a similar subset of rasterizer and data interpolation fea-
tures. We propose several new algorithms, or modern implementations of known
ones, making use of the aforementioned hardware features. Each solution is tack-
ling a single aliasing issue: efficient spatial super-sampling, high-quality edge
antialiasing, and temporal stability. All are based around the principle of data
reconstruction. We discuss each one separately, identifying potential problems,
benefits, and performance considerations. Finally, we present a combined solu-
tion used in an actual production environment. The framework we demonstrate
was fully integrated into the Dunia engine’s deferred renderer. Our goal was
to render a temporarily stable image, with quality surpassing 4× rotated-grid
super-sampling, at a cost of 1 ms at a resolution of 1080p on the Xbox One and
Playstation 4 (see Figure 3.1).
3.2 Overview
Antialiasing is a crucial element in high-quality rendering. We can divide most
aliasing artifacts in rasterization-based rendering into two main categories: tem-
poral and spatial. Temporal artifacts occur as flickering under motion when de-
tails fail to get properly rendered due to missing the rasterization grid on certain
101
102 II Rendering
Figure 3.1. The crops on the right show no AA (top), SMAA (middle), and the presented
HRAA (bottom) results. Only HRAA is capable of reconstructing additional details
while providing high-quality antialiasing.
frames. Spatial artifacts result from signal under-sampling when dealing with a
single, static image. Details that we try to render are just too fine to be properly
resolved at the desired resolution, which mostly manifests itself as jagged edges.
Both sources of aliasing are directly connected with errors of signal under-
sampling and occur together. However, there are multiple approaches targeting
different aliasing artifacts that vary in both performance and quality. We can
divide these solutions into analytical, temporal, and super-sampling–based ap-
proaches.
In this article, we present a novel algorithm that builds upon all these ap-
proaches. By exploring the new hardware capabilities of modern GPUs (we will
base our findings on AMD’s GCN architecture), we optimize each approach and
provide a robust framework that shares the benefits of each algorithm while min-
imizing their shortcomings.
• temporal super-sampling,
• temporal antialiasing.
3. Hybrid Reconstruction Antialiasing 105
d2
D
d1
d0
Figure 3.2. In analytical distance-to-edge techniques, every triangle writes out the
distance to the closest edge used to antialias pixels in a postprocessing pass.
106 II Rendering
Morphological Frame A
Morphological Frame B
Analytical Frame A
Analytical Frame B
Figure 3.3. Antialiased edge changes in motion when using analytical data. Note that
every morphological solution will fail as no gradient change will be detected due to the
same results of rasterization. This gets more problematic with shorter feature search
distance.
Such methods provide temporally stable edge antialiasing, as the blend factor
relies on continuous triangle information rather than discrete rasterization results
(see Figure 3.3).
Gradient length is limited only by storage. In practice, it is enough to store
additional data in 8 bits: 1 bit for the major axis and 7 bits for signed distance,
providing 64 effective gradient steps.
This algorithm also deals efficiently with alpha-tested silhouettes, if a mean-
ingful distance to an edge can be estimated. This proves to be relatively easy
with nonbinary alpha channels. Alpha test derivatives can be used to estimate
the distance to a cutout edge. A better solution would be to use signed distance
fields for alpha testing and directly output the real distance to the edge.
Both methods are fast and easy to implement in practice. It is worth noting
that the final distance to the edge should be the minimum of the geometric
distance to the triangle edge and the edge derived from the alpha channel.
3. Hybrid Reconstruction Antialiasing 107
// C a l c u l a t e c l o s e s t a x i s d i s t a n c e between p o i n t X
// and l i n e AB. Check a g a i n s t known d i s t a n c e and d i r e c t i o n
f l o a t C o m p u t e A x i s C l o s e s t D i s t ( f l o a t 2 inX ,
f l o a t 2 inA ,
f l o a t 2 inB ,
inout uint ioMajorDir ,
inout f l o a t ioAxisDist )
{
f l o a t 2 AB = normalize ( inB − inA ) ;
f l o a t 2 n o r m a l A B = f l o a t 2 (− A B . y , A B . x ) ;
f l o a t dist = d o t ( inA , n o r m a l A B ) − d o t ( inX , n o r m a l A B ) ;
bool majorDir = ( abs ( normalAB . x ) > abs ( normalAB . y ) ) ;
fl o a t axisDist = dist r c p ( m a j o r D i r ? n o r m a l A B . x : n o r m a l A B . y ) ←
;
C o m p u t e A x i s C l o s e s t D i s t ( sc , sc0 , sc1 , o M a j o r D i r , o D i s t a n c e ) ;
C o m p u t e A x i s C l o s e s t D i s t ( sc , sc1 , sc2 , o M a j o r D i r , o D i s t a n c e ) ;
C o m p u t e A x i s C l o s e s t D i s t ( sc , sc2 , sc0 , o M a j o r D i r , o D i s t a n c e ) ;
}
// i n A l p h a i s r e s u l t o f AlphaTest ,
// i . e . , Alpha − AlphaRef
// We assume a l p h a i s a d i s t a n c e f i e l d
void GetSignedDistanceFromAlpha( f l o a t inAlpha ,
out f l o a t oDistance ,
out bool oGradientDir )
{
// Find a l p h a t e s t g r a d i e n t
fl o a t xGradient = ddx_fine ( inAlpha ) ;
fl o a t yGradient = ddy_fine ( inAlpha ) ;
oGradientDir = abs ( xGradient ) > abs ( yGradient ) ;
// Compute s i g n e d d i s t a n c e t o where a l p h a r e a c h e s z e r o
o D i s t a n c e = −i n A l p h a rcp ( oGradientDir ? xGradient : yGradient ) ;
108 II Rendering
Listing 3.1. Optimized GBAA distance to edge shader. This uses direct access to vertex
data from within the pixel shader.
In terms of quality, the analytical methods beat any morphological approach.
Unfortunately, this method proves to be very problematic in many real-world
scenarios. Malan developed a very similar antialiasing solution and researched
further into the practical issues [Malan 10].
The main problem stems from subpixel triangles, which are unavoidable in
a real game production environment. If an actual silhouette edge is composed
of multiple small or thin triangles, then only one of them will get rasterized per
pixel. Therefore, its distance to the edge might not be the actual distance to the
silhouette that we want to antialias. In this case, the resulting artifact will show
up as several improperly smoothed pixels on an otherwise antialiased edge, which
tends to be very visually distracting (see Figure 3.4 and Figure 3.5).
Malan proposed several ways of dealing with this problem [Malan 10]. How-
ever, none of these solutions are very practical if not introduced at the very
beginning of the project, due to complex mesh processing and manual tweaking.
Another issue comes again from the actual data source. Hints for antialiasing
come from a single triangle, therefore it is impossible to correctly detect and pro-
cess intersections between triangles. Many assets in a real production scenario
have intersecting triangles (i.e., a statue put into the ground will have side trian-
gles intersecting with the terrain mesh). GPU rasterization solves intersections
by depth testing before and after rendering a triangle’s pixels. Therefore, there
is no analytical information about the edge created due to intersection. In effect,
the distance to the closest edge does not represent the distance to the intersection
edge, which results in a lack of antialiasing.
Figure 3.4. False distance to a silhouette edge due to subpixel triangles. Taking a single
triangle into account would result in rerasterization of a false edge (blue) instead of the
real silhouette edge (red).
intersections and edges within a pixel. With this information, we could partially
address most of the aforementioned issues. Fortunately, EQAA provides exactly
the information we are interested in by using AMD’s hardware EQAA.
110 II Rendering
CRAA setup. Our goal is to use part of the EQAA pipeline to acquire coverage
information at a high resolution (8 samples per pixel) without paying the com-
putational and memory overhead of full MSAA rendering. We would like to use
3. Hybrid Reconstruction Antialiasing 111
Color Color
Fragments Fragments
1 1
0 0
FMask FMask
3 2 3 0
2 2 2 0
1 2 1 2
0 2 0 0
Color Color
Fragments Fragments
1 1
0 0
FMask FMask
3 0 3 0
2 0 2 0
1 1 1
0 0 0 0
Color Color
Fragments Fragments
1 1
0 0
FMask FMask
3 0 3 1
2 0 2 0
1 1 1 1
0 0 0 0
Color Color
Fragments Fragments
1 1
0 0
FMask FMask
3 1 3 2
2 0 2 0
1 1 1 1
0 0 0 0
Figure 3.6. The steps here illustrate an updated FMask as new triangles are rasterized.
Important note: in the last step, the red triangle does not need to evict Sample 3 if
it would fail a Z-test against the sample. (This, however, depends on the particular
hardware setup and is beyond the scope of this article.)
112 II Rendering
Color
Fragments FMask
0 7 0
6 1
5 1
4 0
3 1
2 0
1 0
0 0
Color
Fragments FMask
0 7 0
6 1
5 1
4 0
3 1
2 0
1 0
0 0
information recovered from the coverage data to derive blending hints in a similar
fashion to AEAA.
In our simplified case of 1F8S we know that FMask will be an 8-bit value,
where the nth bit being set to 0 represents the nth sample being associated with
the rasterized fragment (therefore it belongs to the current pixel’s triangle and
would pass depth testing), while 1 informs us that this sample is unknown—i.e.,
it was occluded by another triangle.
We can think of FMask as a subset of points that share the same color. If
we were to rasterize the current pixel with this newly acquired information, we
would need to blend the current pixel’s fragment weighted by the number of
its known coverage samples, with the other fragment represented by “unknown”
coverage samples. Without adding any additional rendering costs, we could infer
the unknown color fragments from neighboring pixels. We assume that the depth
buffer is working in a compressed mode and that EQAA is using analytical depth
testing, thus providing perfectly accurate coverage information.
3. Hybrid Reconstruction Antialiasing 113
Color
Fragments FMask
0 7 0
6 1
5 1
U 4 0
B 3 1
L 2 0
R 1 0
0 0
Color
Fragments FMask
0 7 0
Neighbor 6 1
Fragments 5 1
U 4 0
B 3 1
L 2 0
R 1 0
0 0
Figure 3.8. Here we illustrate the process of finding an edge that divides a set of
samples into “known” and “unknown” samples. Later, this half plane is used to find an
appropriate neighboring pixel for deriving the unknown color value.
Single edge scenario. We can apply the same strategy behind AEAA to a simple
case in which only a single edge has crossed the pixel. In this case, the pixel’s
FMask provides a clear division of coverage samples: those that passed will be on
one side of the edge, while failed samples will be on the other side. Using a simple
line-fitting algorithm, we can find an edge that splits our set of samples into two
subsets—passed and failed. This edge approximates the real geometric edge of the
triangle that crossed the pixel. In the same spirit of the GBAA algorithm, we find
the major axis of the edge as well as its distance from the pixel’s center. Then we
just need to blend the nearest neighboring pixel color with the current fragment
using the edge distance as a weight. Thus, this technique infers the unknown
samples from the pixel closest to the derived half plane (see Figure 3.8).
114 II Rendering
f l o a t 4 C R A A ( T e x t u r e 2 D M S <f l o a t 4 > i n C o l o r ,
T e x t u r e 2 D <u i n t 2 > i n F M a s k ,
uint2 inTexcord )
{
// Read FMask / HW d e p e n d a nt
uint iFMask = inFMask . Load ( uint3 ( viTexcoord , 0 ) ) ;
uint unknownCov = 0;
f l o a t 2 hP = 0 . 0 ;
// Average a l l d i r e c t i o n s t o unknown s a m p l e s
// t o a p p r o x i m a t e e d g e h a l f p l a n e
f o r ( u i n t i S a m p l e = 0 ; i S a m p l e < N U M _ S A M P L E S ; ++i S a m p l e )
i f ( g e t F M a s k V a l u e F o r S a m p l e ( i F M a s k , i S a m p l e ) == U N K N O W N _ C O D E )
{
h P += T e x C o l o r M S . G e t S a m p l e P o s i t i o n ( i S a m p l e ) ;
u n k n o w n C o v e r a g e ++;
}
// Find f r a g m e n t o f f s e t t o p i x e l on t h e o t h e r s i d e o f e d g e
int2 fOff = int2 ( 1 , 0) ;
i f ( a b s ( h P . x ) > a b s ( h P . y ) && h P . x <= 0 . 0 ) f O f f = i n t 2 ( −1 , 0 ) ;
i f ( a b s ( h P . x ) <= a b s ( h P . y ) && h P . x > 0 . 0 ) f O f f = i n t 2 ( 0 , 1 ) ;
i f ( a b s ( h P . x ) <= a b s ( h P . y ) && h P . x <= 0 . 0 ) f O f f = i n t 2 ( 0 , −1) ;
// Blend i n i n f e r r e d sa m p l e
f l o a t k n o w n C o v = N U M _ S A M P L E S −− u n k n o w n C o v e r a g e ;
f l o a t 4 color = inColor . Load ( viTexcoord , 0 ) knownCov ;
c o l o r += i n C o l o r . L o a d ( v i T e x c o o r d + f O f f , 0 ) unknownCov ;
r e t u r n c o l o r /= N U M _ S A M P L E S ;
}
Listing 3.2. A simple shader for finding the half plane that approximates the orientation
of the “unknown” subset of samples. The half plane is then used to find the closest pixel
on the other side of the edge in order to infer the unknown sample’s color.
Complex scenario. Following what we learned about resolving simple edges using
FMask, we would now like to apply similar ideas to resolving more complex
situations in which multiple edges cross a given pixel. In order to achieve this, we
would like to be able to group together “failed” samples from different triangles
3. Hybrid Reconstruction Antialiasing 115
Figure 3.9. A challenging rendering scenario for antialiasing (top left). Rasterization
grid and edge layout (top right). Simple 8×CRAA resulting in edge antialiasing compa-
rable to 8×MSAA apart from pixels that are intersected by multiple triangles (bottom
left). The results of 8×MSAA (bottom right).
into multiple disconnected sets. For every disconnected set, we find edges (up
to two edges in our implementation) that split it off from other sets. Then we
use the acquired edges to find major directions that should be used for subset
blending. For every subset of unknown fragments, we blend in a color from the
neighboring fragment associated with that subset and weighted by the subset’s
area coverage within the pixel. Finally, we sum all the color values for each
subset and blend this with the current fragment’s known color weighted by the
percentage of passing coverage samples. This way, we can partially reconstruct
the subpixel data using the current pixel’s surrounding neighborhood (see Figure
3.10).
Color
Fragments FMask
0 7 1
Neighbor 6 1
Fragments 5 1
U 4 1
B 3 1
L 2 1
R 1 1
0 1
Color
Fragments FMask
0 7 1
Neighbor 6 1
Fragments 5 1
U 4 1
B 3 0
L 2 1
R 1 1
0 0
Color
CLUT[01101111]
Fragments FMask
U 2
0 7 1
B 3
Neighbor 6 1
L 0
Fragments 5 1
U R 0
4 1
B 3 0
L 2 1
R 1 1
0 0
Figure 3.10. One of the possible methods for finding blend weights for sample subsets.
The bottom image illustrates a blend weight resolve using a lookup table.
FMask analysis. Using 1F16S would also provide significantly better precision
and subpixel handling.
It is worth noting that even the simple logic presented in this section allows for
significant aliasing artifact reduction on thin triangles. Figure 3.11 illustrates a
3. Hybrid Reconstruction Antialiasing 117
Figure 3.11. Top to bottom: edge layout, rasterized edge, simple 8×CRAA resolve, and
8×CRAA LUT correctly resolving subpixel artifacts.
problematic case for AEAA, where our simple CRAA resolve correctly antialiased
the edge. Unfortunately, when there are too many subpixel triangles that don’t
pass rasterization, CRAA may also fail due to incorrect coverage information. In
practice, this heavily depends on the exact rendering situation, and still CRAA
has much more relaxed restrictions than AEAA.
The code snippet in Listing 3.3 illustrates the CRAA LUT resolve properly
resolving minor subpixel details (see Figure 3.10).
f l o a t 4 C R A A _ L U T ( T e x t u r e 2 D M S <f l o a t 4 > i n C o l o r ,
T e x t u r e 2 D <u i n t 2 > i n F M a s k ,
T e x t u r e 1 D <u i n t > i n C R A A L U T ,
uint2 inTexcord )
{
// Read FMask / HW d e p e n d a nt
uint iFMask = inFMask . Load ( uint3 ( viTexcoord , 0 ) ) ;
uint LUTREsult = inCRAALUT [ iFMask ] ;
f l o a t wC , wN , wE , wS , w W ;
// LUT i s packed a s 8 b i t i n t e g e r w e i g h t s
// North 8 b | West 8b | South 8 b | E a st 8b
// Can a l s o pack wh o l e n e i g h b o r h o o d w e i g h t s i n 4 b i t s
E x c t r a c t L U T W e i g h t s ( L U T R e s u l t , wC , wN , wE , wS , w W ) ;
Taking all pros and cons into account, we decided to pursue the highest pos-
sible quality with k = 1. This means that we are only dealing with one frame of
history, and for every single frame we have two unique samples at our disposal
(assuming that our history sample was accepted as valid); we would like to get
as much data from them as possible.
Figure 3.14. Quincunx sampling and resolve pattern guarantees higher-quality results
than 2×MSAA while still keeping the sample count at 2.
Figure 3.15. The 4× rotated-grid super-sampling pattern maximizes row and column
coverage.
3.6.3 FLIPQUAD
[Akenine-Möller 03] proposed several other low-sample-cost patterns such as FLIP-
TRI and FLIPQUAD. We will focus on FLIPQUAD as it perfectly matches our
goal of using just two unique samples. This sampling pattern is similar to quin-
cunx in its reuse of samples between pixels. However, a massive quality improve-
ment comes from putting sampling points on pixel edges in a fashion similar to
the rotated-grid sampling patterns. This provides unique rows and columns for
each sample, therefore guaranteeing the maximum possible quality.
The FLIPQUAD pattern requires a custom per-pixel resolve kernel as well as
custom per-pixel sampling positions (see Figure 3.16). An important observation
is that the pattern is mirrored, therefore every single pixel quad is actually the
same.
The article [Laine and Aila 06] introduced a unified metric for sampling pat-
tern evaluation and proved FLIPQUAD to be superior to quincunx, even sur-
passing the 4× rotated-grid pattern when dealing with geometric edges (see Fig-
ure 3.17 and Table 3.1).
We can clearly see that the resolve kernel is possible in a typical pixel shader.
However, the per-pixel sampling offsets within a quad were not supported in hard-
ware until modern AMD graphic cards exposed the EQAA rasterization pipeline
extensions. This feature is exposed on Xbox One and Playstation 4, as well as
through an OpenGL extension on PC [Alnasser 11].
3. Hybrid Reconstruction Antialiasing 121
Figure 3.16. FLIPQUAD provides optimal usage of two samples matching quality of
4× rotated-grid resolve.
Figure 3.17. Left to right: single sample, FLIPQUAD, and quincunx. [Image courtesy
[Akenine 03].]
Pattern E
1× Centroid > 1.0
2 × 2 Uniform Grid 0.698
2 × 2 Rotated Grid 0.439
Quincunx 0.518
FLIPQUAD 0.364
Table 3.1. Error metric (E) comparison against a 1024-sample reference image as
reported by [Laine and Aila 06] (lower is better).
122 II Rendering
return 0.25 ( s0 + s1 + s2 + s3 ) ; }
0 1
0
0 1
1
2 3
3
2 3
2
Figure 3.18. Temporal FLIPQUAD pattern. Red samples are rendered on even frames.
Blue samples are rendered on odd frames.
3. Hybrid Reconstruction Antialiasing 123
s0 = C u r r e n t F r a m e . S a m p l e ( P o i n t S a m p l e r , UV ) ;
s1 = C u r r e n t F r a m e . S a m p l e ( P o i n t S a m p l e r , UV , o f f s e t 0 ) ;
s2 = PreviousFrame . Sample ( LinearSampler , previousUV ) ;
s3 = PreviousFrame . Sample ( LinearSampler , previousUV , offset1 ) ;
return 0.25 ( s0 + s1 + s2 + s3 ) ; }
Figure 3.19. Top to bottom: edge rasterized on an even frame and then an odd frame
and the final edge after temporal FLIPQUAD reconstruction kernel.
124 II Rendering
// Quad d e f i n e d a s ( sa m p l e p o s i t i o n s w i t h i n quad )
// s 0 0 s 1 0
// s 0 1 s 1 1
D D X [ f ( s 0 0 ) ] = [ f ( s 0 0 ) −− f ( s 1 0 ) ] / dx , d x = | s 0 0 −− s 1 0 |
D D Y [ f ( s 0 0 ) ] = [ f ( s 0 0 ) −− f ( s 0 1 ) ] / dy , d y = | s 0 0 −− s 0 1 |
// Hardware a ssu m e s dx = dy = 1
// I n c a s e o f s a m p l i n g p a t t e r n from L i s t i n g 6 . dx != dy
// F o o t p r i n t −b a se d s a m p l i n g p i c k s b a s e mip l e v e l
// Based on max ( ddx , ddy )
// Frame A max ( ddx , ddy ) != Frame B max ( ddx , ddy )
// I m p l i e s non t e m p o r a r i l y c o h e r e n t mip s e l e c t i o n
// C a l c u l a t e d i n 1/16 t h o f p i x e l
// Frame A (BLUE)
d x = | −8 −− ( 1 6 + ( −8) ) | = 16
d y = | −2 −− ( 1 6 + ( 2 ) ) | = 20
b a s e M i p ˜ m a x ( dx , d y ) = 20
// Frame B (RED)
d x = | 2 −− ( 1 6 + ( −2) ) | = 12
d y = | −8 −− ( 1 6 + ( −8) ) | = 16
b a s e M i p ˜ m a x ( dx , d y ) = 16}
Figure 3.20. The top row shows even and odd frames of the reordered Temporal
FLIPQUAD pattern. The bottom row shows the default temporal FLIPQUAD pat-
tern clearly suffering from mipmap level mismatches. (The bottom right represents an
oversharpened odd frame).
Listing 3.9. Reordered temporal FLIPQUAD with additional projection matrix offsets.
126 II Rendering
s0 = C u r r e n t F r a m e . S a m p l e ( P o i n t S a m p l e r , UV ) ;
s1 = C u r r e n t F r a m e . S a m p l e ( P o i n t S a m p l e r , UV , o f f s e t 0 ) ;
s2 = PreviousFrame . Sample ( LinearSampler , previousUV ) ;
s3 = PreviousFrame . Sample ( LinearSampler , previousUV , offset1 ) ;
return 0.25 ( s0 + s1 + s2 + s3 ) ; }
3.6.7 Resampling
Any reprojection method is prone to numerical diffusion errors. When a frame
is reprojected using motion vectors and newly acquired sampling coordinates do
3. Hybrid Reconstruction Antialiasing 127
not land exactly on a pixel, a resampling scheme must be used. Typically, most
methods resort to simple bilinear sampling. However, bilinear sampling will result
in over-smoothing. If we would like to use a history buffer in order to accumulate
multiple samples, we will also accumulate resampling errors, which can lead to
serious image quality degradation (see Figure 3.22). Fortunately, this problem is
very similar to well-researched fluid simulation advection optimization problems.
In fluid simulation, the advection step is very similar to our problem of image
reprojection. A data field of certain quantities (i.e., pressure and temperature)
has to be advected forward in time by a motion field. In practice, both fields
are stored in discretized forms; thus, the advection step needs to use resampling.
Assuming that the operation is a linear transform, this situation is equal to the
problem of reprojection.
Under these circumstances, a typical semi-Lagrangian advection step would be
equal to reprojection using bilinear resampling. A well-known method to prevent
over-smoothing is to use second order methods for advection. There are several
known methods to optimize this process, assuming that the advection operator is
reversible. One of them is the MacCormack scheme and its derivation: back and
forth error compensation and correction (BFECC). This method enables one to
closely approximate the second order accuracy using only two semi-Lagrangian
steps [Dupont and Liu 03].
BFECC is very intuitive. In short, we advect the solution forward and back-
ward in time using advection operator A and its reverse, AR . Operator error is
estimated by comparing the original nvalue against the newly acquired one. The
n
original value is corrected by error ( ϕ −
2
ϕ
) and finally advected forward into the
next step of the solution (see Algorithm 3.1 and Figure 3.21 for an illustration).
In the context of reprojection, our advection operator is simply a bilinear
sample using a motion vector offset. It is worth noting that the function described
by the motion vector texture is not reversible (i.e., multiple pixels might move to
same discretized position).
A correct way to acquire a reversible motion vector offset would be through a
depth-buffer–based reprojection using an inverse camera matrix. Unfortunately,
this would limit the operator to pixels subject to camera motion only. Also, the
operator would be invalid on pixels that were occluded during the previous time
step.
n+1 n
1: ϕ = A(ϕ ).
n = AR (ϕ
2: ϕ n+1 ).
n ϕn − ϕ n
3: ϕ̄ = ϕ + .
2
n+1
4: ϕ = A(ϕ̄).
n+1 n
1: ϕ = A(ϕ ).
n = AR (.ϕ
2: ϕ n+1 ).
n+1 ϕn − ϕ
n
3: ϕ n+1 +
=ϕ .
2
n+1 n
1: ϕ = A(ϕ ).
n = AR (ϕ
2: ϕ n+1 ).
n+1
3: ϕ n ).
= A(ϕ
n+1 n+1 ϕ n+1
n+1 − ϕ
4: ϕ =ϕ
+ .
2
// P a ss o u t p u t s phiHatN1Texture
// A( ) o p e r a t o r u s e s motion v e c t o r t e x t u r e
void G e t P h i H a t N 1 ( f l o a t 2 inUV , int2 i n V P O S )
{
f l o a t 2 m o t i o n V e c t o r = M o t i o n V e c t o r s T . Load ( i n t 3 ( inVPOS , 0) ) . xy ;
f l o a t 2 forwardProj = inUV + motionVector ;
// Perform a d v e c t i o n by o p e r a t o r A( )
// P a ss o u t p u t s p h i H a t Te x t u r e
// AR( ) o p e r a t o r u s e s n e g a t i v e v a l u e from motion v e c t o r t e x t u r e
// phiHatN1 t e x t u r e i s g e n e r a t e d by p r e v i o u s p a s s GetPhiHatN1 ( )
void G e t P h i H a t N ( f l o a t 2 inUV , int2 i n V P O S )
{
f l o a t 2 m o t i o n V e c t o r = M o t i o n V e c t o r s T . Load ( i n t 3 ( inVPOS , 0) ) . xy ;
f l o a t 2 backwardProj = inUV − motionVector ;
// Perform r e v e r s e a d v e c t i o n by o p e r a t o r AR( )
return phiHatN1T . SampleLevel ( Linear , backwardProj , 0) . rgb ;
}
// F i n a l o p e r a t i o n t o g e t c o r r e c t l y r e s a m p l e d phiN1
// A( ) o p e r a t o r u s e s motion v e c t o r t e x t u r e
// phiHatN1 and phiHatN t e x t u r e s a r e g e n e r a t e d by p r e v i o u s passes
void G e t R e s a m p l e d V a l u e B F E C C ( f l o a t 2 inUV , int2 i n V P O S )
{
f l o a t 3 phiHatN1 = phiHatN1T . Load ( i n t 3 ( inVPOS , 0) ) . rgb ;
f l o a t 2 m o t i o n V e c t o r = M o t i o n V e c t o r s . Load ( i n t 3 ( inVPOS , 0) ) . xy ;
f l o a t 2 A = inUV + motionVector ;
// Perform a d v e c t i o n by o p e r a t o r A( )
f l o a t 3 phiHatHatN1 = phiHatT . SampleLevel ( Linear , A , 0) . rgb ;
// Perform BFECC
f l o a t 3 phiN1 = 1.5 phiHatN1 − 0.5 phiHatHatN1 ;
ˆˆ n + 1
3
n ˆn
3 2 2
ˆn+1 ˆn+1
1 1
3
n n+1 n n+1
4
3
–n
Figure 3.21. Conceptual scheme of the original BFCE method (left) and of the shader
optimized BFCE used for texture resampling (right).
Figure 3.22. Continuous resampling of 30 frames using a history buffer. The camera is
in motion, panning from left to right. Using bilinear sampling shows numerical diffusion
errors resulting in a blurry image (left). Using optimized linear BFCE helps to minimizes
blurring (right).
artifacts may occur. Ideally we would like to accumulate more frames over time
to further improve image quality. Unfortunately, as described in Sections 3.6.1
and 3.6.6, it is very hard to provide a robust method that will work in real-world
situations, while also using multiple history samples, without other artifacts.
Therefore, several methods rely on super-sampling only in certain local contrast
regions of an image [Malan 12, Sousa 13, Valient 14]. These approaches rely on
visually plausible temporal stabilization (rather than super-sampling). We would
like to build upon these approaches to further improve our results.
3. Hybrid Reconstruction Antialiasing 131
Acceptance
Mean
Max
Min
δ δ δ δ Value
www.allitebooks.com
132 II Rendering
// Motion c o h e r e n c y w e i g h t
f l o a t motionDelta = length ( inCurMotionVec − inPrevMotionVec ) ;
f l o a t motionCoherence = saturate ( c_motionSens motionDelta ) ) ;
// C a l c u l a t e c o l o r window r a n g e
f l o a t 3 range = inCutMin − inCurMax ;
// O f f s e t t h e window bounds by d e l t a p e r c e n t a g e
flo at 3 extOffset = c_deltaColorWindowOffset range ;
f l o a t 3 e x t B o x M i n = m a x ( i n C u r M i n − e x t O f f s e t . rgb , 0 . 0 ) ;
flo at 3 extdBoxMax = inCurMax + extOffset ;
// C a l c u l a t e d e l t a s f o r c u r r e n t p i x e l a g a i n s t p r e v i o u s
f l o a t 3 meanWeight = abs ( inCurValue − inPreValue ) ;
f l o a t loContrast = length ( meanWeight ) c_loWeight ;
f l o a t hiContrast = length ( valDiff ) c_hiWeight ;
// C a l c u l a t e f i n a l w e i g h t s
f l o a t denom = max (( loContrast − hiContrast ) , 0 . 0 ) ;
f l o a t finalWeight = saturate ( rcp ( denom + epsilon ) ) ;
// C o r r e c t p r e v i o u s s a m p l e s a c c o r d i n g t o motion c o h e r e n c y w e i g h t s
finalWeight = saturate ( finalWeight − motionCoherence ) ;
// F i n a l v a l u e b l e n d
return lerp ( inCurValue , clampedPrevVal , finalWeight ) ;
}
◦ SMAA,
3. Hybrid Reconstruction Antialiasing 133
Temporarily Stable
Frame N
Edge Anti-aliasing
FLIPQUAD
Frame N-1 Reconstruction Stable Super-
& Temporal sampled Frame
Anti-aliasing
Frame N-2
Accumulation
History Buffer
Figure 3.24. Data flow graph in our implementation of the HRAA pipeline.
◦ CRAA,
◦ AEAA (GBAA);
• Temporal FLIPQUAD reconstruction combined with temporal antialiasing
(TAA) (see Listing 3.13).
Figure 3.24 illustrates the data flow inside the framework.
During production, we implemented and optimized all three approaches to
temporarily stable edge antialiasing.
SMAA was implemented with geometric edge detection based on depth and
normal buffers. Edges were refined by a predicated threshold based on the lumi-
nescence contrast. Our edge-detection algorithm choice was dictated by making
the resolve as temporally stable as possible.
CRAA and AEAA used the implementations described in Sections 3.5.1 and
3.5.2. Our EQAA setup used a 1F8S configuration, while our AEAA offset buffer
was compressed down to 5 bits (utilizing the last remaining space in our tightly
packed G-buffer).
The results of either edge antialiasing pass were used as N , N − 1, and N − 2
frame sources in the last pass. The history buffer used by TAA at frame N was
the output buffer of TAA from frame N − 1.
// U n o p t i m i z ed p se u d o c o d e f o r f i n a l
// Temporal FLIPQUAD r e c o n s t r u c t i o n & TAA
// Frames N & N−2 a r e assumed
// To have same j i t t e r o f f s e t s
fl oa t3 getFLIPQUADTaa ()
{
f l o a t 3 curMin , currMax , curMean ;
GetLimits ( currentValueTexture , curMin , curMax , curMean ) ;
// Get sums o f a b s o l u t e d i f f e r e n c e
flo at 3 curSAD = GetSAD ( curValueTexture ) ;
flo at 3 prevPrevSAD = GetSAD ( prevPrevValueTexture ) ;
// Motion c o h e r e n c y w e i g h t
f l o a t moCoherence = GetMotionCoherency ( curMotionTexture ,
prevMotionTexture ) ;
// C o l o r c o h e r e n c y w e i g h t
f l o a t colCoherence = GetColorCoherency ( curSAD , prevPrevSAD ) ;
// FLIPQUAD p a r t s
flo at 3 FQCurPart = GetCurFLIPQUAD ( curValueTexture ) ;
flo at 3 FQPrevPart = GetPrevFLIPQUAD ( prevValueTexture ) ;
f l o a t FQCoherency = motionCoherence + colorCoherence ;
f l o a t 3 c l a m p F Q P r e v = clamp ( F Q P r e v P a r t , c u r M i n , c u r M a x ) ;
// Th i s l e r p a l l o w s f u l l c o n v e r g a n c e
// I f c o l o r f l o w (N−2 t o N) i s c o h e r e n t
FQPrevPart = lerp ( FQPrevPart , clampFQPrev , colCoherence ) ;
// F i n a l r e c o n s t r u c t i o n b l e n d
f l o a t 3 FLIPQUAD = lerp ( FQCurPart , FQPrevPart , 0 . 5 moCoherence ) ;
Listing 3.13. Pseudocode for the combined temporal FLIPQUAD reconstruction and
temporal antialiasing.
While temporal FLIPQUAD and TAA remained stable and reliable compo-
nents of the framework, the choice of the edge antialiasing solution proved to be
problematic.
SMAA provided the most visually plausible results on static pixels under any
circumstances. The gradients were always smooth and no edge was left without
antialiasing. Unfortunately, it sometimes produced distracting gradient wobble
while in motion. The wobble was partially mitigated by the FLIPQUAD and
TAA resolves. Unfortunately, SMAA had the highest runtime cost out of the
whole framework.
3. Hybrid Reconstruction Antialiasing 135
AEAA provided excellent stability and quality, even in close-ups where tri-
angles are very large on screen. Unfortunately, objects with very high levels
of tessellation resulted in very objectionable visual noise or even a total loss of
antialiasing on some edges. Even though this was the fastest method for edge
antialiasing, it proved too unreliable for our open world game. It is worth noting
that our AEAA implementation required us to modify every single shader that
writes out to the G-buffer. This might be prohibitively expensive in terms of
developer maintainability and runtime performance.
CRAA mitigated most of the issues seen with AEAA and was also the easiest
technique to implement. Unfortunately, on the current generation of hardware,
there is a measurable cost for using even a simple EQAA setup and the cost scales
with the number of rendered triangles and their shader complexity. However,
in our scenario, it was still faster than SMAA alone. Even though we were
able to solve multiple issues, we still found some finely tessellated content that
was problematic with this technique and resulted in noisy artifacts on edges.
These artifacts could be effectively filtered by temporal FLIPQUAD and TAA.
Unfortunately the cost of outputting coverage data from pixel shaders was too
high for our vegetation-heavy scenarios. We did not experiment with manual
coverage output (i.e., not hardware based).
At the time of writing, we have decided to focus on two main approaches for
our game: SMAA with AEAA used for alpha-tested geometry or CRAA with
AEAA used for alpha-tested geometry. SMAA with AEAA is the most expensive
and most reliable while also providing the lowest temporal stability. CRAA with
AEAA provides excellent stability and performance with medium quality and
medium reliability. The use of AEAA for alpha-tested objects seems to provide
the highest quality, performance, and stability in both use cases; therefore, we
integrated its resolve filter into the SMAA and CRAA resolves. See the perfor-
mance and image quality comparisons of the full HRAA framework in Figure 3.25
and Table 3.2.
3.10 Conclusion
We provided a production proven hybrid reconstruction antialiasing framework
along with several new algorithms, as well as modern implementations of well-
known algorithms. We believe that the temporal FLIPQUAD super-sampling
as well as temporal antialiasing will gain wider adoption due to their low cost,
simplicity, and quality. Our improvements to distance-to-edge–based methods
might prove useful for some projects. Meanwhile, CRAA is another addition to
the temporally stable antialiasing toolbox. Considering its simplicity of imple-
mentation and its good performance, we believe that with additional research it
might prove to be a viable, widely adopted edge antialiasing solution. We hope
that the ideas presented here will inspire other researchers and developers and
provide readers with valuable tools for achieving greater image quality in their
projects.
136 II Rendering
Figure 3.25. Comparison of different HRAA setups showing different scenarios based on
actual game content. From left to right: centroid sampling (no antialiasing), temporal
FLIPQUAD (TFQ), AEAA + TFQ, CRAA + TFQ, and SMAA + TFQ.
Table 3.2. Different HRAA passes and timings measured on an AMD Radeon HD 7950
at 1080p resolution, operating on 32-bit image buffers. “C” means content dependent
and “HW” means hardware type or setup dependent.
3. Hybrid Reconstruction Antialiasing 137
Bibliography
[Akenine-Möller 03] T. Akenine-Möller. “An Extremely Inexpensive Multisam-
pling Scheme.” Technical Report No. 03-14, Ericsson Mobile Platforms AB,
2003.
[AMD 11] AMD Developer Relations. “EQAA Modes for AMD 6900 Se-
ries Graphics Cards.” http://developer.amd.com/wordpress/media/2012/
10/EQAAModesforAMDHD6900SeriesCards.pdf, 2011.
[Alnasser 11] M. Alnasser, G. Sellers, and N. Haemel. “AMD Sample Positions.”
OpenGL Extension Registry, https://www.opengl.org/registry/specs/AMD/
sample positions.txt, 2011.
[Bavoil and Andersson 12] L. Bavoil and J. Andersson. “Stable SSAO in Bat-
tlefield 3 with Selective Temporal Filtering.” Game Developer Conference
Course, San Francisco, CA, March 5–9, 2012.
[Burley 07] B. Burley. “Filtering in PRMan.” Renderman Repository,
https://web.archive.org/web/20130915064937/http:/www.renderman.
org/RMR/st/PRMan Filtering/Filtering In PRMan.html, 2007. (Original
URL no longer available.)
[Drobot 11] M. Drobot. “A Spatial and Temporal Coherence Framework for Real-
Time Graphics.” In Game Engine Gems 2, edited by Eric Lengyel, pp. 97–
118. Boca Raton, FL: CRC Press, 2011.
[Drobot 14] M. Drobot. “Low Level Optimizations for AMD GCN Architecture.”
Presented at Digital Dragons Conference, Krakòw, Poland, May 8–9, 2014.
[Dupont and Liu 03] T. Dupont and Y. Liu. “Back and Forth Error Compen-
sation and Correction Methods for Removing Errors Induced by Uneven
Gradients of the Level Set Function.” J. Comput. Phys. 190:1 (2003), 311–
324.
[Jimenez et al. 11] J. Jimenez, B. Masia, J. Echevarria, F. Navarro, and D.
Gutierrez. “Practical Morphological Antialiasing.” In GPU Pro 2: Advanced
Rendering Techniques, edited by Wolfgang Engel, pp. 95–114. Natick, MA:
A K Peters, 2011.
[Jimenez et al. 12] J. Jimenez, J. Echevarria, D. Gutierrez, and T. Sousa.
“SMAA: Enhanced Subpixel Morphological Antialiasing.” Computer Graph-
ics Forum: Proc. EUROGRAPHICS 2012 31:2 (2012), 355–364.
[Kirkland et al. 99] Dale Kirkland, Bill Armstrong, Michael Gold, Jon Leech, and
Paula Womack. “ARB Multisample.” OpenGL Extension Registry, https://
www.opengl.org/registry/specs/ARB/multisample.txt, 1999.
138 II Rendering
[Laine and Aila 06] S. Laine and T. Aila. “A Weighted Error Metric and Opti-
mization Method for Antialiasing Patterns.” Computer Graphics Forum 25:1
(2006), 83–94.
[Malan 12] H. Malan. “Realtime global illumination and reflections in Dust 514.”
Advances in Real-Time Rendering in Games: Part 1, SIGGRAPH Course,
Los Angeles, CA, August 5–9, 2012.
[Selle et al. 08] A. Selle, R. Fedkiw, B. Kim, Y. Liu, and J. Rossignac. “An Un-
conditionally Stable MacCormack Method.” J. Scientific Computing 35:2–3
(2008), 350–371.
[Valient 14] M. Valient. “Taking Killzone Shadow Fall Image Quality into the
Next Generation.” Presented at Game Developers Conference, San Fran-
cisco, CA, March 17–21, 2014.
3. Hybrid Reconstruction Antialiasing 139
Real-Time Rendering of
Physically Based Clouds Using
Precomputed Scattering
Egor Yusov
4.1 Introduction
Rendering realistic clouds has always been a desired feature for a variety of appli-
cations, from computer games to flight simulators. Clouds consist of innumerable
tiny water droplets that scatter light. Rendering clouds is challenging because
photons are typically scattered multiple times before they leave the cloud. De-
spite the impressive performance of today’s GPUs, accurately modeling multiple
scattering effects is prohibitively expensive, even for offline renderers. Thus, real-
time methods rely on greatly simplified models.
Using camera-facing billboards is probably the most common real-time method
[Dobashi et al. 00, Wang 04, Harris and Lastra 01, Harris 03]. However, bill-
boards are flat, which breaks the volumetric experience under certain conditions.
These methods have other limitations: lighting is precomputed resulting in static
clouds [Harris and Lastra 01], multiple scattering is ignored [Dobashi et al. 00],
or lighting is not physically based and requires tweaking by artists [Wang 04].
Volume rendering techniques are another approach to render clouds [Schpok
et al. 03, Miyazaki et al. 04, Riley et al. 04]. To avoid aliasing artifacts, many
slices usually need to be rendered, which can create a bottleneck, especially
on high-resolution displays. More physically accurate methods exist [Bouthors
et al. 06, Bouthors et al. 08], which generate plausible visual results, but are
difficult to reproduce and computationally expensive.
We present a new physically based method to efficiently render realistic ani-
mated clouds. The clouds are comprised of scaled and rotated copies of a single
particle called the reference particle. During the preprocessing stage, we precom-
pute optical depth as well as single and multiple scattering integrals describing
141
142 II Rendering
the light transport in the reference particle for all possible camera positions and
view directions and store the results in lookup tables. At runtime, we load the
data from the lookup tables to approximate the light transport in the cloud in
order to avoid costly ray marching or slicing. In this chapter, we elaborate upon
our previous work [Yusov 14b]. In particular, the following improvements have
been implemented:
We briefly review the main concepts of this method, but we will concentrate on
implementation details and improvements. Additional information can be found
in the original paper [Yusov 14b].
B−A
where P = A + ||B−A|| · s is the current integration point.
To determine the intensity of single scattered light, we need to step along
the view ray and accumulate all the differential amounts of sunlight scattered at
4. Real-Time Rendering of Physically Based Clouds Using Precomputed Scattering 143
Lsum
Q
LB
P1
S
P
v θ
P0
C
In this equation, C is the camera position and v is the view direction. P0 and
P1 are the points where the view ray enters and leaves the cloud body, LSun is
the intensity of sunlight outside the cloud, and Q is the point through which the
sunlight reaches the current integration point P (Figure 4.1). P (θ) is the phase
function that defines how much energy is scattered from the incident direction
to the outgoing direction, with θ being the angle between the two. Note that
the sunlight is attenuated twice before it reaches the camera: by the factor of
e−τ (Q,P) on the way from the entry point Q to the scattering point P, and by
the factor of e−τ (P,P0 ) on the way from the scattering point to the camera.
The phase function for cloud droplets is very complex [Bohren and Huff-
man 98]. In real-time methods, it is common to approximate it using the Cornette-
Shanks function [Cornette and Shanks 92]:
1 3(1 − g 2 ) (1 + cos2 (θ))
P (θ) ≈ . (4.3)
4π 2(2 + g 2 ) (1 + g 2 − 2g cos(θ))3/2
(1)
Using the intensity LIn of single scattering, we can compute secondary scat-
(2) (3)
tering LIn , then third-order scattering LIn , and so on. The nth-order scattering
intensity measured at point C when viewing in direction v is given by the follow-
ing integral:
P1
J (n) (P, v ) · e−τ (P,P0 ) · ds.
(n)
LIn (C, v ) = (4.4)
P0
144 II Rendering
(n–1)
LIn (P, ω)
ω
θ
v J (n)(P, v)
(n−1)
In Equation (4.4), J (n) (C, v ) is the net intensity of order n − 1 light LIn (C, v )
that is scattered in the view direction:
(n−1)
J (P, v ) = β(P) ·
(n)
LIn (P, ω ) · P (θ) · dω, (4.5)
Ω
where integration is performed over the whole sphere of directions Ω, and θ is the
angle between ω and v (see Figure 4.2).1
The total in-scattering intensity is found by calculating the sum of all scat-
tering orders:
∞
(n)
LIn (C, v ) = LIn (C, v ). (4.6)
n=1
The final radiance measured at the camera is the sum of in-scattered intensity
and background radiance LB (see Figure 4.1) attenuated in the cloud:
S θs
v S
θv
v
Figure 4.3. Volumetric particle (left) and 4D parameterization (middle and right).
Equation (4.1) through the particle for every camera position and view direction.
To describe every ray piercing the particle, we need 4D parameterization.2 The
first two parameters are the azimuth ϕS ∈ [0, 2π] and zenith θS ∈ [0, π] angles
of the point S where the view ray enters the particle’s bounding sphere (Fig-
ure 4.3 (middle)). The other two parameters are the azimuth ϕv ∈ [0, 2π] and
zenith θv ∈ [0, π/2] angles of the view ray in the tangent frame constructed at
the entry point S (Figure 4.3 (right)). The z-axis of this frame is pointing toward
the sphere center. Note that we only need to consider the rays going inside the
sphere; thus, the maximum value for θv is π/2.
To precompute the optical depth integral, we go through all possible values
of ϕS , θS , ϕv , and θv and numerically evaluate the integral in Equation (4.1).
Section 4.5.1 provides additional details.
Pixel Pixel
color Over blend color
Closest Closest
element element
Figure 4.4. Volume-aware blending when the new particle does not intersect the closest
element.
Pixel Pixel
color Over blend color
Closest Closest
element element
Figure 4.5. Volume-aware blending when the new particle intersects the closest element.
(Figure 4.5). Next, the color of the intersection is computed using the density-
weighted average:
4.5 Implementation
We implemented our method in C++ using the Direct3D 11 API. The full source
code can be found in the supplemental materials to this book. It is also available
at https://github.com/GameTechDev/CloudsGPUPro6.
1 f l o a t 2 P r e c o m p u t e O p t i c a l D e p t h P S ( S Q u a d V S O u t p u t In ) : S V _ T a r g e t
2 {
3 float3 f3StartPos , f3RayDir ;
4 // Co n v e r t l o o k u p t a b l e 4D c o o r d i n a t e s i n t o t h e s t a r t
5 // p o s i t i o n and v i e w d i r e c t i o n
6 OpticalDepthLUTCoordsToWorldParams(
7 f l o a t 4 ( P r o j T o U V ( In . m _ f 2 P o s P S ) , g _ A t t r i b s . f 4 P a r a m . xy ) ,
8 f3StartPos , f3RayDir ) ;
9
10 // I n t e r s e c t t h e v i e w r a y wi t h t h e u n i t s p h e r e
11 float2 f2RayIsecs ;
12 // f 3 S t a r t P o s i s l o c a t e d e x a c t l y on t h e s u r f a c e ; s l i g h t l y
13 // move i t i n s i d e t h e s p h e r e t o a v o i d p r e c i s i o n i s s u e s
14 G e t R a y S p h e r e I n t e r s e c t i o n ( f 3 S t a r t P o s + f 3 R a y D i r 1 e −4 , f 3 R a y D i r ,
15 0 , 1. f , f2RayIsecs ) ;
16
17 float3 f3EndPos = f3StartPos + f3RayDir f2RayIsecs . y ;
18 f l o a t fNumSteps = NUM_INTEGRATION_STEPS ;
19 float3 f3Step = ( f3EndPos − f3StartPos ) / fNumSteps ;
20 f l o a t fTotalDensity = 0;
21 f o r ( f l o a t f S t e p N u m = 0 . 5 ; f S t e p N u m < f N u m S t e p s ; ++f S t e p N u m )
22 {
23 float3 f3CurrPos = f3StartPos + f3Step fStepNum ;
24 fl o a t fDensity = ComputeDensity ( f3CurrPos ) ;
25 f T o t a l D e n s i t y += f D e n s i t y ;
26 }
27
28 return fTotalDensity / fNumSteps ;
29 }
(lines 3–8). The first two components come from the pixel position, the other
two are stored in the g_Attribs.f4Param.xy uniform variable. The shader then
intersects the ray with the unit sphere (lines 11–15) and finds the ray exit point
(line 17). The GetRaySphereIntersection() function takes the ray start position
and direction, sphere center (which is 0), and radius (which is 1) as inputs and
returns the distances from the start point to the intersections in its fifth argu-
ment (the smallest value always go first). Finally, the shader performs numerical
integration of Equation (4.1). Instead of storing the integral itself, we store the
normalized average density along the ray, which always lies in the range [0, 1]
and can be sufficiently represented with an 8-bit UNorm value. Optical depth
is reconstructed by multiplying that value by the ray length and extinction co-
efficient. The ComputeDensity() function combines several 3D noises to evaluate
density at the current point.
1 f l o a t P r e c o m p u t e S i n g l e S c t r P S ( S Q u a d V S O u t p u t In ) : S V _ T a r g e t
2 {
3 float3 f3EntryPoint , f3ViewRay , f3LightDir ;
4 ScatteringLUTToWorldParams(
5 f l o a t 4 ( P r o j T o U V ( In . m _ f 2 P o s P S ) , g _ A t t r i b s . f 4 P a r a m . xy ) ,
6 g_Attribs . f4Param . z , f3EntryPoint , f3ViewRay , f3LightDir ) ;
7
8 // I n t e r s e c t t h e v i e w r a y wi t h t h e u n i t s p h e r e
9 float2 f2RayIsecs ;
10 GetRaySphereIntersection ( f3EntryPoint , f3ViewRay ,
11 0 , 1. f , f2RayIsecs ) ;
12 float3 f3EndPos = f3EntryPoint + f3ViewRay f2RayIsecs . y ;
13
14 f l o a t fNumSteps = NUM_INTEGRATION_STEPS ;
15 float3 f3Step = ( f3EndPos − f3EntryPoint ) / fNumSteps ;
16 f l o a t fStepLen = length ( f3Step ) ;
17 f l o a t fCloudMassToCamera = 0;
18 f l o a t fParticleRadius = g_Attribs . RefParticleRadius ;
19 f l o a t fInscattering = 0;
20 f o r ( f l o a t f S t e p N u m = 0 . 5 ; f S t e p N u m < f N u m S t e p s ; ++f S t e p N u m )
21 {
22 float3 f3CurrPos = f3EntryPoint + f3Step fStepNum ;
23 GetRaySphereIntersection ( f3CurrPos , f3LightDir ,
24 0 , 1. f , f2RayIsecs ) ;
25 fl o a t fCloudMassToLight = f2RayIsecs . x fParticleRadius ;
26 f l o a t fAttenuation = exp (
27 −g _ A t t r i b s . f A t t e n u a t i o n C o e f f
28 ( fCloudMassToLight + fCloudMassToCamera ) ) ;
29
30 f I n s c a t t e r i n g += f A t t e n u a t i o n g_Attribs . fScatteringCoeff ;
31 f C l o u d M a s s T o C a m e r a += f S t e p L e n fParticleRadius ;
32 }
33
34 return fInscattering fStepLen fParticleRadius ;
35 }
The shader numerically integrates Equation (4.2). Note that the phase func-
tion P (θ) and the sun intensity LSun are omitted. Thus, at every step, the shader
needs to compute the following integrand: β(P)·e−τ (Q,P) ·e−τ (P,P0 ) . The scatter-
ing/extinction coefficient β(P) is assumed to be constant and is provided by the
g_Attribs.fScatteringCoeff variable. We use β = 0.07 as the scattering/extinc-
tion coefficient and a reference particle radius of 200 meters. Extinction e−τ (Q,P)
from the current point to the light entry point is evaluated by intersecting the ray
going from the current point toward the light with the sphere (lines 23–25). Ex-
tinction e−τ (P,P0 ) toward the camera is computed by maintaining the total cloud
mass from the camera to the current point in the fCloudMassToCamera variable
(line 31).
(n)
one to store the J (n) term, the other to store the current order scattering LIn ,
and the third to accumulate higher-order scattering. Note that these intermediate
tables cover the entire volume.
Computing every scattering order consists of three steps, as discussed in Sec-
tion 4.3.3. The first step is evaluating the J (n) term according to Equation (4.5).
This step is implemented by the shader shown in Listing 4.3.
The first step in this shader, like the prior shaders, retrieves the world-
space parameters from the 4D texture coordinates (lines 3–6). In the next
step, the shader constructs local frame for the ray starting point by calling the
ConstructLocalFrameXYZ() function (lines 8–10). The function gets two direc-
tions as inputs and constructs orthonormal basis. The first direction is used as
the z-axis. Note that the resulting z-axis points toward the sphere center (which
is 0).
The shader then runs two loops going through the series of zenith θ and
azimuth ϕ angles (lines 18–19), which sample the entire sphere of directions. On
every step, the shader constructs a sample direction using the (θ, ϕ) angles (lines
23–25), computes lookup coordinates for this direction (lines 26–28), and loads
the order n − 1 scattering using these coordinates (lines 29–31). Remember that
the precomputed single scattering does not comprise the phase function and we
need to apply it now, if necessary (lines 32–34). g_Attribs.f4Param.w equals 1
if we are processing the second-order scattering and 0 otherwise. After that, we
need to account for the phase function P (θ) in Equation (4.5) (line 35). For single
scattering, we use anisotropy factor g = 0.9, and for multiple scattering we use
g = 0.7 to account for light diffusion in the cloud. Finally, we need to compute
the dω = dθ · dϕ · sin(θ) term (lines 37–40).
After the J (n) term is evaluated, we can compute nth scattering order ac-
cording to Equation (4.4). The corresponding shader performing this task is very
similar to the shader computing single scattering (Listing 4.4). The difference
is that in the integration loop we load J (n) from the lookup table (lines 19–23)
instead of computing sunlight attenuation in the particle. We also use trapezoidal
integration to improve accuracy.
In the third stage, the simple shader accumulates the current scattering order
in the net multiple scattering lookup table by rendering every slice with additive
blending.
1 f l o a t G a t h e r S c a t t e r i n g P S ( S Q u a d V S O u t p u t In ) : S V _ T a r g e t
2 {
3 float3 f3StartPos , f3ViewRay , f3LightDir ;
4 ScatteringLUTToWorldParams(
5 f l o a t 4 ( P r o j T o U V ( In . m _ f 2 P o s P S ) , g _ A t t r i b s . f 4 P a r a m . xy ) ,
6 f3StartPos , f3ViewRay , f3LightDir ) ;
7
8 float3 f3LocalX , f3LocalY , f3LocalZ ;
9 C o n s t r u c t L o c a l F r a m e X Y Z (− n o r m a l i z e ( f 3 S t a r t P o s ) , f 3 L i g h t D i r ,
10 f3LocalX , f3LocalY , f3LocalZ ) ;
11
12 f l o a t fJ = 0 ;
13 f l o a t fTotalSolidAngle = 0;
14 const f l o a t fNumZenithAngles = SCTR_LUT_DIM . z ;
15 const f l o a t fNumAzimuthAngles = SCTR_LUT_DIM . y ;
16 c on st f l o a t f Z e n i t h S p a n = PI ;
17 c o n s t f l o a t f A z i m u t h S p a n = 2 P I ;
18 f o r ( f l o a t Z e n = 0 . 5 ; Z e n < f N u m Z e n i t h A n g l e s ; ++Z e n )
19 f o r ( f l o a t A z = 0 . 5 ; A z < f N u m A z i m u t h A n g l e s ; ++A z )
20 {
21 f l o a t fZenith = Zen/ fNumZenithAngles fZenithSpan ;
22 f l o a t f A z i m u t h = ( Az / f N u m A z i m u t h A n g l e s − 0 . 5 ) f A z i m u t h S p a n ;
23 float3 f3CurrDir =
24 GetDirectionInLocalFrameXYZ( f3LocalX , f3LocalY , f3LocalZ ,
25 fZenith , fAzimuth ) ;
26 float4 f4CurrDirLUTCoords =
27 WorldParamsToScatteringLUT ( f3StartPos , f3CurrDir ,
28 f3LightDir ) ;
29 f l o a t fCurrDirSctr = 0;
30 SAMPLE_4D ( g_tex3DPrevSctrOrder , SCTR_LUT_DIM ,
31 f4CurrDirLUTCoords , 0 , fCurrDirSctr ) ;
32 i f ( g _ A t t r i b s . f 4 P a r a m . w == 1 )
33 f C u r r D i r S c t r = H G P h a s e F u n c ( d o t(− f 3 C u r r D i r , f 3 L i g h t D i r ) ,
34 0.9 ) ;
35 f C u r r D i r S c t r = H G P h a s e F u n c ( d o t ( f 3 C u r r D i r , f 3 V i e w R a y ) , 0 . 7 ) ;
36
37 f l o a t fdZenithAngle = fZenithSpan / fNumZenithAngles ;
38 f l o a t fdAzimuthAngle = fAzimuthSpan / fNumAzimuthAngles
39 sin ( ZenithAngle ) ;
40 f l o a t fDiffSolidAngle = fdZenithAngle fdAzimuthAngle ;
41 f T o t a l S o l i d A n g l e += f D i f f S o l i d A n g l e ;
42 f J += f C u r r D i r S c t r f D i f f S o l i d A n g l e ;
43 }
44
45 // T o t a l s o l i d a n g l e s h o u l d be 4 PI . R e n o r m a l i z e t o f i x
46 // d i s c r e t i z a t i o n i s s u e s
47 f J = 4 P I / f T o t a l S o l i d A n g l e ;
48
49 r e t u r n fJ ;
50 }
1 f l o a t C o m p u t e S c a t t e r i n g O r d e r P S ( S Q u a d V S O u t p u t In ) : S V _ T a r g e t
2 {
3 // Transform l o o k u p c o o r d i n a t e s i n t o t h e wo r l d p a r a m e t e r s
4 // I n t e r s e c t t h e r a y wi t h t h e s p h e r e , compute
5 // s t a r t and end p o i n t s
6 ...
7
8 f l o a t fPrevJ = 0;
9 SAMPLE_4D ( g_tex3DGatheredScattering , SCTR_LUT_DIM ,
10 f4StartPointLUTCoords , 0 , fPrevJ ) ;
11 f o r ( f l o a t f S t e p N u m =1; f S t e p N u m <= f N u m S t e p s ; ++f S t e p N u m )
12 {
13 float3 f3CurrPos = f3StartPos + f3Step fStepNum ;
14
15 f C l o u d M a s s T o C a m e r a += f S t e p L e n f P a r t i c l e R a d i u s ;
16 f l o a t f A t t e n u a t i o n T o C a m e r a = e x p ( −g _ A t t r i b s . f A t t e n u a t i o n C o e f f
17 fCloudMassToCamera ) ;
18
19 float4 f4CurrDirLUTCoords =
20 WorldParamsToScatteringLUT( f3CurrPos , f3ViewRay , f3LightDir ) ;
21 f l o a t fJ = 0 ;
22 SAMPLE_4D ( g_tex3DGatheredScattering , SCTR_LUT_DIM ,
23 f 4 C u r r D i r L U T C o o r d s , 0 , fJ ) ;
24 f J = f A t t e n u a t i o n T o C a m e r a ;
25
26 f I n s c a t t e r i n g += ( f J + f P r e v J ) / 2 ;
27 f P r e v J = fJ ;
28 }
29
30 return fInscattering fStepLen fParticleRadius
31 g_Attribs . fScatteringCoeff ;
32 }
1. Process the 2D cell grid to build a list of valid nonempty cells, and compute
the cell attributes.
2. Compute the density for each voxel of the cloud density lattice located in
the nonempty cells.
3. Process the visible voxels of the light attenuation lattice located in the
nonempty cells and compute attenuation for each voxel.
4. Process the particle lattice and generate particles for visible cells whose
density is above the threshold.
Processing cell grid. The processing cell grid is performed by a compute shader
that executes one thread for every cell. It computes the cell center and size based
on the camera world position and the location of the cell in the grid. Using the
cell center, the shader then computes the base cell density by combining two 2D
noise functions. If the resulting value is above the threshold, the cell is said to be
valid (Figure 4.7). The shader adds indices of all valid cells to the append buffer
(g_ValidCellsAppendBuf ), which at the end of the stage contains an unordered
list of all valid cells. If a cell is also visible in the camera frustum, the shader
also adds the cell to another buffer (g_VisibleCellsAppendBuf ) that collects valid
visible cells.
Processing cloud density lattice. In the next stage, we need to process only those
voxels of the lattice that are located within the valid cells of the cloud grid. To
compute the required number of GPU threads, we execute a simple one-thread
compute shader:
R W B u f f e r <u i n t > g _ D i s p a t c h A r g s R W : r e g i s t e r ( u0 ) ;
[ numthreads (1 , 1 , 1) ]
void ComputeDispatchArgsCS ()
4. Real-Time Rendering of Physically Based Clouds Using Precomputed Scattering 155
Low density
Medium density
High density
Valid cell
{
uint s = g_GlobalCloudAttribs . uiDensityBufferScale ;
g_DispatchArgsRW [ 0 ] = ( g_ValidCellsCounter . Load (0) s s s
g _ G l o b a l C l o u d A t t r i b s . u i M a x L a y e r s + T H R E A D _ G R O U P _ S I Z E −1) /
THREAD_GROUP_SIZE ;
}
The number of elements previously written into the append buffer can be copied
into a resource suitable for reading (g_ValidCellsCounter ) with the CopyStructure
Count() function. The buffer previously bound as UAV to g_DispatchArgsRW is
then passed to the DispatchIndirect() function to generate the required number
of threads. Each thread then reads the index of the valid cell it belongs to from
g_ValidCellsUnorderedList , populated at the previous stage, and finds out its
location within that cell. Then the shader combines two 3D noise functions with
the cell base density to create volumetric noise. The noise amplitude decreases
with altitude to create typical cumulus cloud shapes with wider bottoms and
narrower tops.
Light attenuation. Light attenuation is computed for every voxel inside the visible
grid cells. To compute the required number of threads, we use the same simple
compute shader used in the previous stage, but this time provide the number of
valid and visible cells in the g_ValidCellsCounter variable. Light attenuation is
then computed by casting a ray from the voxel center toward the light and ray
marching through the density lattice. We perform a fixed number of 16 steps.
Instead of storing light attenuation, we opt to store the attenuating cloud mass
because it can be properly interpolated.
156 II Rendering
Particle generation. The next stage consists of processing valid and visible voxels
of the cloud lattice and generating particles for some of them. To generate the
required number of threads, we again use the simple one-thread compute shader.
The particle generation shader loads the cloud density from the density lattice
and, if it is not zero, it creates a particle. The shader randomly displaces the
particle from the voxel center and adds a random rotation and scale to eliminate
repeating patterns. The shader writes the particle attributes, such as position,
density, and size, into the particle info buffer and adds the particle index into
another append buffer (g_VisibleParticlesAppendBuf ).
Sorting. Sorting particles back to front is the final stage before they can be ren-
dered and is necessary for correct blending. In our original work, we sorted all
the voxels of the particle lattice on the CPU and then streamed out only valid
visible voxels on the GPU. This approach had a number of drawbacks. First, it
required active CPU–GPU communication. Second, due to random offsets, par-
ticle order could slightly differ from voxel order. But the main problem was that
all voxels were always sorted even though many of them were actually empty,
which resulted in significant CPU overhead.
We now sort particles entirely on the GPU using the merge sort algorithm by
Satish et al. [Satish et al. 09] with a simplified merge procedure. We begin by
subdividing the visible particle list into subsequences of 128 particles and sorting
each subsequence with a bitonic sort implemented in a compute shader. Then we
perform a number of merge stages to get the single sorted list. When executing
the binary search of an element to find its rank, we directly access global memory.
Because the number of particles that need to be sorted is relatively small (usually
not greater than 50,000), the entire list can fit into the cache and merging is still
very efficient even though we do not use shared memory.
An important aspect is that we do not know how many particles were gen-
erated on the GPU and how many merge passes we need to execute. Thus, we
perform enough passes to sort the maximum possible number of particles. The
compute shader performs an early exit, with very little performance cost, when
no more work needs to be done.
4. Real-Time Rendering of Physically Based Clouds Using Precomputed Scattering 157
4.5.3 Rendering
After visible particles are generated, processed, and sorted, they are ready for
rendering. Since only the GPU knows how many particles were generated, we
use the DrawInstancedIndirect() function. It is similar to DrawInstanced(), but
reads its arguments from a GPU buffer. We render one point primitive per
visible particle. The geometry shader reads the particle attributes and generates
the particle bounding box, which is then sent to the rasterizer.
In the pixel shader, we reconstruct the view ray and intersect it with the
ellipsoid enclosed in the particle’s bounding box. If the ray misses the ellip-
soid, we discard the pixel. Otherwise, we apply our shading model based on the
precomputed lookup tables, as shown in Listing 4.5.
Our first step is to compute the normalized density along the view ray using
the optical depth lookup table (lines 2–10). We randomly rotate the particle
around the vertical axis to eliminate repetitive patterns (line 6). f3EntryPoint
USSpace and f3ViewRayUSSpace are the coordinates of the entry point and the view
ray direction transformed into the particle space (which is unit sphere space, thus
the US suffix). Next, we compute the transparency (lines 14–17).
Our real-time model consists of three components: single scattering, multiple
scattering, and ambient light. We compute single scattering in lines 20–27. It
is a product of a phase function, sunlight attenuation (computed as discussed
in Section 4.5.2), and the sunlight intensity. Because single scattering is most
noticeable where cloud density is low, we multiply the value by the transparency.
Next, we evaluate multiple scattering by performing a lookup into the precom-
puted table (lines 30–39). We multiply the intensity with the light attenuation.
Since multiple scattering happens in dense parts of the cloud, we also multiply
the intensity with the opacity (1-fTransparency ).
Finally, we use an ad hoc approximation for ambient light (lines 42–52). We
use the following observation: ambient light intensity is stronger on the top
boundary of the cloud and decreases toward the bottom. Figure 4.8 shows dif-
ferent components and the final result.
Figure 4.8. From left to right, single scattering, multiple scattering, ambient, and all
components.
158 II Rendering
1 // Compute l o o k u p c o o r d i n a t e s
2 float4 f4LUTCoords ;
3 WorldParamsToOpticalDepthLUTCoords( f3EntryPointUSSpace ,
4 f3ViewRayUSSpace , f4LUTCoords ) ;
5 // Randomly r o t a t e t h e s p h e r e
6 f 4 L U T C o o r d s . y += P a r t i c l e A t t r s . f R n d A z i m u t h B i a s ;
7 // Get t h e n o r m a l i z e d d e n s i t y a l o n g t h e v i e w r a y
8 fl o a t fNormalizedDensity = 1. f ;
9 SAMPLE_4D_LUT ( g_tex3DParticleDensityLUT , OPTICAL_DEPTH_LUT_DIM ,
10 f4LUTCoords , 0 , fNormalizedDensity ) ;
11
12 // Compute a c t u a l c l o u d mass by m u l t i p l y i n g t h e n o r m a l i z e d
13 // d e n s i t y wi t h r a y l e n g t h
14 fCloudMass = fNormalizedDensity fRayLength ;
15 fCloudMass = ParticleAttrs . fDensity ;
16 // Compute t r a n s p a r e n c y
17 f T r a n s p a r e n c y = e x p ( −f C l o u d M a s s g_Attribs . fAttenuationCoeff );
18
19 // E v a l u a t e p h a se f u n c t i o n f o r s i n g l e s c a t t e r i n g
20 f l o a t f C o s T h e t a = d o t (− f 3 V i e w R a y U S S p a c e , f 3 L i g h t D i r U S S p a c e ) ;
21 f l o a t PhaseFunc = HGPhaseFunc ( fCosTheta , 0 . 8 ) ;
22
23 float2 f2Attenuation = ParticleLighting . f2SunLightAttenuation ;
24 // Compute i n t e n s i t y o f s i n g l e s c a t t e r i n g
25 float3 f3SingleScattering =
26 fTransparency ParticleLighting . f4SunLight . rgb
27 f2Attenuation . x PhaseFunc ;
28
29 // Compute l o o k u p c o o r d i n a t e s f o r m u l t i p l e s c a t t e r i n g
30 float4 f4MultSctrLUTCoords =
31 WorldParamsToScatteringLUT( f3EntryPointUSSpace ,
32 f3ViewRayUSSpace , f3LightDirUSSpace ) ;
33 // Load m u l t i p l e s c a t t e r i n g from t h e l o o k u p t a b l e
34 fl o a t fMultipleScattering =
35 g_tex3DScatteringLUT . SampleLevel ( samLinearWrap ,
36 f 4 M u l t S c t r L U T C o o r d s . xyz , 0 ) ;
37 float3 f3MultipleScattering =
38 (1− f T r a n s p a r e n c y ) fMultipleScattering
39 f2Attenuation . y ParticleLighting . f4SunLight . rgb ;
40
41 // Compute ambient l i g h t
42 f l o a t 3 f 3 E a r t h C e n t r e = f l o a t 3 ( 0 , −g _ A t t r i b s . f E a r t h R a d i u s , 0 ) ;
43 fl o a t fEnttryPointAltitude = length ( f3EntryPointWS − f3EarthCentre ) ;
44 fl o a t fCloudBottomBoundary =
45 g_Attribs . fEarthRadius + g_Attribs . fCloudAltitude −
46 g_Attribs . fCloudThickness /2. f ;
47 fl o a t fAmbientStrength =
48 ( fEnttryPointAltitude − fCloudBottomBoundary ) /
49 g_Attribs . fCloudThickness ;
50 fAmbientStrength = clamp ( fAmbientStrength , 0.3 , 1 ) ;
51 f l o a t 3 f 3 A m b i e n t = (1− f T r a n s p a r e n c y ) fAmbientStrength
52 ParticleLighting . f4AmbientLight . rgb ;
an unordered access view, which enables the pixel shader to read and write to
arbitrary memory locations. For each pixel on the screen, we store the following
information about the closest element: minimal/maximal distance along the view
ray, optical mass (which is the cloud mass times the scattering coefficient), and
color:
struct SParticleLayer
{
flo at 2 f2MinMaxDist ;
f l o a t fOpticalMass ;
flo at 3 f3Color ;
};
The pixel shader implements the merging scheme described in Section 4.4
and is shown in the code snippet given in Listing 4.6. The shader creates an
array of two layers. The properties of one layer are taken from the attributes of
the current particle (lines 8–10). The other layer is read from the appropriate
position in the buffer (lines 12–17). Then the layers are merged (lines 20–23),
and the merged layer is written back (line 26) while color f4OutColor is passed
to the output merger unit to be blended with the back buffer.
1 // I n i t e x t e n s i o n s
2 IntelExt_Init () ;
3 ...
4 // P r o c e s s c u r r e n t p a r t i c l e and compute i t s c o l o r f3NewColor ,
5 // mass fCloudMass , and e x t e n t s fNewMinDist/ fNewMaxDist
6
7 SParticleLayer Layers [ 2 ] ;
8 Layers [ 1 ] . f2MinMaxDist = float2 ( fNewMinDist , fNewMaxDist ) ;
9 Layers [ 1 ] . fOpticalMass = fCloudMass g_Attribs . fAttenuationCoeff ;
10 Layers [ 1 ] . f3Color = f3NewColor ;
11
12 uint2 u i 2 P i x e l I J = In . f4Pos . xy ;
13 uint uiLayerDataInd =
14 ( ui2PixelIJ . x + ui2PixelIJ . y g_Attribs . uiBackBufferWidth ) ;
15 // E n a b l e p i x e l s h a d e r o r d e r i n g
16 IntelExt_BeginPixelShaderOrdering () ;
17 Layers [ 0 ] = g_rwbufParticleLayers [ uiLayerDataInd ] ;
18
19 // Merge two l a y e r s
20 SParticleLayer MergedLayer ;
21 float4 f4OutColor ;
22 MergeParticleLayers ( Layers [ 0 ] , Layers [ 1 ] , MergedLayer ,
23 f 4 O u t C o l o r . rgb , f 4 O u t C o l o r . a ) ;
24
25 // S t o r e updated l a y e r s
26 g_rwbufParticleLayers [ uiLayerDataInd ] = MergedLayer ;
R W S t r u c t u r e d B u f f e r <S P a r t i c l e L a y e r> g _ r w b u f P a r t i c l e L a y e r s ;
It must be noted that the algorithm described above would not work as ex-
pected on standard DirectX 11–class graphics hardware. The reason is that
we are trying to read from the same memory in parallel from different pixel
shader threads, modify data, and write it back. There is no efficient way on
DirectX 11 to serialize such operations. Intel graphics chips, starting with the
Intel HD Graphics 5000, can solve this problem. They contain a special exten-
sion, called pixel shader ordering. When it is enabled, it guarantees that all
read–modify–write operations from different pixel shader instances, which map
to the same pixel, are performed atomically. Moreover, the pixel shader in-
stances are executed in the same order in which primitives were submitted for
rasterization. The second condition is very important to ensure temporally sta-
ble results. In DirectX 11, the extensions are exposed through two functions.
IntelExt_Init() tells the compiler that the shader is going to use extensions, and
after the call to IntelExt_BeginPixelShaderOrdering(), all instructions that ac-
cess UAVs get appropriately ordered. It is worth mentioning that this capability
will be a standard feature of DirectX 12, where it will be called rasterizer ordered
views.
After all particles are rendered, the closest volume buffer needs to be merged
with the back buffer. We render a screen-size quad and perform the required
operations in the pixel shader.
During rendering, we generate three buffers: cloud color, transparency, and
the distance to the closest cloud. To improve performance, we render the clouds to
a quarter resolution buffers (1/2×1/2) and then upscale to the original resolution
using a bilateral filter.
and distance to the cloud to attenuate the light samples along the view ray (please
refer to [Yusov 14b] for more details).
One important missing detail is sample refinement (see [Yusov 14a]), which
needs to account for screen-space cloud transparency. When computing coarse
unoccluded in-scattering, we take the screen-space cloud transparency and dis-
tance to attenuate the current sample. This automatically gives the desired effect
(Figure 4.9) with a minimal increase in performance cost.
Table 4.2. Performance of the algorithm on Intel HD Graphics 5200, 1280 × 720 reso-
lution (times in ms).
4. Real-Time Rendering of Physically Based Clouds Using Precomputed Scattering 163
Figure 4.11. Test scene rendered in different quality profiles: highest (top left), high
(top right), medium (bottom left), and low (bottom right).
Table 4.3. Performance of the algorithm on NVIDIA GeForce GTX 680, 1920 × 1080
resolution (times in ms).
cost of lower quality. The processing stage includes all the steps discussed in
Section 4.5.2 except sorting, which is shown in a separate column. The clearing
column shows the amount of time required to clear the cloud density and light
attenuation 3D textures to initial values. This step takes almost the same time
as processing itself. This is because of the low memory bandwidth of the GPU.
Rendering light scattering effects takes an additional 5.8 ms. In the medium-
quality profile, the total required time is less than 20 ms, which guarantees real-
time frame rates.
Performance results on our high-end test platform are given in Table 4.3.
Because our second GPU has much higher memory bandwidth, the performance
of the algorithm is significantly better. It takes less than 2.3 ms to render the
clouds in low profile and less than 4.5 ms to render in medium profile at full
HD resolution. Since clearing the 3D textures takes much less time, we do not
164 II Rendering
separate this step in Table 4.3. Computing atmospheric light scattering takes an
additional 3.0 ms of processing time. Also note that the GTX 680 is a relatively
old GPU. Recent graphics hardware provides higher memory bandwidth, which
will improve the performance of our method.
4.6.1 Limitations
Our method is physically based, not physically accurate. We make two main
simplifications when approximating shading: scattering is precomputed in a ho-
mogeneous spherical particle, and energy exchange between particles is ignored.
Precomputing the scattering inside an inhomogeneous particle would require a
5D table. It is possible that some degrees of that table can allow point sampling,
which would reduce the lookup into the table to two fetches from a 3D texture.
This is an interesting direction for future research.
The other limitation of our method is that our volume-aware blending can
precisely handle the intersection of only two particles. When more than three
particles intersect, the method can fail. However, visual results are acceptable in
most cases. We also believe that our method gives a good use-case example for
the capabilities of upcoming GPUs.
4.7 Conclusion
In this chapter we presented a new method for rendering realistic clouds. The
key idea of our approach is to precompute optical depth and single and multiple
scattering for a reference particle at preprocess time and to store the resulting
information in lookup tables. The data is then used at runtime to compute cloud
shading without the need for ray marching or slicing. We also presented a new
technique for controlling the level of detail as well as a method to blend the
particles accounting for their volumetric intersection. We believe that our idea
of precomputing scattering is promising and can be further improved in future
research. The idea of precomputing transparency can also be used for rendering
different kinds of objects such as distant trees in forests.
Bibliography
[Bohren and Huffman 98] C. Bohren and D. R. Huffman. Absorption and Scat-
tering of Light by Small Particles. Berlin: Wiley-VCH, 1998.
[Bouthors et al. 06] Antoine Bouthors, Fabrice Neyret, and Sylvain Lefebvre.
“Real-Time Realistic Illumination and Shading of Stratiform Clouds.” In
Proceedings of the Second Eurographics Conference on Natural Phenomena,
pp. 41–50. Aire-la-Ville, Switzerland: Eurographics Association, 2006.
4. Real-Time Rendering of Physically Based Clouds Using Precomputed Scattering 165
[Bouthors et al. 08] Antoine Bouthors, Fabrice Neyret, Nelson Max, Eric Brune-
ton, and Cyril Crassin. “Interactive Multiple Anisotropic Scattering in
Clouds.” In SI3D, edited by Eric Haines and Morgan McGuire, pp. 173–
182. New York: ACM, 2008.
[Cornette and Shanks 92] W.M. Cornette and J.G. Shanks. “Physical Reason-
able Analytic Expression for the Single-Scattering Phase Function.” Applied
Optics 31:16 (1992), 3152–3160.
[Harris and Lastra 01] Mark J. Harris and Anselmo Lastra. “Real-Time Cloud
Rendering.” Comput. Graph. Forum 20:3 (2001), 76–85.
[Harris 03] Mark Jason Harris. “Real-Time Cloud Simulation and Rendering.”
Ph.D. thesis, The University of North Carolina at Chapel Hill, 2003.
[Losasso and Hoppe 04] Frank Losasso and Hugues Hoppe. “Geometry
Clipmaps: Terrain Rendering Using Nested Regular Grids.” ACM Trans.
Graph. 23:3 (2004), 769–776.
[Miyazaki et al. 04] Ryo Miyazaki, Yoshinori Dobashi, and Tomoyuki Nishita. “A
Fast Rendering Method of Clouds Using Shadow-View Slices.” In Proceed-
ing of Computer Graphics and Imaging 2004, August 17–19, 2004, Kauai,
Hawaii, USA, pp. 93–98. Calgary: ACTA Press, 2004.
[Riley et al. 04] Kirk Riley, David S. Ebert, Martin Kraus, Jerry Tessendorf, and
Charles Hansen. “Efficient Rendering of Atmospheric Phenomena.” In Pro-
ceedings of the Fifteenth Eurographics Conference on Rendering Techniques,
pp. 375–386. Aire-la-Ville, Switzerland: Eurographics Association, 2004.
[Satish et al. 09] Nadathur Satish, Mark Harris, and Michael Garland. “Design-
ing Efficient Sorting Algorithms for Manycore GPUs.” In Proceedings of
the 2009 IEEE International Symposium on Parallel&Distributed Process-
ing, IPDPS ’09, pp. 1–10. Washington, DC: IEEE Computer Society, 2009.
[Schpok et al. 03] Joshua Schpok, Joseph Simons, David S. Ebert, and Charles
Hansen. “A Real-time Cloud Modeling, Rendering, and Animation System.”
In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on
166 II Rendering
Sparse Procedural
Volume Rendering
Doug McNabb
5.1 Introduction
The capabilities and visual quality of real-time rendered volumetric effects dis-
proportionately lag those of film. Many other real-time rendering categories have
seen recent dramatic improvements. Lighting, shadowing, and postprocessing
have come a long way in just the past few years. Now, volumetric rendering
is ripe for a transformation. We now have enough compute to build practical
implementations that approximate film-style effects in real time. This chapter
presents one such approach.
167
168 II Rendering
real time with amazing visual quality [Yusov 14]. Light scattering in heteroge-
neous participating media is the more-general problem, and correspondingly is
more expensive. Our technique approximates single scattering in heterogeneous
media and can look very good. It is worth noting that our scattering model is
simpler than the typical homogeneous counterparts to accommodate the added
complexity from heterogeneous media.
Fluid simulation is another mechanism for generating volumetric effects. The
results are often stunning, particularly where accuracy and realism are required.
But, the costs can be high in both performance and memory. Developers typically
use these simulations to fill a volume with “stuff” (e.g., smoke, fire, water, etc.),
and then render that volume by marching rays originating from the eye’s point of
view. They periodically (e.g., every frame) update a 3D voxel array of properties.
Each voxel has properties like pressure, mass, velocity, color, temperature, etc.
Our technique fills the volume differently, avoiding most of the traditional sim-
ulation’s computation and memory costs. We can use less memory than typical
fluid simulations by directly populating the volume from a small set of data. We
can further reduce the memory requirements by filling the volume on demand,
processing only the parts of the volume that are covered by volume primitives.
This volume-primitive approach is also attractive to some artists as it gives good
control over sculpting the final effect.
5.3 Overview
Our goal for rendering the volume is to approximate efficiently how much light
propagates through the volume and reaches the eye. We perform a three-step
5. Sparse Procedural Volume Rendering 169
Voxels
Metavoxels
Figure 5.2. A large volume composed of metavoxels, which are composed of voxels.
The volume may also occlude the background; the amount of light from the
background that reaches the eye can be absorbed and scattered by the volume.
Our approach separates these two eye-view contributions. We determine the
lit volume’s contribution with a pixel shader and attenuate the background’s
contribution with alpha blending.
5.4 Metavoxels
The key point of our approach is that we can gain efficiency by avoiding un-
occupied parts of the volume. Each of our tasks can be made significantly less
expensive: we can fill fewer voxels, propagate light through fewer voxels, and
ray-march fewer voxels. We accomplish this by logically subdividing the volume
into a uniform grid of smaller volumes. Each of these smaller volumes is in turn
a collection of voxels, which we call a metavoxel. (See Figure 5.2.)
The metavoxel enables us to efficiently fill and light the volume. Most im-
portantly, it allows us to avoid working on empty metavoxels. It also allows pro-
cessing multiple metavoxels in parallel (filling can be parallel; lighting has some
dependencies). It allows us to switch back and forth between filling metavox-
els and ray-marching them, choosing our working set size to balance performance
against memory size and bandwidth. Using a small set improves locality. Reusing
the same memory over many metavoxels can reduce the total memory required
and may reduce bandwidth (depending on the hardware). It also improves ray-
marching efficiency, as many rays encounter the same voxels.
Figure 5.3 shows a few variations of a simple scene and the related metavoxels.
The first pane shows a few stand-in spheres, a camera, and a light. The second
5. Sparse Procedural Volume Rendering 171
Figure 5.3. A simple scene (left), with all metavoxels (middle) and with only interest-
ing/occupied metavoxels (right).
pane shows a complete volume containing the spheres. The third pane shows the
scene with only those metavoxels covered by one or more spheres. This simplified
example shows a total volume of 512(83) metavoxels. It requires processing only
64 of them, culling 7/8 of the volume.
Figure 5.4 shows a stream of simple spheres and a visualization of the metavox-
els they cover. Note how the metavoxels are tilted toward the light. Orienting
the volume this way allows for independently propagating light along each voxel
column. The lighting for any individual voxel depends only on the voxel above it
in the column (i.e., the next voxel closer to the light) and is unrelated to voxels
in neighboring columns.
Computers get more efficient every year. But memory bandwidth isn’t pro-
gressing as rapidly as compute efficiency. Operating on cache-friendly metavoxels
172 II Rendering
Bin Particles
Pre-metavoxel
Particle Bin
Render Metavoxels
Fill Metavoxel
For each
nonempty Propagate Light
metavoxel
Raymarch Metavoxel
from Eye
may be more useful in the coming years as compute efficiency will almost certainly
continue to outpace bandwidth efficiency. Ray-marching multiple metavoxels one
at a time can be more efficient than ray-marching a larger volume. The metavoxel
localizes the sample points to a relatively small volume, potentially improving
cache hit rates and minimizing expensive off-chip bandwidth.
We fill a metavoxel by testing its voxels against the set of particles that cover
the metavoxel. For each of the voxels covered by a particle, we compute the
particle’s color and density at the covered location. Limiting this test to the
metavoxel’s set of voxels is more efficient than filling a much larger volume;
choosing a metavoxel size such that it fits in the cache(s) can reduce expen-
sive off-chip bandwidth. Processing a single voxel multiple times, e.g., once for
each particle, can also be more efficient if the voxel’s intermediate values are in
the cache. Populating the metavoxel with one particle type at a time allows us
to maintain separate shaders, which each process different particle types. Note
that we currently populate the volume with only a single particle type (displaced
sphere). But, composing an effect from multiple particle types is a desirable fea-
ture and may be simplified through sharing intermediate results versus a system
that requires that a single shader support every particle type.
5.5 Algorithm
Our goal is to render the visible, nonempty metavoxels. Figure 5.5 shows that
we loop over each of these interesting metavoxels, filling them with particles (i.e.,
our displaced sphere volume primitive), and then ray-marching them from the
5. Sparse Procedural Volume Rendering 173
Bounding
Sphere
Bounding
Box
eye. It’s worth noting that “visible” here means visible either from the eye’s view
or the light’s view. We consider the light’s view when culling because even if
a metavoxel lies outside the eye view, it may still lie between the light and the
eye’s view such that the metavoxels that are within the eye’s view may receive
its shadows. We need to propagate lighting through all parts of the volume that
contribute to the final scene.
5.5.1 Binning
We determine the interesting metavoxels using a binning process. Binning adds a
small amount of extra work but it reduces the overall workload. We can quickly
generate a list for each metavoxel containing the indices for the particles that
cover the metavoxel, and only those particles. It also allows us to completely
avoid metavoxels that aren’t covered by any particles.
Each bin holds a list of particle indices. We populate the bin with an index
for every particle that covers the metavoxel. We maintain an array of bins—one
bin for every metavoxel. (For example, were we to subdivide our total volume
into 32 × 32 × 32 metavoxels, then we would have a 32 × 32 × 32 array of bins.)
A typical sparsely populated volume will involve a small fraction of these, though
the algorithm does not inherently impose a limit.
We bin a particle by looping over the metavoxels covered by the particle’s
bounding box. (See Figure 5.6.) We refine the approximation and improve overall
174 II Rendering
// D e t e r m i n e t h e p a r t i c l e s e x t e n t s
min = particleCenter − particleRadius
max = particleCenter + particleRadius
// Loop o v e r e a c h m e t a v o x e l w i t h i n t h e e x t e n t s
// Append t h e p a r t i c l e t o t h o s e b i n s f o r t h e
// m e t a v o x e l s a l s o c o v e r e d by t h e bounding s p h e r e
f o r Z in min . Z to max . Z
f o r Y in min . Y to max . Y
f o r X in min . X to max . X
i f particleBoundingSphere covers metavoxel [ Z , Y , X ]
a p p e n d p a r t i c l e to m e t a v o x e l B i n [ Z , Y , X ] }
Inside
Outside
Center to
displaced surface
Center to voxel
proportionally to particle density (i.e., a dense particle affects the final color more
than a less-dense particle). In practice, simply accepting the maximum between
two colors produces plausible results and is computationally inexpensive. This
won’t work for every effect, but it efficiently produces good results for some.
Different color components may be required for different effects. For example,
fire is emissive with color ranging from white through yellow and orange to red,
then black as the intensity drops. Smoke is often constant color and not emissive.
The diffuse color is modulated by light and shadow, while the emissive color is
not.
We compute the density by performing a coverage test. Figure 5.7 shows our
approach. We determine the particle’s density at each voxel’s position. If a voxel
is inside the displaced sphere, then we continue and compute the particle’s color
and density. Voxels outside the displaced sphere are unmodified. Note that the
displacement has a limited range; there are two potentially interesting radii—
inner and outer. If the voxel is inside the inner radius, then we can be sure
it’s inside the displaced sphere. If the voxel is outside the outer radius, then we
can be sure that it’s outside the displaced sphere. Coverage for voxels that lie
between these two points is defined by the displacement amount.
We radially displace the sphere. The position of each point on the displaced
sphere’s surface is given by the length of the vector from the sphere’s center to
the surface. If the vector from the sphere’s center to the voxel is shorter than
this displacement, then the voxel is inside the sphere; otherwise it’s outside.
Note a couple of optimizations. First, the dot product inexpensively computes
length2 : A · A = length2 (A). Using distance2 allows us to avoid the potentially
expensive square-root operations. The second optimization comes from storing
176 II Rendering
Figure 5.8. Example cube map: 3D noise sampled at sphere surface, projected to cube
map faces.
our displacement values in a cube map. The cube map, like the displacement
is defined over the sphere’s surface. Given a voxel at position (X, Y, Z) and the
sphere’s center at (0, 0, 0), the displacement is given by cubeMap[X, Y, Z].
We don’t currently support dynamically computed noise. We suspect that a
dynamic solutions would benefit from using a cube map for intermediate storage
as an optimization; the volume is 3D while the cube map is 2D (cube map lo-
cations are given by three coordinates, but they project to a flat, 2D surface as
seen in Figure 5.8). The number of expensive dynamic-noise calculations can be
reduced this way.
We determine each voxel’s lit color by determining how much light reaches
it and multiplying by the unlit color. We propagate the lighting through the
volume to determine how much light reaches each voxel. (See Figure 5.9.)
There are many possible ways to compute the color: constant, radial gradient,
polynomial, texture gradient, cube map, noise, etc. We leave this choice to the
reader. We note a couple of useful approximations: Figure 5.10 shows the results
of using the displacement map as an ambient occlusion approximation and using
the radial distance as a color ramp (from very bright red-ish at the center to dark
gray further out). The ambient occlusion approximation can help a lot to provide
form to the shadowed side.
Many of the displaced sphere’s properties can be animated over time: position,
orientation, scale, opacity, color, etc. This is a similar paradigm to 2D billboards,
only with 3D volume primitives.
5. Sparse Procedural Volume Rendering 177
// 100% l i g h t p r o p a g a t e s t o s t a r t
propagatedLight = 1
// Loop o v e r a l l v o x e l s i n t h e column
f o r Z in 0 to M E T A V O X E L _ H E I G H T
// L i g h t t h i s v o x e l
color [ Z ] = propagatedLight
// A t t e n u a t e t h e l i g h t l e a v i n g t h i s v o x e l
p r o p a g a t e d L i g h t /= ( 1 + d e n s i t y [ Z ] )
Listing 5.2 shows pseudocode for propagating lighting through the metavoxel.
At each step, we light the current voxel and attenuate the light for subsequent
voxels.
// The r a y s t a r t s a t t h e e y e and g o e s t h r o u g h t h e
// n e a r p l a n e a t t h e c u r r e n t p i x e l
ray = pixelPosition − eyePosition
// Clamp t h e r a y t o t h e e y e p o s i t i o n
end = max ( eyePosition , end )
// s t e p a l o n g t h e ray , a c c u m u l a t i n g and a t t e n u a t i n g
f o r step in start to end
color = volume [ step ] . rgb
density = volume [ step ] . a
blendFactor = 1/(1 + density )
r e s u l t C o l o r = lerp ( color , resultColor , b l e n d F a c t o r )
resultTransmittance = blendFactor
line from the camera). The eye is looking down on those metavoxels. So, the
eye can see through some previously rendered metavoxels. In this case, we need
to render the more-recent metavoxel behind the previously rendered metavoxel.
The solution is to process all of the metavoxels above the perpendicular before
processing those below. We also switch sort order and render those metavoxels
below the line sorted front to back.
The different sort orders require different alpha-blending modes. We render
back to front with over blending. We render front to back with under blending
[Ikits et al. 04].
It is possible to render all metavoxels sorted front to back with under blending.
That requires maintaining at least one column of metavoxels. Light propagation
requires processing from top to bottom. Sorting front to back can require render-
ing a metavoxel before those above it have been processed. In that case, we would
still propagate the lighting through the entire column before ray-marching them.
Consistently sorting front to back like this could potentially allow us to “early
out,” avoiding future work populating and ray-marching fully occluded voxels.
5.6 Conclusion
Computers are now fast enough for games to include true volumetric effects.
One way is to fill a sparse volume with volume primitives and ray-march it from
the eye. Efficiently processing a large volume can be achieved by breaking it
into smaller metavoxels in which we process only the occupied metavoxels that
contribute to the final image. Filling the metavoxels with volume primitives
allows us to efficiently populate the volume with visually interesting contents.
Finally, sampling the metavoxels from a pixel shader as 3D textures delivers an
efficient ray-marching technique.
Bibliography
[Ikits et al. 04] Milan Ikits, Joe Kniss, Aaron Lefohn, and Charles Hansen. “Vol-
ume Rendering Techniques.” In GPU Gems, edited by Randima Fernando,
Chapter 39. Reading, MA: Addison-Wesley Professional, 2004.
[Wrennige and Zafar 11] Magnus Wrennige and Nafees Bin Zafar “Produc-
tion Volume Rendering Fundamentals.” SIGGRAPH Course, Vancouver,
Canada, August 7–11, 2011.
[Yusov 14] Egor Yusov. “High Performance Outdoor Light Scattering using
Epipolar Sampling” In GPU Pro 5: Advanced Rendering Techniques, edited
by Wolfgang Engel, pp. 101–126. Boca Raton, FL: CRC Press, 2014.
III
Lighting
Lighting became one of the most active areas of research and development for
many of the game teams. Ever increasing speed of the GPUs in Playstation 4,
Xbox One, and new PCs finally give programmers enough power to move beyond
the Phong lighting model and rudimentary shadow algorithms. We’re also seeing
solutions for in-game indirect diffuse or specular lighting, be it prerendered or
real-time generated.
The chapter “Real-Time Lighting via Light Linked List” by Abdul Bezrati
discusses an extension to the deferred lighting approach used at Insomniac Games.
The algorithm allows us to properly shade both opaque and translucent surfaces
of a scene in an uniform way. The algorithm manages linked lists of lights affecting
each pixel on screen. Each shaded pixel then can read this list and compute the
appropriate lighting and shadows.
This section also includes two chapters about techniques used in Assassin’s
Creed IV: Black Flag from Ubisoft. “Deferred Normalized Irradiance Probes” by
John Huelin, Benjamin Rouveyrol, and Bartlomiej Wroński describes the global
illumination with day–night cycle support. The authors take time to talk about
various tools and runtime optimizations that allowed them to achieve very quick
turnaround time during the development.
“Volumetric Fog and Lighting” by Bartlomiej Wroński focuses on volumetric
fog and scattering rendering. The chapter goes beyond screen-space ray marching
and describes a fully volumetric solution running on compute shaders and offers
various practical quality and performance optimizations.
The next chapter, “Physically Based Light Probe Generation on GPU” by
Ivan Spogreev, shows several performance optimizations that allowed the gener-
ation of specular light probes in FIFA 15. The algorithm relies on importance
sampling in order to minimize the amount of image samples required to correctly
approximate the specular reflection probes.
The last chapter in this section is “Real-Time Global Illumination Using
slices” by Hugh Malan. Malan describes a novel way of computing single-bounce
indirect lighting. The technique uses slices, a set of 2D images aligned to scene
surfaces, that store the scene radiance to compute and propagate the indirect
lighting in real time.
I would like to thank all authors for sharing their ideas and for all the hard
work they put into their chapters.
—Michal Valient
1
III
1.1 Introduction
Deferred lighting has been a popular technique to handle dynamic lighting in
video games, but due to the fact that it relies on the depth buffer, it doesn’t
work well with translucent geometry and particle effects, which typically don’t
write depth values. This can be seen in Figure 1.1, where the center smoke effect
and the translucent water bottles are not affected by the colorful lights in the
scene.
Common approaches in deferred engines have been to either leave translucent
objects unlit or apply a forward lighting pass specifically for those elements. The
forward lighting pass adds complexity and an extra maintenance burden to the
engine.
At Insomniac Games, we devised a unified solution that makes it possible to
light both opaque and translucent scene elements (Figure 1.2) using a single path.
We have named our solution Light Linked List (LLL), and it requires unordered
access views and atomic shader functions introduced with DirectX 10.1–level
hardware.
The Light Linked List algorithm shares the performance benefits of deferred
engines in that lighting is calculated only on the pixels affected by each light
source. Furthermore, any object not encoded in the depth buffer has full access
to the lights that will affect it. The Light Linked List generation and access is
fully GPU accelerated and requires minimal CPU handholding.
1.2 Algorithm
The Light Linked List algorithm relies on a GPU-accessible list of light affecting
each pixel on screen. A GPU Linked List has been used in the past to implement
183
184 III Lighting
Figure 1.1. Smoke effect and translucent water bottles don’t receive any scene lighting
in a traditional deferred lighting engine.
Figure 1.2. Smoke effects and translucent water bottles receive full-scene lighting via
the LLL.
struct LightFragmentLink
{
f l o a t m_MinDepth ; // L i g h t minimum d e p t h a t t h e c u r r e n t pixel
f l o a t m_MaxDepth ; // L i g h t maximum d e p t h a t t h e c u r r e n t pixel
u i n t m _ L i g h t I n d e x ; // L i g h t i n d e x i n t o t h e f u l l i n f o r m a t i o n array
uint m_Next ; // Next L i g h t F r a g m e n t L i n k i n d e x
};
struct LightFragmentLink
{
u i n t m _ D e p t h I n f o ; // High b i t s min depth , low b i t s max d e p t h
u i n t m _ I n d e x N e x t ; // L i g h t i n d e x and l i n k t o t h e n e x t f r a g m e n t
};
186 III Lighting
RWByteAddressBuffer g_LightBoundsBuffer
The third buffer is also a read and write byte address buffer that will be used
to track the index of the last LightFragmentLink placed at any given pixel on
screen:
RWByteAddressBuffer g_LightStartOffsetBuffer
The final buffer is an optional depth buffer that will be used to perform
software depth testing within a pixel shader. We chose to store the depth as
linear in a FP32 format instead of the typical hyper values.
Figure 1.3. The light shell displayed in gray is used to describe a point light in the
scene.
Occluder
Figure 1.4. Front faces in green pass the hardware depth test, whereas back faces fail.
188 III Lighting
// D e t e c t f r o n t f a c e s
i f ( f r o n t _ f a c e == t r u e )
{
// S i g n w i l l be n e g a t i v e i f t h e l i g h t s h e l l i s o c c l u d e d
f l o a t depth_test = sign ( g_txDepth [ vpos_i ] . x − light_depth ) ;
// Encode t h e l i g h t i n d e x i n t h e u p p e r 16 b i t s and t h e l i n e a r
// d e p t h i n t h e l o w e r 16
u i n t b o u n d s _ i n f o = ( l i g h t _ i n d e x << 1 6 ) | f 3 2 t o f 1 6 ( l i g h t _ d e p t h
depth_test ) ;
// S t o r e t h e f r o n t f a c e i n f o
g_LightBoundsBuffer . Store ( dst_offset , bounds_info ) ;
// Only a l l o c a t e a L i g h t F r a g m e n t L i n k on back f a c e s
return ;
}
Once we have processed the front faces, we immediately rerender the light
geometry but with front-face culling enabled.
We fetch the information previously stored into g_LightBoundsBuffer , and
we decode both the light ID and the linear depth. At this point, we face two
scenarios.
In the first scenario, the ID decoded from the g_LightBoundsBuffer sample
and the incoming light information match. In this case, we know the front faces
were properly processed and we proceed to check the sign of the stored depth:
if it’s negative we early out of the shader since both faces are occluded by the
regular scene geometry.
The second scenario occurs when the decoded ID doesn’t match the light
information provided by the back faces. This scenario can happen when the
1. Real-Time Lighting via Light Linked List 189
frustum near clip intersects the light geometry. In this case, the minimum depth
to be stored in the LightFragmentLink is set to zero.
// Load t h e c o n t e n t t h a t was w r i t t e n by t h e f r o n t f a c e s
uint bounds_info = g_LightBoundsBuffer . Load ( dst_offset ) ;
// Decode t h e s t o r e d l i g h t i n d e x
u i n t s t o r e d _ i n d e x = ( b o u n d s _ i n f o >> 1 6 ) ;
// Decode t h e s t o r e d l i g h t d e p t h
f l o a t f r o n t _ d e p t h = f 1 6 t o f 3 2 ( b o u n d s _ i n f o >> 0 ) ;
Now that we know both minimum and maximum light depths are available to
us, we can move forward with the allocation of a LightFragmentLink . To al-
locate a LightFragmentLink , we simply increment the internal counter of our
StructuredBuffer containing all the fragments. To make the algorithm more ro-
bust and to avoid driver-related bugs, we must validate our allocation and make
sure that we don’t overflow:
// A l l o c a t e
uint new_lll_idx = g_LightFragmentLinkedBuffer . IncrementCounter ( ) ;
// Don t o v e r f l o w
i f ( n e w _ l l l _ i d x >= g _ V P _ L L L M a x C o u n t )
{
return ;
}
uint prev_lll_idx ;
// Get t h e i n d e x o f t h e l a s t l i n k e d e l e m e n t s t o r e d and r e p l a c e
// i t i n t h e p r o c e s s
g_LightStartOffsetBuffer . InterlockedExchange ( dst_offset , new_
lll_idx , prev_lll_idx ) ;
At this point, we have all four of the required values to populate and store a
valid LightFragmentLink :
// Encode t h e l i g h t d e p t h v a l u e s
u i n t l i g h t _ d e p t h _ m a x = f 3 2 t o f 1 6 ( l i g h t _ d e p t h ) ; // Back f a c e d e p t h
u i n t l i g h t _ d e p t h _ m i n = f 3 2 t o f 1 6 ( f r o n t _ d e p t h ) ; // F r o n t f a c e d e p t h
// F i n a l o u t p u t
LightFragmentLink element ;
// Pack t h e l i g h t d e p t h
e l e m e n t . m _ D e p t h I n f o = ( l i g h t _ d e p t h _ m i n << 1 6 ) | light_depth_max ;
// I n d e x / Link
e l e m e n t . m _ I n d e x N e x t = ( l i g h t _ i n d e x << 2 4 ) | ( p r e v _ l l l _ i d x &
0 xFFFFFF ) ;
// S t o r e t h e e l e m e n t
g_LightFragmentLinkedBuffer [ new_lll_idx ] = element ;
With the LLL index calculated, we fetch our first link from the unordered
access view resource g_LightStartOffsetView and we start our lighting loop; the
loop stops whenever we find an invalid value.
// Decode t h e f i r s t e l e m e n t i n d e x
uint element_index = ( first_offset & 0 xFFFFFF ) ;
1. Real-Time Lighting via Light Linked List 191
// I t e r a t e o v e r t h e L i g h t L i n k e d L i s t
w h i l e ( e l e m e n t _ i n d e x != 0 x F F F F F F )
{
// F e t c h
LightFragmentLink element = g _ L i g h t F r a g m e n t L i n k e d V i e w [ ←
element
_index ] ;
// Update t h e n e x t e l e m e n t i n d e x
element_index = ( element . m_IndexNext & 0 xFFFFFF ) ;
...
}
// Decode t h e l i g h t bounds
f l o a t light_depth_max = f 1 6 t o f 3 2 ( e l e m e n t . m _ D e p t h I n f o >> 0 ) ;
f l o a t light_depth_min = f 1 6 t o f 3 2 ( e l e m e n t . m _ D e p t h I n f o >> 16 ) ;
// Do d e p t h bounds c h e c k
i f ( ( l_depth > light_depth_max ) || ( l_depth < light_depth_min ) )
{
continue ;
}
If our pixel lies within the light’s bounds, we decode the global light index
stored in the LightFragmentLink and we use it to read the full light information
from a separate global resource.
// Decode t h e l i g h t i n d e x
u i n t l i g h t _ i d x = ( e l e m e n t . m _ I n d e x N e x t >> 2 4 ) ;
// A c c e s s t h e l i g h t e n v i r o n m e n t
GPULightEnv light_env = g_LinkedLightsEnvs [ light_idx ];
In practice, generating the Light Linked List at one quarter of the native game
resolution, or even one eighth, is largely sufficient and reduces the required mem-
192 III Lighting
fl oa t4 d4_max ;
{
f l o a t 4 d 4 _ 0 0 = g _ t x D e p t h . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −3 , −3) ) ;
f l o a t 4 d 4 _ 0 1 = g _ t x D e p t h . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −1 , −3) ) ;
f l o a t 4 d 4 _ 1 0 = g _ t x D e p t h . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −3 , −1) ) ;
f l o a t 4 d 4 _ 1 1 = g _ t x D e p t h . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −1 , −1) ) ;
d 4 _ m a x = max ( d4_00 , max ( d4_01 , max ( d4_10 , d4_11 ) ) ) ;
}
{
f l o a t 4 d4_00 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −3 , 3 ) ) ;
f l o a t 4 d4_01 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −1 , 3 ) ) ;
f l o a t 4 d4_10 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −3 , 1 ) ) ;
f l o a t 4 d4_11 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( −1 , 1 ) ) ;
d4_max = max ( d4_max , m a x ( d 4 _ 0 0 , m a x ( d 4 _ 0 1 , m a x ( d 4 _ 1 0 , ←
d4_11 ) ) ) ) ;
}
{
f l o a t 4 d4_00 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( 3 , −3) ) ;
f l o a t 4 d4_01 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( 1 , −3) ) ;
f l o a t 4 d4_10 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( 3 , −1) ) ;
f l o a t 4 d4_11 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ←
( 1 , −1) ) ;
d4_max = max ( d4_max , m a x ( d 4 _ 0 0 , m a x ( d 4 _ 0 1 , m a x ( d 4 _ 1 0 , ←
d4_11 ) ) ) ) ;
}
{
1. Real-Time Lighting via Light Linked List 193
f l o a t 4 d4_00 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ( ←
3 , 3) ) ;
f l o a t 4 d4_01 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ( ←
1 , 3) ) ;
f l o a t 4 d4_10 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ( ←
3 , 1) ) ;
f l o a t 4 d4_11 = g_txDepth . G a t h e r R e d ( g _ s a m P o i n t , s c r e e n _ u v s , i n t 2 ( ←
1 , 1) ) ;
d4_max = max ( d4_max , m a x ( d 4 _ 0 0 , m a x ( d 4 _ 0 1 , m a x ( d 4 _ 1 0 , ←
d4_11 ) ) ) ) ;
}
// C a l c u l a t e t h e f i n a l max d e p t h
f l o a t d e p t h _ m a x = m a x ( d 4 _ m a x . x , m a x ( d 4 _ m a x . y , m a x ( d 4 _ m a x . z , ←
d4_max . w ) ) ) ;
1.6 Conclusion
The Light Linked List algorithm helped us to drastically simplify our lighting
pipeline while allowing us to light translucent geometry and particle effects, which
were highly desirable. With Light Linked List, we were able to match or improve
the performance of our deferred renderer, while reducing memory use. Addition-
ally, the flexibility of Light Linked List allowed us to easily apply custom lighting
for materials like skin, hair, cloth, and car paint.
In the future, we intend to further experiment with a more cache-coherent
layout for the LightFragmentLink buffer, as this seems likely to yield further per-
formance improvements.
Bibliography
[Gruen and Thibieroz 10] Holger Gruen and Nicolas Thibieroz. “Order Indepen-
dent Transparency and Indirect Illumination Using Dx11 Linked Lists.” Pre-
sentation at the Advanced D3D Day Tutorial, Game Developers Conference,
San Francisco, CA, March 9–13, 2010.
2
III
Deferred Normalized
Irradiance Probes
John Huelin, Benjamin Rouveyrol, and
Bartlomiej Wroński
2.1 Introduction
In this chapter we present deferred normalized irradiance probes, a technique
developed at Ubisoft Montreal for Assassin’s Creed 4: Black Flag. It was devel-
oped as a cross-console generation scalable technique and is running on all of our
six target hardware platforms: Microsoft Xbox 360, Microsoft Xbox One, Sony
Playstation 3, Sony Playstation 4, Nintendo WiiU, and PCs. We propose a par-
tially dynamic global illumination algorithm that provides high-quality indirect
lighting for an open world game. It decouples stored irradiance from weather and
lighting conditions and contains information for a whole 24-hour cycle. Data is
stored in a GPU-friendly, highly compressible format and uses only VRAM mem-
ory. We present the reasoning behind a higher achieved quality than what was
possible with other partially baked solutions like precomputed radiance transfer
(under typical open-world game constraints).
We also describe our tools pipeline, including a fully GPU-based irradiance
baking solution. It is able to generate bounced lighting information for a full-
day cycle and big game world in less than 8 minutes on a single PC machine.
We present multiple optimizations to the baking algorithm and tools that helped
achieve such performance and high productivity.
We provide details for both CPU and GPU runtime that stream and generate
data for a given world position, time of day, and lighting conditions.
Finally, we show how we applied the calculated irradiance information in a
fullscreen pass as part of our global ambient lighting and analyze the performance
of whole runtime part of the algorithm. We discuss achieved results and describe
how this technique affected art pipelines.
In the last section of our chapter, we propose potential improvements to de-
veloped solutions: analysis of pros and cons of different irradiance data storage
195
196 III Lighting
basis and possible next-generation runtime extensions to improve the quality even
more.
2.1.1 Overview
Achieving realistic, runtime lighting is one of biggest unsolved problems in real-
time rendering applications, especially in games. Simple direct lighting achieved
by analytical lights is quite easy to compute in real time. On the other hand,
indirect lighting and effects of light bouncing around the scene and its shadowing
are very difficult to compute in real time. Full-scene lighting containing both
direct and indirect lighting effects is called global illumination (GI), and full
runtime high-quality GI is the Holy Grail of rendering.
A full and proper solution to the light transport equation is impossible in the
general case—as it is an infinite integral and numerical solutions would require an
infinite number of samples. There are lots of techniques that approximate results,
but proper GI solutions are far from being close to real time (they achieve timing
of seconds, minutes, or even hours).
In games and real-time rendering, typically used solutions fall into three cat-
egories:
1. static and baked solutions,
2. dynamic crude approximations,
3. partially dynamic, partially static solutions.
The first category includes techniques like light mapping, radiosity normal map-
ping [McTaggart 04], or irradiance environment mapping [Ramamoorthi and Han-
rahan 01]. They can deliver very good final image quality, often indistinguishable
from ground truth for diffuse/Lambertian lighting. Unfortunately, due to their
static nature, they are not usable in games featuring very dynamic lighting con-
ditions (like changing time of day and weather).
The second category of fully dynamic GI approximation is gaining popularity
with next-generation consoles and powerful PCs; however, it still isn’t able to
fully replace static GI. Current dynamic GI algorithms still don’t deliver a com-
parable quality level as static solutions (light propagation volumes [Kaplanyan
09]), rely on screen-space information (deep screen-space G-buffer global illumina-
tion [Mara et al. 14]), or have prohibitive runtime cost (voxel cone tracing [Crassin
11]).
There are some solutions that try to decouple some elements of the light
transport equation—for example, shadowing like various screen-space ambient
occlusion techniques—but they capture only a single part of the phenomenon.
The final category containing partially dynamic and partially static solutions
is the most interesting one thanks to a variety of different approaches and so-
lutions working under different constraints. Usually in computer games we can
2. Deferred Normalized Irradiance Probes 197
assume that some of scene information is static (like placements of some objects
and scene geometry) and won’t change, so it is possible to precompute elements
of a light transport integral and apply them in the runtime. In our case, some
constraints were very limiting—very big open world size, previous generations of
consoles as two major target platforms, dynamic weather, and dynamic time of
day. On the other hand, due to the game setting, we didn’t need to think about
too many dynamic lights affecting GI and could focus only on sky and sun/moon
lighting.
An example of partially dynamic solutions is precomputed radiance trans-
fer [Sloan et al. 02]. It assumes that shaded scene is static, and lighting conditions
can be dynamic but are fully external (from faraway light sources). Under such
constraints, it is possible to precompute radiance transfer, store it using some
low-frequency basis, and then in runtime compute a product integral with simi-
lar representation of lighting in the scene. Using orthonormal storage functions
like spherical harmonics, the product integral is trivial and very efficient, as it
simplifies to a single dot product of basis functions coefficients. The biggest prob-
lem of typical partially resident texture (PRT) solutions is a long baking time
and large memory storage requirements (if stored per vertex or in PRT texture
maps). Interesting and practical variations and implementations of this technique
for an open-world game with dynamic weather, sky, and lighting conditions was
presented as deferred radiance transfer volumes by Mickael Gilabert and Nikolay
Stefanov at GDC 2012 [Gilabert and Stefanov 12].
Its advantages are numerous—relatively small required storage, real-time per-
formance on previous generations of consoles, good quality for open-door render-
ing scenarios, and full dynamism. For Assassin’s Creed 4, we tried integrating
this technique in our engine. Unfortunately, we found that while it delivered
good quality for uninhabited and rural areas, it wasn’t good enough in case of
dense, colonial towns with complex shadowing. Achieved results were too low
of frequency, both in terms of temporal information (indirect lighting direction
and irradiance didn’t change enough when changing time of day and the main
light direction) as well as spatial density (a probe every 4 meters was definitely
not enough). We realized that simple second-order spherical harmonics are not
able to capture radiance transfer in such complex shadowing of the scene (the
result was always a strong directional function in the upper hemisphere, so light-
ing didn’t change too much with changing time of day). We decided to keep
parts of the solution but to look for a better storage scheme fitting our project
requirements.
1. Game levels are fully static in terms of object placement and diffuse mate-
rials.
4. Weather affects only light color and intensity, not light direction.
5. Worlds are very big, but only parts of them are fully accessible to the player
and need global illumination.
6. We had a good-quality and optimal system for handling direct sky lighting
and its occlusion already [St-Amour 13].
Offline
• Store irradiance at eight different hours
• Compute 2D VRAM textures (many light probes)
Figure 2.1. Simplified diagram of our algorithm split into two parts.
The whole algorithm is split into two parts: the static, tool-side part and the
final runtime part.
The tool-side part consists of the following steps:
1. Spawn a uniform 2.5D grid of light probes, placing them on the lowest point
accessible to the player (near the ground).
3. For each sector, render a cube map with G-buffer information for each probe
and a single, high-resolution shadow map for every keyframed hour for the
whole sector using calculated light direction.
4. Using pixel shaders, compute normalized irradiance for every light probe
and keyframed hour, and store it in a texture.
Having such baked textures storing this information, we are able to use them in
the runtime in the following steps:
5. In the deferred ambient pass, combine all computed information with sky
lighting and SSAO into the final per-pixel ambient lighting.
A simplified diagram showing these steps and split between the editor and runtime
parts is shown in Figure 2.1. Both parts will be covered in detail in following
sections.
2. Deferred Normalized Irradiance Probes 201
Figure 2.2. A light probe and the four vectors constructed using the irradiance basis.
We store light probes in a uniform 2.5D grid. The grid density is 1 probe
every 2 meters, and such assumptions helped us to keep the runtime code very
simple. We organized light probes into sectors of 16 × 16 light probes (32 ×
32 meters). Therefore, such a sector takes 24 kB of memory. We store sectors
202 III Lighting
RRRR 16
GGGG
BBBB
Figure 2.4. Example probe placement—notice the lack of probes on buildings’ rooftops.
need them. And because the number of probes directly drove the generation and
baking process times, we had to address the problem of game levels over-sampling.
In order to reduce the number of probes, we had to remove as many unneces-
sary probes automatically as possible—for instance, probes inside houses, in the
ocean, or in unattainable areas. We decided to use the player and AI navigation
mesh (navmesh) for that for a few reasons: it gave us a simple representation of
our world, easy to query, but it also provided clues to where the player can and,
most importantly, can’t go.
We also wanted to avoid placing probes on the roofs. We used the navigation
mesh in conjunction with another representation of our world called the ground-
heights map (usually used for sound occlusion, it stores only the ground height;
no roofs or platforms are included in this data). By computing the difference
between the navmesh z position and the ground height position, we decided,
under a certain threshold, whether to spawn the probe or not—see Figure 2.4
and Figure 2.5.
If included, the probe was spawned on the lowest z position of the navmesh.
The xy position was decided by the regular grid. This gave us a 70% reduction
of the number of probes spawned in our biggest game world, bringing it down to
30,000 probes.
Because of memory constraints, we couldn’t keep all the data for the whole
world loaded at the same time on consoles: we split it by sectors of 32×32 meters,
aligned on the uniform grid. Therefore, the texture owned by a sector is 16 × 16
texels.
204 III Lighting
2.3.2 Baking
For each probe, we needed to get irradiance value for four basis vectors. We
didn’t have any baking solution in our engine, and writing a dedicated ray tracer
or renderer was out of question. We also wanted the lighting artists to be able
to iterate directly on their computers (not necessarily DirectX 11 compatible at
that point), so it had to be completely integrated inside our world editor.
Due to such constraints, we decided to use cube-map captures. It meant
getting one G-buffer cube map and one shadow map for each time of day, lighting
them, and integrating them to get the irradiance values. The normalized lighting
was done at eight different times of day, with a plain white light, no weather
effects enabled (rain, fog) and neutral but still artist-controllable ambient terms
(to be able to still capture some bounces in the shadowed areas). To do the
integration, for each basis and for each time of day, we computed a weighted
integral of normalized irradiance responses against the basis.
2. Deferred Normalized Irradiance Probes 205
The irradiance computation is very similar to that of [Elias 00]: for every
single basis vector and for every cube-map texel, we project incoming radiance
to diffuse the lighting contribution. To do this efficiently, we have a multiplier
map that takes into account both Lambert’s cosine law term and the hemicube’s
shape compensation. This weight map is normalized (the sum of all texel weights
for a single basis is 1). Compensation is necessary because different cube-map
texels corresponding to different positions subtend a different solid angle on a
hemisphere. Once incoming radiance is multiplied by a bidirectional reflectance
distribution function (BRDF) and normalization factors for a given basis, we can
integrate it by simply adding all texel contributions together.
We faded the integrated radiance smoothly with distance. The reasoning for
it was to avoid popping and aliasing artifacts that could happen because of a
limited cube-map far plane (for optimization purposes)—in some cases, GI could
suddenly appear. Then for every basis vector, we merged whole information
from relevant cube-map faces and downscaled the result down to one pixel that
represented our normalized irradiance at a given probe position or basis direction.
All the data was then packed in our sector textures and added to our loading grid
to be streamable at runtime.
Therefore, our first version directly used our renderer to generate each face of
each cube map at each time of day independently, integrating the results on the
CPU by locking the cube-map textures and integrating radiance in serial loops.
Even with efficient probe number reduction like the one mentioned in Section
2.3.1, computing the data for around 30,000 probes for eight different times of
day was a lengthy process: at 60 fps and 48 renders for every probe (6 faces ×
8 times of day), it would take 400 minutes. This “quick and dirty” prototype
generated data for the world in 12 hours. Most of the time was spent on the CPU
and GPU synchronization and on the inefficient, serial irradiance integration. The
synchronization problem was due to the fact that on PCs it is not uncommon
for the GPU to be 2–3 frames behind the CPU and the command buffer being
written due to driver scheduling and resource management. Also, sending and
copying lots of data between GPU and CPU memory (needed for reading) is
much slower than localized, GPU-only operations. Therefore, when we tried to
lock the cube-map textures for CPU read-back in the naı̈ve way (after every
single cube-map face being rendered), we spent an order of magnitude higher
times on synchronization and CPU computations than on the actual rendering.
(See Figure 2.6.)
Therefore, the first step was to remove the CPU irradiance calculations part
by processing all the downscaling and irradiance calculations on the GPU and
reading back only final irradiance values on the CPU. This kind of operation is
also trivially parallelizable (using many simple 2 × 2 down-sample steps) and is
well suited for the GPU, making the whole operation faster than the CPU version.
But even when the whole algorithm was running on the GPU, we were still
losing a lot of time on the CPU when locking the final result (1 lock per probe)
206 III Lighting
Figure 2.6. Diagram showing the first naı̈ve implementation for the GPU-based baker.
Work is done on a per-probe basis.
Draw calls Try lock K previous Request saving for Draw calls
CPU
Sector N sector textures textures locked Sector N + 1
Figure 2.7. Overview of batched baking rendering pipeline. “Render Sector N ” means
drawing, lighting, and computing irradiance for each of the 16 × 16 probes in Sector N .
because the CPU was well ahead of the GPU. We decided to use a pool of textures
and lock only when we knew the GPU actually wrote the data and it was ready
to be transferred (we checked it using asynchronous GPU queries). Batching also
helped: instead of locking texture for every probe, we locked once per sector—
each probe was directly writing its data to its own texel inside the sector’s texture.
At that point, our entire baker was running asynchronously between CPU and
GPU and was generating the whole map in around three hours. The GPU cost
was still high, but we were mainly CPU traversal bound at that point. (See
Figure 2.7.)
To cut some of the CPU cost, we wrote a new occlusion culling system that
was much less accurate (it didn’t matter for such short rendering distances), but
simpler and faster. We used a simple custom traversal per sector (radial distance
around the sector) and used also a reduced far-plane distance during the cube-
map generation.
To reduce the GPU workload, we also generated only one shadow map per
sector, instead of per probe. This helped reduce the GPU cost, as well as the
CPU cost of traversing the scene for the shadow geometry pass each time for each
time of day.
For each face of the cube map, we were generating the G-buffer only once. We
could reuse it for each time of day, as material properties like albedo and normals
don’t change over time. We could light the cube maps per every keyframed
time with the albedo, normal, and depth information we had, plus the sun and
shadow-map direction at the requested time of day.
At the end, generating our biggest world was taking 8 minutes on an artist’s
computer. The baking was so fast that we provided a real-time baking mode.
It was collecting the closest probes and computed lighting for them in the back-
ground. This way, artists could see the result of their work with GI almost
2. Deferred Normalized Irradiance Probes 207
RenderSector
Draw s h a d o w maps c o n t a i n i n g s e c t o r f o r the eight times of day
For each probe in s e c t o r
For each of the six d i r e c t i o n s
R e n d e r G−b u f f e r c e n t e r e d o n p r o b e
For each time of day
Use s e c t o r s h a d o w map f o r c u r r e n t time of day
Perform lighting
For every basis
Compute texels irradiance BRDF contribution
D o w n−s a m p l e i r r a d i a n c e c o n t r i b u t i o n u n t i l 1 x 1
Figure 2.8. Implemented GI baking debugging tools. Top left inset, from left to right:
the six faces of the normalized cube map for the current time of day, the associated depth
for those faces, the current sector shadowmap, and the runtime heightmap texture (see
Section 2.4.3).
Because all these texture are relatively lightweight (24 kB of VRAM per sec-
tor), the impact on the game streaming system was negligible and no additional
effort was necessary to improve the loading times.
Figure 2.9. Resolved DNIP textures—the circle shape is a radial attenuation of the
DNIP to hide popping when streaming data in or out.
16
16
Figure 2.10. Debug representation for the resolved DNIP textures: yellow square—
single blitted sector; blue area—whole final irradiance texture; dashed squares—squares
only partially blitted into final texture.
Each of these draw calls will interpolate the DNIP data from the two closest
stored times of day, and multiply the result by the current lighting condition.
Based on the distance to the camera, we fade out the DNIP contribution to a
constant color. This allows us to stream in and out the DNIP data that is far
away without any discontinuity. This shader is very cheap to evaluate (works
on a configuration of three 128 × 128 render targets): less than 0.1 ms on Play-
station 3.
Once these textures are generated, we use them during the ambient lighting
pass.
210 III Lighting
(a)
(b) (c)
(d)
Figure 2.11. Visual summary of DNIP evaluation: (a) resolved DNIP textures, (b) world
height-map data, (c) world-space normals buffer, and (d) final indirect sunlight GI
contribution.
(a) (b)
(c) (d)
Figure 2.12. Lighting composition: (a) direct sunlight, (b) direct sky lighting, (c) indi-
rect sunlight (exaggerated), and (d) composed ambient lighting buffer.
Figure 2.13. Final composed image with direct sunlight and albedo.
(a) (b)
Figure 2.14. World ambient occlusion: (a) source top-down depth map and (b) blurred
shadow map used for the runtime evaluation of the world ambient occlusion.
212 III Lighting
DNIP results are added to the sky lighting, giving the final ambient color.
On next-generation consoles and PCs, this ambient term gets multiplied by
SSAO before being added to the direct lighting. On the previous generation of
consoles, because of memory constraints, SSAO was multiplied at the end of the
lighting pass (after sunlight and local lights). It was improper, but allowed us to
alias some render targets and save a considerable amount of GPU memory.
Table 2.1 gives a summary of the important data used by the DNIP technique.
The GPU time indicates the total time taken by both the resolved DNIP textures
generation and the ambient lighting pass. The 600 kB of VRAM is for a total of
25 DNIP textures of streamed sectors, which covers an area of 160 × 160 meters
around the camera. The render targets are the resolved DNIP textures, which
are 64 × 64 and cover an area of 128 × 128 meters around the camera.
Thanks to our baking algorithm running only on the GPU and not needing
any special data structures, generating the indirect lighting in the runtime for the
closest probes could also be another path to explore. This way we could support
single- or multi-bounce indirect lighting from various light sources and occluded
by dynamic objects, instead of just the key lighting.
Finally, having multiple volumes of GI would allow us to work at multiple
content-dependent frequencies and help solve the potential light leaking problem
that would happen in any game mixing indoors and outdoors. This was not a
problem on Assassin’s Creed 4, as it was a game based mostly on exteriors—in
our case almost no system supported mixed interiors and exteriors, which was
solved by game design and in data. All the interiors were already separate areas
into which players were teleported instead of being real areas embedded in the
world.
2.6 Acknowledgments
We would like to thank the whole Assassin’s Creed 4 team of rendering program-
mers, technical art directors, and lighting artists for inspiring ideas and talks
about the algorithm and its optimizations. Special thanks go to Mickael Gilabert,
author of “Deferred Radiance Transfer Volumes,” for lots of valuable feedback
and suggestions and to Sebastien Larrue, Danny Oros, and Virginie Cinq-Mars
for testing and giving feedback and practical applications of our solution.
Bibliography
[Crassin 11] Cyril Crassin. “Gigavoxels”, “GigaVoxels: A Voxel-Based Rendering
Pipeline for Efficient Exploration of Large and Detailed Scenes.” PhD thesis,
Grenoble University, 2011.
[Gilabert and Stefanov 12] Mickael Gilabert and Nikolay Stefanov. “Deferred
Radiance Transfer Volumes.” Presented at Game Developers Conference,
San Francisco, CA, March 5–9, 2012.
[McTaggart 04] Gary McTaggart. “Half Life 2/Valve Source Shading.” Direct
3D Tutorial, http://http://www2.ati.com/developer/gdc/D3DTutorial10
Half-Life2 Shading.pdf, 2004.
[Ramamoorthi and Hanrahan 01] Ravi Ramamoorthi and Pat Hanrahan. “An
Efficient Representation for Irradiance Environment Maps.” In Proceedings
of the 28th Annual Conference on Computer Graphics and Interactive Tech-
niques, pp. 497–500. New York: ACM, 2001.
[Sloan et al. 02] Peter-Pike Sloan, Jan Kautz, and John Snyder. “Precomputed
Radiance Transfer for Real-Time Rendering in Dynamic, Low-Frequency
Lighting Environments.” Proc. SIGGRAPH ’02: Transaction on Graphics
21:3 (2002), 527–536.
3.1 Introduction
This chapter presents volumetric fog, a technique developed at Ubisoft Montreal
for Microsoft Xbox One, Sony Playstation 4, and PCs and used in Assassin’s
Creed 4: Black Flag. We propose a novel, real-time, analytical model for calcu-
lating various atmospheric phenomena. We address the problem of unifying and
calculating in a coherent and optimal way various atmospheric effects related to
atmospheric scattering, such as
217
218 III Lighting
3.2 Overview
Atmospheric scattering is a very important physical phenomenon describing in-
teraction of light and various particles and aerosols in transporting media (like
air, steam, smoke, or water). It is responsible for various visual effects and phe-
nomena, like sky color, clouds, fog, volumetric shadows, light shafts, and “god
rays.”
Computer graphics research tries to reproduce those effects accurately. They
not only increase realism of rendered scenes and help to establish visual distinction
of distances and relations between objects, but also can be used to create a specific
mood of a scene or even serve as special effects. Computer games and real-time
rendering applications usually have to limit themselves to simplifications and
approximations of the phenomena, including analytical exponential fog [Wenzel
06], image-based solutions [Sousa 08], artist-placed particles and billboards, or,
recently, various modern ray-marching–based solutions [Tóth and Umenhoffer 09,
Vos 14, Yusov 13].
All of those approaches have their limitations and disadvantages—but ray
marching seemed most promising and we decided to base our approach on it.
Still, typical 2D ray marching has number of disadvantages:
• Solutions like epipolar sampling [Yusov 13] improve the performance but
limit algorithms to uniform participating media density and a single light
source.
• Most algorithm variations are not compatible with forward shading and
multiple layers of transparent affected objects. A notable exception here
is the solution used in Killzone: Shadow Fall [Vos 14], which uses low-
resolution 3D volumes specifically for particle shading. Still, in this ap-
3. Volumetric Fog and Lighting 219
Out-scattering
Incoming light
Transmission
Absorption
proach, scattering effects for shaded solid objects are computed in an image-
based manner.
• transmittance,
• scattering,
• absorption.
30 60 90 120 150
0
180
Figure 3.2. Example polar plot of a phase function for clouds [Bouthors et al. 06]. In
this plot, we see how much scattering happens in which direction—zero being the angle
of the original light path direction.
On the other hand, so-called Mie scattering of bigger particles (like aerosols
or dust) has a very anisotropic shape with a strong forward lobe and much higher
absorption proportion.
In reality, photons may get scattered many times before entering the eye or
camera and contributing to the final image. This is called multi-scattering. Un-
fortunately, such effects are difficult and very costly to compute, so for real-time
graphics we use a single-scattering model. In this model, atmospheric scatter-
ing contributes to the final image in two separate phenomena, in-scattering and
out-scattering.
In-scattering is the effect of additional light entering the paths between shaded
objects and the camera due to scattering. Therefore, we measure larger radi-
ance values than without the scattering. Out-scattering has the opposite effect—
because of scattering, photons exit those paths and radiance gets lost. The phys-
ical term describing how much light gets through the medium without being
out-scattered is transmittance. When in- and out-scattering effects are combined
in a single-scattering model, they result in contrast loss in comparison to the
original scene.
Characteristics of different scattering types can be modeled using three math-
ematical objects: scattering coefficient βs , absorption coefficient βa , and a phase
function. A phase function is a function of the angle between an incoming light
source and all directions on a sphere describing how much energy is scattered in
which direction. We can see an example of a complex phase function (for clouds)
in Figure 3.2.
A very common, simple anisotropic phase function that is used to approximate
Mie scattering is the Henyey–Greenstein phase function. It is described using the
following formula:
1 1 − g2
p(θ) = ,
4π (1 + g 2 − 2g cos θ)3/2
where g is the anisotropy factor and θ is the angle between the light vector and
3. Volumetric Fog and Lighting 221
0.10
0.05
–0.05
–0.10
Figure 3.3. Polar plot of Henyey–Greenstein phase function for different g anisotropy
coefficients (0.0, 0.1, 0.2, 0.3, and 0.4). In this plot, the positive x-axis corresponds to
the original view direction angle.
the view vector (facing the camera). We can see how this phase function looks
for different anisotropy factors in Figure 3.3.
The Henyey–Greenstein phase function has two significant advantages for use
in a real-time rendering scenario. First, it is very efficient to calculate in shaders
(most of it can be precomputed on the CPU and passed as uniforms) for analytical
light sources. Second, the Henyey–Greenstein phase function is also convenient
to use for environmental and ambient lighting. Very often, ambient lighting,
sky lighting, and global illumination are represented using spherical harmon-
ics [Green 03]. To calculate the integral of spherical harmonics lighting with a
phase function, one has to calculate the spherical harmonics representation of
the given function first. This can be difficult and often requires the expensive
step of least-squares fitting [Sloan 08]. Fortunately, the Henyey–Greenstein phase
function has trivial and analytical expansion to zonal spherical harmonics, which
allows efficient product integral calculation of lighting that is stored in the spher-
ical harmonics (SH). The expansion up to the fourth-order zonal SH is simply
(1, g, g 2 , g 3 ).
Finally, the last physical law that is very useful for light scattering calculations
is the Beer–Lambert law that describes the extinction of incoming lighting (due to
the light out-scattering). This law defines the value of transmittance (proportion
of light transported through medium to incoming light from a given direction).
It is defined usually as B
T (A → B) = e A βe (x)dx
where βe is the extinction coefficient, defined as the sum of scattering and ab-
sorption coefficients. We can see from the Beer–Lambert law that light extinction
is an exponential function of traveled distance by light in a given medium.
222 III Lighting
th
io ep
ut d
ib tial
n
str n
di one
Exp
Z:
16-bit Float RGBA
Y: Device Y coordinates
Y
160 × 90 × 64
160 × 90 × 128
X
Z
X: Device X coordinates
performance budgets and target hardware platforms (for example high-end PCs),
used resolutions could be larger as the algorithm scales linearly in terms of used
compute threads and the arithmetic logic unit (ALU).
We used two of such textures: one for in-scattered lighting at a given point and
a density-related extinction coefficient and the second one to store final lookups
for integrated in-scattering and transmittance. The used format for those two
textures was a four-channel (RGBA) 16-bit floating point. The volumetric tex-
ture’s layout can be seen in Figure 3.4.
The resolution of volumetric textures may seem very low in the X and Y
dimensions, and it would be true with 2D ray-marching algorithms. To calcu-
late information for low-resolution tiles, classic ray-marching approaches need
to pick a depth value that is representative of the whole tile. Therefore, many
depth values contained by this tile might not be represented at all. Algorithms
like bilateral up-sampling [Shopf 09] try to fix it in the up-sampling process by
checking adjacent tiles for similar values. However, this approach can fail in case
of thin geometric features or complex geometry. Volumetric fog doesn’t suffer
from this problem because, for every 2D tile, we store scattering values for many
depth slices. Even very small, 1-pixel wide objects on screen can get appropriate
depth information. Figure 3.5 shows this comparison of 2D and 3D approaches
in practice.
Still, even with better filtering schemes, small-resolution rendering can cause
artifacts like under-sampling and flickering of higher-frequency signals. Sec-
tions 3.3.7 and 3.4.3 will describe our approach to fix those problems.
A significant disadvantage of such low volume resolution rendering is visual
softness of the achieved effect, but it can be acceptable for many scenarios. In
our case, it did fit our art direction, and in general it can approximate a “soft”
multi-scattering effect that would normally have prohibitive calculation cost.
224 III Lighting
Z axis
Z axis
X axis
X axis
(a) (b)
Figure 3.5. Flat XZ scene slice. (a) A smaller-resolution 2D image (black lines represent
depth) causes lack of representation for a small object (black dot)—no adjacent tiles
contain proper information. (b) All objects, even very small ones, get proper filtered
information (3D bilinear filtering shown as green boxes).
d(h) = d0 × e−hD ,
where d(h) is the calculated density for height h, d0 is density at the reference
level (literature usually specifies it as ground or sea level), and D is the scaling
coefficient describing how fast the density attenuates. Coefficient D depends on
the type of aerosols and particles, and in typical in-game rendering scenarios, it
probably will be specified by the environment and lighting artists.
The second part of density estimation is purely art driven. We wanted to
simulate clouds of dust or water particles, so we decided to use the animated,
volumetric GPU shader implementation of Ken Perlin’s noise function [Perlin 02,
Green 05]. It is widely used in procedural rendering techniques as it has advan-
tage of smoothness, lack of bilinear filtering artifacts, derivative continuity, and
realistic results. We can see it in Figure 3.6. Perlin’s improved noise can be com-
bined in multiple octaves at varying frequencies to produce a fractal turbulence
3. Volumetric Fog and Lighting 225
(a) (b)
Figure 3.6. (a) Bilinear textured noise compared to (b) volumetric 3D improved Perlin
noise [Green 05].
// World−s p a c e p o s i t i o n o f v o l u m e t r i c t e x t u r e t e x e l
fl oa t3 worldPosition
= CalcWorldPositionFromCoords( dispatchThreadID . xyz ) ;
// T h i c k n e s s o f s l i c e −− non−c o n s t a n t due t o e x p o n e n t i a l s l i c e
// d i s t r i b u t i o n
fl o a t layerThickness = ComputeLayerThickness ( dispatchThreadID . z ) ;
// E st i m a t e d d e n s i t y o f p a r t i c i p a t i n g medium a t g i v e n p o i n t
fl o a t dustDensity = CalculateDensityFunction ( worldPosition ) ;
// S c a t t e r i n g c o e f f i c i e n t
fl o a t scattering = g_VolumetricFogScatteringCoefficient dustDensity
layerThickness ;
// A b s o r p t i o n c o e f f i c i e n t
fl o a t absorption = g_VolumetricFogAbsorptionCoefficient dustDensity
layerThickness ;
// N o r m a l i z e d v i e w d i r e c t i o n
f l o a t 3 viewDirection = normalize ( worldPosition − g_WorldEyePos . xyz ) ;
fl oa t3 lighting = 0.0 f ;
// L i g h t i n g s e c t i o n BEGIN
// Adding a l l c o n t r i b u t i n g l i g h t s r a d i a n c e and m u l t i p l y i n g i t by
// a p h a se f u n c t i o n −− v o l u m e t r i c f o g e q u i v a l e n t o f BRDFs
l i g h t i n g += G e t S u n L i g h t i n g R a d i a n c e ( w o r l d P o s i t i o n )
GetPhaseFunction ( viewDirection , g_SunDirection ,
g_VolumetricFogPhaseAnisotropy ) ;
l i g h t i n g += G e t A m b i e n t C o n v o l v e d W i t h P h a s e F u n c t i o n ( w o r l d P o s i t i o n ,
viewDirection , g_VolumetricFogPhaseAnisotropy ) ;
[ loop ]
f o r ( i n t l i g h t I n d e x = 0 ; l i g h t I n d e x < g _ L i g h t s C o u n t ; ++l i g h t I n d e x )
{
3. Volumetric Fog and Lighting 227
fl oa t3 localLightDirection =
GetLocalLightDirection ( lightIndex , worldPosition ) ;
l i g h t i n g += G e t L o c a l L i g h t R a d i a n c e ( l i g h t I n d e x , w o r l d P o s i t i o n )
GetPhaseFunction ( viewDirection , localLightDirection ,
g_VolumetricFogPhaseAnisotropy ) ;
}
// L i g h t i n g s e c t i o n END
// F i n a l l y , we a p p l y some p o t e n t i a l l y non−w h i t e f o g s c a t t e r i n g a l b e d o
color lighting = g_FogAlbedo ;
// F i n a l i n−s c a t t e r i n g i s p r o d u c t o f o u t g o i n g r a d i a n c e and s c a t t e r i n g
// c o e f f i c i e n t s , w h i l e e x t i n c t i o n i s sum o f s c a t t e r i n g and a b s o r p t i o n
fl oa t4 finalOutValue = flo at 4 ( lighting scattering , scattering
+ absorption ) ;
Listing 3.1. Pseudocode for calculating in-scattering lighting, scattering, and absorption
coefficients in compute shaders.
The last part of lighting in-scattering that helps to achieve scene realism is
including ambient, sky, or indirect lighting. Ambient lighting can be a dominat-
ing part of scene lighting in many cases, when analytical lights are shadowed.
Without it, the scene would be black in shadowed areas. In a similar way, if am-
bient lighting is not applied to in-scattering, the final scattering effect looks too
dark (due to lighting out-scattering and extinction over the light path). Figure
3.7 shows a comparison of a scene with and without any ambient lighting.
The main difference between direct lighting and ambient lighting is that ambi-
ent lighting contains encoded information about incoming radiance from all pos-
sible directions. Different engines and games have different ambient terms—e.g.,
constant term, integrated cube sky lighting, or environment lighting containing
global illumination. The main problem for calculating the in-scattering of ambi-
ent lighting is that most phase functions have only simple, directional, analytical
forms, while ambient contribution is usually omnidirectional but nonuniform.
(a) (b)
Figure 3.7. Effect of adding ambient lighting to volumetric fog in-scattering calculations:
(a) Fog without sky lighting or GI = darkening, and (b) Fog with sky lighting and GI.
228 III Lighting
Ray marching
3D texture
In our case, ambient lighting was split into two parts. First, the indirect
sunlight was stored and shaded using deferred normalized irradiance probes, de-
scribed in the previous chapter [Huelin et al. 15]. We used a simple irradiance
storage basis constructed from four fixed-direction basis vectors, so it was trivial
to add their contribution to volumetric fog and calculate the appropriate phase
function. The second part was the cube-map–based sky lighting (constructed in
real time from a simple sky model) modulated by the precomputed sky visibility
[St-Amour 13]. It was more difficult to add it properly to the fog due to its
omnidirectional nature. Fortunately, when we calculated the cube-map represen-
tation using CPU and SPU jobs, we computed a second, simpler representation
in spherical-harmonics basis as well. As described in Section 3.2.1, this orthonor-
mal storage basis is very simple and often used to represent environment lighting
[Green 03]. We used the Henyey–Greenstein phase function due to its very simple
expansion to spherical harmonics and calculated its product integral with the sky
lighting term in such form.
The optimization that we used in Assassin’s Creed 4 combines density estima-
tion and lighting calculation passes together. As we can see in Figure 3.8, those
passes are independent and can be run in serial, in parallel, or even combined. By
combining, we were able to write out the values to a single RGBA texture—RGB
contained information about in-scattered lighting, while alpha channel contained
extinction coefficient (sum of scattering and absorption coefficients).
This way we avoided the cost of writing and reading memory and launch-
ing a new compute dispatch between those passes. We also reused many ALU
computations—local texture-space coordinates, slice depth, and the texture voxel
world position. Therefore, all computations related to in-scattering were per-
formed locally and there was no need for an intermediate density buffer. It’s
worth noting though that in some cases it may be beneficial to split those passes—
for example, if density is static, precomputed, or artist authored, or if we simply
can calculate it in lower resolution (which often is the case). Splitting passes can
3. Volumetric Fog and Lighting 229
lower effective register count and increase shader occupancy of them as well. It
is also impossible to evaluate density in the lighting pass if some dynamic and
nonprocedural density estimation techniques are used.
4. Write out to another volumetric texture at the same position RGB as the
accumulated in-scattering and alpha of the transmittance value.
Z axis
2D group of X × Y
compute shader threads
X axis Step 0
X axis
Progress of ray-marching
computer shader
The pass progresses with this loop until all Z slices are processed. This process
is illustrated in Figure 3.9.
A single step of this process accumulates both in-scattering color as well as
the scattering extinction coefficients, which are applied in the Beer–Lambert
law. This way, we can calculate transmittance for not only color but also the
in-scattered lighting. Lighting in-scattered farther away from camera gets out-
scattered by decreasing the transmittance function, just like the incoming radi-
ance of shaded objects. Without it, with very long camera rays, in-scattering
would improperly accumulate to infinity—instead, it asymptotically approaches
some constant value. The entire code responsible for this is presented on List-
ing 3.2.
// One s t e p o f n u m e r i c a l s o l u t i o n t o t h e l i g h t
// s c a t t e r i n g e q u a t i o n
f l o a t 4 A c c u m u l a t e S c a t t e r i n g ( in f l o a t 4 c o l o r A n d D e n s i t y F r o n t ,
in f l o a t 4 c o l o r A n d D e n s i t y B a c k )
{
// r g b = i n −s c a t t e r e d l i g h t a c c u m u l a t e d s o f a r ,
// a = a c c u m u l a t e d s c a t t e r i n g c o e f f i c i e n t
f l o a t 3 light = colorAndDensityFront . rgb + saturate (
e x p (− c o l o r A n d D e n s i t y F r o n t . a ) ) colorAndDensityBack . rgb ;
r e t u r n f l o a t 4 ( l i g h t . rgb , colorAndDensityFront . a +
colorAndDensityBack . a ) ;}
}
// W r i t i n g o u t f i n a l s c a t t e r i n g v a l u e s }
v o i d W r i t e O u t p u t ( i n u i n t 3 pos , i n f l o a t 4 c o l o r A n d D e n s i t y )
{
// f i n a l v a l u e r g b = i n−s c a t t e r e d l i g h t a c c u m u l a t e d s o f a r ,
// a = s c e n e l i g h t t r a n s m i t t a n c e
f l o a t 4 f i n a l V a l u e = f l o a t 4 ( c o l o r A n d D e n s i t y . rgb ,
e x p (− c o l o r A n d D e n s i t y . a ) ) ;
OutputTexture [ pos ] . rgba = finalValue ;
}
W r i t e O u t p u t ( u i n t 3 ( d i s p a t c h T h r e a d I D . xy , 0 ) , c u r r e n t S l i c e V a l u e ) ;
f o r ( u i n t z = 1 ; z < V O L U M E {\ _ } D E P T H ; z++)}
{
uint3 volumePosition =
u i n t 3 ( d i s p a t c h T h r e a d I D . xy , z ) ; }
flo a t4 nextValue = InputTexture [ volumePosition ] ; }
c u r r e n t S l i c e V a l u e =}
AccumulateScattering ( currentSliceValue , nextValue ) ;}
WriteOutput ( volumePosition , currentSliceValue ) ;}
}
}
// Read v o l u m e t r i c i n −s c a t t e r i n g and t r a n s m i t t a n c e
f l o a t 4 scatteringInformation = tex3D ( VolumetricFogSampler ,
positionInVolume ) ;
f l o a t 3 inScattering = scatteringInformation . rgb ;
f l o a t transmittance = scatteringInformation . a ;
// Apply t o l i t p i x e l
fl oa t3 finalPixelColor = pixelColorWithoutFog transmittance . xxx
+ inScattering ;
Listing 3.3. Manual blending for applying the volumetric fog effect.
where InScattering is described by the RGB value of a texel read from volumetric
texture and Transmittance is in its alpha.
Because we store 3D information for many discrete points along the view ray
(from camera position up to the effect range), it is trivial to apply the effect using
trilinear filtering to any amount of deferred- or forward-shaded objects. In the
case of deferred shading, we can read the value of the Z-buffer, and using it and
the screen position of shaded pixel, we can apply either hardware blending (Dest
× SourceAlpha + Source) or manual blending (Listing 3.3).
The sampler we are using is linear, so this way we get piecewise-linear ap-
proximation and interpolation of the in-scattering and transmittance functions.
It is not exactly correct (piecewise-linear approximation of an exponential decay
function), but the error is small enough, and even with the camera moving it
produces smooth results.
For the deferred-shaded objects, this step can be combined together with a
deferred lighting pass—as lighting gets very ALU heavy with physically based
rendering techniques, this could become a free step due to latency hiding. In-
formation for volumetric fog scattering can be read right at the beginning of the
lighting shader (it doesn’t depend on anything other than screen-space pixel po-
sition and depth value). It is not needed (there is no wait assembly instruction
that could stall the execution) until writing the final color to the lighting buffer,
so the whole texture fetch latency hides behind all the lights calculations.
For forward-lit objects, particles, and transparencies, we can apply scattering
in the same way. The advantage of our algorithm is that we can have any number
of layers of such objects (Figure 3.10) and don’t need to pay any additional cost
other than one sample from a volumetric texture and a fused multiplication–
addition operation.
232 III Lighting
Figure 3.10. Multiple layers of opaque and transparent objects and trilinear 3D texture
filtering.
Frame N Frame N + 1
Frame N Frame N + 1
Figure 3.11. Under-sampling and aliasing problems without a low-pass filter caused by
small changes in the shadow map (top). Correct low-pass filtering helps to mitigate
such problems (bottom).
Our performance figures on Microsoft Xbox One are shown in Table 3.1. It is
worth noting that we included in this table the cost of a separate fullscreen pass
for effect application in deferred rendering—but in typical rendering scenarios
this pass would be combined with deferred lighting. We also included the costs
of shadow-map down-sampling and blurring—but those passes are not unique to
the volumetric fog. They could be reused for particle shadowing or other low-
frequency shadowing (translucent object shadowing), and this way the cost would
be amortized among multiple parts of the rendering pipeline.
We are satisfied with the achieved results and performance and are already
using it in many other projects. Still, it is possible to extend the algorithm and
improve the quality and controllability, allowing us to achieve a slightly different
visual effect and fit other rendering scenarios. It is also possible to improve
performance for games with tighter frame budgets—like 60 fps first-person or
racing games.
3. Volumetric Fog and Lighting 235
The main area for future improvements is related to the low effect resolution.
While most of the shadow aliasing is gone due to the described shadowing algo-
rithm, aliasing from both density calculation and lighting could still be visible
with extreme fog and scattering settings. Also, staircase bilinear filtering arti-
facts can be visible in some high-contrast areas. They come from piecewise linear
approximation of bilinear filtering—which is only a C0 -continuous function.
Such strong scattering settings were never used in Assassin’s Creed 4, so we
didn’t see those artifacts. However, this algorithm is now an important part of
the Anvil game engine and its renderer and we discussed many potential im-
provements that could be relevant for other projects. We will propose them in
the following subsections.
rithm passes and skip updating the fog texture volumes behind solid objects as
this information won’t be read and used for the shading of any currently visible
object. It doesn’t help in the worst case (when viewing distance is very large and
the whole screen covers the full fog range), but in an average case (half of the
screen is the ground plane or near objects), it could cut the algorithm cost by
30–50% by providing a significant reduction of both used bandwidth and ALU
operations. It could also be used for better 3D light culling like in [Olsson et
al. 12]. We didn’t have hierarchical Z-buffer information available in our engine,
and computing it would add some fixed cost, so we didn’t try this optimization.
On the other hand, relying on the depth buffer would mean that asynchronous
compute optimization could not be applied (unless one has a depth prepass).
Therefore, it is a tradeoff and its practical usage depends on the used engine,
target platforms, and whole rendering pipeline.
one layer of objects is stored (the depth buffer acts like a height field—we have
no information for objects behind it). Therefore, when reprojecting a dynamic
2D scene, occlusion artifacts are inevitable and there is a need to reconstruct
information for pixels that were not present in the previous frame (Figure 3.14).
In the case of volumetric reprojection, it is much easier, as we store informa-
tion for whole 3D viewing frustum in volumetric textures, as well as for the space
behind the shaded objects. Therefore, there are only two cases of improper data
after volumetric reprojection:
1. data for space that was occupied by objects that moved away as shading
changes,
2. data outside of the volume range.
We can see how much easier the reprojection is in a 3D case in Figure 3.15.
Reprojection itself stabilizes some motion flickering artifacts but isn’t the
solution for increasing image quality for a static scene or camera. A common
???
Figure 3.15. Volumetric reprojection (top view of the whole view volume).
Figure 3.16. Fixing under-sampling and staircase artifacts in volumetric fog without
(left) and with (right) temporal jittering and super-sampling.
3. Volumetric Fog and Lighting 239
3.5 Acknowledgments
I would like to thank whole Assassin’s Creed 4 team of rendering programmers,
technical art directors, and lighting artists for inspiring ideas and talks about
the algorithm and its optimizations. Special thanks go to colleagues at Ubisoft
Montreal who were working on similar topics for other games and shared their
code and great ideas—Ulrich Haar, Stephen Hill, Lionel Berenguier, Typhaine Le
Gallo, and Alexandre Lahaise.
Bibliography
[Annen et al. 07] Thomas Annen, Tom Mertens, Philippe Bekaert, Hans-Peter
Seidel, and Jan Kautz. “Convolution Shadow Maps.” In Proceedings of the
18th Eurographics conference on Rendering Techniques, pp. 51–60. Aire-la-
Ville, Switzerland: Eurographics Association, 2007.
240 III Lighting
[Bouthors et al. 06] Antoine Bouthors, Fabrice Neyret, and Sylvain Lefebvre.
“Real-Time Realistic Illumination and Shading of Stratiform Clouds.” Pre-
sented at Eurographics, Vienna, Austria, September 4–8, 2006.
[Bunnell and Pellacini 04] Michael Bunnell and Fabio Pellacini. “Shadow Map
Antialiasing.” In GPU Gems, edited by Randima Fernando, Chapter 11.
Reading, MA: Addison-Wesley Professional, 2004.
[Delalandre et al. 11] Cyril Delalandre, Pascal Gautron, Jean-Eudes Marvie, and
Guillaume François. “Transmittance Function Mapping.” Presented at Sym-
posium on Interactive 3D Graphics and Games, San Francisco, CA, February
18–20, 2011.
[Green 03] Robin Green. “Spherical Harmonic Lighting: The Gritty Details.”
Presented at Game Developers Conference, San Jose, CA, March 4–8, 2003.
[Green 05] Simon Green. “Implementing Improved Perlin Noise.” In GPU Gems
2, edited by Matt Farr, pp. 409–416. Reading, MA: Addison-Wesley Profes-
sional, 2005.
[Harada 12] Takahiro Harada, Jay McKee, and Jason C.Yang. “Forward+:
Bringing Deferred Lighting to the Next Level.” Presented at Eurographics,
Cagliari, Italy, May 13–18, 2012.
[Hill and Collin 11] Stephen Hill and Daniel Collin. “Practical, Dynamic Visibil-
ity for Games.” In GPU Pro 2: Advanced Rendering Technicques, edited by
Wolfgang Engel, pp. 329–347. Natick, MA: A K Peters, 2011.
[Huelin et al. 15] John Huelin, Benjamin Rouveyrol, and Bartlomiej Wroński,
“Deferred Normalized Irradiance Probes.” In GPU Pro 6: Advanced Ren-
dering Techniques, edited by Wolfgang Engel, pp. 195–215. Boca Raton, FL:
CRC Press, 2015.
[Jansen and Bavoil 10] Jon Jansen and Louis Bavoil. “Fourier Opacity Map-
ping.” Presented at Symposium on Interactive 3D Graphics and Games,
Bethesda, MD, February 19–20, 2010.
[Olsson et al. 12] Ola Olsson, Markus Billeter, and Ulf Assarsson. “Clustered De-
ferred and Forward Shading.” In Proceedings of the Nineteenth Eurographics
Conference on Rendering, pp. 87-96. Aire-la-Ville, Switzerland: Eurograph-
ics Association, 2012.
[Perlin 02] Ken Perlin. “Improving Noise.” ACM Trans. Graphics 21:3 (2002),
681–682.
[Pharr and Humphreys 10] Matt Phar and Greg Humphreys. Physically Based
Rendering: From Theory to Implementation, Second Edition. San Francisco:
Morgan Kaufmann, 2010.
[Sloan 08] Peter-Pike Sloan. “Stupid Spherical Harmonics (SH) Tricks.” Pre-
sented at Game Developers Conference, San Francisco, CA, February 18–22,
2008.
[Sousa 08] Tiago Sousa. “Crysis Next Gen Effects.” Presented at Game Devel-
opers Conference, San Francisco, CA, February 18–22, 2008.
[St-Amour 13] Jean-Francois St-Amour. “Rendering of Assassin’s Creed 3.” Pre-
sented at Game Developers Conference, San Francisco, CA, March 5–9, 2012.
[Tóth and Umenhoffer 09] Balázs Tóth, Tamás Umenhoffer, “Real-Time Volu-
metric Lighting in Participating Media.” Presented at Eurographics, Munich,
Germany, March 30–April 3, 2009.
[Valient 14] Michal Valient. “Taking Killzone Shadow Fall Image Quality into
the Next Generation.” Presented at Game Developers Conference, San Fran-
cisco, CA, March 17–21, 2014.
[Vos 14] Nathan Vos. “Volumetric Light Effects in Killzone: Shadow Fall.” In
GPU Pro 5: Advanced Rendering Techniques, edited by Wolfgang Engel,
pp. 127–148. Boca Raton, FL: CRC Press, 2014.
[Wrennige et al. 10] Magnus Wrenninge, Nafees Bin Zafar, Jeff Clifford, Gavin
Graham, Devon Penney, Janne Kontkanen, Jerry Tessendorf, and Andrew
Clinton. “Volumetric Methods in Visual Effects.” SIGGRAPH Course, Los
Angeles, CA, July 25–29, 2010.
242 III Lighting
4.1 Introduction
As the quality and complexity of modern real-time lighting has steadily evolved,
increasingly more and more advanced and optimal methods are required in order
to hit performance targets. It is not merely enough nowadays to have a static
ambient term or simple cube-map reflections to simulate indirect light. The
environment needs to have lighting that fully matches the surroundings. The
shading needs to not only handle and properly process direct lighting coming
from the light source, but also lighting that bounces around the environment.
Lighting received by a surface needs to be properly reflected toward the camera
position as well. By generating and processing our lighting information entirely
on the GPU, we were able to achieve dynamic, physically based environment
lighting while staying well within our performance targets.
When we started working on FIFA 15, we decided that we require a physically
based system that can dynamically update indirect lighting for the players on the
pitch at runtime. The main goal was to generate the lighting information for the
pitch at level load time. Because FIFA has a playable loading screen, there is
significant latency and performance constraints on these lighting computations.
When a player waits for the match to get started, the system cannot result in
any frame drops or stuttering. This means that each step of the light-generation
procedure needs to complete within a few milliseconds so we can completely
render the rest of the frame. The second goal was to give the artist the ability to
iterate on the lighting conditions without waiting for a pass of content pipeline
to provide the relevant updates in lighting information. Under our approach,
each time the artist would change a light direction, color value, or sky texture,
he or she would immediately see an updated scene with the proper lighting.
Finally, our technique also allowed us to include many area lights directly into
the precalculated lighting information.
243
244 III Lighting
In computer graphics, there are different methods to solve the rendering equa-
tion (e.g., path tracing and photon mapping), all of which require the tracing of
many rays and performing heavy computations. This is simply not an option for
games and real-time 3D graphics. So, instead of computing lighting every frame
for every single shading point, we preintegrate it for some base variables and use
the results later. Such precomputation should give us the quality we require with
the real-time performance we need.
The composition of the preintegrated lighting information for a given position
in space is commonly called a light probe. (See Figure 4.1.) Again, we introduce a
separation. We define two light probes for both parts of the integral in Equation
(4.2): diffuse light probes and specular light probes.
during level load corresponding to our dynamic lighting conditions for each game
match. When rendering into the cube map is complete, we run the preintegration
step that will be described in the Sections 4.3.3, 4.3.4, and 4.3.5. After the
preintegration step is done for one probe, we move to the next light probe and
repeat the process. This process can incidentally increase the loading time of a
level because we cannot render dynamic objects using the light probes without
the completion of the preintegration step. It was thus important to make this
process as performant as possible without making significant quality tradeoffs.
After a cube map gets generated, we need to solve the rendering integral in
Equation (4.2). One well-known tool to generate the probes themselves is called
CubeMapGen [Lagarde 12]. This tool can be used in the pipeline to generate the
lighting information from an environment. It is open source, so it can be modified
if need be. However, this tool uses the CPU to prefilter specular cube maps
and takes a significant amount of time to process a high resolution environment
map.
Because our goal was to generate the light probes in runtime during the level
loading and we had graphics cycles to spare, a GPU solution appeared more
favorable.
1
Lambertian BRDF = , (4.3)
π
1
brdfD(ωi , ωo ) × Li (ωi ) × (ωi · n)dωi = × Li (ωi ) × (ωi · n)dωi .
Ω Ω π
The integral in Equation (4.3) depends on two vectors: normal and light
direction. While the normal is constant per shading point, the incoming light
(Li ) varies across the hemisphere. We treat each pixel in a cube map as a light
source. Because the diffuse BRDF does not depend on the view direction, we
integrate the rendering equation for every possible normal direction. We do this
by integrating and projecting the rendering equation onto spherical harmonic
coefficients [Ramamoorthi and Hanrahan 01] in real time using the GPU [King
05]. This method allows us to preintegrate the diffuse part of the integral in
0.5 ms on a GeForce GTX 760.
Spherical harmonics and their usage in real-time 3D graphics is out of the
scope of this chapter. For more information, we recommend reading the great
article from Peter-Pike Sloan: “Stupid Spherical Harmonics (SH) Tricks”
[Sloan 08].
4. Physically Based Light Probe Generation on GPU 247
D×F ×G
Cook-Torrance specular BRDF = , (4.4)
4 × (V · N ) × (N · L)
a2
GGX D(H) = , (4.5)
π(cos(θH )2 × (a2 − 1) + 1)2
D
brdfS(ωi , ωo ) = , (4.6)
4 × (N · L)
D
brdfS(ωi , ωo ) × Li (ωi ) × (ωi · n)dωi = × Li (ωi ) × (ωi · n)dωi .
Ω Ω 4 × (N · L)
Figure 4.2. (a) When the function is regular, the Monte Carlo integration works well
with a small number of samples. (b) When the function is irregular, it gets harder to
estimate. (c) Importance sampling focuses on the difficult areas and gives us a better
approximation.
Monte Carlo importance sampling. To solve the integral from Equation (4.6), we
use the Monte Carlo importance sampling method [Hammersley and Handscomb
64] shown in Equation (4.7):
1 f (Xi )
N
f (x)dx ≈ , (4.7)
Ω N i=1 p(Xi )
BRDF importance sampling. Using the BRDF shape as a PDF can result in a
sample distribution that matches the integrand well. For example, with a mirror-
like surface, it would make sense to focus on the directions around the reflection
direction (Figure 4.3) as this would be the area where most of the visible light
rays originate from.
In order to match the specular BRDF shape closely, we build the PDF based
on the distribution function D and the cosine between the half vector H and the
4. Physically Based Light Probe Generation on GPU 249
Figure 4.3. Illustration of the BRDF importance sampling. Most of the samples get
generated toward the reflection vector, where the specular BRDF commonly has higher
values.
normal N (Equation (4.8)) [Burley 12]. This is because D has the most effect
on the BRDF’s shape. The multiplication by the cosine term will help in further
calculations:
PDF(H) = D(H) × cos(θH ). (4.8)
PDF(H)
PDF(L) = .
4 cos(θH )
From Equation (4.8) and Equation (4.5), we can see that the PDF(H) does not
depend on the angle φ. So we can simply derive [Pharr and Humphreys 04] that
1
PDF(φ) becomes constant with a value of 2π :
1
PDF(φ) = .
2π
250 III Lighting
0.25
0.55 S1
0.13 0.5 S2
0.13 S3
0.75
S4
0.13
1
(a) (b)
Figure 4.4. The illustration of the correlation between PDF and CDF. (a) The sample
with the higher PDF value has more space on the CDF. (b) The inverse CDF maps
the uniform distributed values to the samples. A given value on the [0 : 1] interval has
higher chance to get mapped to the sample S1.
Therefore,
where N is the number of samples, ωi is the sampling light direction, and Li (ωi )
is the sampling color in direction ωi .
The PDF only gives us the probability of a certain direction x. What we
actually require is the inverse; we need to be able to generate samples based on
a given probability. We start by computing a cumulative distribution function
(CDF) for our PDF [Papoulis 84, pp. 92–94] (Equation (4.10)). For a value x,
the CDF defines the uniformly distributed value ε on the [0 : 1] interval in a
proportion to PDF(x) [Papoulis 84, Pharr and Humphreys 04] (Figure 4.4(a)).
While the CDF has a uniform unit probability distribution, it is actually the
opposite of what we desire. To solve our problem, we simply need to calculate
the inverse CDF (Figure 4.4(b)).
4. Physically Based Light Probe Generation on GPU 251
The following equations show how to calculate the CDFs (CDF(φ) and CDF(θ))
for the PDF function derived from the original specular BRDF based on the for-
mal definition (Equation (4.10)) of the CDF:
CDF(X) = PDF(x)dx, (4.10)
φ
1 1
CDF(φ) = dx = φ,
0 2π 2π
q
2 × a2 × x 1 − q2
CDF(θ) = dx = ,
0 (x × (a − 1) + 1) 1 + q 2 (a2 − 1)
2 2 2
where q = cos(θH ).
We now invert our CDF functions to produce mappings from uniform values
ε1 and ε2 to the angles φ and θ, respectively:
Finally, we can now generate a direction (φ, θ) based on Equations (4.11) and
(4.12) from uniformly distributed random values (ε1 , ε2 ) in [0 : 1].
Putting all this together we get the code in Listing 4.1.
// e1 , e 2 i s p a i r o f two random v a l u e s
// Roughness i s t h e c u r r e n t r o u g h n e s s we a r e i n t e g r a t i n g f o r
//N i s t h e normal
f l o a t 3 I m p o r t a n c e S a m p l e G G X ( f l o a t e1 , f l o a t e2 , f l o a t R o u g h n e s s ,
float3 N )
{
f l o a t a = Roughness Roughness ;
// B u i l d a h a l f v e c t o r
float3 H ;
H . x = sin_theta cos ( phi ) ;
H . y = sin_theta sin ( phi ) ;
H . z = cos_theta ;
// Transform t h e v e c t o r from t a n g e n t s p a c e t o wo r l d s p a c e
f l o a t 3 up = abs ( N . z ) < 0. 999? f l o a t 3 ( 0 , 0 , 1 ) : f l o a t 3 ( 1 , 0 , 0 ) ;
f l o a t 3 r i g h t = n o r m a l i z e ( c r o s s ( up , N ) ) ;
f l o a t 3 forward = cross ( N , right ) ;
f l o a t 3 color = 0;
// we s k i p t h e s a m p l e s t h a t a r e n o t i n t h e same h e m i s p h e r e
// wi t h t h e normal
i f ( NoL > 0 )
{
// Sample t h e cube map i n t h e d i r e c t i o n L
c o l o r += S a m p l e T e x ( L ) . r g b ;
}
return color ;
}
Roughness
0.0 0.25 0.5 0.75 1.0
(a)
(b)
Figure 4.5. (a) The preintegrated specular BRDF for different roughness with 1024
samples per pixel. (b) The ground truth integration using 100,000 samples without the
importance sampling.
The main problem with BRDF importance sampling (and importance sam-
pling in general) is that a large number of samples are needed in order to reduce
noise and get a smooth image (Figure 4.6). This problem gets even worse when
Figure 4.7. Dark environment map with few bright light sources using 1024 samples.
there are high-frequency details in the environment map (Figure 4.7). Some of
our nighttime environments have area lights surrounded by dark regions (which
introduces a lot of high-frequency details). Having such noisy prefiltered maps is
a big issue. We needed some additional methods to help resolve this issue.
Figure 4.8. Source environment (left) and ground truth (right), with roughness 0.25
BRDF IS using 128 samples (middle left) and roughness 0.25 BRDF IS using 128 samples
with prefiltering (middle right).
0 0.20 0 0.250
0.18 0.225
100 100
0.16 0.200
0.14 0.175
200 200
0.12 0.150
0.08 0.100
400 400
0.06 0.075
0.02 0.025
600 600
0.00 0.000
(a) (b)
Figure 4.9. Error heat map of the final result using BRDF importance sampling with
prefiltering: (a) a result using 128 samples, and (b) a result using 1024 samples.
Prefiltering the environment map solves most of the problems with noise. We
found that it works well for daytime environments, where the energy is relatively
similar in the local pixel neighborhood. However, for nighttime, although there is
no noise in the result, the error is still higher due to the extremely high frequency
details (Figure 4.8) that get excessively blurred. For example, a low-probability
sample might get a lower energy value than it would have gotten in the ground
truth (Figure 4.9).
We are thus faced with a problem. On one hand, if we don’t prefilter the
environment map, the result is too noisy. On the other hand, prefiltering produces
high error with a low number of samples. So we added another technique for the
preintegration of probes with high roughness values.
strategy. For example, consider the case of a dark room with very few bright light
sources (or in FIFA 15 ’s case, a nighttime stadium with small but bright area
light sources). Sampling based on the BRDF distribution might generate many
samples that miss the light sources. This will create variance when the samples do
hit the light source (especially if the samples had low probability). In that case, it
would have been preferable to instead generate samples that tend to point toward
light sources (i.e., pixels with high energy values). Environment map importance
sampling [Colbert et al. 10] allows us to achieve exactly this. We use environment
map importance sampling to focus the sample generation on areas with higher
intensity (Figure 4.10).
First, we reduce the number of dimensions that we are working with to sim-
plify calculations. Cube-map texture sampling is based on a 3D vector, yet it
really only has a 2D dependency. We instead use spherical surface coordinates
to represent a direction. We also need to map our sphere to a linear rectangular
texture. In order to do that, we simply stack each cube map face one after the
other (Figure 4.11).
In order to generate sample directions with proper probabilities, we need to
define the PDF, CDF, and inverse CDF (similarly to the BRDF importance
sampling). However, in this case, because the environment map is not analytical,
we need to work with discrete versions of these functions.
We start with the PDF. We simply use the luminosity of each pixel as a basis
for generating the PDF. This allows us to catch the “brightest” pixels in the
image. We also need to define two types of PDFs: marginal and conditional
(Figure 4.12). We use the marginal PDF to find which row of pixels we will
sample from. The sum of the PDF for a given row is the probability that a
random sample will fall within that row; this is the marginal PDF. Then we use
the conditional PDF of this row to find which column the sample falls into. The
4. Physically Based Light Probe Generation on GPU 257
Calculate PDFs
Marginal PDF
Conditional PDF0
Sum Conditional PDF1
each
row Conditional PDFn–2
Conditional PDFn–1
Calculate marginal CDF Calculate conditional CDFs
Marginal CDF
Conditional CDF0
Conditional CDF1
Conditional CDFn–2
Conditional CDFn–1
Figure 4.12. The structure of the marginal and conditional PDFs and CDFs. The
conditional PDF and CDF are unique for each row and are represented as a 1D array
for each row. However, there is only one marginal PDF and one marginal CDF for the
image, which are also represented as 1D arrays.
conditional and marginal PDFs can be calculated using the following equations:
conditional PDF(i,j) = luminance(i, j), (4.13)
n
marginal PDFj = luminance(i, j). (4.14)
i=0
For each type of PDF we define, there is a corresponding CDF: marginal and
conditional CDFs. When a PDF is purely discrete, the CDF can be calculated as
the sum of the PDF values from 0 to m for each location m [Pharr and Humphreys
04]:
m
CDFm = PDFk . (4.15)
k=0
The function that represents the summation of the rows’ probabilities is the
row-wise CDF for the image as a whole; this is the marginal CDF (Figure 4.12).
The individual row PDFs are unique and each also has its own column-wise CDF,
which is called the conditional CDF (Figure 4.12).
The simple example in Figure 4.13 demonstrates the behavior of the discrete
CDF. Samples with high probabilities get mapped to a wide range on the Y axis.
For example, if we randomly choose a [0 : 1] value on the Y axis, the third sample
will be picked with a probability of 0.7.
The inverse CDF is thus simply a mapping between a random [0 : 1] value and
its corresponding sample. Since by definition the CDF is a sorted array (Equation
(4.15)), we can use a binary search to find the corresponding sample’s index. In
short the algorithm can be described as follows:
258 III Lighting
PDF CDF
0.8 1
PDF CDF
0.7 0.9
0.8
0.6
0.7
0.5 0.6
0.4 0.5
0.3 0.4
0.3
0.2
0.2
0.1 0.1
0 0
1 2 3 4 1 2 3 4 5
// Get t h e PDF v a l u e s o f t h e sa m p l e f o r t h e f u r t h e r c a l c u l a t i o n
// o f t h e i n t e g r a l on t h e GPU
f l o a t pdfRow = marginalPDF [ row ] ;
f l o a t pdfColumn = conditionalPDF [ row ] [ column ] ;
(a) (b)
Figure 4.14. (a) Random samples might produce bad coverage. (b) Stratified sampling
guarantees at least one sample in equally distributed areas.
stratifying the samples, we guarantee that we have at least one sample in equally
distributed areas. This reduces the probability of sample “clumping” around a
specific location.
260 III Lighting
We then have an array of samples that we pass on to the GPU. The GPU
receives flattened (u, v) coordinates. In order to use those samples, we have to
first convert (u, v) coordinates to direction vectors, and transform the PDF from
the (u, v) distribution to a distribution over a solid angle [Pharr and Humphreys
04]. The PDF conversion can be derived from the environment map unwrapping
where we have six cube-map faces in a row and each face has the field of view
equal to π2 :
π
solid angle PDFu = PDFu × 6 × ,
2
π
solid angle PDFv = PDFv × .
2
The final GPU code for the environment map importance sampling is shown
in Listing 4.5.
// C a l c u l a t e t h e o u t g o i n g r a d i a n c e f o r t h e sa m p l e d i r e c t i o n L
f l o a t 3 e n v M a p S a m p l e ( f l o a t R o u g h n e s s , f l o a t 2 uv , f l o a t 3 L ,
f l o a t 3 N , f l o a t pdfV , f l o a t pdfU )
{
// C o s i n e w e i g h t
f l o a t NoL = saturate ( dot (N , normalize ( L ) ) ) ;
f l o a t 3 c o l o r = u n w r a p T e x . L o a d ( i n t 3 ( u v . xy , 0 ) ) . r g b ;
float3 V = N ;
f l o a t 3 H = n o r m a l i z e ( L+V ) ;
f l o a t D = GGX ( Roughness , H , N ) ;
f l o a t brdf = D / (4 NoL ) ;
// C a l c u l a t e t h e s o l i d a n g l e
//dA ( a r e a o f cube ) = ( 6 2 2 ) /Nˆ2
//N i s a f a c e s i z e
//dw = dA / r ˆ3 = dA pow ( x x + y y + z z , −1.5)
f l o a t dw = ( 6 4 . 0 / ( C U B E M A P _ S I Z E C U B E M A P _ S I Z E ) )
p o w ( L . x L . x + L . y L . y + L . z L . z , −1.5) ;
// Co n v e r t t h e uv sa m p l e p o s i t i o n t o a v e c t o r . We need t h i s t o
// c a l c u l a t e t h e BRDF
f l o a t 3 L = n o r m a l i z e ( u v T o V e c t o r ( uv ) ) ;
// Sample t h e l i g h t coming from t h e d i r e c t i o n L
// and c a l c u l a t e t h e s p e c u l a r BRDF f o r t h i s d i r e c t i o n
f l o a t 3 e n v I S = e n v M a p S a m p l e ( R o u g h n e s s , uv , L , N , p d f V , p d f U ) ;
return envIS ;
}
f l o a t 3 PreintegrateSpecularLightProbe( f l o a t Roughness ,
int numENVSamples , flo at 3 R )
{
// For t h e p r e i n t e g r a t i o n , we assume t h a t N=V=R
float3 N = R ;
float3 V = R ;
float3 finalColor = 0;
// Sample a l l o f t h e p r e g e n e r a t e d s a m p l e s
f o r ( i n t i = 0 ; i < n u m E N V S a m p l e s ; i++)
{
f i n a l C o l o r += S a m p l e E N V ( i , R o u g h n e s s , N , V ) ;
}
// The f i n a l c o l o r n e e d s t o be d i v i d e d by t h e number o f s a m p l e s
// b a se d on t h e Monte C a r l o i m p o r t a n c e s a m p l i n g d e f i n i t i o n
f i n a l C o l o r /= n u m E N V S a m p l e s ;
return finalColor ;
}
0 0.20
0.18
100
0.16
0.14
200
0.12
300 0.10
0.08
400
0.06
0.04
500
0.02
600 0.00
Figure 4.15. Environment map importance sampling error for the roughness value 1
using 128 samples at the nighttime-specific lighting condition.
Figure 4.16. Specular light probe using environment map importance sampling. The
first probe is black because it is a perfect mirror and none of the samples hit the reflection
ray exactly.
Roughness
0.0 0.25 0.5 0.75 1.0
Roughness
0.0 0.25 0.5 0.75 1.0
thermore, we only use environment map importance sampling when the roughness
is greater than 0.7. With lower roughness values, BRDF importance sampling
worked well alone. Figure 4.17 shows the number of samples for both methods
with different roughness values.
Listing 4.6 demonstrates the final preintegration function that uses the com-
bination of both methods.
Using the combined importance sampling gives us the required quality result
within less than 2 ms of GPU time. (See Figure 4.18.)
4.4 Conclusion
Implementing combined importance sampling on the GPU gave us the ability to
generate and prefilter the light probes during level loading. Each probe takes less
than 2 ms to preintegrate. (This time does not include the time it takes to gen-
erate the environment map itself.) However, we split the light probe generation
process across multiple frames in order to prevent frame drops.
Using environment importance sampling helps to reduce preintegration error
in nighttime situations with small and bright area lights. It also helped keep the
number of samples low in order to stay within performance restrictions. However,
we found that BRDF importance sampling works well for the majority of cases.
It is only during our specific case of nighttime lighting that BRDF importance
sampling (with prefiltering) alone was not enough.
264 III Lighting
float3 finalColor = 0;
float3 envColor = 0;
float3 brdfColor = 0;
// S o l v e t h e i n t e g r a l u s i n g e n v i r o n m e n t i m p o r t a n c e s a m p l i n g
f o r ( i n t i = 0 ; i < n u m E N V S a m p l e s ; i++)
{
e n v C o l o r += S a m p l e E N V ( i , R o u g h n e s s , N , V ) ;
}
// S o l v e t h e i n t e g r a l u s i n g BRDF i m p o r t a n c e s a m p l i n g
f o r ( i n t i = 0 ; i < n u m B R D F S a m p l e s ; i++)
{
// Ge n e r a t e t h e u n i f o r m l y d i s t r i b u t e d random v a l u e s u s i n g
// Hammersley q u a si r a n d o m low−d i s c r e p a n c y s e q u e n c e
// ( L i s t i n g 4 . 2 )
f l o a t 2 e1e2 = Hammersley ( ( i ) , numBRDFSamples ) ;
b r d f C o l o r += S a m p l e B R D F ( e 1 e 2 . x , e 1 e 2 . y , Roughness , N , V ) ;
}
// D i v i d e e a c h r e s u l t s by t h e number o f s a m p l e s u s i n g t o
// compute i t
e n v C o l o r /= n u m E N V S a m p l e s ;
b r d f C o l o r /= n u m B R D F S a m p l e s ;
return finalColor ;
}
One positive side effect of having fast probe generation is quick feedback to
the artist. The artist is able to iterate on the lighting setup and see the results
almost instantaneously.
For future work, we would like to further optimize the shader code for specular
probe generation. This would allow us to place even more probes in the level
without affecting loading times.
4. Physically Based Light Probe Generation on GPU 265
4.5 Acknowledgments
I would like to express my gratitude to the people who supported me and proof-
read this chapter: Peter McNeeley of EA Canada and Ramy El Garawany of
Naughty Dog.
I would also like to thank the editors Michal Valient of Guerilla Games and
Wolfgang Engel of Confetti FX.
Bibliography
[Burley 12] B. Burley. “Physically-Based Shading at Disney.” Practical Physi-
cally Based Shading in Film and Game Production, SIGGRAPH Course,
Los Angeles, CA, August 8, 2012.
[Colbert et al. 10] Mark Colbert, Simon Premože, and Guillaume François. “Im-
portance Sampling for Production Rendering.” SIGGRAPH Course, Los An-
geles, CA, July 25–29, 2010.
[Cook and Torrance 81] R. Cook and K. Torrance. “A Reflectance Model for
Computer Graphics.” Computer Graphics: Siggraph 1981 Proceedings 15:3
(1981), 301–316.
[Drobot 14] Michal Drobot. “Physically Based Area Lights.” In GPU Pro 5: Ad-
vanced Rendering Techniques, edited by Wolfgang Engel, pp. 67–100. Boca
Raton, FL: CRC Press, 2014.
[Kr̆ivánek and Colbert 08] Jaroslav Kr̆ivánek and Mark Colbert. “Real-Time
Shading with Filtered Importance Sampling.” Comp. Graph. Forum: Proc.
of EGSR 2008, 27:4 (20080, 1147–1154.
[Karis 13] Brian Karis. “Real Shading in Unreal Engine 4.” SIGGRAPH Course,
Anaheim, CA, July 21–25, 2013.
5.1 Introduction
In this chapter, we’ll present a method for implementing real-time single-bounce
global illumination.
In recent years, several practical real-time global illumination techniques have
been demonstrated. These have all been voxel-based scene databases.
The common theme of all these approaches is to initialize the data structure
using the lit scene geometry. Then, a propagation or blurring step is applied, and
after that the structure is ready for irradiance or reflection queries.
The Light Propagation Volumes (LPV) method [Kaplanyan and Dachs
-bacher 10] uses a voxel array, where each voxel contains a first-order spherical
harmonic representation of the irradiance. The array is initialized using reflective
shadow maps; the propagation step is to iteratively transfer irradiance from each
cell to its neighbors.
The voxel octrees algorithm [Crassin et al. 2011] converts the scene to an
octree representation, where each leaf holds radiance. Non-leaf nodes are calcu-
lated to have the average of child node colors. Sharp reflections are computed
by ray-tracing the octree and sampling the color from the leaf node hit; blurry
reflections and irradiance are found by sampling a parent node, whose generation
depends on blurriness.
The cascaded 3D volumes approach [Panteleev 2014] uses a sequence of 3D
volumes. They are all the same dimension, but each one’s side length doubles.
The algorithm is comparable to the octree approach, but it can be updated and
queried more efficiently.
267
268 III Lighting
Figure 5.2. Scene geometry with distorted cuboids (cells) fitted to it. The cuboid edges
are shown with red lines; dimmer lines indicate hidden edges.
Then, we’ll deal with the question of how to support multiple such surfaces, the
motivation for the distorted-cuboid approach, and how to set up the array of cells
to match the geometry for a given scene.
Second, the kernel scales up with distance from the plane. If we sample a
point k times farther from the plane, then the weighting function scales too:
The increase in texel size at each level corresponds to the increase in the
convolution kernel size, so the fidelity can be expected to be consistent for all
levels of the image pyramid.
This image pyramid can be implemented on the GPU by using a standard
mipmapped texture, sampled using trilinear filtering.
For a given distance d from the emissive plane, the mipmap parameter is
d
mipmap parameter = log2 (5.3)
s
0.35
Approximated value, using blended Gaussians
Ground truth, using irradiance integral
0.3
0.25
0.2
Density
0.15
0.1
0.05
0
0.000 0.391 0.781 1.172 1.563 1.953 2.344 2.734 3.125 3.516 3.906 4.297
Ratio between Transverse Distance and Distance from Plane
Figure 5.3. Irradiance contribution of a flat plane: cross section of the contribution of
each point using the ideal irradiance integral and a paired-Gaussian approximation.
Figure 5.3 shows the comparison between the ideal and approximation.
The naive approach to generate the image pyramid is to consider each mipmap
in turn, and for the implied distance from the plane, calculate the Gaussian blur
radii, then blur and blend the radiance image correspondingly. Unfortunately,
this leads to prohibitively large tap counts for more distant images.
The solution we used is to generate a second image pyramid from the radi-
ance image as a preprocess—this image pyramid is like a mipchain; each layer
is constructed by down-sampling the previous layer by a factor of 2 using a box
filter.
Then, rather than blurring the full-resolution image, an appropriate mip level
is chosen as input to the aforementioned Gaussian blur. The standard deviation
of the Gaussian blur is specified in world units, so the number of taps will vary
depending which mip level is chosen—though obviously the quality will degrade
if the resolution and tap count are too low. This means the tap count for the
Gaussian blur can be controlled, and it’s possible to use even just 5–10 taps
without substantial quality loss.
272 III Lighting
1. Generate a standard mipmap chain for the texture map R. Each mipmap is
a half-size boxfiltered down-sample of the previous. Let R be the resulting
image pyramid. It contains radiance in RGB and opacity in A. The opacity
value is not used for the irradiance pyramid generated here, but is required
for the cell-to-cell propagation step described later in Section 5.8.
2. Allocate the irradiance image pyramid I. It will have the same dimensions
and mipmap count as R . Each mipmap corresponds to a certain distance
from the emissive plane, defined by Equation (5.2).
3. For each mip level m of the image pyramid I, we generate the image as
described by Equation (5.4).
For the current mip level, compute its distance from the emissive plane, and
find the standard deviations of the two Gaussian blurs as a distance in world
space. Find the two source mip levels to use as inputs for the blurs. (We found
that evaluating the Gaussian out to two standard deviations and using five taps
gave acceptable results.) Blur those two input images using the appropriate
Gaussian functions, rescale the results so they’re the same resolution, and blend
the resulting two images to build the mipmap for I.
Note that because the image pyramid I contains RGB only, these image-
processing steps can discard the alpha channel.
E1 1 + nz
= . (5.5)
E0 2
We’ll use the relationship from Equation (5.5) to attenuate the value sampled
from the image pyramid, to support arbitrary query normals.
Proof: We’re interested in computing the irradiance for a scene consisting only
of a constant emissive plane. We may assume with no loss of generality that the
sample point is at the origin, and the upper hemisphere over which we gather
irradiance is (0, 0, 1). The plane is initially at z = −1, with surface normal
(0, 0, 1); it is rotated around the x axis by the angle θmax , where θmax [0, 180◦ ].
If θmax is 180◦ , then the plane is above the origin with surface normal (0, 0, −1),
so the irradiance integral covers the full hemisphere as usual. But if θmax < 180◦ ,
then areas on the hemisphere for which θ > θmax correspond to rays that will miss
the emissive plane. Therefore, the irradiance integral is restricted to directions
in the range [0, θmax ].
Figure 5.4 shows the situation. The grid represents the plane; it has been
faded out in the middle so it doesn’t obscure the rest of the diagram. The gray
truncated hemisphere represents the set of directions that intersect the emissive
plane.
The integral is identical to Equation (5.1), but expressed in spherical polar
coordinates (φ and θ) and restricted to directions that intersect the emissive
plane. For a given φ and θ, the z component is sin(φ) sin(θ), which is equivalent
to the cos(θ) term in Equation (5.1). We also scale by sin(φ) to compensate for
the change in area near the poles. Suppose the constant color radiated by the
plane is Lp . Then the integral is as follows:
θ=θmax φ=π
Lp sin(φ) sin(θ) sin(φ)dφdθ
θ=0 φ=0
θ=θmax φ=π
= Lp sin(θ)dθ · sin2 (φ)dφ
θ=0 φ=0
274 III Lighting
θmax
θ
θ=θmax
π π
= Lp sin(θ) ·
= Lp [− cos(θ)]θ0max ·
θ=0 2 2
π π
= Lp ([− cos(θmax )] − [− cos(0)]) · = Lp (1 − cos(θmax )) ·
2 2
1 − cos(θmax )
= Lp · π ·
2
The irradiance E0 , when the surface normal points directly toward the plane,
can be found by substituting θmax = π. This gives E0 = Lp π. The ratio E1 /E0
is therefore
E1 1 − cos(θmax )
= .
E0 2
Sample point
d, Distance
to plane
t, Transverse distance
to sample point
θmax
Valid region
d
fractional contribution of the valid region = 0.5 + √ . (5.7)
2 t + d2
2
Dividing by 0.5, which is the value taken on the border, gives Equation (5.6).
Figure 5.7 is a graph of Equation (5.7), showing how the fractional contribu-
tion to irradiance changes for points past the border. The reason for using the
ratio t/d as an axis is that Equation (5.7) may be rewritten as a function of t/d.
1. Find the relative position of p within the image pyramid volume, to obtain
the (u, v) for sampling the pyramid. Compute d, the distance to the plane.
3. Sample the image pyramid using trilinear filtering, with the mipmap pa-
rameter calculated at the previous step. Clamp the (u, v) to the image
pyramid region. Let cRGB be the color sampled.
4. If the query point is not within the image pyramid volume, attenuate cRGB
using Equation (5.6).
0.6
0.5
Valid Region’s Contribution to Irradiance
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Transverse Distance Past Border (t)/distance from Plane (d)
Figure 5.7. Contribution of the valid region to irradiance when sampling beyond the
border.
5.4.6 Results
Before moving on, it’s useful to compare the approximation that has been de-
scribed, with the ground truth result.
Figure 5.8 shows the scene. The black square with the cyan circle is the emis-
sive plane, at y = 0. We’ll evaluate the irradiance at points on the transparent
square, which are the points −32 ≤ x ≤ +32, −64 ≤ y ≤ 0, z = 0. For simplicity,
the surface normal used for sampling is n = (0, 1, 0), directly toward the emissive
plane.
Figure 5.9 to Figure 5.11 show the resulting irradiance. Figure 5.9 shows the
ground truth result where the indirect term at each point was evaluated using
16,000 importance-sampled rays. A lot of rays are needed because there’s only a
small bright region—even using 1000 rays per pixel gives noisy results.
Figure 5.10 shows the image-pyramid approximation described in this chapter.
The image pyramid only covers a subset of this space: −16 ≤ x, y ≤ +16,
between the yellow lines. Note that the values outside this region are still a good
approximation because the trilinear sample is attenuated using Equation (5.6).
For comparison, Figure 5.11 shows the standard boxfiltered mipchain without the
Gaussian blurs or attenuation.
278 III Lighting
Figure 5.10. Texture approximation. One trilinear Figure 5.11. Comparison—trilinear lookup with-
filtered lookup from a 256 × 256 texture. out Gaussian blurs or attenuation.
5. Real-Time Global Illumination Using Slices 279
p
a
5.4.7 Limitations
It’s tempting to try using the slice approach with arbitrary heightfields. Unfor-
tunately, this can often give poor results: slices approximate the emissive surface
as a flat plane and weight the contributions of each part of the emissive plane ac-
cordingly. With a heightfield, it’s possible for there to be points quite close to our
sample point, which means they should contribute significantly to irradiance—but
with the slice approach, they have a very low contribution.
For instance, consider the situation shown in Figure 5.12. The green line is
a cross section through a heightfield. Suppose we’ve distorted a slice to match
it exactly, and use the slice to evaluate irradiance. The point p is the location
where we’d like to sample irradiance; the semicircle indicates the hemisphere for
the gather. The black line running vertically down from p to the heightfield
shows the distance to the heightfield; this will be the distance used to calculate
the mipmap parameter for the image pyramid lookup.
In Figure 5.12, point a is quite close to p, so it should contribute significantly
to irradiance. However, the slice approximation will weight it as if it were at b.
So, if the shape of a slice is distorted too much, the quality of the approxima-
tion will suffer.
In conclusion, if we restrict the emissive surfaces to near-planar surfaces, then
we can efficiently evaluate the indirect lighting contribution and, in addition, the
steps to build the image pyramid are cheap enough to run in real time.
In the next sections of the chapter, we’ll describe how to use this approach to
support more general scenes.
Figure 5.13. Two emissive squares; one completely occludes the other when irradiance
is gathered for the hemisphere indicated.
It’s tempting to combine the irradiance values sampled from different slices.
Unfortunately, the only way this can be reliably done is if the slices do not obscure
each other.
Imagine two nearly coincident squares, one red and one blue, with their ir-
radiance sampled, as shown in Figure 5.13. The red square does not contribute
to the irradiance at all when they are combined; it’s completely obscured by the
blue square.
In this case, occlusion completely changes the result—and occlusion isn’t ac-
counted for with the slice approximation. However, it is possible to sum the
irradiance and opacity sampled from two slices if they do not obscure each other
at all. Recall the irradiance definition from, Equation (5.1):
E(p, n) = Li (p, ω) cos θ dω,
H+
where Li (p, ω) is the incoming light falling on p from the direction ω. Define
LAi (p, ω) to be the incoming light if there was only object A in the scene, and
LBi (p, ω) similarly. These will give the RGB value (0, 0, 0) for directions that
don’t hit the corresponding object.
If there are no p and ω such that LA i (p, ω) and Li (p, ω) are simultaneously
B
E(p, n) = Li (p, ω) cos θ dω = i (p, ω) + Li (p, ω)) cos θ dω
(LA B
H+ H+
= LA
i (p, ω) cos θ dω + i (p, ω)) cos θ dω
LB
H+ H+
= [irradiance due to object A] + [iradiance due to object B].
5. Real-Time Global Illumination Using Slices 281
So, we can sum the irradiance from different objects/slices if they never occlude
each other.
However: the relation Li (p, ω) = LA
i (p, ω)+Li (p, ω) will still be a reasonable
B
f2
f1
f0
These heightfields define the split between neighboring cells. They produce a
collection of cells that can be indexed using three integers (i, j, k), where 0 ≤ i ≤
nx , 0 ≤ j ≤ ny , and 0 ≤ k ≤ nz . The resulting arrangement of cells are like a
distortion of an nx × ny × nz array of voxels.
The region covered by the cell (i, j, k) occupies the set of points (fi (y, z),
gj (x, z), hk (x, y)) where x ∈ (i, i + 1), y ∈ (j, j + 1), and z ∈ (k, k + 1).
Defining the distortion in this way does not allow arbitrary distortions to be
represented. For instance, a twist distortion (like the “twirl” distortion in Adobe
Photoshop) with an angle of more than 90 degrees cannot be expressed as a series
of layered heightfields.
boundary heightfields:
x − fi (y, z)
u= ,
fi+1 (y, z) − fi (y, z)
y − gj (x, z)
v= , (5.8)
gj+1 (x, z) − gj (x, z)
z − hk (x, y)
w= .
hk+1 (x, y) − hk (x, y)
The point (u, v, w) can be used directly for evaluating the slice irradiance. For
example, to evaluate the contribution of the slice on the fi side of the cell, the
texture coordinate to sample the slice texture is (v, w), and the value u can be
directly used to calculate the mipmap paramater (Equation (5.3)) if the distance
to the first mipmap is also expressed in that space. The other five slices making
up the cell can be evaluated in a similar way.
So at runtime, the full process of evaluating irradiance for a given point in
world space involves these steps:
• Search the layered heightfields for each of the three axes to find the cell
(i, j, k).
• Find the relative position (u, v, w) within the cell using Equation (5.9).
• Sample the image pyramid of each of the six slices associated with that cell.
• Scale and attenuate those six samples based on surface normal and distance
outside border (if need be) to evaluate irradiance using Equations (5.4) and
(5.5).
• Sum the resulting six irradiance values.
S0 S1 S2 S3 S4
Figure 5.15. Scene divided into equal parts. The geometry is shown in green; the arrows
on the left indicate the ray collection.
other—they’ll form a series of layers. So fi (y, z) < fi+1 (y, z) always, with similar
restrictions for the g and h series.
3. For each ray in turn, trace the ray through the scene, finding each of the
surface intersections. Within each of the nx subset spaces defined at step
1, find the front-facing intersection with the minimum x component, and
the back-facing intersection with maximum x. (Front-facing means the
surface the ray intersected has a surface normal whose x component is
< 0—i.e., the ray hit the surface from the front. Back-facing means the
opposite.) Let Sj be the resulting set of intersection points for ray j.
Record the intersection by storing the ray parameter, i.e., the value p for
which (ray origin) + (ray direction × p) = intersection point. Figure 5.16
shows the relevant front- and back-faces for each scene subset.
5. Build a 2D table of value and weight for each scene subset. Within each of
the nx scene subsets, for each ray i we have up to two intersection points
(a front-facing one and a back-facing one) and an associated significance
value.
Compute the pair {v ∗ w, w}, where v is the sum of the ray parameters
and w is the sum of the associated significance value for those intersection
points— there will be zero, one, or two of them. Figure 5.18 shows the
average positions.
If there were no intersection points for the ray in this scene subset, let the
pair be {0, 0}.
Because the rays are defined to be a regular grid, indexed by the integers p
and q (see Step 2) assign the pair {v, w} to the table entry Ti (p, q), where
i indicates the scene subset.
6. Smooth and extrapolate the table associated with each of the scene subsets.
O(p, q) be the point at which
Let the ray (p, q) originates. Then Ti (p, q) =
−|0(p,q)−o(r,s)
0≤r≤pmax ,0≤s≤qmax Ti (r, s)·c , where c is a constant controlling
286 III Lighting
S0 S1 S2 S3 S4
Figure 5.16. Red markers indicate front and back faces for each scene subset.
S0 S1 S2 S3 S4
S0 S1 S2 S3 S4
Figure 5.18. Average position of the intersection points in each scene subset.
the locality of the blur. Note that the values T (p, q) are pairs of real values
{a, b}; they are scaled and added like 2D vectors.
7. Output the heightfield. Define the 2D table Ui with the same dimensions
as Ti . Ti (p, q) is the pair {a, b}; the corresponding entry Ui (p, q) is defined
to be a/b. Note that b will be zero if and only if the b entry is zero for
all entries in table Ti (see Figure 5.19). Note that the heightfields follow
the scene geometry where possible, and there is no heightfield in the s − s1
subset.
S0 S1 S2 S3 S4
5.7.2 Discussion
This approach works well for scenes which have strong axis-aligned features. That
is, scenes where the main surfaces are roughly parallel to the xy, xz, and yz
planes. Architectural scenes and game environments built out of prefabs usually
meet this requirement.
Large curved surfaces run the risk of problems. If the surface needs to be
represented by slices from more than one axis, parts of that surface may be
missed (leading to light leakage) or have regions included in two slices (leading
to too much indirect light).
These problem cases can usually be fixed by manually editing the distortion
function to ensure that the surface always aligns with one and only one slice.
The slice placement step was intended to run offline, as a preprocess, to gen-
erate slice geometry that does not change at runtime.
5. Real-Time Global Illumination Using Slices 289
Again, this definition means that if a slice is opaque, it will block the lateral
flow of light. If a slice is transparent, then the light will flow laterally from cell
to cell with no discontinuities or artifacts.
With this approach, the light will radiate outward in a plausible fashion, with
no discontinuities at the cell borders.
5.9 Results
Figure 5.21 shows the simplest case. Light radiated from a wall is captured by
a single slice. The floor plane samples that irradiance. The images demonstrate
how the irradiance softens and widens with distance from the emissive surface.
It also underlines that the emissive surface can radiate light in any pattern; it’s
not a point or line but a texture.
Figure 5.22 shows a scene with two cells separated by a perforated wall. This
scene shows light propagation from cell to cell and how the attenuation is affected
by the occluding wall. Note that the light that passes through the holes and falls
on the floor gives a sharper pool of light for the hole near the floor. The farther
hole gives a pool of light on the floor that’s blurrier and farther from the wall.
Figure 5.22. Light propagating from one cell to another, with attenuation.
5. Real-Time Global Illumination Using Slices 291
Figure 5.23 shows a production scene where the irradiance is provided using
slices. The inset image shows the direct lighting. The blue skylight was created
by adding constant blue light into the transparent areas of the topmost layer.
For this scene, the cells are roughly 5 m along each side, and each slice is 64 × 64
resolution.
Figure 5.24 illustrates the sort of changes possible at runtime: the lefthand
image shows the a scene with a single light; on the right, a square hole has been
292 III Lighting
cut out of the balcony and the roof above is now lit by bounce light. The slice
shapes were not altered in any way: opening the hole in the balcony only changed
the transparency of one of the slices, allowing light to propagate from the cell
beneath the balcony to the cell above.
5.10 Conclusion
We have presented a method for efficiently evaluating irradiance from a flat sur-
face using an image pyramid, and an efficient method for rebuilding the image
pyramid at runtime.
The biggest weakness of this approach is the requirement that the scene be
represented by a set of slices. If this requirement can’t be met, there will be light
leakage and other quality problems. While most architectural and man-made
scenes can be adequately represented by slices, more organic environments are a
challenge.
In addition, this method won’t support dynamic objects moving through the
scene. While it can support limited changes to the architecture (e.g., doors
opening, a wall collapsing, a roof opening up), it isn’t a solution for real-time
irradiance effects due to characters moving through the scene.
In conclusion, this best use case for this approach is an architectural or other
manmade scene that can be accurately represented by slices; where the lighting
is dynamic, the effects of dynamic objects on irradiance is not significant, and
the runtime changes to the scene geometry are limited (e.g., doors opening and
closing). In this case, it will yield high-quality irradiance that updates in real
time.
Bibliography
[Crassin et al. 2011] Cyril Crassin, Fabrice Neyret, Miguel Sainz, Simon Green,
and Elmar Eisemann. “Interactive Indirect Illumination Using Voxel Cone
Tracing.” Computer Graphics Forum: Proc. of Pacific Graphics 2011 30:7
(2011), 1921–1930.
[d’Eon and Luebke 07] Eugene d’Eon and David Luebke. “Advanced Techniques
for Realistic Real-Time Skin Rendering.” In GPU Gems 3, edited by
Hubert Nguyen, Chapter 14. Reading, MA: Addison-Wesley Professional,
2007. (Available online at http://http.developer.nvidia.com/GPUGems3/
gpugems3 ch14.html.)
Shadows are the dark companions of lights, and although both can exist on their
own, they shouldn’t exist without each other in games. Achieving good visual
results in rendering shadows is considered one of the particularly difficult tasks
of graphics programmers.
The first article in the section, “Practical Screen-Space Soft Shadows” by
Márton Tamás and Viktor Heisenberger, describes how to implement a shadow
filter kernel in screen space while preserving the shadow color data in layers.
The next article, “Tile-Based Omnidirectional Shadows” by Hawar Doghra-
machi, shows how to implement efficient shadows in combination with a tiled de-
ferred shading system by using programmable draw dispatches, the programmable
clipping unit, and tetrahedron shadow maps.
The third and last article, “Shadow Map Silhouette Revectorization” by Vladi-
mir Bondarev, utilizes MLAA to reconstruct the shadow penumbra, concealing
the perspective aliasing with an additional umbra surface. This is useful for hard
shadow penumbras.
—Wolfgang Engel
1
IV
Practical Screen-Space
Soft Shadows
Márton Tamás and Viktor Heisenberger
1.1 Introduction
This article describes novel techniques that extend the original screen-space soft
shadows algorithm [Gumbau et al. 10] in order to make sure that the speed of ren-
dering is optimal and that we take into consideration overlapping and translucent
shadows. We introduce layers, an essential component to filtering overlapping
shadows in screen space. We aim to render near one hundred properly filtered,
perceptually correct shadows in real time. We also aim to make this technique
easy to integrate into existing rendering pipelines.
1.2 Overview
Shadows are important to establish spatial coherency, establish relationships be-
tween objects, enhance composition, add contrast, and indicate offscreen space
that is there to be explored. As a gameplay element, they are used to project
objects onto walls with the intent to create new images and signs that may tell
a story. Shadows are often used to either lead the viewer’s eye or obscure unim-
portant parts of the scene.
In computer graphics, light emitters are often represented as a single point
with no definite volume. These kinds of mathematical lights cast only hard-
edged shadows (a point is entirely obscured by a shadow caster or not) called an
umbra. However, in the real world, lights usually have volume (like the sun), and
therefore they cast soft-edged shadows that consist of an umbra, penumbra (a
point is partially obscured by shadow caster), and antumbra (the shadow caster
appears entirely contained by the light source, like a solar eclipse). Figure 1.1
shows a real-world umbra, penumbra, and antumbra.
297
298 IV Shadows
Penumbra
Antumbra
Umbra
Figure 1.1. A real-life umbra, penumbra, and antumbra. The objects are lit by a desk
spot lamp.
1.3 History
Traditionally, umbras have been represented by either shadow mapping [Wil-
liams 78] or shadow volumes [Crow 77]. Shadow mapping works by rendering the
scene depth from the point of view of the light source and later in the lighting
pass sampling it and comparing the reprojected scene depth to it to determine if
a point is in a shadow. Shadow volumes work by creating shadow geometry that
divides space into shadowed and unshadowed regions. However, shadow volumes
are often bottlenecked by fill rate, leading to lower performance [Nealen 02].
Thus, we use shadow mapping.
While shadow volumes can achieve pixel-perfect hard shadows, shadow map-
ping’s quality depends on the allocated shadow map’s (depth texture’s) size. If
there’s not enough shadow map resolution, under-sampling will occur, leading to
aliasing. If there’s more than enough shadow map resolution, over-sampling will
occur, leading to wasted memory bandwidth. Shadow maps also suffer from pro-
jective aliasing, perspective aliasing, and erroneous self-shadowing, which needs
to be properly addressed.
To simulate penumbra, shadow mapping is often extended with shadow fil-
tering. In order to render soft shadows, percentage closer filtering (PCF) was
introduced by [Reeves et al. 87]. This technique achieves soft shadows by imple-
menting blurring in shadow space. Later, PCF was extended by a screen-space
1. Practical Screen-Space Soft Shadows 299
Figure 1.2. Hard shadows (left), a uniform penumbra rendered using PCF (middle), and
a perceptually correct variable penumbra rendered using SSSS. When using a variable
penumbra, shadow edges become sharper as they approach the shadow caster.
blurring pass [Shastry 05] that enables the use of large filter kernels. However,
these techniques can only achieve uniform penumbras. Figure 1.2 shows a com-
parison of hard shadows, shadows with uniform penumbras, and shadows with
variable-sized penumbras.
Percentage-closer soft shadows (PCSS) was introduced to properly render
variable-sized penumbras [Fernando 05]. PCSS works by varying the filter size
of the PCF blurring. It does a blocker search in order to estimate the size
of the penumbra at the given pixel, then uses that information to do variable-
sized blurring. However, PCSS still does the blurring step in shadow space, and,
depending on the shadow map and kernel size, this step can be a bottleneck,
especially when multiple lights are involved. Screen-space soft shadows (SSSS)
[Gumbau et al. 10] aims to combat this by deferring the blurring to a screen-space
pass so that it will be independent of the actual shadow map size. In screen space,
however, we need to account for the varying view angle and therefore we need
to use an anisotropic filter. Because the blocker search is still an expensive step
(O(n2 )), SSSS was extended by [Gumbau et al. 10] with an alternate way to
estimate the penumbra size by doing a min filter on the shadow map. In addition,
this filter is separable and the result only needs to be coarse, so a low-resolution
result is acceptable (O(n + n), for much smaller n). [Engel 10] extends SSSS by
adding exponential shadow maps and an improved distance function. This allows
for self-shadowing, artifact-free soft shadows and better use of the same filter size
when viewed from far away.
Mipmapped screen-space soft shadows (MSSSS) [Aguado and Montiel 11] also
tries to further improve the speed of filtering. It transforms the shadow map
300 IV Shadows
Figure 1.3. Not handling overlapping shadows properly by using layers can lead to
artifacts (left), and correct overlapping shadows (right).
we decided to separate this to make sure this technique can be easily integrated
into any rendering pipeline. It is possible to go with any G-buffer layout, provided
it contains at least the depth buffer and the normals of the scene, as we will need
these later. It is important to state that it doesn’t matter whether deferred
shading or deferred lighting or any of the other popular techniques is being used.
We decided to use deferred shading because of its simplicity and speed.
Our G-buffer layout consists of
• D24 depth buffer (stores distance between the viewer and the point being
processed),
Figure 1.4. Contents of the translucency map: a red pole rendered from the point of
view of the light source.
Figure 1.5. Point lights colored according to their respective layer. Each layer is
represented by a color (red, green, blue, and yellow). The white cubes illustrate the
lights’ positions.
2. 2.
3. 3.
1. 1.
4. 4.
5. 5.
6. 6.
Figure 1.6. Lights are numbered and represented by circles (left), where each color
represents a layer (red, green, and blue). Lights and their intersections with each other
are represented on a graph (right). We can see that Light 1 has a vertex degree of 5, so
we would need a maximum of six layers to render these lights; however, in this case, by
using a good graph coloring algorithm, we can reduce the number of needed layers to
three.
budget allows for. In addition, in order to speed up the light intersection process,
one can use an arbitrary space division data structure such as an octree. The
actual layer layout is dependent on the exact technique being used (covered later).
Figure 1.5 illustrates the shadow layers and Figure 1.6 illustrates the graph.
304 IV Shadows
⎛ n ⎞
0 0 0
⎜ r n ⎟
⎜ 0 0 0 ⎟
⎜ t ⎟
⎜ −(f + n) −2f n ⎟
⎜ ⎟
⎝ 0 0 ⎠
f −n f −n
0 0 −1 0
Figure 1.7. The (symmetric perspective) OpenGL projection matrix, where n is the
near plane distance, f is the far, t = n × tan(fov × 0.5), and r = aspect × t
//# o f b i t s i n d e p t h t e x t u r e p e r p i x e l
unsigned bits = 1 6 ;
unsigned precision_scaler = pow (2 , bits ) − 1 ;
// g e n e r a t e s a p e r s p e c t i v e p r o j e c t i o n m a t r i x
mat4 p r o j m a t = p e r s p e c t i v e ( r a d i a n s ( fov ) , aspect , near , far ) ;
// a r b i t r a r y p o s i t i o n i n v i e w s p a c e
vec4 vs_pos = vec4 (0 , 0 , 2. 5 , 1) ;
// c l i p −s p a c e p o s i t i o n
vec4 cs_pos = projmat vs_pos ;
// p e r s p e c t i v e d i v i d e
vec4 ndc_pos = cs_pos / cs_pos . w ;
f l o a t zranged = ndc_pos . z 0 . 5 f + 0 . 5 f ; // r a n g e : [ 0 . . . 1 ]
// t h i s g o e s i n t o t h e d e p t h b u f f e r
unsigned z_value = floor ( precision_scaler zranged ) ;
// h e l p e r v a r i a b l e s t o c o n v e r t back t o v i e w s p a c e
f l o a t A = −( f a r + n e a r ) / ( f a r − n e a r ) ;
f l o a t B = −2 far near / ( far − near ) ;
// g e t d e p t h from t h e d e p t h t e x t u r e , r a n g e : [ 0 . . . 1 ]
f l o a t depth = texture ( depth_tex , texcoord ) . x ;
f l o a t zndc = depth 2 − 1 ; // r a n g e : [ − 1 . . . 1 ]
// r e c o n s t r u c t e d view−s p a c e z
f l o a t v s _ z r e c o n = −B / ( z n d c + A ) ;
// r e c o n s t r u c t e d c l i p −s p a c e z
f l o a t cs_zrecon = zndc −v s _ z r e c o n ;
We have two options for generating penumbra information for many lights:
• We can batch them and generate all of them at once (covered later).
When generating the penumbra information for each light in a separate pass, at
each pixel there will be multiple layers in the penumbra map; therefore, we need
to store the penumbra information of each layer separately. We can achieve this
by using additive hardware blending.
306 IV Shadows
where z is the scene depth from the point of view of the viewer, d is the scene
depth from the point of view of the light source, and k is an empirical value (scale
factor) that is used to tweak the exponential shadow map.
// i n p u t : f l o a t v a l u e i n r a n g e [ 0 . . . 1 ]
uint float_to_r8 ( f l o a t val )
{
const uint bits = 8;
uint precision_scaler = uint ( pow ( uint (2) , bits ) ) − uint (1) ;
return uint ( floor ( precision_scaler val ) ) ;
}
f l o a t threshold = 0.2 5;
f l o a t filter_size =
// a c c o u n t f o r l i g h t s i z e ( a f f e c t s penumbra s i z e )
light_size
// a n i s o t r o p i c term , v a r i e s wi t h v i e w i n g a n g l e
// added t h r e s h o l d t o a c c o u n t f o r d i m i n i s h i n g f i l t e r s i z e
// a t g r a z i n g a n g l e s
sqrt ( max ( dot ( vec3 ( 0 , 0 , 1 ) , normal ) , threshold ) )
// d i s t a n c e c o r r e c t i o n term , s o t h a t t h e f i l t e r s i z e
// r e m a i n s c o n s t a n t no m a t t e r where we v i e w t h e shadow from
( 1 / ( depth ) ) ;
In order to account for various viewing angles, we need to modify the filter
size. To do this, we approximate the filter size by projecting the Gaussian filter
kernel into an ellipse following the orientation of the geometry. We used the
method described in [Geusebroek and Smeulders 03], which shows how to do
anisotropic Gaussian filtering while still keeping the kernel separable. We also
need to consider that if we are viewing the shadows from far away, the filter
size needs to be decreased to maintain the effective filter width. Because of the
nature of the dot product, the filter size can diminish at grazing angles, so we
need to limit the minimum filter size. The value that we are comparing to is
chosen empirically (usually 0.25 works well). Listing 1.3 shows how this variable
filter size is implemented.
Next, all we need to do is evaluate the Gaussian filter. Because this is sep-
arable, the blurring will actually take two passes: a horizontal and a vertical.
We will modify the filter size by the anisotropy value. We also need to sample
all layers at once at each iteration and unpack the individual layers. The layer
308 IV Shadows
When lighting, one needs to find out which layer the light belongs to, sample
the blurred shadow maps accordingly, and multiply the lighting with the shadow
value. Note that you only need to sample the screen-space soft shadows once for
each pixel and then use the appropriate layer.
1.11 Results
The tests were run on a PC that has a Core i5 4670 processor, 8-GB DDR3 RAM,
and a Radeon 7770 1-GB graphics card. We used an untextured Sponza scene
310 IV Shadows
Figure 1.8. Shadows rendered using SSSS (left), and reference image rendered with
Blender (right).
with 16 colored lights each having a 1024 × 1024 shadow texture to illustrate
overlapping shadows.
Figure 1.9. Screenshots with light sources rendered over the scene as boxes.
Table 1.1. Performance results (frame times) from the Sponza scene.
312 IV Shadows
1.14 Conclusion
We showed, that using layered shadow buffers, we can correctly handle overlap-
ping shadows and that we can use layered translucency maps to allow for colored
shadows cast by translucent shadow casters. We also showed that this technique
can be implemented in real time while still being perceptually correct.
Bibliography
[Aguado and Montiel 11] Alberto Aguado and Eugenia Montiel. “Mipmapped
Screen-Space Soft Shadows.” In GPU Pro 2: Advanced Rendering Tech-
niques, edited by Wolfgang Engel, pp. 257–274. Natick, MA: A K Peters,
2011.
[Anichini 14] Steve Anichini. “Bioshock Infinite Lighting.” Solid Angle, http:
//solid-angle.blogspot.hu/2014/03/bioshock-infinite-lighting.html, March 3,
2014.
[Engel 10] Wolfgang Engel. “Massive Point Light Soft Shadows.” Presented at
Korean Game Developer Conference, September 14, 2010. (Available at http:
//www.slideshare.net/WolfgangEngel/massive-pointlightsoftshadows.)
[Gumbau et al. 10] Jesus Gumbau, Miguel Chover, and Mateu Sbert. “Screen
Space Soft Shadows.” In GPU Pro: Advanced Rendering Techniques, edited
by Wolfgang Engel, pp. 477–491. Natick, MA: A K Peters, 2010.
[Nealen 02] Andrew V. Nealen. “Shadow Mapping and Shadow Volumes: Recent
Developments in Real-Time Shadow Rendering.” Project Report for Ad-
vanced Computer Graphics: Image-Based Rendering, University of British
Columbia, 2002.
[Reeves et al. 87] William T. Reeves, David H. Salesin, and Robert L. Cook.
“Rendering Antialiased Shadows with Depth Maps.” Computer Graphics:
Proc. SIGGRAPH ’87 21:4 (1987), 283–291.
[Shastry 05] Anirudh S. Shastry. “Soft-Edged Shadows.” GameDev.net, http://
www.gamedev.net/page/resources/ /technical/graphics-programming-and
-theory/soft-edged-shadows-r2193, January 18, 2005.
[Williams 78] Lance Williams. “Casting Curved Shadows on Curved Surfaces.”
Computer Graphics: Proc. SIGGRAPH ’78 12:3 (1978), 270–274.
2
IV
Tile-Based
Omnidirectional Shadows
Hawar Doghramachi
2.1 Introduction
Rendering efficiently a massive amount of local light sources had already been
solved by methods such as tiled deferred shading [Andersson 09], tiled forward
shading [Billeter et al. 13], and clustered deferred and forward shading [Olsson
et al. 12]. However, generating appropriate shadows for a large number of light
sources in real time is still an ongoing topic. Since accurate shadows from direct
lights significantly improve the final image and give the viewer additional infor-
mation about the scene arrangement, their generation is an important part of
real-time rendering.
This chapter will demonstrate how to efficiently generate soft shadows for a
large number of omnidirectional light sources where each light casts individual
shadows. It will be further shown that this is accomplished without introducing
new artifacts, such as shadow flickering. The underlying algorithm is based on
shadow mapping, introduced in [Williams 78], thus it benefits from the architec-
ture of current rasterizer-based graphics hardware as well as from a wide range
of existing techniques to provide high-quality soft shadows.
For this, the concepts of programmable draw dispatch [Riccio and Lilley 13]
and tetrahedron shadow mapping [Liao 10] are combined via a novel usage of the
programmable clipping unit, which is present in current consumer graphics hard-
ware. For each light source a separate shadow map is generated, so a hierarchical
quad-tree is additionally utilized, which efficiently packs shadow maps of all light
sources as tiles into a single 2D texture map. In this way, significantly more
shadow maps can be stored in a limited amount of texture memory than with
traditional shadow mapping methods.
315
316 IV Shadows
2.2 Overview
The main target of this work is to utilize recently available features of common
consumer graphics hardware, exposed by the OpenGL graphics API, to acceler-
ate the computation of high-quality soft shadows for a high number of dynamic
omnidirectional light sources.
Traditional shadow map rendering typically first determines the meshes that
are overlapping the volumes of all relevant light sources which is already an
O(nm) time complexity task. After this information has been computed, for
each relevant mesh and light source, one GPU draw command is dispatched. For
omnidirectional lights, the situation is even more problematic: e.g., for a cube
map-based approach [Gerasimov 04], we need do the visibility determination for
six cube map faces and dispatch up to six GPU draw commands per mesh and
light source. The large amount of submitted draw calls can cause a significant
CPU overhead. The first part of the proposed algorithm bypasses this problem by
using the concept of programmable draw dispatch [Riccio and Lilley 13]. In this
way, the entire visibility determination and draw command generation process
is shifted to the GPU, avoiding almost the entire CPU overhead of traditional
methods.
The second part of the proposed technique makes use of the idea that for om-
nidirectional light sources it is not necessary to subdivide the 3D space into six
view volumes, as done for cube map–based approaches [Gerasimov 04]. Accord-
ing to tetrahedron shadow mapping [Liao 10], it is entirely enough to subdivide
the 3D space into four view volumes by a regular tetrahedron to produce accu-
rate shadows for omnidirectional light sources. In this way up to a third of the
draw call amount of cube map–based approaches can be saved. In contrast to
the tetrahedron shadow mapping algorithm as proposed in [Liao 10], the entire
process of creating shadow maps for four separate view directions is efficiently
moved to the GPU by introducing a novel usage of the programmable clipping
unit, which is part of current consumer graphics hardware. Furthermore, the
original method is extended in order to provide soft shadows.
Finally, this work takes advantage of the observation that the required shadow
map resolution is proportional to the screen area that the corresponding light
source influences—i.e., the smaller the radius of the light source and the larger
its distance to the viewer camera, the smaller the required shadow map resolution.
After determining the required resolution, the shadow maps of all relevant light
sources are inserted as tiles into one large 2D texture map, which will be called
the tiled shadow map. To make optimal use of the available texture space, a
hierarchical quad-tree is used. This concept not only saves memory bandwidth
at writing and reading of shadow maps, but further enables the use of a large
amount of shadow-casting light sources within a limited texture space.
The entire process of tile-based omnidirectional shadows can be subdivided
into four distinct steps:
2. Tile-Based Omnidirectional Shadows 317
• By the use of a single indirect draw call submitted from the CPU, all GPU-
generated draw commands within the indirect draw buffer are executed.
In this way, shadow maps are generated for all relevant light sources and
written into corresponding tiles of the tiled shadow map.
• Finally, the tiled shadow map is sampled during the shading process by
all visible screen fragments for each relevant light source to generate soft
shadows.
2.3 Implementation
In the following subsections, each step will be described in detail. All explanations
assume a column-major matrix layout, right-handed coordinate system with the
y axis pointing upward, left-bottom corner as texture and screen-space origin, and
clip-space depth-range from −1.0 to 1.0. This work only focuses on generating
shadows for point lights, but as will be demonstrated in Section 2.5.2, it can be
easily extended to additionally support spotlights.
2.3.1 Preparation
In this step, it is first determined which lights are relevant for further processing.
Typically these are all shadow-casting light sources that are visible to the viewer
camera—that is, their light volume overlaps the view frustum and is not totally
occluded by opaque geometry. This can be accomplished by view frustum culling
and GPU hardware occlusion queries.
Tile resolution. After finding all relevant light sources, we need to determine how
large the influence of each light source on the final image is. For this, we first
compute the screen-space axis-aligned bounding box (AABB) of the spherical
light volume. Care must be taken not to clip the AABB against the boundaries
318 IV Shadows
Tile resolution
81922
40962
20482
10242
Figure 2.1. First four levels of a quad-tree that manages the tiles of a 8192 × 8192 tiled
shadow map.
of the screen; for example, a large point light that is near to the viewer but only
a small portion of which is visible on the screen still requires a high-resolution
shadow map tile. After finding the width and height of the AABB, the larger of
these two values will be taken as an approximation for the required shadow map
tile resolution. However, to avoid extremely small or large values, the acquired
resolution should be clamped within a reasonable range. For the case that more
shadow-map tiles will be inserted than the tiled shadow map can handle, the lights
are sorted relative to their acquired tile resolution. In this way, light sources with
the smallest tile resolution will be at the end of the sorted light list and are the
first to be excluded from shadow-map rendering when the tiled shadow map runs
out of space.
Tile management. A typical texture resolution that should suffice in most cases
for a tiled shadow map is 8192 × 8192. When using a 16-bit depth buffer texture
format at this resolution, we can keep the required amount of video memory
under 135 MB, which should be a reasonable value on modern graphics cards.
For the quad-tree implementation, a cache-friendly approach is chosen, where
all nodes are stored in a preallocated linear memory block. Instead of pointers,
indices are used to identify each node. Keeping all nodes in a linear list has
the further advantage that resetting the quad-tree is a very fast operation, since
we only have to iterate linearly over the node list. Each level of the quad-tree
corresponds to a power-of-two shadow map tile resolution and each node holds
the texture-space position of a tile in the tiled shadow map (Figure 2.1). To
increase runtime performance, the quad-tree nodes are already initialized with the
corresponding position values for a user-specified number of levels. The previously
acquired tile resolution should be clamped within a reasonable range since, on
the one hand, too small values would increase runtime performance for finding an
appropriate node and, on the other hand, too large values would rapidly occupy
the available texture space.
At runtime, each light source requests, in the order of the sorted light list,
a tile inside the quad-tree with the calculated tile resolution. For this, first we
2. Tile-Based Omnidirectional Shadows 319
must determine the lowest quad-tree level that has a tile resolution that is still
higher than the specified value:
where s is the resolution of the entire tiled shadow map and x the specified
resolution. However, after finding a corresponding free tile node, the initially
acquired resolution is used instead of the power-of-two node value. Thus, popping
artifacts at shadow edges can be avoided, which would otherwise occur when the
distance of the viewer camera to the light source changes. Performance-wise,
the costs for the tile lookup are negligible; on an Intel Core i7-4810MQ 2.8 GHZ
CPU for 128 light sources, the average required time is about 0.16 ms. Lights that
cannot acquire a free tile due to an exaggerated light count are flagged as non–
shadow casting and ignored during shadow generation. Because such lights have
the smallest influence on the output image anyway, in general, visual artifacts
are hard to notice.
Matrix setup. After all relevant lights are assigned to a corresponding shadow
map tile, for each light source, the matrices that are used during shadow-map
rendering and shading have to be correctly set up. As initially described, a reg-
ular tetrahedron is used to subdivide the 3D space for omnidirectional shadows.
Because this part of the system builds upon tetrahedron shadow mapping as pro-
posed in [Liao 10], only the modifications introduced here will be described in
detail.
First, for each of the four tetrahedron faces, a view matrix needs to be found
that consists of a rotational and a translational part. The rotational part can be
precomputed since it is equal for all lights and never changes; yaw, pitch, and
roll values for constructing these matrices are listed in Table 2.1.
The translational part consists of the vector from the point light center to the
origin and must be recalculated whenever the light position changes. Concate-
nating the translation matrix with each of the rotation matrices yields the final
four view matrices.
In the next step, appropriate perspective projection matrices have to be cal-
culated. For this, the far plane is set to the radius of the point light. Table 2.2
shows the horizontal and vertical field of view (FOV) for each tetrahedron face.
Table 2.1. Yaw, pitch, and roll in degrees to construct the rotation matrices for the
four tetrahedron faces.
320 IV Shadows
Table 2.2. Horizontal and vertical FOV in degrees to construct the perspective pro-
jection matrices for the four tetrahedron faces. As can be seen, faces A and C and,
respectively, faces B and D share the same values. In order to provide soft shadows, the
values from the original paper have to be adjusted by α and β.
v e c 3 c e n t e r s [ 4 ] = { v e c 3 ( −1 ,0 , −1) , v e c 3 ( 1 , 0 , − 1 ) , v e c 3 (0 , −1 , −1) ,
vec3 (0 ,1 , −1) } ;
v e c 3 o f f s e t s [ 4 ] = { v e c 3 (−r , 0 , 0 ) , v e c 3 ( r , 0 , 0 ) , v e c 3 (0 , − r , 0 ) ,
vec3 (0 , r , 0 ) };
f o r ( u i n t i =0; i <4; i++)
{
c e n t e r s [ i ] += o f f s e t s [ i ] ;
v [ i ] = normalize ( invProjMatrix centers [ i ] ) ;
}
dilatedFovX = acos ( dot ( v [ 0 ] , v [ 1 ] ) ) 180/ P I ;
dilatedFovY = acos ( dot ( v [ 2 ] , v [ 3 ] ) ) 180/ P I ;
alpha = dilatedFovX − originalFovX ;
beta = dilatedFovY − originalFovY ;
Listing 2.1. Pseudocode for computing α and β that is used to extend the original FOV
values in order to provide soft shadows.
Because the original paper [Liao 10] did not take into account that soft shad-
ows require a slightly larger texture area for filtering, the original horizontal and
vertical FOV values must be increased by α and β (Table 2.2). These two angles
can be computed by first offsetting the center points of each clip-space edge at
the near plane with a dilation radius r. Using r = 0.0625 provides in practice
enough space for reasonable filter kernels while avoiding an unnecessary reduc-
tion of the effective texture resolution. The offset center points are transformed
into view space with the inverse projection matrix of tetrahedron face A, which is
built with the original FOV values and normalized to form the vectors v0 , . . . , v3
that point from the view-space origin to the transformed points. With the help
of these vectors, α and β can be calculated as shown in Listing 2.1.
Fortunately, the projection matrices are equal for all lights and never change;
thus, they can be precomputed.
Finally, the texture transformation matrices have to be calculated, which
will position the projected tetrahedron views correctly within the tiled shadow
map. Because the projected view area of each tetrahedron face correspond to a
triangle (Figure 2.2), these areas can be packed together into squared tiles, which
2. Tile-Based Omnidirectional Shadows 321
C C
D
D B
A
(a) (b)
Figure 2.2. (a) A perspective view of the used tetrahedron, where face B is facing away
from the camera. (b) The triangular-shaped projected views of the four tetrahedron
faces packed together into a squared tile.
⎛ ⎞ ⎛ ⎞
s 0 0 px s/2 0 0 px − s/2
⎜ 0 s/2 0 py + s/2 ⎟ ⎜ 0 s 0 py ⎟
MC = ⎜
⎝ 0
⎟,
⎠ MD =⎜
⎝ 0 0 1
⎟,
⎠
0 1 0 0
0 0 0 1 0 0 0 1
Light buffer. In the order of the sorted light list, the position, radius, and four
shadow matrices of each light source have to be uploaded to the GPU, for which
a GL_SHADER_BUFFER_STORAGE buffer is used.
Mesh-info buffer. Similar as for the light sources, one first needs to determine
which meshes are relevant for further processing. Typically these are all shadow-
casting meshes that overlap the volumes of the point lights that are found to
be visible to the viewer camera. Because the actual light-mesh overlap test will
be done later on the GPU, at this stage, only a fast preexclusion of irrelevant
meshes should be performed. This could be done for instance by testing the
322 IV Shadows
AABB of the meshes for overlap with the AABB that encloses all relevant light
sources. An important prerequisite of the proposed technique is that commonly
processed meshes have to share the same vertex and index buffer. However, this
is strongly recommended anyway, since frequent switching of GPU resources has
a significant impact on the runtime performance due to a driver CPU overhead
[Riccio and Lilley 13]. According to the light buffer, the required information for
each relevant mesh is written into a GL_SHADER_BUFFER_STORAGE type GPU buffer.
For each mesh, its first index into the common index buffer, number of indices
required to draw the mesh, and minimum and maximum corners of the enclosing
AABB have to be uploaded.
Indirect draw buffer. The first required output buffer is the command buffer it-
self. The first member of this buffer is an atomic counter variable that keeps
track of the number of indirect draw commands that are stored subsequently.
The indirect draw command structure is already predefined by the OpenGL
specification and contains the number of required mesh indices (count), num-
ber of instances to be rendered (instanceCount ), first index into the bound index
buffer (firstIndex), offset to be applied to the indices fetched from the bound
index buffer (baseVertex ), and offset for fetching instanced vertex attributes
(baseInstance ).
Light-index buffer. The second required output buffer stores the indices of all
relevant lights that overlap the processed meshes. Corresponding to the indirect
draw buffer, an atomic counter variable keeps track of the number of subsequently
stored light indices.
which would be more expensive [Harada et al. 13]. After all relevant lights are
processed for a mesh, a new indirect draw command is added to the indirect draw
buffer, but only if at least one light overlaps the AABB of the processed mesh.
This is done by incrementing the atomic counter of the indirect draw buffer and
writing the new draw command to the corresponding location. At this point,
we additionally increment the atomic counter of the light-index buffer with the
number of overlapping lights. This will return a start index into the light-index
buffer, which resides in the global video memory, from where the acquired light
indices in the shared thread group memory can be copied into the light-index
buffer. The copying process is done in parallel by each thread of a thread group
at the end of the compute shader.
Besides passing the firstIndex and count of the current mesh to the new in-
direct draw command, the number of overlapping lights is forwarded as instance
Count—i.e., later on, when the indirect draw command is executed, for each light
source a new mesh instance will be rendered. However, at that stage it is nec-
essary to acquire for each instance the corresponding light index. For this, we
write the obtained start index, which points into the light-index buffer, into the
baseInstance member of the draw command. This member will be only used by
the OpenGL pipeline if instanced vertex attributes are utilized—that is, vertex
attributes with a nonzero divisor. Since traditional instancing (e.g., to create
multiple instances of the same mesh at various locations) does not make much
sense in the proposed method, we can relinquish instanced vertex attributes,
which enables the use of the valuable baseInstance parameter. Fortunately, in
the context of OpenGL 4.4, the GL_ARB_shader_draw_parameters extension has
been introduced, which allows a shader to fetch various draw command related
parameters such as the baseInstance one. In this way, when the indirect draw
commands are executed later on, for each instance, an offset into the light-index
buffer can be retrieved in the vertex shader by summing the OpenGL supplied
draw parameters gl_BaseInstanceARB and gl_InstanceID . At this offset, the cor-
responding light index can be fetched from the light-index buffer. This approach
significantly reduces the required amount of video memory space in contrast to
generating for each overlapping light source a new indirect draw command, which
requires five times more space than a single light index. Listing 2.2 shows how
this can be done for OpenGL in GLSL.
l a y o u t ( l o c a l _ s i z e _ x=L O C A L _ S I Z E _ X ) i n ;
void main ()
324 IV Shadows
{
// i n i t i a l i z e g r o u p c o u n t e r
i f ( g l _ L o c a l I n v o c a t i o n I n d e x == 0 )
groupCounter = 0;
barrier () ;
memoryBarrierShared () ;
// i t e r a t e o v e r a l l r e l e v a n t l i g h t s o u r c e s
uint meshIndex = gl_WorkGroupID . x ;
f o r ( u i n t i =0; i<u n i f o r m B u f f e r . n u m L i g h t s ; i+=L O C A L _ S I Z E _ X )
{
u i n t l i g h t I n d e x = g l _ L o c a l I n v o c a t i o n I n d e x +i ;
i f ( lightIndex < uniformBuffer . numLights )
{
v e c 3 l i g h t P o s i t i o n = l i g h t B u f f e r . l i g h t s [ l i g h t I n d e x ] . ←
position ;
fl o a t lightRadius = lightBuffer . lights [ lightIndex ] . radius ;
vec3 mins = meshInfoBuffer . infos [ meshIndex ] . mins ;
vec3 maxes = meshInfoBuffer . infos [ meshIndex ] . maxes ;
// p e r f o r m AABB−s p h e r e o v e r l a p t e s t
v e c 3 d i s t a n c e s = m a x ( m i n s−l i g h t P o s i t i o n , 0 . 0 ) +
m a x ( l i g h t P o s i t i o n−m a x e s , 0 . 0 ) ;
i f ( d o t ( d i s t a n c e s , d i s t a n c e s ) <= ( l i g h t R a d i u s l i g h t R a d i u s ) )
{
// For e a c h o v e r l a p i n c r e m e n t g r o u p Co u n t e r and add
// l i g h t I n d e x t o l i g h t −i n d e x a r r a y i n s h a r e d t h r e a d
// g r o u p memory .
uint index = atomicAdd ( groupCounter , 1) ;
groupLightIndices [ index ] = lightIndex ;
}
}
}
barrier () ;
memoryBarrierShared () ;
Programmable clipping. There is still one major obstacle that needs to be solved
prior to being able to render indirectly all shadow map tiles into the tiled shadow
map. As demonstrated, the previously generated shadow matrices will create tri-
angular projected areas that can be theoretically tightly packed as squared tiles,
but since we are rendering into a 2D texture atlas, these areas will overlap and
cause major artifacts. One possible solution could be the use of a viewport array.
However, since the maximum number of simultaneously set viewports is usually
limited to a small number, typically around 16, and the viewports are rectan-
gular and not triangular, this approach is not viable. Another possible solution
could be to discard in a fragment shader all fragments outside the projected tri-
angular areas, but this would be far too slow to be feasible. Fortunately, with
326 IV Shadows
c2
fD
fC
p
c1
c3
c0
fA
Figure 2.3. The green arrows show the tetrahedron face vectors fA , fC , and fD . Face
vector fB is pointing away from the camera. The four corners of the tetrahedron are
marked as c0 , . . . , c3 , and the center of the tetrahedron that coincides with the point
light position is shown as p. The three clipping planes that separate the view volume
of tetrahedron face D from its neighbors are depicted in blue, green, and yellow.
In Listing 2.3, the normal of the yellow clipping plane illustrated in Figure 2.3
2. Tile-Based Omnidirectional Shadows 327
Face x y z
A 0.0 −0.57735026 0.81649661
B 0.0 −0.57735026 −0.81649661
C −0.81649661 0.57735026 0.0
D 0.81649661 0.57735026 0.0
Table 2.3. The x, y, and z components of the four normalized tetrahedron face vectors.
n o r m a l = n o r m a l i z e ( c r o s s ( v1 , v ) ) ;
r o t a t i o n A x i s = n o r m a l i z e ( c r o s s ( fA , f D ) ) ;
// q u a t ( r o t a t i o n A x i s , a l p h a ) i s a q u a t e r n i o n t h a t r o t a t e s a l p h a
// d e g r e e s around r o t a t i o n A x i s
rotatedNormal = quat ( rotationAxis , alpha ) normal ;
will be calculated, which separates the view volumes of faces A and D. All other
clipping plane normals can be calculated correspondingly.
Since later on it should be possible to generate soft shadows by applying,
e.g., percentage closer filtering (PCF), the plane normals have to be adjusted
appropriately. For this the plane normals are rotated in order to increase the
aperture of the tetrahedron view volumes. The angle α used for this is the same
as derived in the section “Matrix setup” on page 319; this angle ensures, on the
one hand, that a sufficient amount of primitives pass the clipping stage to account
for shadow map filtering and, on the other hand, that the projected tetrahedron
view areas do not overlap in the effective sampling area. Since the resulting 12
clipping plane normals are equal for all lights and never change at runtime, they
can be precalculated and added as constants into the corresponding shader.
At runtime, the precalculated normals are combined each time with the posi-
tion of the processed light source to construct the appropriate clipping planes.
Vertex processing. To render indirectly the shadow maps, a simple vertex shader is
required to fetch the vertex attributes (typically the vertex position), to calculate
the light index (as already described in the section “Computation” on page 322),
and to pass this value to a subsequent geometry shader.
percentage of cases, less than four primitives have to be emitted and that the
light buffer data has to be fetched only once for each incoming primitive in the
loop-based approach.
An alternative strategy would be to cull the AABBs of the relevant meshes
against the four tetrahedron view volumes for each light in the indirect draw buffer
generation step and add for each overlap a new indirect draw command, thus
avoiding later the use of the geometry shader. However, it has shown that this
approach not only requires more video memory for storing the increased amount of
indirect draw commands, but also runs notably slower than the geometry shader
approach. A reason for this can be that the geometry shader performs culling on
a per-triangle basis, in contrast to culling AABBs of the relevant meshes.
Since back-face culling as implemented by the graphics hardware is performed
after the vertex and primitive processing stage, it is done manually at the begin-
ning of the geometry shader. By reducing the amount of processed primitives,
runtime performance can be further increased [Rákos 12]. This can be an addi-
tional reason why geometry shader instancing is performing more slowly, because
the back-face culling code has to be performed four times in contrast to the
loop-based solution, where this code is shared for all four primitives.
Though the clip distances are passed via gl_ClipDistance to the clipping unit
of the graphics hardware, it has proven that additionally culling primitives in the
shader further improves runtime performance. This can be done by only emitting
a new primitive when at least one of the calculated clip distances of the three
processed triangle vertices is greater than zero for all three clipping planes of the
processed tetrahedron face.
Finally, transforming the incoming vertices boils down to performing for each
relevant tetrahedron face one matrix multiplication with the matching shadow
matrix. Listing 2.4 shows the corresponding GLSL geometry shader.
l a y o u t ( t r i a n g l e s ) in ;
layout ( triangle_strip , max_vertices = 12) out ;
void main ()
{
const uint lightIndex = inputGS [ 0 ] . lightIndex ;
const vec3 lightPosition = lightBuffer . lights [ lightIndex ] . position ;
// p e r f o r m back−f a c e c u l l i n g
v e c 3 n o r m a l = c r o s s ( g l _ i n [ 2 ] . g l _ P o s i t i o n . xyz−g l _ i n [ 0 ]
. g l _ P o s i t i o n . xyz ,
g l _ i n [ 0 ] . g l _ P o s i t i o n . xyz−g l _ i n [ 1 ]
. gl_Position . xyz ) ;
2. Tile-Based Omnidirectional Shadows 329
v e c 3 v i e w = l i g h t P o s i t i o n−g l _ i n [ 0 ] . g l _ P o s i t i o n . x y z ;
// i t e r a t e o v e r t e t r a h e d r o n f a c e s
f o r ( u i n t f a c e I n d e x =0; f a c e I n d e x <4; f a c e I n d e x++)
{
uint inside = 0;
f l o a t clipDistances [ 9 ] ;
// C a l c u l a t e f o r e a c h v e r t e x d i s t a n c e t o c l i p p i n g p l a n e s and
// d e t e r m i n e wh e t h e r p r o c e s s e d t r i a n g l e i s i n s i d e v i e w
// volume .
f o r ( u i n t s i d e I n d e x =0; s i d e I n d e x <3; s i d e I n d e x++)
{
c o n s t u i n t p l a n e I n d e x = ( f a c e I n d e x 3)+ s i d e I n d e x ;
c o n s t u i n t b i t = 1 << s i d e I n d e x ;
// I f t r i a n g l e i s i n s i d e volume , e m i t p r i m i t i v e .
i f ( i n s i d e == 0 x 7 )
{
const mat4 shadowMatrix =
lightBuffer . lights [ lightIndex ] . shadowMatrices [ faceIndex ] ;
Tiled shadow map. After the draw commands in the indirect draw buffer are
executed, the shadow map tiles of all relevant light sources are tightly packed
together into the tiled shadow map. Figure 2.4 shows this texture that was
generated for the scene in Figure 2.5.
As can be seen in Figure 2.4, the shadow map tiles of all light sources in
the processed scene are tightly packed; thus, shadow maps for significantly more
330 IV Shadows
Figure 2.4. A tiled shadow map (generated for the scene in Figure 2.5) with a resolution
of 8192 × 8192. The tile size is clamped between 64 and 512. Since the scene is rendered
with view frustum culling of invisible light sources, for 117 out of the 128 medium-sized
moving point lights, an individual shadow map tile is generated. With this texture and
clamped tile resolution, in the worst case, shadow map tiles for 256 light sources can
still be stored in the tiled shadow map.
omnidirectional light sources can be stored in a limited texture space than with
traditional shadow mapping systems.
2.3.4 Shading
Finally, the tiled shadow map can be used in the shading stage to produce high-
quality soft shadows. Shading methods such as tiled deferred shading [Andersson
09], tiled forward shading [Billeter et al. 13], or clustered deferred and forward
shading [Olsson et al. 12] require the shadow maps for all relevant light sources
to be created prior to the shading process as the proposed algorithm does. How-
2. Tile-Based Omnidirectional Shadows 331
// m a t r i x o f t e t r a h e d r o n f a c e v e c t o r s
mat4x3 faceMatrix ;
faceMatrix [ 0 ] = faceVectors [ 0 ] ;
faceMatrix [ 1 ] = faceVectors [ 1 ] ;
faceMatrix [ 2 ] = faceVectors [ 2 ] ;
faceMatrix [ 3 ] = faceVectors [ 3 ] ;
// d e t e r m i n e f a c e t h a t i s c l o s e s t t o s p e c i f i e d l i g h t v e c t o r
v e c 4 d o t P r o d u c t s = −l i g h t V e c N f a c e M a t r i x ;
f l o a t maximum = max ( max ( dotProducts . x , dotProducts . y ) ,
max ( dotProducts . z , dotProducts . w ) ) ;
uint index ;
i f ( m a x i m u m == d o t P r o d u c t s . x )
index = 0;
e l s e i f ( m a x i m u m == d o t P r o d u c t s . y )
index = 1;
e l s e i f ( m a x i m u m == d o t P r o d u c t s . z )
index = 2;
else
index = 3;
// p r o j e c t f r a g m e n t world−s p a c e p o s i t i o n
vec4 projPos =
lightBuffer . lights [ lightIndex ] . shadowMatrices [ index ] position ;
p r o j P o s . x y z /= p r o j P o s . w ;
projPos . xyz = ( projPos . xyz 0. 5) +0.5;
Listing 2.5. Generating the shadow term with a tiled shadow map.
ever, lighting methods such as deferred shading [Hargreaves and Harris 04] that
theoretically can reuse shadow map textures for multiple lights by alternating be-
tween shadow map rendering and shading, can profit as well from the proposed
method, since frequent switching of render states and GPU resources can be an
expensive operation.
Generating shadows with the help of a tiled shadow map is straightforward
and follows [Liao 10]. After acquiring the world-space position of the currently
shaded screen fragment, for each relevant light source it is first determined inside
which of the four tetrahedron view volumes the processed fragment is located.
The acquired fragment position is then multiplied with the corresponding shadow
matrix to yield the projected fragment position with which a shadow comparison
is done. See Listing 2.5 for details.
Besides performing a hardware-filtered shadow comparison, various filtering
approaches such as PCF [Reeves et al. 87] or percentage-closer soft shadows
(PCSS) [Fernando 05] can be used to produce high-quality soft shadows. Since,
as already described earlier in this section, the shadow projection matrices and
tetrahedron clipping plane normals are properly adapted, such filtering techniques
will not produce any artifacts by sampling outside of the appropriate shadow map
areas.
332 IV Shadows
2.4 Results
To capture the results, the Crytek Sponza scene was used, which contains without
the central banner 103 meshes and ∼280,000 triangles. The test machine had an
Intel Core i7-4810MQ 2.8 GHZ CPU and an NVIDIA GeForce GTX 880 Mobile
GPU and the screen resolution was set to 1280 × 720. For the lighting system,
tiled deferred shading [Andersson 09] is used.
A layered cube map–based shadowing solution is used as the reference for the
proposed technique. For this, the shadow maps of each point light are rendered
into a cube map texture array with 128 layer and a 16-bit depth buffer texture
format; each cube map face has a texture resolution of 256 × 256. For each point
light, the 3D space is split up into six view frustums that correspond to the six
faces of a cube map. Each mesh is tested for overlap with each of the six view
frustums. Every time an overlap is detected, a new indexed draw call is submitted
to the GPU. To speed up rendering performance, all meshes share the same vertex
and index buffer and the cube map face selection is done in a geometry shader.
For a large number of light sources, it has proven to be more performant to
submit for each overlap a separate draw call rather than always amplifying the
input geometry in the geometry shader six times and using one draw call. To
improve the quality of the generated shadows, GL_TEXTURE_CUBE_MAP_SEAMLESS is
enabled, and besides performing hardware shadow filtering, 16× PCF is used for
soft shadows. In the remaining part of this section, the reference technique will
be referred to as the cube solution.
For the proposed method, a 8192 × 8192 tiled shadow map is used with a
16-bit depth buffer texture format. The tile size is clamped between 64 and 512
(see Figure 2.4). According to the reference method, hardware shadow filtering
in combination with 16× PCF is used to produce soft shadows. In the remain-
ing part of this section, this proposed technique will be referred to as the tiled
technique.
It can be seen in the comparison screenshots in Figure 2.5 that the quality
of both images is nearly equal while the proposed method runs more than three
times faster than the reference solution. In the close-up comparison screenshots
shown in Figure 2.6, we can also see that quality-wise the technique described
here comes very close to the reference solution.
For the performance measurements, the same scene configuration was used
as in Figure 2.5 with the exception that view frustum culling of invisible lights
was disabled; hence, for all 128 point lights in the scene, shadow maps were
generated. The measured frame times in Figure 2.7 show that the tiled technique
gets significantly faster compared to the reference cube solution as the number
of shadow-casting point lights increases. Figure 2.8 shows the number of draw
calls that were submitted for each frame from the CPU to render the shadow
maps. In the proposed method, the number of draw calls is constantly one due
to the indirect shadow map rendering, whereas the number of draw calls rapidly
2. Tile-Based Omnidirectional Shadows 333
Figure 2.5. Real-time rendering (on an NVIDIA GeForce GTX 880 Mobile at 1280×720
resolution) of the Crytek Sponza scene (∼280,000 triangles) with 128 medium-sized
moving point lights, which all cast omnidirectional shadows via shadow maps. The
upper image is rendered with the proposed tiled method at 28.44 fps; the lower image is
the reference with the cube approach at 8.89 fps. Both methods use hardware shadow
filtering in combination with 16× PCF for providing high-quality soft shadows.
increases in the reference technique. Finally, in Table 2.4, CPU and GPU times
for shadow map rendering and shading are compared.
According to Table 2.4, the CPU times for rendering shadow maps with the
proposed technique are at a constant low value since only one indirect draw call
is submitted each frame. However, the CPU times for the reference technique are
drastically increasing with the light count due to the rising number of CPU draw
calls. When comparing the times taken by the GPU to render the shadow maps,
the proposed technique is significantly faster than the reference method, which
can be primarily attributed to the reduced number of primitives processed in the
334 IV Shadows
Figure 2.6. One shadow-casting point light is placed directly in front of the lion-head
model in the Crytek Sponza scene. The images on the left are rendered with the tiled
technique, and the images on the right with the reference cube technique. While the
images at the bottom show the final shading results, the images at the top visualize the
partitioning of the tetrahedron and cube, respectively, volumes. As can be seen, the
shadow quality of the proposed solution comes close to that of the reference method.
tiled solution. Considering the times taken by the GPU to shade all visible screen
fragments using tiled deferred shading, it first seems unexpected that the cube
solution would have higher execution times than the tiled technique. Though
2. Tile-Based Omnidirectional Shadows 335
160
Tiled Cube
140
120
100
Time (ms)
80
60
40
20
0
1 2 4 8 16 32 64 128
Number of Lights
Figure 2.7. Frame times of tiled versus cube technique with an increasing number of
shadow-casting point light sources.
18000
Tiled Cube
16000
14000
12000
Number of Draws
10000
8000
6000
4000
2000
0
1 2 4 8 16 32 64 128
Number of Lights
Figure 2.8. Number of CPU submitted draw calls to render shadow maps in tiled and
cube technique with an increasing number of shadow-casting point lights.
doing a hardware texture lookup in a cube map is faster than doing the proposed
lookup, this is not true for performing PCF to produce soft shadows. While for
336 IV Shadows
Table 2.4. Comparison of CPU and GPU times (ms) for shadow map rendering and
shading with an increasing number of shadow-casting point lights.
the tiled method it is enough to apply 2D offsets to the lookup coordinates, for
the cube technique a 3D direction vector, which is used for the texture lookup,
has to be rotated in 3D space.
According to the presented performance values, the proposed technique is
in all aspects and for all number of shadow-casting point lights faster than the
reference technique. On the one hand, the driver CPU overhead, present in the
reference method due to the high number of draw calls, can be nearly completely
eliminated; on the other hand, the time taken by the GPU to render the shadow
maps is significantly reduced.
2.5 Discussion
We now discuss some important aspects related to this technique and relevant
for real-time applications such as computer games.
2.5.2 Spotlights
Though this chapter focuses on point lights, it is trivial to include support for
spotlight shadows as well. Actually, it is easier to handle spotlight sources since
only one view volume that corresponds to the view frustum of the spotlight has
to be taken into account. However, when clipping the primitives while rendering
into the tiled shadow map, the clipping planes must be set to the four side planes
of the spotlight view frustum.
For the above discussed cases, the indirect draw buffer generation as well
as the indirect shadow map rendering step should be handled separately where
applicable to avoid dynamic shader branching. In most cases, this only means
dispatching the compute shader for generating the indirect draw buffer and sub-
mitting an indirect draw call a few times per frame, which will have only a slight
negative impact on the driver CPU overhead. Nevertheless, in all cases, one
unique tiled shadow map can be used.
338 IV Shadows
2.6 Conclusion
This chapter presented a comprehensive system for generating high-quality soft
shadows for a large number of dynamic omnidirectional light sources without the
need of doing approximations as merging shadows of multiple lights. It has been
demonstrated that this method is competitive quality-wise to a reference cube
map–based approach and performs with any tested number of shadow-casting
point lights faster. Furthermore, due to the usage of a tiled shadow map, sig-
nificantly more shadow maps can be stored for point light sources in a limited
amount of texture space than with a cube map–based approach.
2.7 Acknowledgments
I would like to thank Nikita Kindt for porting the accompanying demo application
to Linux.
Bibliography
[Andersson 09] J. Anderrson. “Parallel Graphics in Frostbite: Current and
Future.” Beyond Programmable Shading, SIGGRAPH Course, New Or-
leans, LA, August 3–7, 2009. (Available at http://s09.idav.ucdavis.edu/
talks/04-JAndersson-ParallelFrostbite-Siggraph09.pdf.)
[Riccio and Lilley 13] C. Riccio and S. Lilley. “Introducing the Programmable
Vertex Pulling Rendering Pipeline.” In GPU Pro 4: Advanced Rendering
Techniques. edited by Wolfgang Engel, pp. 21–38. Boca Raton, FL: CRC
Press, 2013.
[Williams 78] L. Williams. “Casting Curved Shadows on Curved Surfaces.” Com-
puter Graphics: Proc. SIGGRAPH ’78 12:3 (1978), 270–274.
3
IV
3.1 Introduction
Shadow mapping [Williams 78] is known for its compatibility with rasterization
hardware, low implementation complexity, and ability to handle any kind of ge-
ometry. However, aliasing is also a very common problem in shadow mapping.
This chapter introduces a shadow map filtering technique that approximates an
additional umbra surface (space completely occluded from the direct light) based
on linear interpolation in projected view space.
Projection and perspective aliasing [Lloyd et al. 08] are the two main dis-
continuity types that deteriorate the quality of a projected shadow. Since the
introduction of shadow mapping, many algorithms have been developed to re-
duce or even completely remove shadow map aliasing. Most algorithms that are
developed to remove aliasing are not compatible to run in real time [Johnson
et al. 05] and in some cases propose additional hardware changes to allow for
real-time application [Lloyd et al. 08].
Most real-time shadow-mapping techniques can be divided in two main cate-
gories: sample redistribution (PSM, TSM, LiSPSM, and CSM) and filter-based
techniques (VSM, PCF, and BFSM). Shadow Map Silhouette Revectorization
341
342 IV Shadows
Figure 3.1. From left to right, the shadow silhouette revectorization process.
Figure 3.2. An uncompressed image (left), and the encoded shadow discontinuity buffer
(right). See Table 3.1 for color definition.
3.2 Implementation
The SMSR technique consists of two fullscreen passes and requires access to the
depth buffer, shadow map, lighting buffer, view matrix, light matrix, and inverse
of the light matrix.
the current fragment sample is inside the umbra and the next neighboring sample
is outside the umbra, and interior discontinuity, where the current fragment
sample is outside the umbra and the next neighboring sample is inside the umbra.
SMSR is only concerned with the exterior discontinuity of the shadow silhou-
ette edge. When an exterior discontinuity is detected, the direction from the
current fragment sample toward the discontinuity is encoded into one of the out-
put channels (used in the second pass to determine discontinuity orientation).
Horizontal discontinuities are stored into the red channel and vertical discontinu-
ities are stored into the green channel. Each channel has four possible states: for
example, the red channel uses the value 0.0 to indicate no discontinuity, 0.5 dis-
continuity to the left, 0.75 discontinuity to the left and right, and 1.0 discontinuity
to the right. The green channel uses the value 0.0 to indicate no discontinuity,
0.5 discontinuity to the bottom, 0.75 discontinuity to the bottom and top, and
1.0 discontinuity to the top.
To reduce the memory footprint, the discontinuity encoding can be stored in
a 4-bit channel. However, for the sake of simplicity, we are not doing it in this
implementation.
X
1
Y
1
Figure 3.3. Orientated normalized discontinuity space (ONDS) stretches from 0.0 to
1.0 on the y-axis over eight shadow-map samples and on x-axis over just one. The last
ONDS sample located near y = 1.0 indicates the discontinuity end.
3.3 Results
SMSR successfully hides the visual perspective aliasing (see Figure 3.1, rightmost
image, and Figure 3.4) in under-sampled areas of the shadow map, and the un-
optimized version takes less than 1.5 ms to process on GTX 580, regardless of
the shadow-map resolution in full HD.
3.3.1 Inconsistencies
SMSR doesn’t come without its drawbacks, which are categorized into special
cases, absence of data, and mangled silhouette shape.
Figure 3.4. Configuration of the Crytek Sponza scene with a 1024 × 1024 shadow map:
without SMSR (top) and with SMSR (bottom).
1 1
Figure 3.5. A closeup with SMSR (left) and without SMSR (right). Point 1 is the
discontinuity in more than two directions, a special case that makes it hard for SMSR
to handle. The current solution is to fill those areas completely with an umbra.
right image). SSM suffers from the same problem. The SMSR kernel has a dedi-
cated portion of code that fills all single shadow-map spacing with an additional
umbra, yielding less visually noticeable artifacts (see Figure 3.5, left image).
346 IV Shadows
Figure 3.6. A mangled silhouette shape with SMSR (top) and without SMSR (bottom).
Due to edge generalization and lack of shape understanding, SMSR changes the desired
object shape.
3.5 Conclusion
Shadow Map Silhouette Revectorization particularly shines in scenes with many
large polygons, where it has the ability to utilize a lower shadow-map resolution
(to reduce the GPU memory footprint) without sacrificing a great portion of
visual quality and effectively helps to conserve the GPU fill rate. However, the
3. Shadow Map Silhouette Revectorization 347
technique is in its early stage and can be improved in many different areas such
as interpolation based on shape patterns (to improve edge revectorization), soft
shadows (to improve realism), and temporal aliasing (to reduce jagged edges). It
can also be combined with other sample-redistribution techniques such as cascade
shadow maps (to optimize the use of shadow sample density where it is needed).
Bibliography
[Jimenez et al. 11] J. Jimenez, B. Masia, J. Echevarria, F. Navarro, and D.
Gutierrez. “Practical Morphological Antialiasing.” In GPU Pro 2: Advanced
Rendering Techniques, edited by Wolfgang Engel, pp. 95–114. Natick, MA:
A K Peters, 2011.
[Johnson et al. 05] Gregory S. Johnson, Juhyun Lee, Christopher A. Burns, and
William R. Mark. “The Irregular Z-Buffer: Hardware Acceleration for Irreg-
ular Data Structures.” ACM Transactions on Graphics 24:4 (2005), 1462–
1482.
[Lloyd et al. 08] D. Brandon Lloyd, Naga K. Govindaraju, Cory Quammen,
Steven E. Molnar, and Dinesh Manocha. “Logarithmic Perspective Shadow
Maps.” ACM Transactions on Graphics 27:4 (2008), Article no. 106.
[Lopez-Moreno et al. 10] Jorge Lopez-Moreno, Veronica Sundstedt, Francisco
Sangorrin, and Diego Gutierrez. “Measuring the Perception of Light Incon-
sistencies.” In Proceedings of the 7th Symposium on Applied Perception in
Graphics and Visualization, pp. 25–32. New York: ACM Press, 2010.
[Pan et al. 09] Minghao Pan, Rui Wang, Weifeng Chen, Kun Zhou, and Hujun
Bao. “Fast, Sub-pixel Antialiased Shadow Maps.” Computer Graphics Forum
28:7 (2009), 1927–1934.
[Sen et al. 03] Pradeep Sen, Mike Cammarano, and Pat Hanrahan. “Shadow Sil-
houette Maps.” ACM Transactions on Graphics 22:3 (2003), 521–526.
[Williams 78] Lance Williams. “Casting Curved Shadows on Curved Surfaces.”
Computer Graphics: Proc. SIGGRAPH ’78 12:3 (1978), 270–274.
V
Mobile Devices
Features of the latest mobile GPUs and the architecture of tile-based GPUs pro-
vide new and interesting ways to solve existing rendering problems. In this sec-
tion we will cover topics ranging from hybrid ray tracing to HDR computational
photography.
“Hybrid Ray Tracing on a PowerVR GPU” by Gareth Morgan describes how
an existing raster-based graphics engine can use ray tracing to add high-quality
effects like hard and soft shadows, reflection, and refraction while continuing to
use rasterization as the primary rendering method. The chapter also gives an
introduction to the OpenRL API.
“Implementing a GPU-Only Particle-Collision System with ASTC 3D Tex-
tures and OpenGL ES 3.0” by Daniele Di Donato shares how the author used
OpenGL ES 3.0 and ASTC 3D textures to do bandwidth-friendly collision de-
tection of particles on the GPU. The 3D texture stores a voxel representation of
the scene, which is used to do direct collision tests as well as lookup the nearest
surface.
“Animated Characters with Shell Fur for Mobile Devices” by Andrew Girdler
and James L Jones presents how the authors were able to optimize a high-quality
animation system to run efficiently on mobile devices. With OpenGL ES 3.0, they
made use of transform feedback and instancing in order to reach the performance
target.
“High Dynamic Range Computational Photography on Mobile GPUs” by Si-
mon McIntosh-Smith, Amir Chohan, Dan Curran, and Anton Lokhmotov ex-
plores HDR computational photography on mobile GPUs using OpenCL and
shares some very interesting results.
I would like to thank all the contributors in this section for their great work
and excellent articles.
—Marius Bjørge
1
V
1.1 Introduction
Ray tracing and rasterization are often presented as a dichotomy. Since the
early days of computer graphics, ray tracing has been the gold standard for
visual realism. By allowing physically accurate simulation of light transport, ray
tracing renders extremely high-quality images. Real-time rendering, on the other
hand, is dominated by rasterization. In spite of being less physically accurate,
rasterization can be accelerated by efficient, commonly available GPUs and has
mature standardized programing interfaces.
This chapter describes how an existing raster-based game engine renderer can
use ray tracing to implement sophisticated light transport effects like hard and
soft shadows, reflection, refraction, and transparency, while continuing to use
rasterization as the primary rendering method. It assumes no prior knowledge of
ray tracing.
The PowerVR Wizard line of GPUs adds hardware-based ray tracing accel-
eration alongside a powerful rasterizing GPU. Ray tracing acceleration hardware
vastly improves the efficiency and therefore the performance of the techniques
described.
1.2 Review
1.2.1 Conceptual Differences between Ray Tracing and Rasterization
In a ray tracer, everything starts with the initial rays (often called primary rays).
Typically, these rays emulate the behavior of a camera, where at least one ray
is used to model the incoming virtual light that gives color to each pixel in a
framebuffer. The rays are tested against the scene’s geometry to find the closest
351
352 V Mobile Devices
intersection, and then the color of the object at the ray’s intersection point is
evaluated. More precisely, the outgoing light that is reflecting and/or scattering
from the surface in the direction of the ray is computed. These calculations
often involve creating more secondary rays because the outgoing light from a
surface depends on the incoming light to that surface. The process can continue
recursively until the rays terminate by hitting a light-emitting object in the scene
or when there is no light contributed from a particular ray path.
Contrast this with rasterization, where the driving action is the submission
of vertices describing triangles. After the triangles are projected to screen space,
they are broken into fragments and the fragments are shaded. Each datum is
processed independently and there is no way for the shading of one triangle to
directly influence another unrelated triangle in the pipeline.
Ray tracing enables inter-object visibility, but the tradeoff is that every piece
of the scene that could possibly be visible to any ray must be built and resident
prior to sending the first ray into the scene.
Submit geometry
(triangles)
Void main {
if (dot(r1_inRay.
direction), Use shaders to
normal) >0) return; define ray
else accumulate
(vec3(1.0));} behavior
objects. Each primitive object represents a conceptual object within the scene—
for example, the glass top of a coffee table could be a primitive object. They are
defined in world space, and their state is retained from one frame to the next.
Each primitive object needs to know how to handle rays that intersect it. This
is done by attaching a ray shader to the object. The ray shader runs whenever a
ray intersects a piece of geometry. It can be used to define the look of the object’s
material or, more specifically, the behavior of the material when interacting with
rays. A ray shader can be thought of as a conceptual analogy to a fragment shader
in rasterization. There is, however, one big difference between OpenRL shaders
and traditional raster shaders: OpenRL shaders can emit rays, and hence trigger
future shader invocations. This feedback loop, where one ray intersection results
in secondary rays being emitted, which in turn causes more ray intersections,
is a vital part of the ray-tracing process. In OpenRL shaders, this process is
implemented via the built-in functions createRay() and emitRay(). The built-in
variable rl_OutRay represents the newly created ray. This ray structure is made
up of ray attributes, some of which are built-in, such as direction and origin,
and some of which can be user defined.
In the aforementioned glass coffee table example, the ray shader would de-
fine the appearance of a glass tabletop by emitting secondary rays based on the
material properties stored in the primitive object (such as color and density).
Those secondary rays will intersect other objects in the scene, defining how those
objects (for example, the base of the table or the floor it is resting on) contribute
to the final color of the glass tabletop.
The final step in our simple ray tracer is to create the primary rays. In
OpenRL, a frame shader is invoked once for every pixel and is used to program-
matically emit the primary rays.
The simplest camera is called a pinhole camera. This name comes from the
fact that every light ray passes through the exact same point in space, or pinhole
aperture, and therefore the entire scene is in perfect focus.
354 V Mobile Devices
void main () {
v e c 3 d i r e c t i o n = v e c 3 ( ( r l _ F r a m e C o o r d / r l _ F r a m e S i z e − 0 . 5 ) . xy , 1 . 0 ) ;
createRay () ;
rl_OutRay . origin = cameraPosition ;
rl_OutRay . direction = direction ;
emitRay () ;
}
Figure 1.3. G-buffer contents: (a) normals, (b) positions, and (c) material IDs.
Some highly specular materials, like glass, propagate light in a direction that
is largely dependent on the direction of the incoming light, while diffuse materials
like plaster will scatter incoming light across a whole hemisphere.
A ray is fundamentally a line. It has zero thickness and its intersection with a
surface is therefore a point.1 In order to approximate a diffuse material, renderers
often emit many rays to estimate the continuous function of incoming light from
all directions.
ray shaders to perform mipmapping during texture samples and differential functions within
the shader. However, a ray will only intersect one point on one surface.
356 V Mobile Devices
void main ()
{
vec2 uv = r l _ F r a m e C o o r d . xy / r l _ F r a m e S i z e . xy ;
vec4 n o r m a l = t e x t u r e 2 D ( n o r m a l T e x t u r e , uv ) ;
vec4 p o s i t i o n = t e x t u r e 2 D ( p o s i t i o n T e x t u r e , uv ;
from the surface. The results from the ray tracer render are then returned to
the rasterizer, where they are composited, along with the albedo color from the
original G-buffer, to produce the final frame.
Each G-buffer component is bound to a 2D texture uniform in the frame
shader, and those textures are sampled for each pixel. This provides the world-
space surface properties required to start tracing rays directly from the surface
defined by that pixel, without emitting any camera rays.
On a pixel-by-pixel basis, the frame shader can then decide which effects to
implement for that fragment and how many rays each effect uses based on the
material properties stored in the G-buffer. This allows the application to use its
ray budget on surfaces where raytraced effects will add most to the look or the
user experience.
Currently, hybrid ray tracing requires using two different APIs—one for ray
tracing (OpenRL) and one for rasterization; a separate OpenRL render context
must be created for the ray-tracing operations. Every frame, the contents of the
G-buffer must be transferred to the ray tracer, and the results must be returned
1. Hybrid Ray Tracing on a PowerVR GPU 357
to the rasterizer for final frame render. On platforms where it is available, EGL
can be used to avoid this extra copy by sharing the contents of these textures
between the ray tracer and the rasterizer. Listing 1.3 shows how each OpenRL
texture object is bound on an EGL image object to achieve this.
The rest of this chapter will discuss some effects that can be added to your
raster-based renderer by taking advantage of the light simulation provided by ray
tracing.
createRay () ;
rl_OutRay . maxT = length ( toLight ) ;
rl_OutRay . direction = normalize ( toLight ) ;
rl_OutRay . occlusionTest = true ;
rl_OutRay . defaultPrimitive = lightPrimitive ;
emitRay () ;
if the ray fails to hit any geometry. Finally, the distance to the light is calculated
and assigned to the ray’s maxT attribute.2 These attributes collectively mean that
the shader will run when there is no occluding geometry in the way, so light can
be accumulated into the framebuffer. If occluding geometry is encountered, the
ray is dropped and no light is accumulated. The shader fragment in Listing 1.4
shows how to implement hard shadows using these ray attributes.
origin and the intersection point. In OpenRL, maxT is a far clipping distance, past which no
objects are evaluated for intersection.
1. Hybrid Ray Tracing on a PowerVR GPU 359
createRay () ;
rl_OutRay . maxT = length ( toLight ) ;
rl_OutRay . direction = toLight/ rl_OutRay . maxT ;
rl_OutRay . color = vec3 ( weight ) ;
rl_OutRay . occlusionTest = true ;
rl_OutRay . defaultPrimitive = lightPrimitive ;
emitRay () ;
}
}
1.6 Reflections
Reflections are another optical phenomenon that are well suited for simulation
with ray tracing. They are an important aspect of rendering many material types,
not just perfectly reflective materials such as chrome and mirrors.
Reflections are caused by light bouncing off of a surface in a manner defined
by the law of the reflection. This is an ancient physical law first codified by Euclid
in the third century BC. It says that when light hits a perfectly reflective surface,
it is reflected at the same angle as the incident angle.
360 V Mobile Devices
Rendering reflections using ray tracing is very simple, and in fact how to do
so is suggested by looking at any textbook diagram of the law of reflection. When
shading the reflective surface, we simply emit an extra ray from the surface to
generate the reflection color. The direction of this reflection ray is calculated
by reflecting the direction of the incoming ray about surface normal. When the
reflection ray collides with objects in the scene, it should be shaded as if it were a
primary ray; in this way, the surface that is visible in the reflection will contribute
its color to the original surface.
When rendering reflections using a hybrid approach, there are several addi-
tional implementation details that must be handled. Firstly, we have to decide
whether the pixel we are shading is reflective. We can do this by encoding our
reflectivity in the G-buffer when we rasterize out fragments into it, then reading
it back in our frame shader to decide if we need a reflection ray.
Another issue is that we are emitting our primary rays from a surface defined
by a G-buffer pixel, so we don’t have an incoming ray to reflect. Therefore, we
have to calculate a “virtual” incoming ray based on the view frustum used by
the rasterizer. In this example, we pass in the corners of the view frustum as
four normalized vec3s, and then we can calculate the virtual ray’s direction by
interpolating between the corners based on the pixel position. We then reflect
this ray around the normal defined by the G-buffer producing our reflection ray
direction. The built-in RLSL function reflect is used to perform this calculation.
Finally, when our reflection ray hits a surface, we need to make sure the result
is the same as when the same surface is viewed directly. So the output from the
ray shader for a reflection ray must match the result of the compositing fragment
shader that produces the final color for directly visible surfaces.
1.7 Transparency
Transparency is a fundamental physical property that is not handled well by
rasterization. Rasterization approximates transparency using alpha blending.
Transparent objects are sorted by distance from the camera and rendered after
the opaque objects, in an order starting at the most distant. Transparency is ap-
proximated in the raster pipeline by having each fragment combine a percentage
of its color with the value already in the framebuffer.
1. Hybrid Ray Tracing on a PowerVR GPU 361
vec3 CalcVirtualInRay ()
{
vec2 uv = r l _ F r a m e C o o r d . xy / r l _ F r a m e S i z e . xy ;
vec3 left = mix ( frustumRay [ 0 ] , frustumRay [ 1 ] , uv . y ) ;
vec3 right = mix ( frustumRay [ 2 ] , frustumRay [ 3 ] , uv . y ) ;
vec3 c a m e r a R a y = mix ( left , right , uv . x ) ;
return cameraRay ;
}
createRay () ;
rl_OutRay . direction = reflection ;
rl_OutRay . origin = position ;
emitRay () ;
}
Alpha blending causes many artifacts, as it bares little relation to how trans-
parency works in real life. Transparency is caused by light traveling through a
transparent medium, where some wavelengths are absorbed and some are not.
Ray tracing can be used to simulate transparency, independent of vertex sub-
mission order and without any of the artifacts and problems inherent in alpha
blending.
To render a transparent surface, we emit a transparency ray from the back
side of the surface, with the same direction as the incoming ray. If the surface
is translucent, the ray’s color will have its color ray attribute modulated with
the color of the surface. This transparency ray is treated exactly the same as a
reflection ray. The final color that the transparency ray contributes to the pixel
will be modulated by the color of the transparent surface it traveled through.
In this example, the surface transparency is stored in the alpha channel of the
surface color. If the surface is completely transparent, the ray has 100% intensity,
and as the surface becomes opaque, the ray’s intensity approaches zero.
Simple ray-traced transparency of this kind does not take into account the
behavior of many transparent materials. The physics of what happens when
light travels from one transparent medium to another is more complicated than
presented above. Some light is reflected off the surface (according the law of re-
flection discussed earlier) and some light bends, or refracts, changing its direction
based on the relative speed of light in the two media.
This too can be represented in a ray tracer using a simple combination of a
transparency ray and a reflection ray. The percentage of the light that is reflected
versus refracted is defined by Fresnel’s equations and can be approximated using
a power function.
362 V Mobile Devices
createRay () ;
rl_OutRay . direction = inRay ;
rl_OutRay . origin = position ;
rl_OutRay . color = (1.0 − c o l o r . a ) r l _ I n R a y . c o l o r c o l o r . r g b ;
emitRay () ;
}
if( rl_FrontFacing )
ior = 1.0 / ior ;
else {
/ Beer s Law t o a p p r o x i m a t e a t t e n u a t i o n . /
atten = vec3 ( 1 . 0 ) − materialColour ;
atten = materialDensity −r l _ I n t e r s e c t i o n T ;
atten = exp ( atten ) ;
}
createRay () ;
rl_OutRay . direction = refract ( rl_InRay . direction , normal , ior ) ;
/ For T o t a l I n t e r n a l R e f l e c t i o n , r e f l e c t ( ) r e t u r n s 0 . 0 /
i f ( r l _ O u t R a y . d i r e c t i o n == v e c 3 ( 0 . 0 ) ) {
r l _ O u t R a y . d i r e c t i o n = r e f l e c t ( inRay , n o r m a l ) ;
}
1.8 Performance
Performance in a ray-tracing GPU is a big topic that cannot be covered by one
section of this chapter. Hopefully this section contains enough information to
provide a framework to begin to optimize your engine.
Figure 1.5. A sample heat map showing the most expensive pixels. Note the internal
ray bouncing on the refractive glass objects can generate many rays.
(a) (b)
(c) (d)
Figure 1.6. An unfiltered soft shadow computation using (a) one, (b) two, (c) four, and
(d) eight rays per pixel. The quality gradually improves as more rays are used.
are created, removed, or modified, and that uniforms that affect a vertex shader
are not modified.
The cost of ray traversal is a direct function of the geometric complexity of
the scene. If reducing the complexity of your meshes yields a large performance
gain, then the ray tracer may be bottlenecked on traversal.
The most difficult factor to isolate is the shader. As in a rasterizer, one
valuable test may be to reduce the complexity of the shaders in the scene, for ex-
ample, by replacing the material ray shaders with a simple shader that visualizes
the normal at the ray intersection point. In ray tracing, however, this may also
mask the problem by avoiding the emission of secondary rays (and hence less ray
traversal).
Keep in mind that, in the Wizard architecture, ray and frame shaders execute
on exactly the same shading hardware as the vertex and fragment shaders used
in rasterization. Furthermore, they share exactly the same interface to memory.
This means that a heavy raster shader could bog down the system for ray tracing
or vice versa.
1.9 Results
All of the screenshots in Figures 1.7–1.12 were rendered with between one and
four rays per pixel, measured as a frame-wide average.
1. Hybrid Ray Tracing on a PowerVR GPU 365
1.10 Conclusion
This chapter described one way of adding the sophisticated light transport sim-
ulation of ray tracing to a raster-based renderer. By using ray tracing as a tool
like this, the physically accurate rendering techniques that have long been used
in ray-tracing production renderers can be added to real-time renderers. As ray-
tracing acceleration becomes more wide spread in consumer GPUs, many other
techniques will likely be developed as computer graphics developers explore in-
novative ways to add ray tracing to their products.
Bibliography
[Blender 14] Blender. “Cycles Render Engine.” Blender 2.61 Release Notes, http:
//wiki.blender.org/index.php/Dev:Ref/Release Notes/2.61/Cycles, accessed
August 19, 2014.
[Hargreaves and Harris 04] Shawn Hargreaves and Mark Harris. “Deferred
Shading.” Presented at NVIDIA Developer Conference: 6800 Leagues
Under the Sea, San Jose, CA, March 23, 2004. (Available at http://http.
download.nvidia.com/developer/presentations/2004/6800 Leagues/6800
Leagues Deferred Shading.pdf.).
[Kajiya 86] James T Kajiya. “The Rendering Equation.” Computer Graphics:
Proc. SIGGRAPH ‘86 20:4 (1986), 143–150.
[Keller and Heidrich 01] Alexander Keller and Wolfgang Heidrich. “Interleaved
Sampling.” In Proceeding of the 12th Eurographics Workshop on Rendering
Techniques, edited by S. J. Gortler and K. Myszkowski, pp. 269–176. London:
Springer-Verlag, 2001.
[LuxRender 14] LuxRender. “LuxRays.” http://www.luxrender.net/wiki/
LuxRays, accessed July 19, 2014.
[Pharr and Humphreys 04] Matt Pharr and Greg Humphreys. Physically Based
Rendering. San Francisco: Morgan Kaufmann, 2004.
2
V
Implementing a GPU-Only
Particle-Collision System with
ASTC 3D Textures and
OpenGL ES 3.0
Daniele Di Donato
2.1 Introduction
Particle simulation has always been a part of games to realize effects that are
difficult to achieve in a rasterizer systems. As the name suggests, particles are
associated with the concept of small elements that appear in huge numbers. To
avoid the complexity of real-world physics, the particles used in graphics tend
to be simplified so they can be easily used in real-time applications. One of
these simplifications is to consider each particle independent and not interacting
with each other, which makes them suitable for parallelization across multiple
processors.
The latest mobile GPUs support OpenGL ES 3.0, and the new features added
gives us the right tools for implementing this simulation. We also wanted to
enable a more realistic behavior, especially concerning collisions with objects in
the scene. This can be computationally expensive and memory intensive since the
information of the geometry needs to be passed to the GPU and traversed, per
simulation step, if we want to parallelize the traditional CPU approach. With
the introduction of ASTC [Nystad et al. 12] and its support for 3D textures,
we are now able to store voxelized data on mobile devices with huge memory
savings. This texture can be used in the OpenGL pipeline to read information
about the scene and use it to modify the particle’s trajectory at the cost of a
single texture access per particle. The following sections describe all the steps of
the particle-system simulation in detail (Figure 2.1).
369
370 V Mobile Devices
• Particles
position and
Input velocity
• ASTC 3D
texture
Simulation
• Simulate
Vertex physics
Shader
other compression algorithms, ASTC offers more parameters to tune the quality
of the final image (more details are available in [Smith 14]). The main options
are the block size, the quality settings, and an indication of the correlation within
the color channels (and the alpha channel if present). For the 3D format, ASTC
allows the block sizes described in Table 2.1
Because the block compressed size is always 128 bits for all block dimensions
and input formats, the bit rate is simply 128/(number of texels in a block). This
specifies the tradeoff between quality and dimension of the generated compressed
texture. In Figure 2.2, various ASTC compressed 3D texture have been rendered
using slicing planes and various block sizes.
The other parameter to choose is the amount of time spent finding a good
match for the current block. From a high-level view, this option is used to increase
the quality of the compression at the cost of more compression time. Because
this is typically done as an offline process, we can use the fastest option for debug
Figure 2.2. From left to right: uncompressed 3D texture, ASTC 3D 3×3×3 compressed
texture, ASTC 3D 4 × 4 × 4 compressed texture, and ASTC 3D 5 × 5 × 5 compressed
texture.
372 V Mobile Devices
purposes and compress using the best one for release. The options supported by
the free ARM ASTC evaluation codec [ARM Mali 15a, ARM Mali 15c] are very
fast, fast, standard, thorough, and exhaustive. The last parameter to set is the
correlation within the color channels. The freely available tools also allows us to
use various preset configuration options based on the data you want to compress.
For example, the tool has a preset for 2D normal maps compression that treats the
channels as uncorrelated and also uses a different error metric for the conversion.
This preset is not available for 3D textures, so we set the uncorrelation using
the fine-grained options available. Note that the ASTC compression tool used
does not store negative numbers, even in case of half-float format. This is due to
the internal implementation of the ASTC algorithm. Because our data contains
mostly unit vectors, we shifted the origin to be at [1, 1, 1] so that the vectors
resides in the [0, 0, 0] to [2, 2, 2] 3D cube.
Table 2.2. ASTC 3D texture compression examples with various block sizes.
the transform feedback feature available through OpenGL ES 3.0. For the pur-
pose of the demo, we implemented a simple Euler integration, and each shader
execution computes a step of the integration. This implementation is good enough
for the demo, but for advanced purposes, a variable time step can be used and
each shader execution can split this time step further and compute a smaller
integration inside the shader itself.
So, the physical simulation for step N + 1 will be dependent on a function of
step N and the delta time (Δt) that occurred between the simulation steps:
Due to the time dependency of position, velocity, and acceleration, this method
is suitable for use in our simulation.
374 V Mobile Devices
glGenBuffers ( 2 , m_XformFeedbackBuffers );
glBindBuffer ( GL_ARRAY_BUFFER ,
m_XformFeedbackBuffers [ 0 ] );
glBufferData ( GL_ARRAY_BUFFER ,
s i z e o f ( XFormFeedbackParticle )
totalNumberOfParticles ,
NULL ,
GL_STREAM_DRAW ) ;
glBindBuffer ( GL_ARRAY_BUFFER ,
m_XformFeedbackBuffers [ 1 ] );
glBufferData ( GL_ARRAY_BUFFER ,
s i z e o f ( XFormFeedbackParticle )
totalNumberOfParticles ,
NULL ,
GL_STREAM_DRAW ) ;
// I n i t i a l i z e t h e f i r s t b u f f e r wi t h t h e p a r t i c l e s
// d a t a from t h e e m i t t e r s
unsigned i n t offset = 0 ;
f o r ( u n s i g n e d i n t i = 0 ; i < m _ E m i t t e r s . L e n g t h ( ) ; i++ )
{
glBindBuffer ( GL_ARRAY_BUFFER ,
m_XformFeedbackBuffers [ 0 ] ) ;
glBufferSubData ( GL_ARRAY_BUFFER ,
offset ,
m _ E m i t t e r s [ i]−> M a x P a r t i c l e s ( )
s i z e o f ( XFormFeedbackParticle ) ,
m _ E m i t t e r s [ i]−> P a r t i c l e s ( ) ) ;
o f f s e t += m _ E m i t t e r s [ i]−> M a x P a r t i c l e s ( )
s i z e o f ( XFormFeedbackParticle ) ;
}
glBindBufferBase ( GL_TRANSFORM_FEEDBACK_BUFFER ,
0,
m_XformFeedbackBuffers [ 1 ] ) ;
2. Set which buffer is the source buffer and how the data is stored in it.
glEnableVertexAttribArray ( m_ParticleVelocityLocation ) ;
glEnableVertexAttribArray ( m_ParticleAttribLocation ) ;
glEnableVertexAttribArray ( m_ParticleLifeLocation ) ;
//We s t o r e i n one b u f f e r t h e 4 f i e l d s t h a t r e p r e s e n t a ←
particle
// P o s i t i o n : 3 f l o a t v a l u e s f o r a t o t a l o f 12 b y t e s
// V e l o c i t y : 3 f l o a t v a l u e s f o r a t o t a l o f 12 b y t e s
// A t t r i b : 2 f l o a t v a l u e s f o r a t o t a l o f 8 b y t e s
// l i f e : 1 f l o a t v a l u e f o r a t o t a l o f 4 b y t e s
glVertexAttribPointer ( m_ParticlePositionLocation ,
3,
GL_FLOAT ,
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ),
0) ;
glVertexAttribPointer ( m_ParticleVelocityLocation ,
3,
GL_FLOAT ,
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ),
12) ;
glVertexAttribPointer ( m_ParticleAttribLocation ,
4,
GL_FLOAT ,
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ),
24) ;
glVertexAttribPointer ( m_ParticleLifeLocation ,
1,
GL_FLOAT ,
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ),
40) ;
3. Enable transform feedback and disable the rasterizer step. The former is
done using the glBeginTransformFeedback function to inform the OpenGL
pipeline that we are interested in saving the results of the vertex shader
execution. The latter is achieved using the GL_RASTERIZER_DISCARD flag
specifically added for the transform feedback feature. This flag disables
the generation of fragment jobs so that only the vertex shader is executed.
We disabled the fragment execution since the rendering of the particles re-
quired two different approaches based on the scene rendered and splitting
the simulation from the rendering gave us a cleaner code base to work with.
glEnable ( GL_RASTERIZER_DISCARD ) ;
glBeginTransformFeedback ( GL_POINTS ) ;
glEndTransformFeedback () ;
glDisable ( GL_RASTERIZER_DISCARD );
F d = −6πηrv,
where η is the dynamic viscosity coefficient of the air and is equal to 18.27 μ Pa,
r is the radius of the particle (we used 5 μm in our simulation), and v is the
velocity of the particle. Since the first part of the product remains constant, we
computed it in advance to avoid computing it per particle.
vec3 t o t a l A c c e l e r a t i o n = t o t a l F o r c e / u P a r t i c l e M a s s ;
oparticlePos_worldSpace = iparticle_Pos +
( iparticle_Vel uDeltaT ) +
( totalAcceleration uDeltaTSquared ) ;
The new position is then transformed using the transformation matrix derived
by the bounding box of the model. This matrix is computed to have the bounding
box minimum to be the origin (0, 0, 0) of the reference. Also, we want the area
of world space inside the bounding box to be mapped to the unit cube space
(0, 0, 0)–(1, 1, 1). Applying this matrix to the particle’s position in world space
gives us the particle’s coordinate in the space with the origin at the minimum
corner of the bounding box and also scaled based on the dimension of the model.
This means that the particles positioned in bounding box space within (0, 0, 0)
378 V Mobile Devices
and (1, 1, 1) have a chance to collide with the object, and this position is actually
the 3D texture coordinate we will use to sample the 3D texture of the model.
• Host code.
u B o u n d i n g B o x M a t r i x = ( ( 1 . 0 / m a x . x−m i n . x , 0 . 0 , 0 . 0 , − m i n . x ) ,
( 0 . 0 , 1 . 0 / m a x . y−m i n . y , 0 . 0 , −m i n . y ) ,
( 0 . 0 , 0 . 0 , 1 . 0 / m a x . z−m i n . z , −m i n . z ) ,
(0.0 , 0.0 , 0.0 , 1.0) )
inverse ( ModelMatrix ) ;
• Vertex shader.
The surface’s normal will be encoded in a 32-bit field and stored to be used
later in the rendering pass to orient the particles in case of collisions. Due to
the discrete nature of the simulation, it can happen that a particle goes inside
the object. We recognize this event when sampling the 3D texture since we store
a flag plus other data in the alpha channel of the 3D texture. When this event
happens, we use the gradient direction stored in the 3D texture plus the amount
of displacement that needs to be applied and we “push” the particle to the nearest
surface. The push is applied to the particles in the bounding-box space, and the
inverse of the uBoundingBoxMatrix is then used to move the particles back to the
world space. Discrete time steps can cause issues when colliding with completely
planar surfaces since a sort of swinging can appear, but at interactive speeds
(≥ 30 fps), this is almost unnoticeable. For particles colliding with the surface
of the object, we compute the new velocity direction and magnitude using the
previous velocity magnitude, the surface normal, the surface tangent direction,
and a bouncing resistance to simulate different materials and particle behavior.
We use the particle’s mass as sliding factor so that heavier particles will bounce
while lighter particles such as dust and smoke will slide along the surface. A
check needs to be performed for the tangent direction since the normal and the
velocity can be parallel, and in that case, the cross product will give an incorrect
result (see Listing 2.3).
The velocity is then used to move the particle to its new position. Because
we want to avoid copying memory within the GPU and CPU, the lifetime of
all the particles should be managed in the shader itself. This means we check
if the lifetime reached 0 and reinitialize the particle attributes such as initial
position, initial velocity, and total particle duration. To make the simulation more
2. Implementing a GPU-Only Particle-Collision System with ASTC 3D Textures and OpenGL ES 3.0 379
f l o a t s l i d i n g F a c t o r = clamp ( u P a r t i c l e M a s s , 0 . 0 , 1 . 0 ) ;
vec3 v e l o c i t y D i r = n or m al i z e ( i p a r t i c l e _ V e l ) ;
v e c 3 t a n g e n t D i r = c r o s s ( s u r f a c e N o r m a l . xyz , v e l o c i t y D i r ) ;
interesting, some randomness can be added while the particles are flowing and
no collision occurred. The fragment shader of the simulation is actually empty.
This is understandable since we do not need to execute any fragment work for
the simulation results. Also, we have enabled the GL_RASTERIZER_DISCARD to skip
all fragment work from being executed. In a way that differs from the OpenGL
standard, OpenGL ES needs a fragment shader to be attached to the program,
even if is not going to be used.
1. Render the object without color enabled so that its depth is stored in the
depth buffer. We need to do this step to prevent particles behind the object
(from the point of view of the light) from casting shadows on the object.
2. Render the particles with depth testing on, but not depth writing.
3. Render the object normally using the texture generated at Step 2 for the
shadows.
5. Render the floor with the result of Step 4 for the shadows.
380 V Mobile Devices
This approach can be optimized. For example, we can use two different frame-
buffers for the shadow of the floor and on the object so that we avoid incremental
renderings (refer to [Harris 14] for more information). To achieve this, we copy
the result of the texture created at the end of Step 2 into the other framebuffer
and then render the object as shadow on it.
#e x t e n s i o n G L A R M s h a d e r f r a m e b u f f e r f e t c h d e p t h s t e n c i l : e n a b l e
#i f d e f G L A R M s h a d e r f r a m e b u f f e r f e t c h d e p t h s t e n c i l
f l o a t d l a= ( 2 . 0 uNear ) /
( uFar + uNear − gl_LastFragDepthARM ( uFar − uNear ) ) ;
#e l s e
// Te x t u r e r e a d f a l l b a c k
#e n d i f
This feature makes it easier to achieve soft particles, and in the demo, we use a
simple approach. First, we render all the solid objects so that the Z-value will be
written in the depth buffer. Afterward, we render the smoke and we can read the
depth value of the object and compare it with the current fragment of the particle
(to see if it is behind the object) and fade the color accordingly. This technique
eliminates the sharp profile that is formed by the particle quad intersecting the
geometry due to the Z-test. During development, the smoke effect looked nice,
but we wanted it to be more dense and blurry. To achieve all this, we decided to
render the smoke in an offscreen render buffer with a lower resolution compared
to the main screen. This gives us the ability to have a blurred smoke (since the
lower resolution removes the higher frequencies) as well as lets us increase the
number of particles to get a denser look. The current implementation uses a
640 × 360 offscreen buffer that is up-scaled to 1080p resolution in the final image.
2. Implementing a GPU-Only Particle-Collision System with ASTC 3D Textures and OpenGL ES 3.0 381
A naı̈ve approach causes jaggedness on the outline of the object when the smoke
is flowing near it due to the blending of the up-sampled low-resolution buffer.
To minimize this effect, we apply a bilateral filter. The bilateral filter is applied
to the offscreen buffer and is given by the product of a Gaussian filter in the
color texture and a linear weighting factor given by the difference in depth. The
depth factor is useful on the edge of the model because it gives a higher weight
to neighbor texels with depth similar to the one of the current pixel and lower
weight when this difference is higher. (If we consider a pixel on the edge of a
model, some of the neighbor pixels will still be on the model while others will be
far in the background.)
1. Bind the buffers that we will use as the template source data.
// S e t up quad t e x t u r e c o o r d i n a t e b u f f e r
glBindBuffer ( GL_ARRAY_BUFFER , m_TexCoordBuffer ) ;
glEnableVertexAttribArray ( m_QuadTexCoordLocation ) ;
glVertexAttribPointer ( m_QuadTexCoordLocation ,
2,
GL_FLOAT ,
GL_FALSE ,
0,
( void ) 0 ) ;
2. Set a divisor for each vertex attribute array. The divisor specifies how the
vertex attributes advance in the array when rendering instances of primi-
tives in a single draw call. Setting it to 0 will make the attribute advance
2. Implementing a GPU-Only Particle-Collision System with ASTC 3D Textures and OpenGL ES 3.0 383
DrawCall
Template
Template Data
once per vertex, restarting at the start of each instance rendered. This is
what we want to happen for the initial quad position and texture coordinate
since they will be the same for each particle (instance) rendered.
glVertexAttribDivisor ( m_QuadPositionLocation , 0 ) ;
glVertexAttribDivisor ( m_QuadTexCoordLocation , 0 ) ;
3. For the attributes computed in the simulation step, we would like to shift
the vertex buffer index for each of the particles (instances) to be rendered.
This is achieved using a divisor other than zero. The divisor then specifies
how many instances should be rendered before we advance the index in the
arrays. In our case, we wanted to shift the attributes after each instance is
rendered, so we used a divisor of 1.
glVertexAttribDivisor ( m_UpdatedParticlePosLocation , 1 ) ;
glVertexAttribDivisor ( m_UpdatedParticleLifeLocation , 1 ) ;
glVertexAttribDivisor ( m_UpdatedParticleAttribLocation , 1 ) ;
4. Bind the buffer that was output from the simulation step. Set up the vertex
attributes to read from this buffer.
glVertexAttribPointer ( m_UpdatedParticlePosLocation ,
3,
GL_FLOAT ,
384 V Mobile Devices
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ),
0) ;
glVertexAttribPointer ( m_UpdatedParticleAttribLocation ,
4,
GL_FLOAT ,
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ) ,
24) ;
glVertexAttribPointer ( m_UpdatedParticleLifeLocation ,
1,
GL_FLOAT ,
GL_FALSE ,
s i z e o f ( XFormFeedbackParticle ) ,
40) ;
5. Render the particles (instances). The function allows to specify how many
vertices belong to each instance and how many instances we want to render.
Note that when using instancing, we are able to access a built-in variable
gl_InstanceID inside the vertex shader. This variable specifies the ID of
the instance we are currently rendering and can be used to access uniform
buffers.
6. Always set back to 0 the divisor for all the vertex attribute arrays since
they can affect subsequent rendering even if we are not using indexing.
glDisableVertexAttribArray( m_QuadPositionLocation ) ;
glDisableVertexAttribArray( m_QuadTexCoordLocation ) ;
glVertexAttribDivisor ( m_UpdatedParticlePosLocation , 0 ) ;
glVertexAttribDivisor ( m_UpdatedParticleAttribLocation , 0 ) ;
glVertexAttribDivisor ( m_UpdatedParticleLifeLocation , 0 ) ;
2.5 Conclusion
Combining OpenGL ES 3.0 features enabled us to realize a GPU-only particle
system that is capable of running at interactive speeds on current mobile devices.
The techniques proposed are experimental and have some drawbacks, but the
reader can take inspiration from this chapter and explore other options using
ASTC LDR/HDR/3D texture as well as OpenGL ES 3.0. In case there is need to
sort the particles, the compute shader feature recently announced in the OpenGL
ES 3.1 specification will enable sorting directly on the GPU.
2. Implementing a GPU-Only Particle-Collision System with ASTC 3D Textures and OpenGL ES 3.0 385
An issue derived from the use of a texture is the texture’s resolution. This
technique can describe a whole 3D static scene in a single 3D texture, but the
resolution of it needs to be chosen carefully since too small resolution can cause
parts of objects to not collide properly since multiple parts with different normals
will be stored in the same voxel. Also, space is wasted if the voxelized 3D scene
contains parts with no actual geometry in them but that fall inside the volume
that is voxelized. Since we are simulating using a discrete time step, issues can
appear if we change the system too fast. For example, we can miss the collision
detection in narrow parts of the object if we rotate it too fast.
Bibliography
[ARM Mali 15a] ARM Mali. “ASTC Evaluation Codec.” http://malideveloper.
arm.com/develop-for-mali/tools/astc-evaluation-codec, 2015.
[ARM Mali 15b] ARM Mali. “Mali Developer Center.” http://malideveloper.
arm.com, 2015.
[ARM Mali 15c] ARM Mali. “Mali GPU Texture Compression Tool.”
http://malideveloper.arm.com/develop-for-mali/tools/asset-creation/
mali-gpu-texture-compression-tool/, 2015.
[Björge 14] Marius Björge. “Bandwidth Efficient Graphics with ARM Mali
GPUs.” In GPU Pro 5: Advanced Rendering Techniques, edited by Wolfgang
Engel, pp. 275–288. Boca Raton, FL: CRC Press, 2014.
[Harris 14] Peter Harris. “Mali Performance 2: How to Correctly Han-
dle Framebuffers.” ARM Connected Community, http://community.
arm.com/groups/arm-mali-graphics/blog/2014/04/28/mali-graphics
-performance-2-how-to-correctly-handle-framebuffers, 2014.
[Morris 13] Dan Morris. “Voxelizer: Floodfilling and Distance Map Generation
for 3D Surfaces.” http://techhouse.brown.edu/∼dmorris/voxelizer/, 2013.
[Nystad et al. 12] J. Nystad, A. Lassen, A. Pomianowski, S. Ellis, and T. Olson.
“Adaptive Scalable Texture Compression.” In Proceedings of the Fourth
ACM SIGGRAPH/Eurographics Conference on High-Performance Graph-
ics, pp. 105–114. Aire-la-ville, Switzerland: Eurographics Association, 2012.
[Smith 14] Stacy Smith. “Adaptive Scalable Texture Compression.” In GPU Pro
5: Advanced Rendering Techniques, edited by Wolfgang Engel, pp. 313–326.
Boca Raton, FL: CRC Press, 2014.
[Wikipedia 15] Wikipedia. “Stokes’ Law.” http://en.wikipedia.org/wiki/Stokes%
27 law, 2015.
3
V
3.1 Introduction
Fur effects have traditionally presented a significant challenge in real-time graph-
ics. On the desktop, the latest techniques employ DirectX 11 tessellation to
dynamically create geometric hair or fur strands on the fly that number in the
hundreds of thousands [Tariq and Bavoil 08,Lacroix 13]. On mobile platforms, de-
velopers must make do with a much smaller performance budget and significantly
reduced memory bandwidth. To compound this, mobile devices are increasingly
featuring equal or higher resolution screens than the average screens used with
desktop systems.
Many artists are today able to create very detailed models of creatures with
advanced animations to be used in 3D applications. This chapter will describe
a system to animate and render fully detailed meshes of these creatures with a
shell fur effect in real time on mobile platforms. This is made possible by utilizing
new API features present in OpenGL ES 3.0, including transform feedback and
instancing.
We used this technique in the creation of our SoftKitty technical demo, which
was first shown at Mobile World Conference 2014. It enabled a high-polygon
model of a cat to be animated with 12-bone-per-vertex skinning and then ren-
dered with shell fur at native resolution on an Apple iPad Air. Thanks to the
optimizations in this chapter, the device was able to render the cat and a high
detail environment in excess of 30 fps.
3.2 Overview
This approach is an optimization of the shell fur technique presented by [Kajiya
and Kay 89]. Traditionally, combining a shell fur effect with a skinned mesh
387
388 V Mobile Devices
Render output buffer with instancing, each instance with a greater offset
would require the skinned positions to be recomputed for every layer of fur. In
addition to this, there would be a separate draw call per layer, resulting in the base
mesh being transferred to the GPU multiple times per frame. This is inefficient
and, depending on model complexity, possibly not viable on bandwidth-limited
platforms.
This approach avoids these issues by first skinning the mesh in a separate
transform feedback pass and then using instancing to submit the mesh and create
the offset layers of fur with a single draw call. We have also simplified the design
of the textures used to create the fur, transitioning from one texture per layer
to a single texture for all. There are two approaches to implementing this, the
choice of which is decided by model complexity and platform limitations. (See
Figure 3.1.)
g l G e n T r a n s f o r m F e e d b a c k s ( 1 , &m _ T r a n s f o r m F e e d b a c k O b j e c t ) ;
g l G e n B u f f e r s ( B O N E _ B A T C H E S +1 , m _ M o d e l D a t a B u f f e r ) ;
// m ModelDataBuffer [ 0 ] i s o u t p u t b u f f e r
glGenBuffers ( BONE_BATCHES , m_SkinningDataBuffer ) ;
// l o a d i n g e a c h b a t c h o f v e r t i c e s i n t o i t s own b u f f e r
f o r ( u n s i g n e d i n t B a t c h = 0 ; B a t c h < B O N E _ B A T C H E S ; ++B a t c h )
{
// C a l c u l a t e o r r e t r i e v e BatchVertexCount
InputModelData = new ModelDataStruct [ BatchVertexCount ] ;
f o r ( i n t i = 0 ; i < B a t c h V e r t e x C o u n t ;++i )
{
// Copy d a t a i n t o InputModelData
}
g l B i n d B u f f e r ( G L _ A R R A Y _ B U F F E R , m _ M o d e l D a t a B u f f e r [ Batch +1]) ;
glBufferData ( GL_ARRAY_BUFFER , s i z e o f ( ModelDataStruct )
BatchVertexCount , InputModelData , GL_STATIC_DRAW ) ;
glBindBuffer ( GL_ARRAY_BUFFER , 0) ;
delete [ ] InputModelData ;
vertices and normals and one for skinning data. These input buffers were created
per bone batch, containing only the data specific to that bone batch.
If using a single bone batch, the code path in Listing 3.1 can still be used
with a batch count of 1. When using the single buffer approach, we created our
UBO in the following manner:
g l G e n B u f f e r s (1 ,& u i U B O ) ;
glBindBuffer ( GL_UNIFORM_BUFFER , uiUBO ) ;
uiIndex = glGetUniformBlockIndex ( ShaderId , szBlockName ) ;
glUniformBlockBinding ( ShaderId , uiIndex , uiSlot ) ;
glBindBufferBase ( GL_UNIFORM_BUFFER , uiSlot , uiUBO ) ;
3. Animated Characters with Shell Fur for Mobile Devices 391
f o r ( u n s i g n e d i n t B a t c h = 0 ; B a t c h < B O N E _ B A T C H E S ; ++B a t c h )
{ // C a l c u l a t e o r r e t r i e v e BatchVertexCount
glBindBufferRange ( GL_TRANSFORM_FEEDBACK_BUFFER ,0 ,
m_ModelDataBuffer [ 0 ] , iTotalVerts s i z e o f ( ModelDataStruct ) ,
BatchVertexCount s i z e o f ( ModelDataStruct ) ) ;
glBeginTransformFeedback ( GL_POINTS ) ;
// E n a b l e A t t r i b A r r a y s
g l B i n d B u f f e r ( G L _ A R R A Y _ B U F F E R , m _ M o d e l D a t a B u f f e r [ Batch +1]) ;
// S e t V e r t e x and Normal A t t r i b p o i n t e r s
glBindBuffer ( GL_ARRAY_BUFFER , m_SkinningDataBuffer [ Batch ] ) ;
// S e t Bone Weight and I n d e x A t t r i b p o i n t e r s
glUniform1i ( m_TransformFeedback . uiBoneCount , pMesh . sBoneIdx . n ) ;
#i f d e f i n e d (UBO)
m_matrixPaletteUBO . UpdateData ( m_BoneMatrixPalette [ 0 ] . ptr () ) ;
#e l s e
glUniformMatrix4fv ( m_TransformFeedback . uiBoneMatrices ,
BONE_PALETTE_SIZE , GL_FALSE , m_BoneMatrixPalette [ 0 ] . ptr () ) ;
#e n d i f
#i f d e f i n e d (UBO)
layout ( std140 ) uniform BoneMatrixStruct
{ highp mat4 BoneMatrixArray [ NUM_BONE_MATRICES ] ; };
void main ()
{
g l _ P o s i t i o n = v e c 4 ( i n V e r t e x , 1 . 0 ) ; // r e q u i r e d
f o r ( i n t i = 0 ; i < B o n e C o u n t ; ++i )
{
// p e r f o r m s k i n n i n g n o r m a l l y
}
oPosition = position . xyz ;
oNormal = normalize ( worldNormal ) ;
}
also specify the number of instances to draw, which should be the same as the
number of layers used in creating the fur texture. We found with our model,
depending on platform and resolution, a count of between 11 and 25 gave good
visual results while maintaining workable performance. We bind the TexCoord
array to a structure that is created when we load our model from disk. (The
vertices have not been reordered, so this data is unchanged by the process.) (See
Listing 3.4.)
The shell position is then calculated in the shader as shown in Listing 3.5.
Having calculated a base alpha value per layer in the vertex shader, we sample
the StrandLengthTexture to establish where fur should be drawn and how long
it should be. We leave the base layer solid, and we alpha out strands that the
random distribution decided should have ended:
3. Animated Characters with Shell Fur for Mobile Devices 393
InstanceID = gl_InstanceID ;
o S h e l l D i s t = ( f l o a t ( I n s t a n c e I D ) ) / ( f l o a t ( L a y e r C o u n t ) −1.0) ;
o A l p h a = (1.0 − p o w ( o S h e l l D i s t , 0 . 6 ) ) ; // tweaked f o r n i c e r f a l l o f f
s h e l l D i s t = t e x t u r e ( S h e l l H e i g h t T e x t u r e , i n T e x C o o r d ) . r ;
highp vec3 shellPos = inVertex + inNormal oShellDist ;
gl_Position = ProjectionFromModel vec4 ( shellPos , 1 . 0 ) ;
3.9 Conclusion
In moving from skinning every shell individually to using transform feedback, we
saw a dramatic performance increase. With a low-polygon early test model on
a mobile platform using 18 layers, performance increased from 29 fps to being
Vsync limited at 60 fps. We were then able to increase to 30 layers and maintain
a framerate above 30 fps. When we later incorporated the changes to the fur
texture and incorporated instancing, we saw performance rise to 50 fps. With
our final, full-detail model, on a high-performance mobile platform, we were able
to run 17 shells on a 1920 × 1080 display. This gave more than sufficient visual
quality and allowed us to render a surrounding scene and other effects, all in
excess of 30 fps.
We were able to achieve a pleasing result without the additional use of fins,
and our implementation also did not include any force, intersection, or self-
shadowing effects. These are all additional avenues that could be explored on
higher-performance platforms in the future.
Bibliography
[Kajiya and Kay 89] James T. Kajiya and Timothy L. Kay. “Rendering Fur with
Three Dimensional Textures.” In Proceedings of the 16th Annual Conference
on Computer Graphics and Interactive Techniques, pp. 271–180. New York:
ACM Press, 1989.
396 V Mobile Devices
[Lacroix 13] Jason Lacroix. “Adding More Life to Your Characters with
TressFX.” In ACM SIGGRAPH 2013 Computer Animation Festival, p. 1.
New York: ACM Press, 2013.
[Lengyel et al. 01] Jerome Lengyel, Emil Praun, Adam Finkelstein, and Hugues
Hoppe. “Real-Time Fur over Arbitrary Surfaces.” In Proceedings of the
2001 Symposium on Interactive 3D Graphics, pp. 227–232. New York: ACM
Press, 2001.
[Schüler 09] Christian Schüler. “An Efficient and Physically Plausible Real-Time
Shading Model.” In ShaderX 7, edited by Wolfgang Engel, pp. 175–187.
Boston: Cengage, 2009.
[Tariq and Bavoil 08] Sarah Tariq and Louis Bavoil. “Real Time Hair Simulation
and Rendering on the GPU.” In ACM SIGGRAPH 2008 Talks, p. artcle no.
37. New York: ACM Press, 2008.
4
V
4.1 Introduction
Mobile GPU architectures have been evolving rapidly, and are now fully pro-
grammable, high-performance, parallel-processing engines. Parallel programming
languages have also been evolving quickly, to the point where open standards such
as the Khronos Group’s OpenCL now put powerful cross-platform programming
tools in the hands of mobile application developers.
In this chapter, we will present our work that exploits GPU computing via
OpenCL and OpenGL to implement high dynamic range (HDR) computational
photography applications on mobile GPUs. HDR photography is a hot topic in
the mobile space, with applications to both stills photography and video.
We explore two techniques. In the first, a single image is processed in order
to enhance detail in areas of the image at the extremes of the exposure. In the
second technique, multiple images taken at different exposures are combined to
create a single image with a greater dynamic range of luminosity. HDR can be
applied to an image to achieve a different goal too: as an image filter to create
a range of new and exciting visual effects in real time, somewhat akin to the
“radioactive” HDR filter from Topaz Labs [Topaz Labs 15].
These HDR computational photography applications are extremely compute-
intensive, and we have optimized our example OpenCL HDR code on a range of
GPUs. In this chapter, we shall also describe the approach that was taken during
code optimization for the ARM Mali mobile GPUs and give the performance
results we achieved on these platforms.
We also share the OpenCL/OpenGL interoperability code we have developed,
which we believe will be a useful resource for the reader. Surprisingly little is
397
398 V Mobile Devices
(d)
Figure 4.1. Images taken with different exposures: (a) −4 stops, (b) −2 stops, (c) +2
stops, and (d) +4 stops. [Image from [McCoy 08].]
4.2 Background
Real-world scenes contain a much higher dynamic range of brightness than can
be captured by the sensors available in most cameras today. Digital cameras use
8 bits per pixel for each of the red, green, and blue channels, therefore storing
only 256 different values per color channel. Real-world scenes, however, can have
a dynamic range on the order of about 108 : 1, therefore requiring up to 32 bits
per pixel per channel to represent fully.
To compensate for their relatively low dynamic range (LDR), modern digital
cameras are equipped with advanced computer graphics algorithms for producing
high-resolution images that meet the increasing demand for more dynamic range,
color depth, and accuracy. In order to produce an HDR image, these cameras
either synthesize inputs taken concurrently from multiple lenses with different
exposures, or they take multiple-exposure images in sequential order and combine
them into a single scene. Figure 4.1 shows a set of over- and underexposed images
of a scene that can be captured in such a way.
The synthesis process produces a 32-bit image encoding the full HDR of the
scene. Standard displays, such as computer monitors, TVs, and smartphone or
tablet screens, however, only have a dynamic range of around 256 : 1, which
means that they are not capable of accurately displaying the rendered HDR
image. Therefore, to display HDR images on a standard display, the images first
4. High Dynamic Range Computational Photography on Mobile GPUs 399
(a) (b)
Figure 4.2. HDR images obtained using (a) global and (b) local tone-mapping operators.
[Image from [McCoy 08].]
(a) (b)
Figure 4.3. HDR look achieved by Topaz Adjust: (a) original image and (b) HDR
image.
at the same scene with the use of a beam splitter. Unfortunately, most mobile
phones and other handheld cameras do not yet come with the multiple lenses
that would be required to acquire multiple-exposure images in real time. For
this reason, the HDR TMOs we present in this chapter not only perform well on
32-bit HDR images but also bring out details in a single-exposure LDR image,
giving them a HDR look.
Figure 4.3 shows the results of an HDR effect on a single image as obtained
by Topaz Adjust, a plug-in for Adobe Photoshop [Topaz Labs 15]. The plugin
is able to enhance local gradients that are hard to see in the original image.
Furthermore, photographers often manually apply a pseudo-HDR effect on an
LDR image to make it more aesthetically pleasing. One way to achieve such a
pseudo-HDR effect, as described by Kim Y. Seng [Seng 10], is to create under-
and overexposed versions of a well-exposed LDR image. Seng then uses these
artificial under- and overexposed images as the basis for creating a 32-bit HDR
image before tone-mapping it using a TMO.
into improved battery life, and thus using general-purpose computing on GPUs
(GPGPU) has become a hot topic for mobile applications.
One aim of this chapter is to describe an efficient, open source implementation
of a pipeline that can be used to capture camera frames and display output
of HDR TMOs in real time. The second aim of the example presented in this
chapter is to demonstrate an efficient code framework that minimizes the amount
of time taken to acquire the camera frames and render the display to output. The
pipeline should be such that it can be used for any image-processing application
that requires input from a camera and renders the output to a display. This
pipeline should also make it possible to create HDR videos.
We present our example pipeline in OpenCL to serve as a real, worked ex-
ample of how to exploit GPU computing in mobile platforms. We also exploit
OpenCL/OpenGL interoperability with the goal of equipping the reader with a
working template from which other OpenCL/OpenGL applications can be quickly
developed.
Reinhard’s global TMO. The tonal range of an image describes the number of
tones between the lightest and darkest part of the image. Reinhard et al. im-
plemented one of the most widely used global TMOs for HDRI, which computes
the tonal range for the output image [Reinhard et al. 02]. This tonal range is
computed based on the logarithmic luminance values in the original images.
The algorithm first computes the average logarithmic luminance of the entire
image. This average, along with another parameter, is then used to scale the
original luminances. Then, to further allow for more global contrast in the image,
this approach lets the high luminances often “burn out” by clamping them to
pure white. This burning out step is accomplished by computing the smallest
luminance value in the original image and then scaling all of the pixels accordingly.
For many HDR images, this operator is sufficient to preserve details in low-
contrast areas, while compressing higher luminances to a displayable range. How-
ever for very high dynamic range images, especially where there is varying local
contrast, important detail can still be lost.
Reinhard’s global TMO uses the key value of the scene to set the tonal range
for the output image. The key of a scene can be approximated using the loga-
rithmic average luminance L̄w :
1
L̄w = exp log(δ + Lw (x, y)) ,
N x,y
where Lw (x, y) is the luminance of pixel (x, y), N is the total number of pixels
in the image, and δ is a very small value to avoid taking the logarithm of 0 in
case there are pure black pixels in the image. Having approximated the key of
the scene, we need to map this to middle-gray. For well-lit images, Reinhard
proposes a value of 0.18 as middle-gray on a scale of 0 to 1, giving rise to the
following equation:
a
L(x, y) = Lw (x, y), (4.1)
L̄w
where L(x, y) is the scaled luminance and a = 0.18. Just as in film-based pho-
tography, if the image has a low key value, we would like to map the middle-gray
value, i.e, L̄w , to a high value of a to bring out details in the darker parts of the
image. Similarly, if the image has a high key value, we would like to map L̄w to a
lower value of a to get contrast in the lighter parts of the scene. In most natural
scenes, occurrences of high luminance values are quite low, whereas the majority
of the pixel values have a normal dynamic range. Equation (4.1) doesn’t take
this into account and scales all the values linearly.
Reinhard’s global TMO can now be defined as
L(x, y) 1 + L(x,y)
2
Lwhite
Ld (x, y) = , (4.2)
1 + L(x, y)
4. High Dynamic Range Computational Photography on Mobile GPUs 403
where Lwhite is the smallest luminance that we would like to be burnt out. Al-
though Lwhite can be another user-controlled parameter, in this implementation
we will set it to the maximum luminance in the image, Lmax . This will prevent
any burn out; however, in cases where Lmax < 1, this will result in contrast
enhancement, as previously discussed.
The operator has a user-controlled parameter, a. This is the key value and
refers to the subjective brightness of a scene: the middle-gray value that the scene
is mapped to. Essentially, setting a to a high value has an effect of compressing
the dynamic range for darker areas, thus allowing more dynamic range for lighter
areas and resulting in more contrast over that region. Similarly, decreasing a
reduces the dynamic range for lighter areas and shows more contrast in darker
parts of a scene. Since the brightness of a scene is very much subjective to the
photographer, in this implementation a will be a controllable parameter that can
be changed by the user.
The global TMO is one of the most widely implemented TMOs because of its
simplicity and effectiveness. It brings out details in low-contrast regions while
compressing high luminance values. Furthermore, Equation (4.1) and Equation
(4.2) are performed on each pixel independently, and therefore it is fairly straight-
forward to implement a data parallel version using OpenCL in order to exploit
the compute capability of the GPU.
Reinhard’s local TMO. Although the global TMO works well in bringing out de-
tails in most images, detail is still lost for very high dynamic range images. Rein-
hard’s local TMO proposes a tone reproduction algorithm that aims to emphasize
these details by applying dodging and burning.
Dodging and burning is a technique used in traditional photography that
involves restraining light (dodging) or adding more light (burning) to parts of
the print during development. Reinhard et al. extended this idea for digital
images by automating the process for each pixel depending on its neighborhood.
This equates to finding a local key, i.e., a in Equation (4.1), for each pixel, which
can then be used to determine the amount of dodging and burning needed for the
region. Along with the key value a, the size of each region can also vary depending
on the contrast in that area of the image. This size depends on the local contrast
of the pixel. To find the optimal size region over which to compute a, Reinhard’s
approach uses a center-surround function at multiple scales. Center-surround
functions are often implemented by subtracting two Gaussian blurred images. For
this TMO, Reinhard chose to implement the center-surround function proposed
for Blommaert’s model for brightness perception [Blommaert and Martens 90].
This function is constructed using Gaussian profiles of the form
2
1 x + y2
Ri (x, y, s) = exp − .
π(αi s)2 (αi s)2
This circularly symmetric profile is constructed for different scales s around each
404 V Mobile Devices
Because the response requires convolving two functions, it can either be per-
formed in the spatial domain or they can be multiplied in the Fourier domain for
improved efficiency. The example HDR GPU pipeline described in this chapter
makes use of mipmaps as an alternative to the Gaussian profile. Equation (4.4)
is the final building block required for Bloommaert’s center-surround function:
V1 (x, y, s) − V2 (x, y, s)
V (x, y, s) = , (4.4)
2φ a/s2 + V1 (x, y, s)
where V1 is the center response function and V2 is the surround response function
obtained using Equation (4.3). The 2φ a/s2 term in the denominator prevents
V from getting too large when V1 approaches zero. The motive behind having
V1 in the denominator is discussed later. Similarly to the global TMO, a is the
key value of the scene, ϕ is a sharpening parameter, and s is the scale used to
compute the response function.
The center-surround function expressed in Equation (4.4) is computed over
several scales s to find the optimal scale sm . This equates to finding the suitably
sized neighborhood for each pixel, and therefore plays an important role in the
dodging-and-burning technique. An ideal-sized neighborhood would have very
little contrast changes in the neighborhood itself; however, the area surrounding
the neighborhood would have more contrast. The center-surround function com-
putes the difference between the center response V1 and surround response V2 .
For areas with similar luminance values, these will be much the same, however
they will differ in higher-contrast regions. Starting at the lowest scale, the local
TMO selects the first scale sm such that
L(x, y)
Ld (x, y) = . (4.6)
1 + V1 (x, y, sm )
A dark pixel in a relatively bright area will satisfy L < V1 . In such cases,
Equation (4.6) will decrease the Ld of that pixel, which will have the effect of
4. High Dynamic Range Computational Photography on Mobile GPUs 405
Processing
Element Host
OpenCL Device
Compute Unit
Figure 4.4. The OpenCL platform model with a single host and multiple devices.
Each device has one or more compute units, each of which has one or more processing
elements.
The platform model is presented in Figure 4.4 and consists of a host and one
or more devices. The host is a familiar CPU-based system supporting file I/O,
user interaction, and other functions expected of a system. The devices are where
the bulk of the computing takes place in an OpenCL program. Example devices
include GPUs, many-core coprocessors, and other devices specialized to carry out
the OpenCL computations. A device consists of one or more compute units (CUs)
each of which presents the programmer with one or more processing elements
(PEs). These processing elements are the finest-grained units of computation
within an OpenCL program.
The platform model gives programmers a view of the hardware they can use
when optimizing their OpenCL programs. Then, by understanding how the plat-
form model maps onto different target platforms, programmers can optimize their
software without sacrificing portability.
OpenCL programs execute as a fine-grained SPMD (single program, multiple
data) model. The central ideal behind OpenCL is to define an index space of
one, two, or three dimensions. Programmers map their problem onto the indices
of this space and define a block of code, called a kernel, an instance of which runs
at each point in the index space.
Consider the matrix multiplication OpenCL kernel in Listing 4.1. Here we
have mapped the outermost two loops of the traditional sequential code onto a
2D index space and run the innermost loop (over k) within a kernel function.
We then ran an instance of this kernel function, called a work item in OpenCL
terminology, for each point in the index space.
4. High Dynamic Range Computational Photography on Mobile GPUs 409
ocl_get_global_ID(0) = 16
ocl_get_global_ID(1) = 16
ocl_get_local_ID(1) = 4
ocl_get_local_ID(0) = 4
Figure 4.5. A problem is decomposed onto the points of an N -dimensional index space
(N = 1, 2, or 3), known in OpenCL as an NDRange. A kernel instance runs at each
point in the NDRange to define a work item. Work items are grouped together into work
groups, which evenly tile the full index space.
1.
Compute Device ..
P
de
Compute unit 1 Compute unit N vic
es
Compute Device
Private Private Private Private
memory
memory 1 Compute unit 1 M memory
memory 1 Compute unit NM
PE 1 Compute
PrivatePE M Device
Private PE 1 PrivatePE M Private
memory 1 Compute
memory
unit 1 M memory 1 Compute
memory
unit NM
Global Memory
Host Memory
Host
Figure 4.6. The memory model in OpenCL 1.X and its relationship to the platform
model. Here, P devices exist in a single context and therefore have visibility into the
global/constant memory.
4. High Dynamic Range Computational Photography on Mobile GPUs 411
constant memory. Global and constant memories hold OpenCL memory objects
and are visible to all the OpenCL devices involved in a computation (i.e., within
the context defined by the programmer). The onboard DRAM of a discrete
GPU or FPGA will typically be mapped as global memory. It is worth noting
that, for discrete devices, moving data between host memory and global memory
usually requires transferring data across a bus, such as PCI Express, which can
be relatively slow.
Within an OpenCL device, each compute unit has a region of memory local
to the compute unit called local memory. This local memory is visible only to
the processing elements within the compute unit, which maps nicely onto the
OpenCL execution model, with one or more work groups running on a compute
unit and one or more work items running on a processing element. The local
memory within a compute unit corresponds to data that can be shared inside a
work group. The final part of the OpenCL memory hierarchy is private memory,
which defines a small amount of per work-item memory visible only within a work
item.
Another important OpenCL buffer type for any application that wants to
mix OpenCL and OpenGL functionality, is the textured images buffer. These
are available in 2D and 3D and are a global memory object optimized for image
processing, supporting multiple image formats and channels. There is a one-to-
one correspondence between an OpenCL textured image and certain OpenGL
textures. In fact, as discussed later, this correspondence can be taken advantage
of to optimize the framework we will present in this chapter.
Data movement among the layers in the memory hierarchy in OpenCL is
explicit—that is, the user is responsible for the transfer of data from host mem-
ory to global memory and so on. Commands in the OpenCL API and kernel
programming language must be used to move data from host memory to global
memory, and from global memory to either local or private memory.
and from the GPU’s memory. Although the camera input is stored on the GPU,
existing image-processing applications tend to transfer the data to the host device
(the CPU), where they serially process the data and render it to the display using
OpenGL. Clearly this process can cause several inefficient transfers of data back
and forth between the CPU and GPU.
What is required is an approach that avoids any unnecessary memory transfers
between the GPU’s memory and the host’s memory. OpenCL/OpenGL interop-
erability supports this approach. Input from the camera can be acquired in the
form of an OpenGL ES texture using Android’s SurfaceTexture object. OpenCL
then allows a programmer to create a textured image from an OpenGL texture,
which means that the camera data doesn’t need to be transferred to the host,
instead staying resident in the GPU from image acquisition all the way to ren-
dering the output of the OpenCL kernels to the screen. Furthermore, even on
the GPU, the data doesn’t actually move as we switch between OpenCL and
OpenGL; instead it just changes ownership from OpenGL to OpenCL and back
again. To achieve this pipeline, interoperability between OpenCL and OpenGL
ES needs to be established.
4.6.2 EGL
To enable OpenCL and OpenGL ES interoperability, the OpenCL context must
be initialized using the current display and context being used by OpenGL ES.
OpenGL ES contexts are created and managed by platform-specific windowing
APIs. EGL is an interface between OpenGL ES and the underlying windowing
system, somewhat akin to GLX, the X11 interface to OpenGL with which many
readers might already be familiar.
To avoid unnecessary use of memory bandwidth, the implementation makes
use of OpenGL ES to bind the input from the camera to a texture. An OpenGL
ES display and context is then created by acquiring a handle to the Android
Surface. The context and display are then used to create a shared OpenCL
context. This shared context allows OpenCL to have access to the camera texture
and therefore to perform computations upon it. Because the Android OS has
only recently included support for such APIs, to date not many examples have
appeared in this area.
Pseudo-
underexposed
Pseudo-HDR image
Pseudo-
overexposed
Figure 4.7. Pseudo-HDR pipeline that takes multiple input images with a range of
contrasts.
where N is the number of LDR images, w(pij ) is the weight of pixel ij, and ti is
the exposure time of the LDR image i.
The serial implementation of Equation (4.7) is straightforward and is therefore
not presented here. The algorithm simply iterates over each pixel, computes its
luminance and weight, and uses those with the image exposure to calculate the
32-bit HDR pixel color.
Clearly the calculation of each pixel is independent of all the others, and so
this natural data parallelism is ideal for a GPU implementation.
Listing 4.2 shows the OpenCL kernel implemented for the image synthesis
process. Since the number of LDR images can vary, the kernel is passed a 1D ar-
ray, LDRimages, containing all the images. The 1D array is of type unsigned char,
which is sufficient to store each 8-bit color value per pixel.
The if statement on line 10 ensures that the work items don’t access out-of-
bound memory. The for loop on line 14 uses Equation (4.7) to synthesize the
LDR pixels from different images into an HDR pixel. Once an HDR pixel is
calculated, it is stored in the 1D array HDRimage. The array HDRimage is of type
float, which is sufficient to store the higher dynamic range of the pixel.
Step 3: Automating contrast adjustment. We now have a 32-bit HDR image. The
three tone-mapping algorithms we have previously described can now be used
to tone map the 32-bit HDR image, producing an 8-bit HDR image that can be
rendered on most displays. Their implementation is discussed in more detail later
on in this chapter.
1 c o n s t s a m p l e r _ t s a m p l e r = C L K _ N O R M A L I Z E D _ C O O R D S _ F A L S E | ←
CLK_ADDRESS_NONE | CLK_FILTER_NEAREST ;
2
3 // K e r n e l t o p e r f o r m h i s t o g r a m e q u a l i z a t i o n u s i n g t h e m o d i f i e d
4 // b r i g h t n e s s CDF
5 k e r n e l v o i d h i s t o g r a m _ e q u a l i s a t i o n ( r e a d _ o n l y i m a g e 2 d _ t ←
input_image ,
6 write_only image2d_t output_image ,
7 __global uint brightness_cdf ) {
8 int2 pos ;
9 uint4 pixel ;
10 f l o a t 3 hsv ;
11 f o r ( p o s . y = g e t _ g l o b a l _ i d ( 1 ) ; p o s . y < H E I G H T ; p o s . y += ←
get_global_size (1) ) {
12 f o r ( p o s . x = g e t _ g l o b a l _ i d ( 0 ) ; p o s . x < W I D T H ; p o s . x += ←
get_global_size (0) ) {
13 pixel = r e a d _ i m a g e u i ( image , sampler , pos ) ;
14
15 hsv = RGBtoHSV ( pixel ) ; // Co n v e r t t o HSV t o g e t hue and
16 // s a t u r a t i o n
17
18 h s v . z = ( ( H I S T _ S I Z E −1) ( b r i g h t n e s s _ c d f [ ( i n t ) h s v . z ] − ←
brightness_cdf [ 0 ] ) )
19 /( HEIGHT WIDTH − brightness_cdf [ 0 ] ) ;
20
21 pixel = HSVtoRGB ( hsv ) ; // Co n v e r t back t o RGB wi t h t h e
22 // m o d i f i e d b r i g h t n e s s f o r V
23
24 w r i t e _ i m a g e u i ( o u t p u t _ i m a g e , pos , p i x e l ) ;
25 }
26 }
27 }
input and, allocating one pixel to each work item, compute the brightness
value for each pixel. Once the brightness value is computed, the index
corresponding to that value is incremented in the local histogram array
l_hist. To ensure correct synchronization among different work items, a
barrier call is made just before writing to the shared l_hist array. Once
the l_hist array has been modified, the results are written to the global
partial histogram array. The merge_hist kernel then merges the partial
histograms together. This kernel is executed with global size of 256, so as
to have a one-to-one correspondence between the work items and the indices
of the image histogram. For this last kernel, each work item computes the
sum over all the partial histograms for the index value corresponding to the
work item’s ID. Once the sum is computed, the final histogram value for
this work item is then set to this sum.
Cumulative distribution function. Computing the cumulative distribution function
is an operation that is not so well suited for GPGPU, due to the sequential
nature of the algorithm required to compute it. Several OpenCL SDKs
4. High Dynamic Range Computational Photography on Mobile GPUs 417
Computing Lmax and L̄w . As discussed previously, the Lmax of a scene is the
largest luminance value, whereas L̄w is the average logarithmic luminance of
a scene. Calculating these values serially is straightforward; however, to obtain
them using an OpenCL kernel, we will need to perform a reduction over the entire
image. As described in [Catanzaro 10], the fastest way to perform a reduction
is in a two-stage process. Here, each work item i performs reduction operations
over the following array indices:
The result from this equation is then stored in the local array, and reduction
is then performed over this local array. The output of this stage of the reduction
is one partial reduction value for each work group. The second stage of the two-
stage reduction requires execution of a separate kernel, which simply performs
reduction over these partial results.
The input image to the kernel is a 2D texture image, therefore it’s natural to
want to run this kernel in 2D. However, this requires implementing a novel 2D
version of the above two-stage reduction. The main difference is that now each
work item (x, y) performs reduction operations over the image pixels at positions:
where gx and gy are the global sizes in the x and y dimensions, respectively.
418 V Mobile Devices
1 c o n s t s a m p l e r _ t s a m p l e r = C L K _ N O R M A L I Z E D _ C O O R D S _ F A L S E | ←
CLK_ADDRESS_NONE | CLK_FILTER_NEAREST ;
2
3 // Th i s k e r n e l computes logAvgLum by p e r f o r m i n g r e d u c t i o n
4 // The r e s u l t s a r e s t o r e d i n an a r r a y o f s i z e num work groups
5 kernel void computeLogAvgLum ( _ _ r e a d _ o n l y i m a g e 2 d _ t image ,
6 __global fl o a t lum ,
7 __global fl o a t logAvgLum ,
8 __local f l o a t logAvgLum_loc ) {
9
10 f l o a t lum0 ;
11 f l o a t logAvgLum_acc = 0. f ;
12
13 int2 pos ;
14 uint4 pixel ;
15 f o r ( p o s . y = g e t _ g l o b a l _ i d ( 1 ) ; p o s . y < H E I G H T ; p o s . y += ←
get_global_size (1) ) {
16 f o r ( p o s . x = g e t _ g l o b a l _ i d ( 0 ) ; p o s . x < W I D T H ; p o s . x += ←
get_global_size (0) ) {
17 pixel = r e a d _ i m a g e u i ( image , sampler , pos ) ;
18 // lum0 = p i x e l . x 0.2126 f + p ixel . y 0.7152 f + pixe l . z ←
0.0722 f ;
19 l u m 0 = d o t ( G L t o C L ( p i x e l . x y z ) , ( f l o a t 3 ) ( 0 . 2 1 2 6 f , 0 . 7 1 5 2 f , ←
0.0722 f ) ) ;
20
21 l o g A v g L u m _ a c c += l o g ( l u m 0 + 0 . 0 0 0 0 0 1 f ) ;
22 lum [ pos . x + pos . y WIDTH ] = lum0 ;
23 }
24 }
25
26 pos . x = get_local_id (0) ;
27 pos . y = get_local_id (1) ;
28 c o n s t i n t l i d = p o s . x + p o s . y g e t _ l o c a l _ s i z e ( 0 ) ; // L o c a l ID i n
29 // one d i m e n s i o n
30 logAvgLum_loc [ lid ] = logAvgLum_acc ;
31
32 // Perform p a r a l l e l r e d u c t i o n
33 barrier ( CLK_LOCAL_MEM_FENCE ) ;
34
35 f o r ( i n t o f f s e t = ( g e t _ l o c a l _ s i z e ( 0 ) g e t _ l o c a l _ s i z e ( 1 ) ) / 2 ; o f f s e t←
> 0; offset = offset /2) {
36 i f ( lid < offset ) {
37 l o g A v g L u m _ l o c [ l i d ] += l o g A v g L u m _ l o c [ l i d + o f f s e t ] ;
38 }
39 barrier ( CLK_LOCAL_MEM_FENCE ) ;
40 }
41
42 // Number o f wo r k g r o u p s i n x dim
43 c o n s t i n t n u m _ w o r k _ g r o u p s = g e t _ g l o b a l _ s i z e ( 0 ) / g e t _ l o c a l _ s i z e ( 0 ) ←
;
44 c o n s t i n t g r o u p _ i d = g e t _ g r o u p _ i d ( 0 ) + g e t _ g r o u p _ i d ( 1 ) ←
num_work_groups ;
45 i f ( l i d == 0 ) {
46 logAvgLum [ group_id ] = logAvgLum_loc [ 0 ] ;
47 }
48 }
mum of luminances over a range of image pixels (line 17–25). This sum and
maximum is then stored in local arrays at an index corresponding to the pixel’s
position. A wave-front reduction is then performed over these local arrays (lines
36–42), and the result is then stored in the global array for each work group.
The finalReduc kernel is then used to perform reduction over the partial re-
sults, where num_reduc_bins is the number of work groups in the execution of the
computeLogAvgLum kernel. Once the sum over all the luminance values is com-
puted, we take its average and calculate its exponential.
Once we have calculated Lmax and L̄w , these values are plugged into Equation
(4.7), with L(x, y) = Lw (x, y) L̄aw , Lwhite = Lmax , and Lw (x, y), the luminance of
pixel (x, y). Once the values of Lmax and L̄w have been computed, the rest of the
computation is fully data parallel, thus benefitting from a GPGPU implementa-
tion. Due to limited space, the OpenCL kernel is not presented here as it only
requires a simple modification of the serial implementation. The code can be
found in the example pipeline source code accompanying this chapter (available
on the CRC Press website) .
1 c o n s t s a m p l e r _ t s a m p l e r = C L K _ N O R M A L I Z E D _ C O O R D S _ F A L S E | ←
CLK_ADDRESS_NONE | CLK_FILTER_NEAREST ;
2
3 // Computes t h e mapping f o r e a c h p i x e l a s p e r R e i n h a r d s L o c a l TMO
4 kernel void reinhardLocal ( __read_only image2d_t input_image ,
5 __write_only image2d_t output_image ,
6 __global f l o a t lumMips ,
7 __global int m_width ,
8 __global int m_offset ,
9 __global f l o a t logAvgLum_acc ) {
10
11 f l o a t factor = logAvgLum_acc [ 0 ] ;
12
13 // Assumes Phi i s 8 . 0
14 constant fl o a t k [ 7 ] = {
15 256. f KEY / ( 1. f 1. f ) ,
16 256. f KEY / ( 2. f 2. f ) ,
17 256. f KEY / ( 4. f 4. f ) ,
18 256. f KEY / ( 8. f 8. f ) ,
19 256. f KEY / ( 1 6 . f 16. f ) ,
20 256. f KEY / ( 3 2 . f 32. f ) ,
21 256. f KEY / ( 6 4 . f 64. f )
22 };
420 V Mobile Devices
23
24 i n t 2 pos , c e n t r e _ p o s , s u r r o u n d _ p o s ;
25 f o r ( p o s . y = g e t _ g l o b a l _ i d ( 1 ) ; p o s . y < H E I G H T ; p o s . y += ←
get_global_size (1) ) {
26 f o r ( p o s . x = g e t _ g l o b a l _ i d ( 0 ) ; p o s . x < W I D T H ; p o s . x += ←
get_global_size (0) ) {
27 surround_pos = pos ;
28 f l o a t local_logAvgLum = 0. f ;
29 f o r ( u i n t i = 0 ; i < N U M _ M I P M A P S −1; i++) {
30 centre_pos = surround_pos ;
31 surround_pos = centre_pos /2;
32
33 int2 m_width_01 , m_offset_01 ;
34 m _ w i d t h _ 0 1 = v l o a d 2 ( 0 , &m _ w i d t h [ i ] ) ;
35 m _ o f f s e t _ 0 1 = v l o a d 2 ( 0 , &m _ o f f s e t [ i ] ) ;
36
37 i n t 2 i n d e x _ 0 1 = m _ o f f s e t _ 0 1 + ( i n t 2 ) ( c e n t r e _ p o s . x , ←
surround_pos . x ) ;
38 i n d e x _ 0 1 += m _ w i d t h _ 0 1 ( i n t 2 ) ( c e n t r e _ p o s . y , s u r r o u n d _ p o s ←
.y) ;
39
40 flo at 2 lumMips_01 = factor ;
41 l u m M i p s _ 0 1 = ( f l o a t 2 ) ( l u m M i p s [ i n d e x _ 0 1 . s0 ] , l u m M i p s [ ←
i n d e x _ 0 1 . s1 ] ) ;
42
43 f l o a t centre_logAvgLum , surround_logAvgLum ;
44 centre_logAvgLum = l u m M i p s _ 0 1 . s0 ;
45 s u r r o u n d _ l o g A v g L u m = l u m M i p s _ 0 1 . s1 ;
46
47 f l o a t c s _ d i f f = f a b s ( c e n t r e _ l o g A v g L u m − s u r r o u n d _ l o g A v g L u m ←
);
48 i f ( cs_diff > ( k [ i ] + centre_logAvgLum ) EPSILON ) {
49 local_logAvgLum = centre_logAvgLum ;
50 break ;
51 } else {
52 local_logAvgLum = surround_logAvgLum ;
53 }
54 }
55
56 uint4 pixel = read_imageui ( input_image , sampler , pos ) ;
57
58 f l o a t 3 rgb = GLtoCL ( pixel . xyz ) ;
59 f l o a t 3 xyz = RGBtoXYZ ( rgb ) ;
60
61 f l o a t Ld = f a c t o r / ( 1 . f + l o c a l _ l o g A v g L u m ) xyz . y ;
62 pixel . xyz = convert_uint3 (( f l o a t 3 ) 255. f \
63 clamp ( ( p o w ( r g b . x y z / x y z . y , ( f l o a t 3 ) S A T ) ( f l o a t 3 ) L d ) , 0 . f , ←
1. f) ) ;
64
65 w r i t e _ i m a g e u i ( o u t p u t _ i m a g e , pos , pixel ) ;
66 }
67 }
68 }
Computing HDR luminance. To recap, for each pixel the local TMO creates a
Gaussian kernel to compute the average logarithmic luminance in a neighbor-
hood. However, Gaussian kernels are expensive to compute, therefore this imple-
mentation makes use of OpenGL mipmaps.
4. High Dynamic Range Computational Photography on Mobile GPUs 421
Mipmaps of the luminance values are created at different scales and then
used as an approximation to the average luminance value at that scale. Using
OpenGL’s API, mipmaps up to Level 7 are computed. The reinhardLocal kernel
in Listing 4.5 gets passed these seven mipmaps in the array lumMips. The for
loop on line 29 is the core of this TMO. Each mipmap is iterated over to obtain
the average logarithmic luminance at that scale. Lines 37 to 45 compute the
center and surround functions V1 and V2 used in Equation (4.4). Lines 47 to 53
compute V as in Equation (4.4) and checks whether it is less than ε (Equation
(4.5)) to determine the appropriate average logarithmic luminance, V1 (x, y, sm ),
for that pixel. Once the optimal center function V1 is computed, the remaining
code implements Equation (4.6) to obtain the HDR luminance for that pixel.
Writing to output. Having computed the HDR luminance array Ld , the local
tone-map kernel simply modifies the luminance values to reflect the new dynamic
range. We first obtain the original RGB pixel, convert it to (x, y, z), modify its
luminance, and convert it back to RGB.
Java Native Interactive. Our example Android application is written in Java; how-
ever, the OpenCL kernel execution code is in C++ for performance reasons.
Therefore, to call various C++ functions from the Android application we make
use of the Java Native Interface (JNI).
GLSurfaceView. This class provides a canvas where we can draw and manipulate
objects using OpenGL ES API calls. More importantly, GLSurfaceView manages
an EGL display that enables OpenGL ES to render onto a surface. Therefore, by
using GLSurfaceView , we don’t have to worry about managing the EGL windowing
life cycle.
OpenCL texture image. After the OpenCL context has been successfully initial-
ized, OpenCL image textures can be created for the kernels from the camera
input.
must be declared in the fragment shader. This results in the contents of the
GL_TEXTURE_EXTERNAL_OES target texture being copied to the GL_TEXTURE_2D tex-
ture rather than being rendered to the display. At this point we now have an
OpenGL ES GL_TEXTURE_2D texture on the GPU which contains the camera data.
4. High Dynamic Range Computational Photography on Mobile GPUs 423
Listing 4.6. Fragment shader that samples from a GL_TEXTURE_ EXTERNAL_OES texture.
Create an OpenCL image from OpenGL 2D texture. Using JNI, the C++ global
state is then instructed to use the previously created OpenCL context to create an
OpenCL texture image from the provided input texture ID. Combining OpenCL
and OpenGL allows OpenCL kernels to modify the texture image on the GPU,
but before the kernels can access the texture data, the host needs to create an
OpenCL memory object specifically configured for this purpose (line 1 in Listing
4.7).
precision mediump fl o a t ;
uniform sampler2D sTexture ;
varying vec2 texCoord ;
void main () {
gl_FragColor = texture2D ( sTexture , texCoord ) ;
};
Listing 4.8. Java code to render the result texture to the display.
4.8.2 GLSurfaceView.Renderer
Extending GLSurfaceView.Renderer requires implementation of the following meth-
ods.
onSurfaceChanged. This method is called when the surface size changes. How-
ever, the method is redundant here as the orientation is locked in our example
application.
Reinhard global TMO performance. Reinhard’s global TMO iterates over the entire
image twice; once to compute Lmax and L̄w , and a second time to adjust each pixel
according to these values and the key value (a) of the scene. To achieve a real-
time implementation, the kernels need to be executed in less than 33 milliseconds
(30 fps). Figure 4.8 compares the execution times of different-sized images, all
running the same OpenCL code on the ARM Mali T604 and NVIDIA GTX 760
platforms.
The NVIDIA GTX 760, being a fast, discrete, desktop GPU, executes the
OpenCL kernels on all image sizes in less than 2.5 ms, achieving more than 400 fps
at 1080p. This is much faster than the equivalent OpenGL implementation by
Akyuz, which achieved 103 fps on a 1024 × 768 image, albeit on much slower
hardware. The ARM Mali T604 GPU can process the two smaller images fast
enough to render the output in real time. However, processing a 1080p image is
slightly borderline, coming in at about 28 fps. With a little more optimization,
30 fps is probably achievable on this platform.
4. High Dynamic Range Computational Photography on Mobile GPUs 427
30
25
20 18.4
15
10.4
10
5
1.15 2.25
0.67
0
640×480 1280×720 1920×1080
100
80
63.0
60
40
32.5
20
1.27 2.15 4.0
0
640×480 1280×720 1920×1080
Reinhard local TMO performance. Because Reinhard’s Local TMO requires iter-
ating over multiple sizes of neighborhood for each pixel, the algorithm is much
more computationally expensive than its global TMO counterpart. Analyzing
the results in Figure 4.9, we see that the desktop GPU again processes all the
images in less than 33 ms to achieve a real-time implementation. Our OpenCL
implementation achieves 250 fps (4.0 ms) on a 1920 × 1080 image compared to
Akyuz’s OpenGL implementation, which has a frame rate of 103 fps on a slightly
smaller (1027 × 768) image.
428 V Mobile Devices
Reduction
Mipmaps
New luminance mappings
Writing to output texture
Figure 4.10. The fraction of total time spent in each kernel within Reinhard’s local
TMO.
For the more data-expensive and computationally expensive local TMO, the
ARM Mali T604 GPU achieves real-time performance for the 640 × 480 image
size (30.8 fps), but doesn’t exceed our 30 fps goal for the two larger image sizes,
instead achieving 15.9 fps on a 1280 × 720 image and 7.8 fps for the 1920 × 1080
HD image.
A closer look at the execution time of each kernel shows that most of the time
is spent in computing the luminance mappings used to scale each luminance value
from the original image (Figure 4.10). These mappings are computed based on
the luminance of the scene.
When recording a video or taking a picture, the luminance of the scene doesn’t
vary much between frames. We could therefore take advantage of this to achieve
a higher frame rate by only computing a new set of mappings once every few
frames, as opposed to computing them for every frame.
Histogram Equalization
300.0
266.9
Nvidia GTX 760
250.0
Time in ms (lower is better)
Intel i3-3217U
Qualcomm Adreno 330
220.0
150.0
122.5
110.9
100.0
46.0 51.5
50.0
17.5 27.0
11.9
4.0
0.0
640×480 1280×720 1920×1080
4.10 Conclusions
The main contributions of this chapter are
• a pipeline that captures camera images, tone-maps them using the above
OpenCL implementations, and renders the output to display;
430 V Mobile Devices
For a scene where the overall luminance is very low, Reinhard’s TMOs work very
well by adjusting the luminance of the image to highlight details in both the
dark and the bright regions. The OpenCL implementations of these algorithms
have been demonstrated to be efficient and portable. An Android pipeline was
also described, which acquired camera frames, tone-mapped them using OpenCL
kernels, and rendered the output to a display. Using OpenGL ES and OpenCL
interoperability, this pipeline was further optimized to avoid any data transfer
of the camera frames. The pipeline can be used for other image-processing ap-
plications that require input from the camera. To demonstrate this, an OpenCL
histogram equalization program has also been provided.
Bibliography
[Akyüz 12] Ahmet Oǧuz Akyüz. “High Dynamic Range Imaging Pipeline on the
GPU.” Journal of Real-Time Image Processing: Special Issue (2012), 1–15.
[Chiu et al. 11] Ching-Te Chiu, Tsun-Hsien Wang, Wei-Ming Ke, Chen-Yu
Chuang, Jhih-Siao Huang, Wei-Su Wong, Ren-Song Tsay, and Cyuan-Jhe
Wu. “Real-Time Tone-Mapping Processor with Integrated Photographic and
Gradient Compression Using 0.13 μm Technology on an ARM SoC Plat-
form.” Journal of Signal Processing Systems 64:1 (2011), 93–107.
[Huang et al. 09] Song Huang, Shucai Xiao, and Wu-chun Feng. “On the En-
ergy Efficiency of Graphics Processing Units for Scientific Computing.” In
Proceedings of the 2009 IEEE International Symposium on Parallel & Dis-
tributed Processing, pp. 1–8. Washington, CD: IEEE Computer Society, 2009.
[Kang et al. 03] Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and
Richard Szeliski. “High Dynamic Range Video. ACM Transactions on
Graphics 22:3 (2003), 319–325.
[Kiser et al. 12] Chris Kiser, Erik Reinhard, Mike Tocci, and Nora Tocci. “Real-
Time Automated Tone Mapping System for HDR Video.” In Proceedings
of the IEEE International Conference on Image Processing, pp. 2749–2752.
Piscataway, NJ: IEEE, 2012.
[Krawczyk et al. 05] Grzegorz Krawczyk, Karol Myszkowski, and Hans-Peter Sei-
del. “Perceptual Effects in Real-Time Tone Mapping.” In Proceedings of
the 21st Spring Conference on Computer Graphics, pp. 195–202. New York:
ACM, 2005.
[Khronos 15] Khronos Group. “Khronos OpenCL Standard.” http://www.
khronos.org/opencl/, accessed May 6, 2015.
[Kuang et al. 07] Jiangtao Kuang, Hiroshi Yamaguchi, Changmeng Liu, Garrett
M, and Mark D. Fairchild. “Evaluating HDR Rendering Algorithms.” ACM
Trans. Appl. Perception 4:2 (2007), Article no. 9.
[McClanahan 11] Chris McClanahan. “Single Camera Real Time HDR Tonemap-
ping.” mcclanahoochie’s blog, http://mcclanahoochie.com/blog/portfolio/
real-time-hdr-tonemapping/, April 2011.
[McCoy 08] Kevin McCoy. “St. Louis Arch Multiple Exposures.” Wikipedia,
https://en.wikipedia.org/wiki/File:StLouisArchMultExpEV-4.72.JPG, May
31, 2008.
[Reinhard et al. 02] Erik Reinhard, Michael Stark, Peter Shirley, and James Fer-
werda. “Photographic Tone Reproduction for Digital Images.” ACM Trans-
actions on Graphics 21:3 (2002), 267–276.
[Seng 10] Kim Seng. “Single Exposure HDR.” HDR Photography by Captain
Kimo, http://captainkimo.com/single-exposure-hdr/, February 25, 2010.
[Topaz Labs 15] Topaz Labs. “Topaz Adjust.” http://www.topazlabs.com/
adjust, 2015.
[UCI iCAMP 10] UCI iCAMP. “Histogram Equalization.” Math 77C, http://
www.math.uci.edu/icamp/courses/math77c/demos/hist eq.pdf, August 5,
2010.
VI
Compute
Short and sweet is this section, presenting three rendering techniques that make
intensive use of the compute functionality of modern graphics pipelines. GPUs,
including those in new game consoles, can nowadays execute general-purpose
computation kernels, which opens doors to new and more efficient rendering tech-
niques and to scenes of unseen complexity. The articles in this section leverage
this functionality to enable large numbers of dynamic lights in real-time ren-
dering, more complex geometry in ray tracing, and fast approximate ambient
occlusion for direct volume rendering in scientific visualization applications.
“Compute-Based Tiled Culling,” Jason Stewart’s chapter, focuses on one chal-
lenge in modern real-time rendering engines: they need to support many dynamic
light sources in a scene. Both forward and deferred rendering can struggle with
problems such as efficient culling, batch sizes, state switching, or bandwidth con-
sumption, in this case. Compute-based (tiled) culling of lights reduces state
switching and avoids culling on the CPU (beneficial for forward rendering), and
computes lighting in a single pass that fits deferred renderers well. Jason details
his technique, provides a thorough performance analysis, and deduces various
optimizations, all documented with example code.
In “Rendering Vector Displacement-Mapped Surfaces in a GPU Ray Tracer,”
Takahiro Harada’s work targets the rendering of vector displacement-mapped
surfaces using ray-tracing–based methods. Vector displacement is a popular and
powerful means to model complex objects from simple base geometry. However,
ray tracing such geometry on a GPU is nontrivial: pre-tessellation is not an op-
tion due to the high (and possibly unnecessary) memory consumption, and thus
efficient, GPU-friendly algorithms for the construction and traversal of accel-
eration structures and intersection computation with on-the-fly tessellation are
required. Takahiro fills this gap and presents his method and implementation of
an OpenCL ray tracer supporting dynamic tessellation of vector displacement-
mapped surfaces.
“Smooth Probablistic Ambient Occlusion for Volume Rendering” by Thomas
Kroes, Dirk Schut, and Elmar Eisemann covers a novel and easy-to-implement
solution for ambient occlusion for direct volume rendering (DVR). Instead of ap-
plying costly ray casting to determine the accessibility of a voxel, this technique
employs a probabilistic heuristic in concert with 3D image filtering. This way,
434 VI Compute
—Carsten Dachsbacher
1
VI
1.1 Introduction
Modern real-time rendering engines need to support many dynamic light sources
in a scene. Meeting this requirement with traditional forward rendering is prob-
lematic. Typically, a forward-rendered engine culls lights on the CPU for each
batch of scene geometry to be drawn, and changing the set of lights in use requires
a separate draw call. Thus, there is an undesirable tradeoff between using smaller
pieces of the scene for more efficient light culling versus using larger batches and
more instancing for fewer total draw calls. The intersection tests required for
light culling can also be a performance burden for the CPU.
Deferred rendering better supports large light counts because it decouples
scene geometry rendering and material evaluation from lighting. First, the scene
is rendered and geometric and material properties are stored into a geometry
buffer or G-buffer [Saito and Takahashi 90]. Lighting is accumulated separately,
using the G-buffer as input, by drawing light bounding volumes or screen-space
quads. Removing lighting from the scene rendering pass eliminates the state
switching for different light sets, allowing for better batching. In addition, CPU
light culling is performed once against the view frustum instead of for each batch,
reducing the performance cost. However, because each light is now accumulated
separately, overlapping lights increase bandwidth consumption, which can de-
crease GPU performance [Lauritzen 10].
This chapter presents a better method for supporting large light counts:
compute-based tiled culling. Modern GPUs, including those in Xbox One and
Playstation 4, can execute general-purpose computation kernels. This capability
allows light culling to be performed on the GPU. The technique can be used with
both forward and deferred rendering. It eliminates light state switching and CPU
culling, which helps forward rendering, and it calculates lighting in a single pass,
which helps deferred rendering. This chapter presents the technique in detail, in-
cluding code examples in HLSL and various optimizations. The companion code
implements the technique for both forward and deferred rendering and includes
a benchmark.
435
436 VI Compute
(a)
Far plane
Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7
Max Z
Min Z
Scene geometry
Near plane
(b)
Figure 1.1. Partitioning the scene into tiles. (a) Example screen tiles. (b) Fitting view
frustum partitions to the screen tiles. For clarity, the tiles shown in this figure are very
large. They would typically be 16 × 16 pixels.
1.2 Overview
Compute-based tiled culling works by partitioning the screen into fixed-size tiles,
as shown in Figure 1.1(a). For each tile, a compute shader1 loops over all lights in
the scene and determines which ones intersect that particular tile. Figure 1.1(b)
gives a 2D, top-down example of how the tile bounding volume is constructed.
Four planes are calculated to represent the left, right, top, and bottom of an
1 This chapter uses Direct3D 11 terminology. In Direct3D 11, the general-purpose computa-
tion technology required for tiled culling is called DirectCompute 5.0, and the general-purpose
kernel is called a compute shader.
1. Compute-Based Tiled Culling 437
asymmetric partition of the view frustum that fits exactly around the tile. To
allow for tighter culling, the minimum and maximum scene depths are calculated
for the tile, as shown in Figure 1.1(b) for Tile 0. These depth values form the
front and back of the frustum partition. This gives the six planes necessary for
testing the intersection between light bounding volumes and the tile.
Figure 1.2 provides an overview of the algorithm. Figure 1.2(a) shows a 2D
representation of a tile bounding volume, similar to that shown for Tile 0 in
Figure 1.1(b). Several scene lights are also shown. Figure 1.2(b) shows the input
buffer containing the scene light list. Each entry in the list contains the center
and radius for that light’s bounding sphere.
The compute shader is configured so that each thread group works on one tile.
It loops over the lights in the input buffer and stores the indices of those that
intersect the tile into shared memory.2 Space is reserved for a per-tile maximum
number of lights, and a counter tracks how many entries were actually written,
as shown in Figure 1.2(c).
Algorithm 1.1 summarizes the technique.
Referring back to Figure 1.2 as a visual example of the loop in Algorithm 1.1,
note from Figure 1.2(a) that two lights intersect the frustum partition: Light 1
and Light 4. The input buffer index (Figure 1.2(b)) of each intersecting light is
written to shared memory (Figure 1.2(c)). To make this thread safe, so that lights
can be culled in parallel, a counter is stored in shared memory and incremented
using the atomic operations available in compute shaders.
1.3 Implementation
This section gives an implementation in HLSL of the compute-based tiled-culling
algorithm discussed in the previous section. The three parts of Algorithm 1.1
will be presented in order: depth bounds calculation, frustum planes calculation,
and intersection testing.
2 Compute shader execution is organized into thread groups. Threads in the same thread
1 T e x t u r e 2 D <f l o a t > g _ S c e n e D e p t h B u f f e r ;
2
3 // Thread Group S h a r e d Memory ( aka l o c a l d a t a s h a r e , o r LDS)
4 groupshared uint ldsZMin ;
5 groupshared uint ldsZMax ;
6
7 // Co n v e r t a d e p t h v a l u e from p o s t p r o j e c t i o n s p a c e
8 // i n t o v i e w s p a c e
9 f l o a t ConvertProjDepthToView ( f l o a t z )
10 {
11 return ( 1 . f /( z g_mProjectionInv . _34 + g_mProjectionInv . _44 ) ) ;
12 }
13
14 #d e f i n e TILE RES 16
15 [ numthreads ( TILE_RES , TILE_RES , 1 ) ]
16 void CullLightsCS ( uint3 globalIdx : SV_DispatchThreadID ,
17 uint3 localIdx : SV_GroupThreadID ,
18 uint3 groupIdx : SV_GroupID )
19 {
20 f l o a t depth = g_SceneDepthBuffer . Load ( uint3 ( globalIdx . x ,
21 globalIdx . y ,0 ) ) . x ;
22 f l o a t viewPosZ = ConvertProjDepthToView ( depth ) ;
23 uint z = asuint ( viewPosZ ) ;
24
25 uint threadNum = localIdx . x + localIdx . y TILE_RES ;
26
27 // There i s no way t o i n i t i a l i z e s h a r e d memory a t
28 // c o m p i l e time , s o t h r e a d z e r o d o e s i t a t r u n t i m e
29 i f ( t h r e a d N u m == 0 )
30 {
31 ldsZMin = 0 x7f7fffff ; // FLT MAX a s a u i n t
32 ldsZMax = 0;
33 }
34 GroupMemoryBarrierWithGroupSync () ;
35
36 // P a r t s o f t h e d e p t h b u f f e r t h a t were n e v e r w r i t t e n
37 // ( e . g . , t h e sk y ) w i l l be z e r o ( t h e companion c o d e u s e s
38 // i n v e r t e d 32− b i t f l o a t d e p t h f o r b e t t e r p r e c i s i o n ) .
39 i f ( d e p t h != 0 . f )
40 {
41 // C a l c u l a t e t h e minimum and maximum d e p t h f o r t h i s t i l e
42 // t o form t h e f r o n t and back o f t h e f r u st u m
43 InterlockedMin ( ldsZMin , z ) ;
44 InterlockedMax ( ldsZMax , z ) ;
45 }
46 GroupMemoryBarrierWithGroupSync () ;
47
48 f l o a t minZ = asfloat ( ldsZMin ) ;
49 f l o a t maxZ = asfloat ( ldsZMax ) ;
50
51 // Frustum p l a n e s and i n t e r s e c t i o n c o d e g o e s h e r e
52 ...
53 }
and the minimum and maximum are performed against these unsigned bits. This
works because the floating point depth is always positive, and the raw bits of a
32-bit floating point value increase monotonically in this case.
440 VI Compute
1 // P l a n e e q u a t i o n from t h r e e p o i n t s , s i m p l i f i e d
2 // f o r t h e c a s e where t h e f i r s t p o i n t i s t h e o r i g i n .
3 // N i s n o r m a l i z e d s o t h a t t h e p l a n e e q u a t i o n can
4 // be u se d t o compute s i g n e d d i s t a n c e .
5 float4 CreatePlaneEquation ( float3 Q , float3 R )
6 {
7 // N = n o r m a l i z e ( c r o s s (Q−P , R−P) ) ,
8 // e x c e p t we know P i s t h e o r i g i n
9 float3 N = normalize ( cross ( Q , R ) ) ;
10 // D = −(N d o t P) , e x c e p t we know P i s t h e o r i g i n
11 return float4 (N ,0 ) ;
12 }
13
14 // Co n v e r t a p o i n t from p o s t p r o j e c t i o n s p a c e i n t o v i e w s p a c e
15 float3 ConvertProjToView ( float4 p )
16 {
17 p = mul ( p , g_mProjectionInv ) ;
18 return ( p/p . w ) . xyz ;
19 }
20
21 void CullLightsCS ( uint3 globalIdx : SV_DispatchThreadID ,
22 uint3 localIdx : SV_GroupThreadID ,
23 uint3 groupIdx : SV_GroupID )
24 {
25 // Depth bounds c o d e g o e s h e r e
26 ...
27 float4 frustumEqn [ 4 ] ;
28 { // C o n s t r u c t f r u st u m p l a n e s f o r t h i s t i l e
29 uint pxm = TILE_RES groupIdx . x ;
30 uint pym = TILE_RES groupIdx . y ;
31 u i n t p x p = T I L E _ R E S ( g r o u p I d x . x +1) ;
32 u i n t p y p = T I L E _ R E S ( g r o u p I d x . y +1) ;
33 uint width = TILE_RES GetNumTilesX () ;
34 uint height = TILE_RES GetNumTilesY () ;
35
36 // Four c o r n e r s o f t h e t i l e , c l o c k w i s e from top− l e f t
37 float3 p [ 4 ] ;
38 p [ 0 ] = C o n v e r t P r o j T o V i e w ( f l o a t 4 ( p x m / ( f l o a t ) w i d t h 2 . f −1. f ,
39 ( h e i g h t−p y m ) / ( f l o a t ) h e i g h t 2 . f −1. f , 1 . f , 1 . f ) ) ;
40 p [ 1 ] = C o n v e r t P r o j T o V i e w ( f l o a t 4 ( p x p / ( f l o a t ) w i d t h 2 . f −1. f ,
41 ( h e i g h t−p y m ) / ( f l o a t ) h e i g h t 2 . f −1. f , 1 . f , 1 . f ) ) ;
42 p [ 2 ] = C o n v e r t P r o j T o V i e w ( f l o a t 4 ( p x p / ( f l o a t ) w i d t h 2 . f −1. f ,
43 ( h e i g h t−p y p ) / ( f l o a t ) h e i g h t 2 . f −1. f , 1 . f , 1 . f ) ) ;
44 p [ 3 ] = C o n v e r t P r o j T o V i e w ( f l o a t 4 ( p x m / ( f l o a t ) w i d t h 2 . f −1. f ,
45 ( h e i g h t−p y p ) / ( f l o a t ) h e i g h t 2 . f −1. f , 1 . f , 1 . f ) ) ;
46
47 // C r e a t e p l a n e e q u a t i o n s f o r t h e f o u r s i d e s , wi t h
48 // t h e p o s i t i v e h a l f −s p a c e o u t s i d e t h e f r u st u m
49 f o r ( u i n t i =0; i <4; i++)
50 f r u s t u m E q n [ i ] = C r e a t e P l a n e E q u a t i o n ( p [ i ] , p [ ( i +1) & 3 ] ) ;
51 }
52 // I n t e r s e c t i o n c o d e g o e s h e r e
53 ...
54 }
1 B u f f e r<f l o a t 4> g _ L i g h t B u f f e r C e n t e r A n d R a d i u s ;
2
3 #d e f i n e MAX NUM LIGHTS PER TILE 256
4 groupshared uint ldsLightIdxCounter ;
5 groupshared uint ldsLightIdx [ MAX_NUM_LIGHTS_PER_TILE ] ;
6
7 // P o i n t−p l a n e d i s t a n c e , s i m p l i f i e d f o r t h e c a s e where
8 // t h e p l a n e p a s s e s t h r o u g h t h e o r i g i n
9 f l o a t GetSignedDistanceFromPlane( float3 p , float4 eqn )
10 {
11 // d o t ( eqn . xyz , p ) + eqn . w, e x c e p t we know eqn . w i s z e r o
12 r e t u r n d o t ( e q n . xyz , p ) ;
13 }
14
15 #d e f i n e NUM THREADS ( TILE RES TILE RES )
16 void CullLightsCS ( . . . )
17 {
18 // Depth bounds and f r u st u m p l a n e s c o d e g o e s h e r e
19 ...
20 i f ( t h r e a d N u m == 0 )
21 {
22 ldsLightIdxCounter = 0;
23 }
24 GroupMemoryBarrierWithGroupSync () ;
25
26 // Loop o v e r t h e l i g h t s and do a
27 // s p h e r e v e r s u s f r u st u m i n t e r s e c t i o n t e s t
28 f o r ( u i n t i=t h r e a d N u m ; i<g _ u N u m L i g h t s ; i+=N U M _ T H R E A D S )
29 {
30 float4 p = g_LightBufferCenterAndRadius [ i ] ;
31 float r = p.w;
32 f l o a t 3 c = m u l ( f l o a t 4 ( p . xyz , 1 ) , g _ m V i e w ) . x y z ;
33
34 // Te st i f s p h e r e i s i n t e r s e c t i n g o r i n s i d e f r u st u m
35 i f (( GetSignedDistanceFromPlane(c , frustumEqn [ 0 ] ) < r ) &&
36 ( GetSignedDistanceFromPlane(c , frustumEqn [ 1 ] ) < r ) &&
37 ( GetSignedDistanceFromPlane(c , frustumEqn [ 2 ] ) < r ) &&
38 ( GetSignedDistanceFromPlane(c , frustumEqn [ 3 ] ) < r ) &&
39 (−c . z + m i n Z < r ) && ( c . z − m a x Z < r ) )
40 {
41 // Do a t h r e a d −s a f e i n c r e m e n t o f t h e l i s t c o u n t e r
42 // and put t h e i n d e x o f t h i s l i g h t i n t o t h e l i s t
43 uint dstIdx = 0;
44 InterlockedAdd ( ldsLightIdxCounter ,1 , dstIdx ) ;
45 ldsLightIdx [ dstIdx ] = i ;
46 }
47 }
48 GroupMemoryBarrierWithGroupSync () ;
49 }
memory directly. Even if lights overlap, the G-buffer is only read once for each
pixel, and the lighting results are accumulated into shader registers instead of
blended into a render target, reducing bandwidth consumption.
1. Compute-Based Tiled Culling 443
1.4 Optimization
This section covers various optimizations to the compute-based tiled-culling tech-
nique. Common pitfalls to avoid are presented first, followed by several optimiza-
tions to the basic implementation from the previous section.
struct LightArrayData
{
flo at 4 v4CenterAndRadius ;
flo at 4 v4Color ;
};
S t r u c t u r e d B u f f e r <L i g h t A r r a y D a t a> g _ L i g h t B u f f e r ;
For the second change, recall that line 14 in Listing 1.1 defines TILE_RES
as 16, resulting in 16 × 16 threads per thread group, or 256 threads. For AMD
GPUs, work is executed in 64-thread batches called wavefronts, while on NVIDIA
GPUs, work is executed in 32-thread warps. Thus, efficient compute shader
execution requires the number of threads in a thread group to be a multiple of
64 for AMD or 32 for NVIDIA. Since every multiple of 64 is a multiple of 32,
standard performance advice is to configure the thread count to be a multiple of
64. Because 256 is a multiple of 64, setting TILE_RES to 16 follows this advice.
Alternatively, setting TILE_RES to 8 (resulting in 8×8-pixel tiles) yields 64 threads
per thread group, which is certainly also a multiple of 64, and the smaller tile
size might result in tighter culling.
Although these two changes seem minor, both decrease performance, as shown
in Figure 1.3. The “unoptimized” curve contains both changes (combined light
data in a StructuredBuffer and 8 × 8 tiles). For the cache friendly curve, the
3 All performance data in this chapter was gathered on an AMD Radeon R7 260X GPU.
The R7 260X was chosen because its performance characteristics are roughly comparable to the
Xbox One and Playstation 4.
444 VI Compute
3.0
2.8 Unoptimized
Cache Friendly
2.6 Cache Friendly and 16 × 16 Tiles
2.4
2.2
Time (ms)
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0 64 128 192 256 320 384 448 512 576 640 704 768 832 996 960 1024
Number of Lights
Figure 1.3. Basic optimizations.3 Tiled-culling compute shader execution time versus
number of lights for Forward+ rendering at 1920 × 1080 using the companion code for
this chapter.
Min Z2
Half Z
Scene geometry
Max Z2
Figure 1.4. Depth discontinuity optimization strategies. (a) Scene depth discontinuities
can cause a large depth range in the tile bounding volume. (b) The Half Z method splits
the depth range in half and culls against the two ranges. (c) The Modified Half Z method
calculates a second minimum and maximum, bounded by the Half Z value.
Modified Half Z. Figure 1.4(c) shows a second strategy called the Modified Half
Z method. It performs additional atomic operations to find a second maximum
(Max Z2) between Min Z and Half Z and a second minimum (Min Z2) between
Half Z and Max Z. This can result in tighter bounding volumes compared to the
Half Z method, but calculating the additional minimum and maximum is more
expensive than simply calculating Half Z, due to the additional atomic operations
required.
Light count reduction results. Figure 1.5 shows the reduction in per-tile light count
at depth discontinuities from the methods discussed in this section. Note the
446 VI Compute
// Te st i f s p h e r e i s i n t e r s e c t i n g o r i n s i d e f r u st u m
i f ( ( G e t S i g n e d D i s t a n c e F r o m P l a n e ( c , f r u s t u m E q n [ 0 ] ) < r ) &&
( G e t S i g n e d D i s t a n c e F r o m P l a n e ( c , f r u s t u m E q n [ 1 ] ) < r ) &&
( G e t S i g n e d D i s t a n c e F r o m P l a n e ( c , f r u s t u m E q n [ 2 ] ) < r ) &&
( GetSignedDistanceFromPlane(c , frustumEqn [ 3 ] ) < r ) )
{
i f (−c . z + m i n Z < r && c . z − h a l f Z < r )
{
// Do a t h r e a d −s a f e i n c r e m e n t o f t h e l i s t c o u n t e r
// and put t h e i n d e x o f t h i s l i g h t i n t o t h e l i s t
uint dstIdx = 0;
InterlockedAdd ( ldsLightIdxCounterA ,1 , dstIdx ) ;
ldsLightIdxA [ dstIdx ] = i ;
}
i f (−c . z + h a l f Z < r && c . z − m a x Z < r )
{
// Do a t h r e a d −s a f e i n c r e m e n t o f t h e l i s t c o u n t e r
// and put t h e i n d e x o f t h i s l i g h t i n t o t h e l i s t
uint dstIdx = 0;
InterlockedAdd ( ldsLightIdxCounterB ,1 , dstIdx ) ;
ldsLightIdxB [ dstIdx ] = i ;
}
}
column in the foreground of the left side of the scene in Figure 1.5(a). This causes
depth discontinuities for tiles along the column, resulting in the high light counts
shown in red in Figure 1.5(c) for the baseline implementation in Section 1.3.
The results for the Half Z method are shown in Figure 1.5(d). Note that the
light counts for tiles along the column have been reduced. Then, for the Modified
Half Z method, note that light counts have been further reduced in Figure 1.5(e).
Performance results. Figure 1.6 shows the performance of these methods. Note
that, while Figure 1.3 measured only the tiled-culling compute shader, Figure 1.6
measures both the compute shader and the forward pixel shader for Forward+
rendering. More time spent during culling can still be an overall performance
win if enough time is saved during lighting, so it is important to measure both
here.
The “Baseline” curve is from the implementation in Section 1.3. The “Half
Z” curve shows this method at a slight performance disadvantage for lower light
counts, because the savings during lighting do not yet outweigh the extra cost
of testing two depth ranges and maintaining two lists. However, this method
becomes faster at higher light counts. The “Modified Half Z” curve starts out
with a bigger deficit, due to the higher cost of calculating the additional minimum
and maximum with atomics. It eventually pulls ahead of the baseline method,
but never catches Half Z. However, this method’s smaller depth ranges can still
be useful if additional optimizations are implemented, as shown next.
1. Compute-Based Tiled Culling 447
0 1 2 3
(a) (b)
(c) (d)
(e) (f)
Figure 1.5. Tiled-culling optimization results using the companion code for this chapter.
(a) Scene render. (b) Log scale lights-per-tile legend. (c) Baseline. (d) Half Z. (e)
Modified Half Z. (f) Modified Half Z with AABBs.
448 VI Compute
4.0
3.8
3.6
3.4
Time (ms)
3.2
3.0
Baseline
2.8 Half Z
Modified Half Z
2.6 Modified Half AABB
Modified Half Z AABB Parallel Reduction
2.4
0 64 128 192 256 320 384 448 512 576 640 704 768 832 996 960 1024
Number of Lights
Figure 1.6. Tiled-culling optimizations. GPU execution time versus number of lights
using the companion code for this chapter. The vertical axis represents the combined
time for the tiled-culling compute shader and the forward pixel shader in Forward+
rendering at 1920 × 1080.
Max Z
Min Z
(a)
Max Z
Min Z
(b)
Max Z
Min Z
(c)
Figure 1.7. Frustum planes versus AABBs. False positive intersections will occur in
the shaded regions. (a) Frustum intersection testing. (b) AABB intersection testing.
(c) AABB intersection with a small depth range.
Referring back to Figure 1.6, the “Modified Half Z, AABB, Parallel Reduc-
tion” curve is the fastest method throughout. For 1024 lights, the baseline code
executes in 3.97 ms, whereas this final optimized version takes 3.52 ms, a reduc-
tion of roughly half a millisecond. This represents an 11% decrease in execution
time compared to the baseline.
(a)
(b)
(c)
Figure 1.8. Unreal Engine 4 Infiltrator demo: Example 1. (a) Scene render. (b) Baseline
tiled culling. (c) Modified Half Z with AABBs.
454 VI Compute
(a)
(b)
(c)
Figure 1.9. Unreal Engine 4 Infiltrator demo: Example 2. (a) Scene render. (b) Baseline
tiled culling. (c) Modified Half Z with AABBs.
1. Compute-Based Tiled Culling 455
2.6
Tiled Deferred: Optimized vs. Baseline
2.4
Reduction in GPU Execution Time (ms)
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
Unreal Engine 4 Infiltrator Real-Time Demo
Figure 1.10. Unreal Engine 4 tiled-culling execution time improvement for the optimized
version compared to the baseline implementation. Performance was measured over the
entire Infiltrator demo at 1920 × 1080.
of calculating the depth bounds and performing the per-tile culling. However, av-
eraged over the entire demo, tiled deferred is still faster overall. Specifically, the
average cost of standard deferred is 4.28 ms, whereas the optimized tiled-deferred
average cost is 3.74 ms, a reduction of 0.54 ms, or roughly 13% faster.
It is natural to wonder exactly how many lights are needed in a scene with
“many lights” before tiled deferred is consistently faster than standard deferred.
The answer will depend on several factors including the depth complexity of the
scene and the amount of light overlap. For the Infiltrator demo, Figure 1.12 is a
scatterplot of the data used to generate Figure 1.11 plotted against the number
of lights processed during that particular frame. The demo uses a wide range of
light counts, from a low of 7 to a high of 980. The average light count is 299 and
the median is 218.
For high light counts (above 576), tiled deferred has either comparable or
better performance, and is often significantly faster. For example, for counts
above 640, tiled deferred is 1.65 ms faster on average. Conversely, for low light
counts (below 64), standard deferred is faster. For light counts above 64 but
below 576, the situation is less clear from just looking at the chart. Standard
deferred values appear both above and below tiled deferred in this range. How-
ever, it is worth noting that tiled deferred comes out ahead on average over each
interval on the “Number of Lights” axis (i.e., [0, 64], [64, 128], [128, 192], etc.)
except [0, 64].
456 VI Compute
5.0
Tiled Deferrd vs. Standard Deferred
3.0
2.0
1.0
0.0
–1.0
–2.0
Unreal Engine 4 Infiltrator Real-Time Demo
12.0
Standard Deferred
Tiled Deferred
10.0
8.0
Time (ms)
6.0
4.0
2.0
0.0
0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Number of Lights
Figure 1.12. Unreal Engine 4 optimized tiled deferred versus standard deferred. GPU
execution time versus number of lights. Performance was measured over the entire
Infiltrator demo at 1920 × 1080.
12.0
Standard Deferred
Tiled Deferred
10.0
8.0
Time (ms)
6.0
4.0
2.0
0.0
0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Number of Lights
Figure 1.13. Unreal Engine 4 optimized tiled deferred versus standard deferred. GPU
execution time versus number of lights. A moving average was applied to the data in
Figure 1.12 to show overall trends.
on par with or faster than standard deferred for above 70 lights. Thus, for the
particular case of the Infiltrator demo, 70 is the threshold for when tiled deferred
is consistently faster than (or at least comparable to) standard deferred.
Referring back to Figure 1.12, another thing to note about the data is that the
standard deviation is lower for tiled deferred. Specifically, the standard deviation
is 1.79 ms for standard deferred and 0.90 ms for tiled deferred, a 50% reduction.
Note that worst-case performance is also much better for tiled deferred, with no
tiled deferred data point appearing above the 6.0 ms line. That is, in addition to
getting faster performance on average, tiled deferred also offers more consistent
performance, making it easier to achieve a smooth framerate.
1.6 Conclusion
This chapter presented an optimized compute-based tiled-culling implementation
for scenes with many dynamic lights. The technique allows forward rendering to
support such scenes with high performance. It also improves the performance
of deferred rendering for these scenes by reducing the average cost to calculate
lighting, as well as the worst-case cost and standard deviation. That is, it provides
both faster performance (on average) and more consistent performance, avoiding
the bandwidth bottleneck from blending overlapping lights. For more details, see
the companion code.
458 VI Compute
1.7 Acknowledgments
Many thanks to the rendering engineers at Epic Games, specifically Brian Karis
for the idea to use AABBs to bound the tiles and Martin Mittring for the initial
implementation of AABBs and for the Modified Half Z method. Thanks also go
out to Martin for providing feedback for this chapter. And thanks to the Epic
rendering team and Epic Games in general for supporting this work.
The following are either registered trademarks or trademarks of the listed
companies in the United States and/or other countries: AMD, Radeon, and
combinations thereof are trademarks of Advanced Micro Devices, Inc.; Unreal
is a registered trademark of Epic Games, Inc.; Xbox One is a trademark of Mi-
crosoft Corporation.; NVIDIA is a registered trademark of NVIDIA Corporation.;
Playstation 4 is a trademark of Sony Computer Entertainment, Inc.
Bibliography
[Andersson 09] Johan Andersson. “Parallel Graphics in Frostbite—Current and
Future.” Beyond Programmable Shading, SIGGRAPH Course, New Orleans,
LA, August 3–7, 2009.
[Engel 14] Wolfgang Engel. “Compute Shader Optimizations for
AMD GPUs: Parallel Reduction.” Diary of a Graphics Pro-
grammer, http://diaryofagraphicsprogrammer.blogspot.com/2014/03/
compute-shader-optimizations-for-amd.html, March 26, 2014.
[Harada et al. 12] Takahiro Harada, Jay McKee, and Jason C.Yang. “Forward+:
Bringing Deferred Lighting to the Next Level.” Paper presented at Euro-
graphics, Cagliari, Italy, May 13–18, 2012.
[Harris 07] Mark Harris. “Optimizing Parallel Reduction in CUDA.”
NVIDIA, http://developer.download.nvidia.com/compute/cuda/1.1-Beta/
x86 website/projects/reduction/doc/reduction.pdf, 2007.
[Lauritzen 10] Andrew Lauritzen. “Deferred Rendering for Current and Future
Rendering Pipelines.” Beyond Programmable Shading, SIGGRAPH Course,
Los Angeles, CA, July 25–29, 2010.
[Lauritzen 12] Andrew Lauritzen. “Intersecting Lights with Pixels: Reasoning
about Forward and Deferred Rendering.” Beyond Programmable Shading,
SIGGRAPH Course, Los Angeles, CA, August 5–9, 2012.
[Saito and Takahashi 90] Takafumi Saito and Tokiichiro Takahashi. “Compre-
hensible Rendering of 3-D Shapes.” Computer Graphics: Proc. SIGGRAPH
24:4 (1990), 197–206.
2
VI
Rendering Vector
Displacement-Mapped Surfaces
in a GPU Ray Tracer
Takahiro Harada
2.1 Introduction
Ray tracing is an elegant solution to render high-quality images. By combining
Monte Carlo integration with ray tracing, we can solve the rendering equation.
However, a disadvantage of using ray tracing is its high computational cost, which
makes render time long. To improve the performance, GPUs have been used.
However, GPU ray tracers typically do not have as many features as CPU ray
tracers. Vector displacement mapping is one of the features that we do not see
much in GPU ray tracers. When vector displacement mapping is evaluated on the
fly (i.e., without creating a large number of polygons in the preprocess and storing
them in the memory), it allows us to render a highly geometric detailed scene
from a simple mesh. Since geometric detail is an important factor for realism,
vector displacement mapping is an important technique in ray tracing. In this
chapter, we describe a method to render vector displacement-mapped surfaces in
a GPU ray tracer.
459
460 VI Compute
Figure 2.1. The “Party” scene with vector displacement-mapped surfaces rendered
using the proposed method. The rendering time is 77 ms/frame on an AMD FirePro
W9100 GPU. Instancing is not used to stress the rendering algorithm. If pretessellated,
the geometry requires 52 GB of memory.
Figure 2.2. The base mesh used for the “Party” scene.
Figure 2.3. Illustration of vector displacement mapping. (a) Simple geometry (a quad).
(b) A vector displacement map. (c) Surface after applying vector displacement.
Data structure. We could find the closest intersection by testing primitives in the
scene one by one, but it is better to create a spatial acceleration structure to do
this efficiently. As we build it on the fly, the build performance is as important
as the intersection performance. Therefore, we employed a simple acceleration
structure. A patch is split into four patches recursively to build a complete
quad BVH. At the lowest level of the BVH, four vertex positions and texture
coordinates are linearly interpolated from the values of the root patch. The
displaced vertex position is then calculated by adding the displacement vector
value, which is fetched from a texture using the interpolated texture coordinate.
Next, the AABBs enclosing these four vertices are computed and used as the
geometry at the leaves rather than a quad because we subdivide the patch smaller
than a pixel size. This allows us not to store geometries (e.g., vertices), but only
store the BVH. Thus, we can reduce the data size for a VD patch. A texture
coordinate and normal vector are also computed and stored within a node. Once
leaf nodes are computed, it ascends the tree level by level and builds the nodes
of the inner level. It does this by computing the union of AABBs and averaging
normal vectors and texture coordinates of the four child nodes. This process is
repeated until it reaches the root node.
For better performance, the memory footprint for the BVH has to be reduced
as much as possible. Thus, an AABB is compressed by quantizing the maximum
and minimum values into 2 byte integers (maxq , minq ) these as follows:
where maxf and minf are uncompressed maximum and minimum values, respec-
tively, of the AABB and maxroot and minroot are values of the root AABB. We
considered compressing them into 1-byte integers, but the accuracy was not high
enough since the subdivision level can easily go higher than the resolution limit
of 1-byte integers (i.e., eight levels). We also quantized texture coordinates and
the normal vectors into 4 bytes each. Therefore, the total memory footprint for
a node is 20 bytes (Figure 2.4).
We separate the hierarchy of the BVH from the node data (i.e., a node does
not store links to other nodes such as children). This is to keep the memory
footprint for nodes small. We only store one hierarchy data structure for all VD
2. Rendering Vector Displacement-Mapped Surfaces in a GPU Ray Tracer 463
Level 0 0
4 Byte
Max.x Max.y
Level 1
Max.z Min.x
1 2 3 4 Max.y Min.z
Level 2 Normal
UV
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 2.4. Quad BVH. Each node stores two links: one pointing to the first children
(red), and one pointing to the skip node (green). To check if a node is a leaf of level
i, the node index is compared to (4i − 1)/3, e.g., leaf nodes of level 2 BVH are nodes
whose index is greater than 5. Data layout in a node is shown on the left.
patches because we always create a complete quad BVH so that the hierarchy
structure is the same for all the BVHs we construct. Although we build a BVH at
different depths (i.e., levels), we only compute and store the hierarchy structure
for the maximum level we might build. As nodes are stored in breadth-first order,
leaf nodes can be identified easily by checking their index. Leaf nodes at the ith
level are nodes with indices larger than (4i − 1)/3, as shown in Figure 2.4.
We use stackless traversal for BVH traversal. Thus, a node in the hierarchy
structure stores two indices of the first child and the skip node (Figure 2.4). These
two indices are packed and stored in 4 bytes of data.
To summarize the data structure we have
Before we start intersecting rays with VD patches, we gather all the rays
hitting the AABB of any VD patches. When a ray hits multiple VD patches, we
store multiple hits. These hits are sorted by a VD patch index. This results in a
list of VD patches, each of which has a list of rays.
We implemented a kernel doing both BVH build and its traversal. Work
groups are launched with the number of work items optimal for the respective
GPU architecture. We use AMD GPUs, which are 64-wide SIMD, so 64 work
items are executed for a work group. A work group first fetches a VD patch
from the list of unprocessed VD patches. This work group is responsible for the
intersection of all rays hitting the AABBs of the root patch. First, we use work
items executing in parallel for building the BVH. However, as we build a BVH
for the patch that has to be stored somewhere, we need to allocate memory for
it and therefore the question is where to allocate. The first candidate is in the
local data share (LDS), but it is too small if we build a BVH with six levels
(64 × 64 leaf nodes), which requires 108 KB (= 5400 nodes × 20 B). If we limit
the number of levels to five (32 × 32 leaf nodes), we only require 26 KB. Although
this is smaller than the maximum allocation size for the LDS (32 KB) for an
AMD FirePro W9100 GPU, we can only schedule two work groups per compute
unit. (A compute unit has 4 SIMD engines.) Thus, it cannot schedule enough
work groups for a SIMD to hide latencies, which results in poor performance.
Instead of storing it in the LDS, we store it in the global memory, whose access
2. Rendering Vector Displacement-Mapped Surfaces in a GPU Ray Tracer 465
VD patch
RayRayRayRayRay Ray
VD texture
Work buffer
Atomic Op.
Ray hit distances
t t t t t t
Figure 2.5. Overview of the algorithm. In this illustration, the VD patch has 24 rays
intersecting the root AABB; it builds a BVH with depth 3.
latency is higher than the LDS, but we do not have such a restriction in the size
for the global memory. Since we do not use the LDS for the storage of the BVH
data in this approach, the LDS usage is not the limiting factor for concurrent
work group execution in a SIMD. The limiting factor is now the usage of vector
general purpose registers (VGPRs). Our current implementation allows us to
schedule 12 work groups in a compute unit (CU), which is 3 per SIMD, as the
kernel uses 72 VGPRs per SIMD lane.
Because we know the maximum number of work groups executed concurrently
in a CU for this kernel, we can calculate the number of work groups executed in
parallel on the GPU. We used an AMD FirePro W9100 GPU, which has 44 CUs.
Thus, 528 work groups (44 CUs × 12 work groups) are launched for the kernel.
A work group processes VD patches one after another and executes until no VD
patch is left unprocessed. As we know the number of work groups executed, we
allocate memory for the BVH storage in global memory before execution and
466 VI Compute
assign each chunk of memory for a work group as a work buffer. In all the test
cases, we limit the maximum subdivision level to 5, and thus a 13-MB (= 26 KB
× 528) work buffer is allocated.
After work groups are launched and a VD patch is fetched, we first compute
the required subdivision level for the patch by comparing the extent of the AABB
of the root node to the area of a pixel at the distance from the camera. As we
allow instancing for shapes with vector displacement maps (e.g., the same patch
can be at multiple locations in the world), we need to compute the subdivision
level for all the rays. Work items are used to process rays in parallel at this step.
Once a subdivision level is computed for a ray, the maximum value is selected
using an atomic operation to an LDS value.
Then, work items compute the node data, which is the AABB, texture coordi-
nate, and normal vector of a leaf in parallel. If the number of leaf nodes is higher
than the number of work items executed, a work item processes multiple nodes
sequentially. Once the leaf level of the BVH is built, it ascends the hierarchy one
step and computes nodes at the next level of the hierarchy. Work items are used
to compute a node in parallel. Since we write node data to global memory at one
level and then read it at the next level, we need to guarantee that the write and
read order is kept. This is enforced by placing a global memory barrier, which
guarantees the order in a work group only; thus, it can be used for this purpose.
This process is repeated until it reaches the root of the hierarchy. Pseudocode
for the parallel BVH build is shown in Listing 2.2.
{
i n t n c = (1<< l e v e l ) ;
i n t n f = (1<<( l e v e l +1) ) ;
i n t oc = g e t O f f s e t ( level ) ;
i n t o f = g e t O f f s e t ( l e v e l +1 ) ;
wh i l e ( l o c a l I d x < nc nc )
{
i n t i i = l o c a l I d x%n c ;
i n t jj = l o c a l I d x / nc ;
GridCell g = m y C e l l s [ o f + ( 2 i i ) +(2 j j ) n f ] ;
GridCell g 1 = m y C e l l s [ o f + ( 2 i i +1) +(2 j j +1) n f ] ;
GridCell g 2 = m y C e l l s [ o f + ( 2 i i +1) +(2 j j ) n f ] ;
GridCell g 3 = m y C e l l s [ o f + ( 2 i i ) +(2 j j +1) n f ] ;
myCells [ o c + i i + j j n c ] = m e r g e ( g , g1 , g2 , g 3 ) ;
localIdx += W G _ S I Z E W G _ S I Z E ;
}
GLOBAL_BARRIER ;
}
Listing 2.2. BVH build, starting with the leaf-level build and then the upper-level build.
Once the hierarchy is built, we switch the work item usage from a work item
for a node to a work item for a ray. A work item reads a ray from the list of rays
hitting the AABB of the VD patch. A ray is then transformed to the object space
of the model and traversed using the hierarchy information. If the current hit
is closer than the last found hit, the hit distance, element index, normal vector,
and texture coordinate at the hit point are updated. However, we cannot simply
write this hit information because a ray can be processed by more than one work
item in different work groups. The current OpenCL programming model does
not have a mechanism to have a critical section, which would be necessary for our
case.1 Instead, we used 64-bit atomic operations, which are not optimal in terms
of performance, but at least we avoided the write hazard. When the element
index, quantized normal vector, and quantized texture coordinate are all 32 bit
data, the hit distance is converted into a 32-bit integer and appended at the top
of those 32 bits to create 64-bit integers. By using an atomic min operation, we
can store the closest hit information (Figure 2.5).
Pseudocode for the entire kernel is shown in Algorithm 2.1.
ory access from a work group but not for different work groups.
468 Contents
2.5.2 Preparation
Before rendering starts, we compute AABBs for primitives and build top- and
middle-level BVHs. For VD patches, the computation of an accurate AABB
is expensive as it requires tessellation and displacement. Instead, we compute
the maximum displacement amount from a displacement texture and expand the
2. Rendering Vector Displacement-Mapped Surfaces in a GPU Ray Tracer 469
Top-level BVH 0
1 2
3 4 5 6
VD
Figure 2.6. Three-level hierarchy. A leaf of the top-level BVH stores an object, which
is a middle-level BVH and transform. A leaf of the middle-level BVH stores primitives
such as a triangle, a quad, or a VD patch. There is a bottom-level BVH that is built
on the fly during the rendering for a leaf storing a VD patch.
AABB of a quad using the value. Although this results in a loose-fitted AABB,
which makes ray tracing less efficient than when tight AABBs are computed, it
makes the preparation time short.
Figure 2.7. Some of our test scenes with and without vector displacement mapping.
We then execute a kernel described in Section 2.4, which computes the in-
tersection with VD patches. The minimum number of work groups filling the
GPU is executed and each work group fetches an unprocessed VD patch from the
queue and then processes one after another.
did not use instancing for these tests, although we could use it to improve the
performance for a scene in which a same geometry has been placed several times.
We used an AMD FirePro W9100 GPU for all the tests.
The biggest advantage of using vector displacement maps is their small mem-
ory footprints, as they create highly detailed geometry on the fly rather than
preparing a high-resolution mesh. The memory usages with the proposed method
and with pretessellation are shown in Table 2.1. The “Party” scene requires the
most memory and does not fit into any existing GPU’s memory with pretessel-
lation. Even if we could store such a large scene in memory, it takes time to
start the rendering because of the preprocess for rendering, such as IO and spa-
tial acceleration structure build. This prevents a fast iteration of modeling and
rendering. On the other hand, those overheads are low when direct ray tracing
of vector displacement maps is used. The difference is noticeable, even for the
simplest “Pumpkin” scene.
The advantage of the memory footprint is obvious, but the question is, “What
is the cost at runtime, (i.e., the impact for the rendering speed)?” Despite its
complexity in the ray-casting algorithm, direct ray tracing of vector displacement
maps was faster for most of the experiments. We rendered direct illumination of
the scene under an environment light (i.e., one primary ray cast and one shadow
ray cast) and measured the breakdown of the rendering time, which is shown
in Figure 2.8.2 Pretessellation is faster only for the “Pumpkin” scene whose
geometric complexity is the lowest among all tests. Pretessellation is slower for
the “Bark” scene and it fails to render the other two larger scenes. This is
interesting because direct ray tracing is doing more work than pretessellation.
This performance came from less divergent computation of direct ray tracing
(i.e.,the top- and middle-level hierarchies are relatively shallow, and we batch the
rays intersecting with a VD patch).
To understand the ray-casting performance for direct ray tracing better, we
analyzed the breakdown of each ray-cast operation for the scenes (Figure 2.9).
These timings include kernel launch overhead, which is substantial especially for
sorting that requires launching many kernels. Computation time for sorting is
roughly proportional to the number of hit pairs, although it includes the over-
head. Most of the time is spent on bottom-level BVH build and ray casting for
2 The renderer is a progressive path tracer, and thus all screenshots are taken after it casts
Bark
Bark[VD]
0 5 10 15 20 25 30 35 40
Time (ms)
Figure 2.8. Breakdown of computational time for a frame. There are two graphs for
each scene. One is with pretessellation and the other (VD) is with the proposed method.
Barks cannot render without using instancing with VD patches.
Party (Shadow)
Party (Primary)
Pumpkin (Shadow)
Pumpkin (Primary)
Barks (Shadow)
Barks (Primary) Top middle ray cast
Sort
Bark (Shadow)
Bottom ray cast
Bark (Primary)
0 5 10 15 20 25 30 35 40
Time (ms)
Figure 2.9. Time for top and middle ray casts, sort, and bottom ray cast.
VD patches. The time does not change much when we compare primary and
shadow ray casts for the “Barks” scene, although the number of shadow rays is
smaller than the number of primary rays. This indicates the weakness of the
method, which is that the bottom-level BVH construction cost can be amortized
when there are a large number of rays intersecting with a VD patch, but it cannot
be amortized if this number is too low. This is why the ray casting for shadow
rays in the “Pumpkin” scene is so slow compared to the time with pretessella-
tion. The situation gets worse as the ray depth increases. We rendered indirect
illumination with five ray bounces (depths) for the “Bark” scene (Figure 2.10).
Figure 2.11 shows the ray casting time measured for each ray bounce. Although
the number of active rays decreases as it goes deeper, the ray casting time did
not decrease much. This can be improved by caching the generated bottom-level
BVH, which is disposed and computed again for each ray casting operation. This
is an opportunity for future research.
2. Rendering Vector Displacement-Mapped Surfaces in a GPU Ray Tracer 473
Figure 2.10. The “bark” scene rendered with five-bounce indirect illumination.
10
8
Time (ms)
6
4
2
0
1st 1st (sh) 2nd 2nd (sh) 3rd 3rd (sh) 4th 4th (sh) 5th 5th (sh)
Ray Depth
Figure 2.11. Ray casting time for each ray depth in indirect illumination computation.
Those marked (sh) are ray casts for shadow rays.
2.7 Conclusion
In this chapter, we have presented a method to ray-trace vector displacement-
mapped surfaces on the GPU. Our experiments show that direct ray tracing
requires a small memory footprint only, and ray tracing performance is competi-
tive or faster than ray tracing with pretessellation. The advantage gets stronger
as there are more VD patches in the scene.
From the breakdown of the rendering time, we think that optimizing the BVH
build for the scene and ray casting for simple geometries such as triangles and
quads are not as important as optimizing the bottom-level hierarchy build and
ray casting because the complexity of the bottom-level hierarchy easily becomes
higher than the complexity of the top- and middle-level hierarchies once we start
adding vector displacement to the scene.
474 VI Compute
Bibliography
[Hanika et al. 10] Johannes Hanika, Alexander Keller, and Hendrik P. A. Lensch.
“Two-Level Ray Tracing with Reordering for Highly Complex Scenes.” In
Proceedings of Graphics Interface 2010, pp. 145–152. Toronto: Canadian
Information Processing Society, 2010.
[Harada and Howes 11] T. Harada and L. Howes. “Introduction to GPU Radix
Sort.” Supplement to Heterogeneous Computing with OpenCL, edited by
Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, and Dana
Schaa. San Francisco: Morgan Kaufmann, 2011. Available at http://www.
heterogeneouscompute.org/?page id=7.
[Smits et al. 00] Brian E. Smits, Peter Shirley, and Michael M. Stark. “Direct
Ray Tracing of Displacement Mapped Triangles.” In Proceedings of the Eu-
rographics Workshop on Rendering Techniques, pp. 307–318. Aire-la-Ville,
Switzerland: Eurographics Association, 2000.
3
VI
Smooth Probabilistic
Ambient Occlusion
for Volume Rendering
Thomas Kroes, Dirk Schut, and Elmar Eisemann
3.1 Introduction
Ambient occlusion [Zhukov et al. 98] is a compelling approach to improve depth
and shape perception [Lindemann and Ropinski 11, Langer and Bülthoff 99],
to give the illusion of global illumination, and to efficiently approximate low-
frequency outdoor lighting. In principle, ambient occlusion computes the light
accessibility of a point, i.e., it measures how much a point is exposed to its sur-
rounding environment.
An efficient and often-used version of ambient occlusion is screen-space am-
bient occlusion [Kajalin 09]. It uses the depth buffer to compute an approximate
visibility. This method is very appealing because its computational overhead
is minimal. However, it cannot be applied to direct volume rendering (DVR)
because voxels are typically semitransparent (defined via a transfer function).
Consequently, a depth buffer would be ambiguous and is not useful in this con-
text.
The first method to compute ambient occlusion in DVR, called vicinity shad-
ing, was developed by Steward [Stewart 03]. This method computes the ambi-
ent occlusion in each voxel by taking into account how much the neighboring
voxels obscure it. The resulting illumination is stored in an additional volume,
which needs to be recomputed after each scene modification. Similarly, Hernell
et al. [Hernell et al. 10] computed ambient occlusion by ray tracing inside a small
neighborhood around the voxel. Kroes et al. extended this method by taking the
entire volume into account [Kroes et al. 12].
Our approach tries to avoid costly ray tracing and casts the problem into a
filtering process. In this sense, it is similar in spirit to Penner and Mitchell’s
475
476 VI Compute
Figure 3.1. The hemisphere around a point that determines ambient occlusion (left).
The blue part is unoccluded. Volumetric obscurance relies on a full sphere (right).
method [Penner and Mitchell 08], which uses statistical information about the
neighborhood of the voxels to estimate ambient occlusion, as well as the method
by Ropinski et al., which is similar and also adds color bleeding [Ropinski et al. 08].
Furthermore, our approach relates to Crassin et al.’s [Crassin et al. 10], which
proposes the use of filtering for shadow and out-of-focus computations.
Our Smooth Probabilistic Ambient Occlusion (SPAO) is a novel and easy-
to-implement solution for ambient occlusion in DVR. Instead of applying costly
ray casting to determine the accessibility of a voxel, this technique employs a
probabilistic heuristic in concert with 3D image filtering. In this way, ambient
occlusion can be efficiently approximated and it is possible to interactively modify
the transfer function, which is critical in many applications, such as medical and
scientific DVR. Furthermore, our method offers various quality tradeoffs regarding
memory, performance, and visual quality. Very few texture lookups are needed in
comparison to ray-casting solutions, and the interpretation as a filtering process
ensures a noise-free, smooth appearance.
only the notion of blocked and unblocked rays in the following. Please notice
that we can interpret intermediate values of V as a probability for a ray to be
blocked. For example, if V returns a value of 0.5, there is a 50% chance for a ray
to be blocked.
It is also possible to integrate the visibility function over the whole sphere
around a point, making Ω a full sphere, instead of a hemisphere and making
it independent of n. The result is called obscurance and denoted A(p), and it
produces similar effects. Calculating obscurance instead of ambient occlusion has
the advantage that it does not require a normal. However, this definition will
lead to parts of the volume that are located behind the point to intervene in the
computation. This property can be a disadvantage for standard scenes, as the
result might become too dark, but in the context of DVR, it is sometimes even
preferable, as it will unveil information below the surface, which is often desired.
Both ambient occlusion and obscurance only depend on the geometry of the
volume. Therefore, they can be stored in an additional volume that is then
used to modulate the original volume’s illumination. The occlusion values can
be calculated directly from the opacity of the original volume. Nonetheless, the
values have to be recomputed when the original volume changes—for example,
when the user changes the transfer function. This latter step can be very costly
and makes it impossible to interact with transfer functions while maintaining a
high visual fidelity. Our approach is fast to compute and enables a user to quickly
apply such modifications without having to wait a long time for the result.
Initially, our solution will be explained in the context of obscurance, but
in Section 3.3, we will extend our algorithm to approach ambient occlusion by
making use of the normals to reduce the influence of the part of the volume below
the surface.
3.2.1 Overview
To approximate obscurance at a certain point in the volume, we avoid ray casting.
Instead, we introduce an approximation that is based on the probability of the
rays being blocked by the volume. Instead of solving A(p) and its integral entirely,
we consider a limited region around p, formed by volumes of increasing size.
The volume between successive volumes forms a layer of voxels, a so-called shell
(Figure 3.2). We will show how to derive the probability of a random ray to be
blocked by a shell. From this result, we deduce an approximation of the integral
A(p) assuming that the entire volume is represented by a single shell. Finally, the
results for these various shells are combined heuristically to yield our occlusion
approximation for the entire volume.
First, we consider shells being represented by a sphere with a one-voxel-wide
boundary S. These shells are formed by a set of successive spheres, which each
grow in radius by one voxel. In this situation, if we consider one independent
shell, any random ray sent from its center will intersect exactly one voxel. If all
478 VI Compute
Level 1
– =
Figure 3.2. A shell is a layer of voxels formed by the difference between two differently
sized volumes. By creating a cascade of these volumes, a set of shells is formed. For
each shell, we approximate the probability of a ray to be blocked and combine these
probabilities heuristically to form the final obscurance value.
directions are equally likely, the probability for a ray to be blocked then boils
down to an average of all voxel values in the shell, averageS (p). Looking carefully
at this definition, it turns out that this probability is equivalent to solving for A
in the presence of a single shell.
If we now decompose the volume into such a set of shells around a point, we
can compute the probability of the rays to be blocked by each shell, but still
need to combine all these blocking contributions together. In order to do so, we
make use of a heuristic. We assume a statistical independence between the value
distributions in the various shells. The probability of rays originating at p to be
blocked by a set of n englobing shells {Si }ni=1 ordered from small to large is then
given by
n
(1 − averageSi (p)).
i=1
Level 1
– =
Figure 3.3. In this 2D illustration, the shell on the right is a one-voxel-thick hull that
is formed by subtracting the average opacity from level 1 (in the middle) from level 2
(on the left).
Figure 3.5. Volumetric obscurance using (a) ray tracing (256 rays/voxel), (b) mipmap
filtering, and (c) N-buffer filtering.
Consequently, only the average and the relative change in size (S1 /S2 ) is needed
to deduce averageS , which facilitates computations further. Imagine that each
cube is obtained by doubling the length of each edge of the predecessor. Then,
the ratio would be 1 : 8, resulting in averageS = 87 (A2 − 18 A1 ).
0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0
0,0 0,1 0,1 0,1 0,1 0,0 0,0 0,0 0,0 0,0 0,0
0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,1 0,1 0,1 0,1 0,0 0,1 0,1 0,0 0,0 0,0
0,0 0,0 0,0 0,0 0,0 0,1 0,0 0,0 0,0 0,0 0,0 0,3 0,3 0,0 0,0 0,1 0,1 0,0 0,0 0,0 0,0 0,0 0,2 0,3 0,3 0,4 0,3 0,2 0,2 0,1 0,1 0,0
0,0 1,0 0,0 0,0 0,1 0,1 0,1 0,0 0,0 0,0 0,0 0,5 0,5 0,0 0,1 0,2 0,2 0,1 0,0 0,0 0,0 0,0 0,3 0,4 0,5 0,6 0,5 0,4 0,3 0,2 0,1 0,1
0,0 1,0 0,0 0,0 0,1 0,3 0,1 0,0 0,0 0,0 0,0 0,5 0,8 0,5 0,5 0,4 0,3 0,2 0,1 0,0 0,0 0,1 0,3 0,4 0,6 0,8 0,6 0,5 0,4 0,3 0,2 0,1
0,0 1,0 1,0 1,0 1,0 0,3 0,3 0,3 0,1 0,0 0,1 0,6 1,0 1,0 1,0 0,7 0,4 0,3 0,3 0,1 0,0 0,1 0,3 0,4 0,7 0,9 0,8 0,7 0,6 0,4 0,3 0,1
0,5 1,0 1,0 1,0 1,0 0,5 0,3 0,3 0,3 0,1 0,3 0,6 0,8 0,9 1,0 0,8 0,4 0,3 0,3 0,2 0,1 0,1 0,2 0,3 0,5 0,7 0,7 0,8 0,6 0,4 0,3 0,1
0,5 0,5 0,5 1,0 1,0 0,5 0,3 0,3 0,3 0,1 0,1 0,4 0,5 0,8 1,0 0,8 0,6 0,5 0,3 0,1 0,0 0,0 0,1 0,2 0,3 0,5 0,5 0,6 0,5 0,4 0,2 0,1
0,0 0,5 0,5 1,0 1,0 0,5 1,0 0,3 0,1 0,0 0,0 0,3 0,4 0,4 0,8 0,9 0,9 0,6 0,1 0,0 0,0 0,0 0,1 0,1 0,2 0,3 0,3 0,4 0,4 0,3 0,2 0,0
0,0 0,5 0,0 0,0 1,0 1,0 1,0 0,1 0,0 0,0 0,0 0,1 0,1 0,0 0,3 0,6 0,7 0,4 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,1 0,1 0,2 0,2 0,2 0,1 0,0
0,0 0,0 0,0 0,0 0,0 0,3 0,3 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,1 0,2 0,1 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0
0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0
Figure 3.6. A 2D example of how N-buffers are calculated. A dataset is shown on the
left, with the first two N-buffer levels next to it. In each level, the average of four values
of the previous level is combined into one value.
The N-buffer construction is efficient, as each new level can be computed from
the previous using only eight lookups. A 2D example of the calculation is shown
in Figure 3.6. Nonetheless, N-buffers result in higher memory consumption, so
it can be useful to apply a few mipmap levels before processing the rest using
N-buffers.
Figure 3.7. The lookups of the cubes from a point with a normal of length 0.75 in the
upward direction.
482 VI Compute
However, in DVR, a normal is not always clearly defined, e.g., inside a ho-
mogeneous semitransparent volume like jelly pudding. Similarly, between two
different semitransparent voxels, it might be less clear how to define a normal at
the interface between opaque and transparent materials. Consequently, we pro-
pose to scale the cube offset based on how strong the gradient is. Interestingly,
while most techniques derive normals from the normalized gradient via central
differences, we can use the gradient magnitude to determine if a normal is clearly
defined. Hence, we propose to remove the normalization operation and instead
normalize the voxel values themselves to the range [0,1], which will lead to the
gradient becoming an appropriately scaled normal. Additionally, we allow the
user to specify a global scale to either pronounce or reduce the impact of this
ambient-occlusion approximation (Figure 3.8).
3. Smooth Probabilistic Ambient Occlusion for Volume Rendering 483
Table 3.1. Performance measurements for the Macoessix data set (512 × 512 × 512) for
N-buffers and mipmap-based SPAO. For each technique we show the time it takes to
compute the individual levels and to combine them into an ambient occlusion volume.
3.4 Results
Our method has been implemented in a CUDA-based stand-alone software pro-
gram for DVR. The program and its source code are available under the original
BSD license. It is shipped with sample datasets. The transfer function and, thus,
the visual representation can be changed on the fly. Also, the user can select from
three different methods of ambient occlusion computation: mipmaps, N-buffers,
and ray tracing. Our program makes use of CUDA 3.0 texture objects and will
not support lower CUDA versions.
We tested the performance of our technique using the publicly available Ma-
coessix dataset from the Osirix website1 (see Table 3.1). All tests were peformed
on an Intel Xeon W3530 (2.80 GHz) workstation with 12 GB RAM and a GeForce
GTX TITAN Graphics Card with 4 GB of RAM. N-buffers are slightly more
costly than mipmaps, but both are orders of magnitude faster than a volumet-
ric ambient-occlusion ray tracer. The latter takes more than four minutes, see
Table 3.1.
Figure 3.9 shows some results of our approach on the Backpack and Manix
datasets.
3.5 Conclusion
This chapter presents a novel approach to compute ambient occlusion for DVR.
We demonstrate that by considering the ambient-occlusion computation as a
filtering process, we can significantly improve efficiency and make it usable in a
real-time DVR application. Such an approach is useful for medical visualization
applications, where transfer functions are very often subject to change.
1 http://www.osirix-viewer.com/datasets/
484 VI Compute
Figure 3.9. SPAO applied to the Backpack (512×512×461) and Manix (512×512×460)
data sets.
Our approach is efficient and simple to implement and leads to a very good
quality/performance tradeoff. Nonetheless, we also experimented with more com-
plex combinations of the shells, especially, as the assumption of independence of
the occlusion probabilities is usually not true in most datasets. In practice,
it turns out that our solution seems to be a good choice, and any increase in
complexity also led to a significant performance impact. Nonetheless, this topic
remains interesting for future work. Furthermore, we would like to investigate
approximating physically plausible light transport, such as global illumination,
with our filtering technique, which could further enhance the volume depiction.
Bibliography
[Crassin et al. 10] Cyril Crassin, Fabrice Neyret, Miguel Sainz, and Elmar Eise-
mann. “GPU Pro: Advanced Rendering Techniques.” edited by Wolfgang
Engel, Chapter Efficient Rendering of Highly Detailed Volumetric Scenes
with GigaVoxels, pp. 643–676. Natick, MA: A K Peters, 2010.
[Décoret 05] Xavier Décoret. “N-Buffers for Efficient Depth Map Query.” Com-
puter Graphics Forum 24:3 (2005), 393–400.
[Hernell et al. 10] Frida Hernell, Patric Ljung, and Anders Ynnerman. “Local
Ambient Occlusion in Direct Volume Rendering.” IEEE Transactions on
Visualization and Computer Graphics 16:4 (2010), 548–559.
3. Smooth Probabilistic Ambient Occlusion for Volume Rendering 485
Welcome to the 3D Engine Design section of this edition of GPU Pro. The
selection of chapters you will find in here covers a range of engine design problems.
First, Holger Gruen examines the benefits of a block-wise linear memory lay-
out for binary 3D grids in the chapter “Block-Wise Linear Binary Grids for
Fast Ray-Casting Operations.” This memory layout allows mapping a number
of volumetric intersection algorithms to binary AND operations. Bulk-testing a
subportion of the voxel grid against a volumetric stencil becomes possible. The
chapter presents various use cases for this memory layout optimization.
Second, Michael Delva, Julien Hamaide, and Ramses Ladlani present the
chapter “Semantic-Based Shader Generation Using Shader Shaker.” This chapter
offers one solution for developing and efficiently maintaining shader permutations
across multiple target platforms. The proposed technique produces shaders auto-
matically from a set of handwritten code fragments, each responsible for a single
feature. This particular version of the proven divide-and-conquer methodology
differs in the way the fragments are being linked together by using a path-finding
algorithm to compute a complete data flow through shader fragments from the
initial vertex attributes to the final pixel shader output.
Finally, Shannon Woods, Nicolas Capens, Jamie Madill, and Geoff Lang
present the chapter “ANGLE: Bringing OpenGL ES to the Desktop.” ANGLE
is a portable, open-source, hardware-accelerated implementation of OpenGL ES
2.0 used by software like Google Chrome. The chapter provides a close insight
on the Direct3D 11 backend implementation of ANGLE along with how certain
challenges were handled, in addition to recommended practices for application
developers using ANGLE.
I hope you enjoy this edition’s selection, and I hope you find these chapters
inspiring and enlightening to your rendering and engine development work.
Welcome!
—Wessam Bahnassi
1
VII
1.1 Introduction
Binary grids only contain one bit of information per cell. Even reasonably high
grid resolutions (e.g., 4096 × 4096 × 256 amount to 512 MB of memory) still fit
into GPU memory and are thus practical in real-time situations.
This chapter examines the benefits of a block-wise linear memory layout for
binary 3D grids. This memory layout allows mapping a number of volumetric
intersection algorithms to binary AND operations. Bulk-testing a subportion of
the voxel grid against a volumetric stencil becomes possible. The number of
arithmetic operations and the amount of memory words to be accessed is lower
than for regular sampling schemes.
Below, techniques for rendering binary grids are discussed. The text then
describes how to use block-wise linear grids to cast rays through the grid to
detect occluded light sources in the context of an indirect illumination rendering
technique as a real-world use case. Finally, various other use cases for using
block-wise linear grids are discussed.
1.2 Overview
There is a wealth of work regarding the use of binary voxel grids in 3D graph-
ics: [Eisemann and Décoret 06] lists various applications, specifically some from
the area of shadowing; [Penmatsa et al. 10] describes a volumetric ambient occlu-
sion algorithm; and [Kasik et al. 08] presents the use for precomputed visibility
applications, to name a few.
The rendering of binary voxel grids (BVGs) is often realized by mapping the
third axis (e.g., the z-axis) of the grid to the bits of the pixels of a multiple render
target (MRT) setup. During rendering, voxels/bits along the z-axis are set using
489
490 VII 3D Engine Design
32-bit word 1
32-bit word 2
Figure 1.1. A 4 × 4 × 4 voxel grid fits into two consecutive 32-bit words.
8×8×8
These bigger blocks store each of the 4 × 4 × 4 subblocks—of which they are
comprised—in two consecutive 32-bit integer locations. Figure 1.2 depicts this
idea for a 8 × 8 × 8 block that maps to sixteen 32-bit integer words.
In order for readers to start using the described memory layout, Listing 1.1
provides an implementation of a function that can be used to compute the buffer
address and bit-value for a given grid size and block size.
Please note that the number of bits that are set in each 4 × 4 × 4 portion of
the grid can be used to compute a value for volumetric coverage. Modern GPUs
have operations that can count the nonzero bits in integer values—thus mapping
bits to coverage is efficient.
Another way to efficiently implement storing 4 × 4 × 4 bits in a memory
coherent way instead of using a 1D buffer of unsigned integer under Direct3D 11
can be the use of a RWTexture3D<uint2>. In this case, each texel can be used to
encode a 4 × 4 × 4 of the grid.
// Return t h e o f f s e t i n t o t h e b u f f e r i n b y t e s i n . x and t h e
// v a l u e t o OR t o t h e 32− b i t i n t e g e r t o s e t t h e g r i d p o s i n . y
u i n t 2 c o m p u t e O f f s e t A n d V a l ( f l o a t 3 pos , // 3D p o s i n t h e g r i d
f l o a t G r i d S i z e , // s i z e o f t h e g r i d
f l o a t B l o c k R e s ) // b l o c k−s i z e ,
// e . g . , 8 t o pack 8 x8x8
{
// Compute which o f t h e Bl o c k R e s x Bl o c k R e s x Bl o c k R e s b l o c k s
// p o s i s i n
f l o a t 3 block_pos = floor ( floor ( pos ) ( 1 . 0 f/ BlockRes ) ) ;
// Compute 3D p o s i t i o n w i t h i n s u b b l o c k
f l o a t 3 sub_pos = floor ( floor ( pos ) % BlockRes ) ;
// Compute t h e s i z e o f a g r i d wi t h g r i d c e l l s e a c h Bl o c k R e s wide
f l o a t RGS = GridSize/ BlockRes ;
// b l o c k s i z e i n b y t e s
uint block_size = uint ( BlockRes BlockRes BlockRes ) / 8;
// b y t e o f f s e t t o t h e Bl o c k R e s x Bl o c k R e s x Bl o c k R e s p o s i s i n
uint block_off = block_size uint ( block_pos . x +
block_pos . y RGS +
block_pos . z RGS RGS ) ;
// Compute which o f t h e f i n a l 4 x4x4 b l o c k s t h e v o x e l r e s i d e s i n
f l o a t 3 sub_block_pos = floor ( sub_pos 0.25 f ) ;
// Compute t h e b i t p o s i t i o n i n s i d e t h e f i n a l 4 x4x4 b l o c k s
fl oa t3 bit_pos = sub_pos % 4.0 f ;
// Compute t h e s i z e o f a b l o c k i n 4 x4x4 u n i t s
Float FBS = BlockRes 0.25 f ;
// Compute b y t e o f f s e t f o r f i n a l 4 x4x4 s u b b l o c k i n t h e c u r r e n t
// Bl o c k R e s x Bl o c k R e s x Bl o c k R e s b l o c k
uint off = 8.0 f ( sub_block_pos . x +
sub_block_pos . y FBS +
sub_block_pos . z FBS FBS ) ;
return uint2 (
// Add memory o f f s e t s and add f i n a l o f f s e t b a s e on z
b l o c k _ o f f + off + ( b i t _ p o s . z > 1 . 0 f ? 0 x4 : 0 x0 ) ,
Listing 1.1. Compute the offset and bit position for a position in a block-linearly stored
binary grid.
struct GS_RenderGridInput
{
f l o a t 3 f3WorldSpacePos : WSPos ;
...
};
GS_RenderGridInput VS_BinaryGrid ( VS_RenderSceneInput I )
{
GS_RenderGridInput O ;
// P a ss on world−s p a c e p o s i t i o n −−−assu m i n g WS i s p a s s e d i n
O . f3WorldSpacePos = I . f3Position ;
// Compute/ p a s s on a d d i t i o n a l s t u f f
...
return O ;
}
struct PS_RenderGridInput
{
flo at 4 f4Position : SV_POSITION ;
flo at 3 f3GridPosition : GRIDPOS ;
};
[ maxvertexcount ( 3 ) ]
void GS_BinaryGrid ( triangle GS_RenderGridInput input [ 3 ] ,
i n o u t T r i a n g l e S t r e a m<P S _ R e n d e r G r i d I n p u t > T r i a n g l e s )
{
PS_RenderGridInput output ;
// g W o r l d S p a c e G r i d S i z e c o n t a i n s t h e world−s p a c e s i z e o f t h e
// g r i d
f l o a t 3 f3CellSize = g_WorldSpaceGridSize . xyz
( 1.0 f / f l o a t ( BINARY_GRID_RES ) . xxx ) ;
494 VII 3D Engine Design
f l o a t 3 gv [ 3 ] , v [ 3 ] ;
gv [ 1 ] = ( input [ 1 ] . f 3 W o r l d S p a c e P o s − g _ S c e n e L B F b o x . xyz ) /
f3CellSize ;
gv [ 2 ] = ( input [ 2 ] . f 3 W o r l d S p a c e P o s − g _ S c e n e L B F b o x . xyz ) /
f3CellSize ;
// Compute t r i a n g l e e d g e s
f l o a t 3 d0 = gv [ 1 ] − gv [ 0 ] ;
f l o a t 3 d1 = gv [ 2 ] − gv [ 0 ] ;
// Compute t r i a n g l e normal
f l o a t 3 N = n o r m a l i z e ( c r o s s ( d0 , d 1 ) ) ;
f lo a t3 C = ( 1.0 f /3.0 f ) ( gv [ 0 ] + gv [ 1 ] + gv [ 2 ] );
// S e t up v i e w a x i s f o r l o o k i n g a l o n g t h e t r i a n g l e normal
f l o a t 3 xaxis = n o r m a l i z e ( d1 ) ;
f l o a t 3 yaxis = cross ( N , xaxis ) ;
// S e t up v i e w m a t r i x f o r l o o k i n g a l o n g t h e t r i a n g l e normal
float4x4 ViewMatrix = {
x a x i s . x , x a x i s . y , x a x i s . z , −d o t ( x a x i s , E y e ) ,
y a x i s . x , y a x i s . y , y a x i s . z , −d o t ( y a x i s , E y e ) ,
N . x , N . y , N . z , −d o t ( N , E y e ) ,
0.0 f , 0.0 f , 0.0 f , 1.0 f
};
// S e t up a p r o j e c t i o n m a t r i x u s i n g a c o n s t a n t ;
// g V i e w p o r t R e s o l u t i o n i s a c o n s t a n t s e t by t h e a p p l i c a t i o n
float4x4 ProjMatrix =
{
// P r o j e c t v e r t i c e s and p a s s on g r i d −s p a c e p o s i t i o n
[ u n r o l l ] f o r ( i n t i = 0 ; i < 3 ; ++i )
{
output . f4Position = mul ( ProjMatrix , f l o a t 4 ( v [ i ] , 1.0 f ) ) ;
o u t p u t . f 3 G r i d P o s i t i o n = gv [ i ] ;
Triangles . Append ( output ) ;
}
Triangles . RestartStrip ( ) ;
}
RWByteAddressBuffer BinaryGrid : r e g i s t e r ( u0 ) ;
1. Block-Wise Linear Binary Grids for Fast Ray-Casting Operations 495
// Turn on t h e b i t f o r t h e c u r r e n t g r i d p o s i t i o n
BinaryGrid . InterlockedOr ( off_val . x , off_val . y , old ) ;
}
Listing 1.2. Vertex and geometry shader fragments for one-pass voxelization under
Direct3D 11.
For all voxels along the ray, start an iterator V (I) at the start point of
the ray.
1. Determine which 2N × 2N × 2N block B that V (I) sits in.
2. Determine which 4 × 4 × 4 subblock S of B that V (I) sits in.
3. Reserve two 32-bit integer registers R[2] to hold a ray subsection.
4. Build a ray segment in R.
(a) For all voxels v along the ray starting at V (I) that are still inside S,
i. set the bit in R to which v maps,
ii. advance I by 1.
5. Load two 32-bit integer words T [2] from the buffer holding G that
contain S.
6. Perform the following bitwise AND operations:
(a) R[0] & T [0],
(b) R[1] & T [1].
7. If any of the tests in Steps 6(a) or 6(b) return a nonzero result, the ray
has hit something.
Listing 1.3 provides the implementation details. In order to hide the fact that
a discrete binary grid is used, the edges cast through the grid are randomized
using pseudorandom numbers. Also, instead of computing unblocked and blocked
indirect light separately, the shaders in Listing 1.3 cast a ray segment toward each
VPL that is considered.
// Compute a l o n g word s i z e d o f f s e t i n t o t h e g r i d f o r a g r i d
// p o s i t i o n p o s
u i n t c o m p u t e 4 x 4 x 4 B l o c k L W O f f s e t ( f l o a t 3 pos , f l o a t G r i d R e s , f l o a t
BlockRes )
{
f l o a t 3 block_pos = floor ( floor ( pos ) ( 1 . 0 f/ BlockRes ) ) ;
// l o c a l a d d r e s s i n b l o c k
f l o a t 3 sub_pos = floor ( floor ( pos ) % BlockRes ) ;
// Tr a c e an e d g e t h r o u g h t h e b i n a r y g r i d i n 4 x4x4 b l o c k s
f l o a t t r a c e E d g e B i n a r y G r i d ( f l o a t 3 f 3 C P o s , // s t a r t p o s o f r a y
float3 f 3 C N , // normal a t s t a r t p o s o f r a y
float3 f3D , // n o r m a l i z e d d i r e c t i o n o f r a y
float3 f 3 P o s , // end p o s o f r a y / e g d e
float3 f 3 N ) // normal a t end p o s
{
float fCount = 0.0 f ;
// g S c e n e B o x S i z e i s t h e world−s p a c e s i z e o f t h e s c e n e
f l o a t 3 f3CellSize = g_SceneBoxSize . xyz
( 1.0 f / f l o a t ( BINARY_GRID_RES ) . xxx ) ;
// S t e p a l o n g normal t o g e t o u t o f c u r r e n t c e l l
// t o p r e v e n t s e l f −o c c l u s i o n ;
// g SceneLBFbox i s t h e l e f t , bottom , and f r o n t p o s o f t h e wo r l d box
f l o a t 3 f3GridPos = ( f3CPos + ( f3CN f 3 C e l l S i z e ) −−
g_SceneLBFbox . xyz ) / f3CellSize ;
f l o a t 3 f3DstGridPos = ( f3Pos + ( f3N f 3 C e l l S i z e ) −−
g_SceneLBFbox . xyz ) / f3CellSize ;
// Clamp t o t h e g r i d ;
// BINARY GRID RES h o l d s t h e r e s o l u t i o n / s i z e o f t h e b i n a r y g r i d
f l o a t 3 f 3 G r i d C o o r d = m a x ( ( 0 . 0 f ) . xxx , m i n ( ( B I N A R Y _ G R I D _ R E S −1 ) .
xxx , f l o o r ( f 3 G r i d P o s ) ) ) ;
f l o a t 3 f 3 D s t G r i d C o o r d = m a x ( ( 0 . 0 f ) . xxx , m i n ( ( B I N A R Y _ G R I D _ R E S − 1 ) .
xxx , f l o o r ( f 3 D s t G r i d P o s ) ) ) ;
// Compute p o s i t i o n i n a g r i d o f 4 x4x4 b l o c k s
f l o a t 3 f 3 S u b P o s = f 3 G r i d C o o r d %4.0 f ;
f l o a t 3 f3Dg = f3DstGridCoord − f3GridCoord ;
f l o a t 3 f3AbsD = abs ( f3Dg ) ;
f l o a t fMaxD = max ( max ( f3AbsD . x , f3AbsD . y ) , f3AbsD . z ) ;
// S c a l e s t e p t o s t e p 1 p i x e l ahead
f3Dg = rcp ( fMaxD ) ;
// Where do we s t e p o u t o f t h e l o c a l 4 x4x4 g r i d ?
f l o a t 3 f 3 L o c a l D e s t = ( f 3 D g < 0 . 0 f ? −1.0 f : 4 . 0 f ) ;
f l o a t fLoopCount = 0.0 f ;
f L o o p C o u n t += 1 . 0 f ;
// Load t h e l o c a l 4 x4x4 g r i d
g r i d . x = g _ b u f B i n a r y G r i d [ o f f s e t++ ] ;
498 VII 3D Engine Design
// B u i l d l i n e mask f o r c u r r e n t 4 x4x4 g r i d
[ u n r o l l ] f o r ( i n t s s = 0 ; s s < 4 ; ++s s )
{
[ flatten ] i f ( fSteps > 0.5 f )
{
uint bitpos = uint ( f3SubPos . x + ( f3SubPos . y 4.0 f ) +
( ( f3SubPos . z % 2.0 f ) 16.0 f ) ) ;
l i n e s e g . x |= f 3 S u b P o s . z > 1 . 0 f ? 0 x 0 : ( 0 x 1 << b i t p o s ) ;
l i n e s e g . y |= f 3 S u b P o s . z < 2 . 0 f ? 0 x 0 : ( 0 x 1 << b i t p o s ) ;
f 3 S u b P o s += f 3 D g ;
f 3 G r i d C o o r d += f 3 D g ;
f M a x D −= 1 . 0 f ;
f S t e p s −= 1 . 0 f ;
}
}
i f ( ( ( l i n e s e g . x & g r i d . x ) | ( l i n e s e g . y & g r i d . y ) ) != 0 x 0 )
{
f C o u n t += 1 . 0 f ;
break ;
}
// Recompute sub p o s
f 3 S u b P o s = f 3 G r i d C o o r d %4.0 f ;
}
return fCount ;
}
// p u b l i c l y a v a i l a b l e pseudorandom number a l g o r i t h m
uint rand_xorshift ( uint uSeed )
{
uint rng_state = uSeed ;
r n g _ s t a t e \ = ( r n g _ s t a t e << 1 3 ) ;
r n g _ s t a t e \ = ( r n g _ s t a t e >> 1 7 ) ;
r n g _ s t a t e \ = ( r n g _ s t a t e << 5 ) ;
return rng_state ;
}
// Compute t h e i n d i r e c t l i g h t a t f3CPosOrg c a s t i n g r a y s t o t e s t
// f o r b l o c k e d VPLs
f l o a t 3 c o m p u t e I n d i r e c t L i g h t ( f l o a t 2 tc , // RSM t e x t u r e c o o r d
f l o a t 2 fc , // f r a c t i o n a l t e x t u r e c o o r d
i n t 2 i 2 O f f , // o f f s e t f o r d i t h e r i n g
f l o a t 3 f 3 C P o s O r g , // c u r r e n t p o s
f l o a t 3 f 3 C N ) // normal a t c u r r e n t p o s
{
f l o a t 2 tmp ;
f l o a t 3 f3IL = ( 0 . 0 f ) . xxx ;
int3 adr ;
1. Block-Wise Linear Binary Grids for Fast Ray-Casting Operations 499
adr . z = 0;
adr . y = i n t ( tc . y g _ v R S M D i m e n s i o n s . y + (− L F S ) + i 2 O f f . y ) ;
// Loop o v e r s p a r s e VPL k e r n e l
f o r ( f l o a t r o w = −L F S ; r o w <= L F S ; r o w += 6 . 0 f , a d r . y += 6 )
{
adr . x = i n t ( tc . x g _ v R S M D i m e n s i o n s . x + (− L F S ) + i 2 O f f . x ) ;
f o r ( f l o a t c o l = −L F S ; c o l <= L F S ; c o l += 6 . 0 f , a d r . x += 6 )
{
f l o a t 3 f3Pos , f3Col , f3N ;
// Unpack G−b u f f e r d a t a
f l o a t 3 f3Col , f3Pos , f3N ;
G e t G B u f f e r D a t a ( f3Col , f3Pos , f3N ) ;
// Compute i n d i r e c t l i g h t c o n t r i b u t i o n
f l o a t 3 f3D = f3Pos . xyz − f3CPosOrg . xyz ;
f l o a t fLen = length ( f3D ) ;
f l o a t fInvLen = rcp ( fLen ) ;
f l o a t fDot1 = dot ( f3CN , f3D ) ;
f l o a t f D o t 2 = d o t ( f3N , −f 3 D ) ;
f l o a t fDistAtt = saturate ( fInvLen fInvLen ) ;
// Form f a c t o r l i k e term
fDistAtt = saturate ( fDot1 fInvLen )
saturate ( fDot2 fInvLen ) ;
// Compute n o i s e f o r c a s t i n g a n o i s y r a y
f l o a t fNoise1 = 0.15 f computeFakeNoise ( uint ( adr . x
+ fc . x 100));
f l o a t fNoise2 = 0.15 f computeFakeNoise ( uint ( adr . y
+ fc . y 100));
f 3 P o s −= f 3 D fInvLen fNoise1 ;
f 3 C P o s += f 3 D fInvLen fNoise2 ;
Listing 1.3. Compute indirect light tracing rays through a binary grid for each VPL.
Please note that the noisy indirect light is computed at a reduced resolution,
as described in [Gruen 11]. The resulting indirect light gets blurred bilaterally
and is then up-sampled to the full resolution.
The screenshots in Figures 1.3, 1.4, and 1.5 have been generated with and
without the detection of occluded VPLs.
500 VII 3D Engine Design
1.7 Results
One goal of this chapter is to show that using block-wise binary grids does help
to speed up ray casting through a binary voxel grid.
In order to prove this, a standard implementation of traversing the grid has
been implemented as well.
Table 1.1 shows the performance of both methods on a 64 × 64 × 64 grid on an
NVIDIA GTX680 at 1024 × 768. In the final test, the standard implementation
is also allowed to operate on a packed grid in order to show that just the ability
to perform block-wise tests is already enough to generate a speedup.
In the test scene and the test application, block-wise tests allow for a speedup
of around 20%.
Figure 1.4. Screenshot 2: the scene with indirect light but without detecting occluded
VPLs.
Figure 1.5. Screenshot 3: the scene with indirect light from only unoccluded VPLs.
the voxel grid. If this is not intended, it is possible to change the code to test
step by step. Please note that the coherency of memory accesses for this is still
higher than performing texture lookups for each step along the ray.
strategies on how to down-sample each 2 × 2 × 2 block into just one bit do vary
depending on the application.
Similar in spirit to [Crassin et al. 11] one could switch to testing a lower mip
for intersections after a certain distance when, e.g., testing ray segments. This
would speed up the testing of longer rays.
Bibliography
[Crassin and Green 12] Cyril Crassin and Simon Green. “Octree-Based Sparse
Voxelization Using the GPU Hardware Rasterizer.” In OpenGL Insights,
edited by P. Cozzi and C. Riccio, pp. 259–278. Boca Raton, FL: CRC Press,
2012.
[Crassin et al. 11] Cyril Crassin, Fabrice Neyret, Miguel Sainz, Simon Green,
and Elmar Eisemann. “Interactive Indirect Illumination Using Voxel Cone
Tracing.” In Symposium on Interactive 3D Graphics and Games, p. 207. New
York: ACM, 2011.
[Dachsbacher and Stamminger 05] Carsten Dachsbacher and Marc Stamminger.
“Reflective Shadow Maps.” In Proceedings of the 2005 Symposium on Inter-
active 3D Graphics and Games, pp. 203–231. New York, ACM Press, 2005.
[Eisemann and Décoret 06] Elmar Eisemann and Xavier Décoret. “Fast Scene
Voxelization and Applications.” In Proceedings of the 2006 Symposium on
Interactive 3D Graphics and Games, pp. 71–78. New York, ACM, 2006.
[Gruen 11] Holger Gruen. “Real-Time One-Bounce Indirect Illumination and
Shadows using Ray Tracing.” In GPU Pro 2: Advanced Rendering Tech-
niques, edited by Wolfgang Engel, pp. 159–172. Natick, MA: A K Peters,
2011.
[Gruen 12] Holger Gruen. “Vertex Shader Tessellation.“ In GPU Pro 3: Advanced
Rendering Techniques, edited by Wolfgang Engel, pp. 1–12. Boca Raton, FL:
A K Peters/CRC Press, 2012.
[Kasik et al. 08] David Kasik, Andreas Dietrich, Enrico Gobbetti, Fabio Mar-
ton, Dinesh Manocha, Philipp Slusallek, Abe Stephens, and Sung-Eui Yoon.
“Massive Model Visualization Techniques.” SIGGRAPH course, Los Ange-
les, CA, August 12–14, 2008.
504 VII 3D Engine Design
[Penmatsa et al. 10] Rajeev Penmatsa, Greg Nichols, and Chris Wyman. “Voxel-
Space Ambient Occlusion.” In Proceedings of the 2010 ACM SIGGRAPH
Symposium on Interactive 3D Graphics and Games, Article No. 17. New
York: ACM, 2010.
2
VII
Semantic-Based Shader
Generation Using Shader Shaker
Michael Delva, Julien Hamaide,
and Ramses Ladlani
2.1 Introduction
Maintaining shaders in a production environment is hard, as programmers have
to manage an always increasing number of rendering techniques and features,
making the amount of shader permutations grow exponentially. As an example,
allowing six basic features, such as vertex skinning, normal mapping, multitex-
turing, lighting, and color multiplying, already requires 64 shader permutations.
Supporting multiple platforms (e.g., HLSL, GLSL) does not help either. Keep-
ing track of the changes made for a platform and manually applying them to the
others is tedious and error prone.
This chapter describes our solution for developing and efficiently maintaining
shader permutations across multiple target platforms. The proposed technique
produces shaders automatically from a set of handwritten code fragments, each
responsible for a single feature. This divide-and-conquer methodology was al-
ready proposed and used with success in the past, but our approach differs from
the existing ones in the way the fragments are being linked together. From a
list of fragments to use and thanks to user-defined semantics that are used to
tag their inputs and outputs, we are using a pathfinding algorithm to compute
the complete data flow from the initial vertex attributes to the final pixel shader
output.
Our implementation of this algorithm is called Shader Shaker. It is used in
production at Fishing Cactus on titles such as Creatures Online and is open
source for you to enjoy.
505
506 VII 3D Engine Design
Code reuse. This should be the solution that is the most familiar to program-
mers. It consists of implementing a library of utility functions that will be made
available to the shaders thanks to an inclusion mechanism (e.g., include prepro-
cessor directive) allowing code to be reused easily. The main function of the
shader can then be written using calls to these functions and manually feeding
the arguments. This is a natural way of editing shaders for programmers, but it
gets difficult for the less tech savvy to author new permutations and still requires
maintaining all permutations by hand.
A related solution is the one described in [Väänänen 13], where the Python-
based Mako templating engine is used to generate GLSL shaders.
Additive solutions. These work the other way around by defining a series of el-
ementary nodes (or functions) to be aggregated later (either online or offline)
to produce the shader. The aggregation is performed by wiring nodes’ inputs
and outputs together, either visually using a node-based graph editor or pro-
grammatically. This approach has seen lots of implementations [Epic Games
Inc. 15, Holmér 15] largely because of its user friendliness, allowing artists to
produce visually pleasing effects without touching a single line of code. Its
2. Semantic-Based Shader Generation Using Shader Shaker 507
main drawback remains the difficulty to control the efficiency of the generated
shaders [Ericson 08, Engel 08b].
A complete system for generating shaders from HLSL fragments is described
in [Hargreaves 04] in which each shader fragment is a text file containing shader
code and an interface block describing its usage context. In this framework,
fragments are combined without actually parsing the HLSL code itself. The
system was flexible enough to support adaptive fragments, which could change
their behavior depending on the context in which they were used, but lacked the
support of a graph structure (i.e., the system was restricted to linear chain of
operations). Tim Jones implemented this algorithm for XNA 4.0 in [Jones 10].
Trapp and Döllner have developed a system based on code fragments, typed
by predefined semantics that can be combined at runtime to produce an über-
shader [Trapp and Döllner 07].
In [Engel 08a], Wolfgang Engel proposes a shader workflow based on maintain-
ing a library of files, each responsible for a single functionality (e.g., lighting.fxh,
utility.fxh, normals.fxh, skinning.fxh), and a separate list of files responsible for
stitching functions calls together (e.g., metal.fx, skin.fx, stone.fx, eyelashes.fx,
eyes.fx). This is similar to the node-based approach, but it is targeted more at
programmers. As will be shown later, our approach is based on the same idea but
differs from it (and the other node-based solutions) by the fact that the wiring is
done automatically based on user-defined semantics.
Template-based solutions. The last category finds its roots in the famous template
method pattern [Wikipedia 15b], where the general structure of an algorithm (the
program skeleton) is well defined but one is still allowed to redefine certain steps.
This is one of the higher-level techniques adopted by Unity (alongside the
regular vertex and fragment shaders), which is itself borrowed from Renderman:
the surface shader [Pranckevičius 14b]. By defining a clear interface (predefined
function names, input and output structures), the surface shader approach allows
the end user to concentrate on the surface properties alone, while all the more
complex lighting computations (which are much more constant across a game
title) remain the responsibility of the über-shader into which it will be injected.
It should be noted that it would be possible to combine this with any of the
previous three methods for handling permutations at the surface level only.
Taking the idea a bit further, [Yeung 12] describes his solution where he
extends the system with interfaces to edit also the vertex data and the lighting
formula. Unnecessary code is stripped by generating an abstract syntax tree and
traversing it to obtain the variables’ dependencies.
evičius 12, Pranckevičius 14a]. We refer the reader to these articles for more
information, but we summarize the approaches to handling this problem into the
following four families.
The manual way. This could eventually be performed with the help of macros
where the languages do differ, but it does not scale well. It is still tricky because
of subtle language differences and is hard to maintain.
Use another language. Use a language (eventually a graphical one) that will com-
pile into the target shader language as output.
Cross-compile from one language to another. Lots of tools are available to trans-
late from one language to the other at source code level. The problem can be
considered as solved for DirectX 9–level shaders, but there is still work to do
for supporting the new features that have appeared since then (e.g., compute,
geometry, etc.).
Compile HLSL to bytecode and convert it to GLSL. This is easier to do than the
previous technique but suffers from a partly closed tool chain that will run on
Windows only.
2.3 Definitions
Our technique is based around the concepts of fragments and user-defined seman-
tics (not to be confused with the computer graphics fragment used to generate a
single pixel data).
• Fragment: In this context, a fragment is a single file written in HLSL that
is responsible of implementing a single feature and that contains all the
information required for its execution, including uniforms and samplers
declarations, as well as code logic. A fragment example is provided in
Listing 2.1.
• User-defined semantic: A user-defined semantic is a string literal used to tag
a fragment input or output (e.g., MeshSpacePosition , ProjectedPosition ).
This tag will be used during shader generation to match a fragment’s out-
put to another one’s input. User-defined semantics use the existing HLSL
semantic feature, used for mapping input and output of shaders.
2.4 Overview
Shader Shaker, our shader generator, uses a new idea to generate the shader.
User-defined semantics are added to intermediate variables, as shown in List-
ing 2.1. The generation algorithm uses those intermediate semantics to gener-
ate the list of call functions. The algorithm starts from expected output, e.g.,
2. Semantic-Based Shader Generation Using Shader Shaker 509
float4x4 WvpXf ;
void GetPosition (
in f l o a t 3 p o s i t i o n : V e r t e x P o s i t i o n ,
out float4 projected_position : ProjectedPosition
)
{
projected_position = mul ( float4 ( position , 1 ) , WvpXf ) ;
}
LitColor, and creates a graph of the function required to generate the semantic
up to the vertex attributes.
To generate a shader, one has to provide the system with a list of fragments
to use (vertex_skinning + projected_world_space_position + diffuse_texturing
+ normal_mapping + blinn_lighting , for example). Thanks to the semantics, it
is possible to link the desired fragments together to produce the final output
semantic required by the system (e.g., LitColor) and generate the corresponding
complete shader.
Fragments are completely uncoupled; code can be written without considera-
tion of where the data comes from. For example, for a fragment that declares a
function that needs an input argument with a semantic of type ViewSpaceNormal ,
the tool will search another fragment with a function that has an output argu-
ment of the very same semantic to link to this one. In deferred rendering, the
fragment that provides this output argument with the semantic ViewSpaceNormal
would read the geometry buffer to fetch that value, whereas in forward render-
ing, a function could, for example, just return the value of the view-space normal
coming from the vertex shader. In any case, the fragment in the pixel shader that
uses this ViewSpaceNormal is agnostic to where the data it needs comes from.
To achieve this, the code generator adopts a compiler architecture, going
through separate phases:
• HLSL fragments are processed by Shader Shaker to generate for each of
them an abstract syntax tree (AST).
• The ASTs are processed to create a final AST, which contains all the needed
code (functions/uniforms/samplers). The algorithm (explained in detail in
the following section) generates this final AST from the required output
semantics (the output of the pixel shader), then goes upward to the input
semantics, calling successively all functions whose output semantic match
the input semantic of the previous function.
• Eventually, this final AST is converted to the expected output language
(e.g., HLSL, GLSL, etc.).
As the concept has been introduced, let’s dig into the algorithm.
510 VII 3D Engine Design
struct FunctionDefinition
{
set<s t r i n g> I n S e m a n t i c ;
set<s t r i n g> O u t S e m a n t i c ;
set<s t r i n g> I n O u t S e m a n t i c ;
};
• the list of required output semantics (each of them will be mapped to a sys-
tem semantic such as COLOR0; multiple render target code can be generated
by defining multiple final output semantics);
• the list of available input semantics (this can change from mesh to mesh,
creating tailored shaders for a given vertex format).
After the parsing of all fragments, the AST is inspected to extract the signa-
ture of functions. Each function that declares one or more semantics for its argu-
ments is processed, others being considered as helper functions. A FunctionDef
inition structure describing the function is filled up with these semantics infor-
mation (see Listing 2.2). A fragment is then defined by a map of definitions
addressed by function names. It’s important to notice that inout function argu-
ments are supported. It’s useful when a fragment wants to contribute to a vari-
able, like summing different lighting into a final lit color or when transforming a
vertex position through several fragments. When processing an inout semantic,
the semantic is kept in the open set. As each function can only be used once,
another function outputting the semantic is required.
The code generation is done in two steps. The first step consists of the creation
of the call graph. The algorithm is described in Listing 2.3. This algorithm
generates a directed acyclic graph of all function calls from the output to the
input. The second step consists of code generation from the call graph. As the
graph represent the calls from the output, it must be traveled depth first. To
2. Semantic-Based Shader Generation Using Shader Shaker 511
r e p o r t error , s e m a n t i c s i n o p e n s e t do n o t r e s o l v e
end
add_function ( f )
node = { f , f . InSemantic , f . InOutSemantic }
o p e n −= f . I n S e m a n t i c
o p e n += f . O u t S e m a n t i c
//Add i n o u t s e m a n t i c back i n t h e open s e t
o p e n += f . I n O u t S e m a n t i c
c l o s e d += f . I n S e m a n t i c
// Link node t h a t r e q u i r e d t h e s e m a n t i c
f o r each s e m a n t i c in { f . OutSemantic , f . I n O u t S e m a n t i c }
node [ semantic ] . children . add ( node )
end
// R e p o r t a s r e q u i r i n g t h o s e s e m a n t i c s
f o r each s e m a n t i c in { f . InSemantic , f . I n O u t S e m a n t i c }
node [ semantic ] = node
end
// Remove s e m a n t i c p r o v i d e d by v e r t e x
o p e n −= V e r t e x . A t t r i b u t e S e m a n t i c s ;
end
simplify code generation and debugging, the semantic is used as the variable
name. The code generation algorithm is described in Listing 2.4. Finally, a map
of user semantics to system semantics is generated, information to be used in the
engine to bind vertex attributes accordingly.
To illustrate this algorithm, a toy example will be executed step by step. The
fragments are defined as shown in Listing 2.5, Listing 2.6, and Listing 2.7. The
function definitions are created as shown in Listing 2.8. The required semantic is
LitColor. The algorithm generates a graph as shown in Figure 2.1. One can see
the open and closed set populated as the algorithm creates the graph. Finally,
the graph is processed to create the code shown in Listing 2.9. It is important
to notice that the code just uses functions declared in fragments. The final code
aggregates all the fragments codes, only with semantic information removed. It’s
not the purpose of this module to prune unused code. This step can be left to
further modules.
512 VII 3D Engine Design
end
Texture DiffuseTexture ;
sampler2D DiffuseTextureSampler
{
T e x t u r e = <D i f f u s e T e x t u r e >;
};
void C o m p u t e N o r m a l ( in f l o a t 3 v e r t e x _ n o r m a l : V e r t e x N o r m a l ,
out f l o a t 3 pixel_normal : PixelNormal )
{
pixel_normal = normalize ( vertex_normal ) ;
}
f l o a t 4 S o m e L i g h t i n g ( in f l o a t 4 color : D i f f u s e C o l o r ,
in f l o a t 3 n o r m a l : P i x e l N o r m a l ) : L i t C o l o r
{
return ( AmbientLight
+ ComputeLight ( normal ) ) color ;
}
GetDiffuseColor :
{
I n S e m a n t i c : { ” D i f f u s e T e x C o o r d” }
OutSemantic : { ” DiffuseColor” }
}
ComputeNormal :
{
I n S e m a n t i c : { ” VertexNormal ” }
O u t S e m a n t i c : { ” PixelNormal ” }
}
SomeLighting :
{
I n S e m a n t i c : { ” D i f f u s e C o l o r ” , ” PixelNormal ” }
OutSemantic : { ” LitColor ” }
}
LitColor AmbientLighting
Closed = {LitColor}
GetDiffuseColor
LitColor AmbientLighting
ComputeNormal
f l o a t 4 main ( in f l o a t 3 V e r t e x N o r m a l : NORMAL ,
in f l o a t 2 D i f f u s e T e x C o o r d : T E X C O O R D 0 )
{
flo a t4 DiffuseColor ;
GetDiffuseColor ( DiffuseColor , DiffuseTexCoord ) ;
flo a t3 PixelNormal ;
ComputeNormal ( VertexNormal , PixelNormal ) ;
flo a t4 LitColor
= SomeLighting ( DiffuseColor , PixelNormal ) ;
return LitColor ;
}
Graphic quality. The same principle can be used to manage graphic quality set-
tings. Depending on user settings or based on device capabilities, appropriate
fragments can be selected to balance quality against performance.
2.7.4 Programming
Accessing the metadata of generated shaders can be leveraged as a data-driven
feature, e.g., binding the vertex attributes and the uniforms without using the
rendering API to enumerate them. This is even more useful when the graphics
API doesn’t allow us to do so at all.
2.7.5 Debugging
Programmers can easily debug shaders that are generated by Shader Shaker.
Indeed, the output semantics are provided as arguments to the generation process.
If an issue is suspected at any level, the shader can be regenerated with an
intermediate semantic as the output semantic. For example, if we want to display
the view-space normal, the ViewSpaceNormal semantic is provided as the output
semantic. If the semantic variable type is too small to output (e.g., float2 while
ouputs should be float4), a conversion code is inserted.
• Use custom semantics for uniforms and samplers. For now, the semantic
resolution is only applied to functions and input/output arguments. Ap-
plying it to uniforms can be convenient, allowing some values to be passed
either at the vertex level or as uniforms.
2.9 Conclusion
This technique and its user-semantic linking algorithm brings a new ways of cre-
ating shaders. It provides a new way to manage the complexity and combinatory
complexity. Each feature can be developped independently, depending only on
the choice of semantics. Shader Shaker, our implementation, is distributed as
open source software [Fishing Cactus 15].
Bibliography
[Engel 08a] Wolfgang Engel. “Shader Workflow.” Diary of a Graph-
ics Programmer, http://diaryofagraphicsprogrammer.blogspot.pt/2008/09/
shader-workflow.html, September 10, 2008.
[Engel 08b] Wolfgang Engel. “Shader Workflow—Why Shader Generators are
Bad.” Diary of a Graphics Programmer, http://diaryofagraphicsprogrammer.
blogspot.pt/2008/09/shader-workflow-why-shader-generators.html,
September 21, 2008.
[Epic Games Inc. 15] Epic Games Inc. “Materials.” Unreal Engine 4 Doc-
umentation, https://docs.unrealengine.com/latest/INT/Engine/Rendering/
Materials/index.html, 2015.
[Ericson 08] Christer Ericson. “Graphical Shader Systems Are Bad.” http://
realtimecollisiondetection.net/blog/?p=73, August 2, 2008.
518 VII 3D Engine Design
[Jones 10] Tim Jones. “Introducing StitchUp: ‘Generating Shaders from HLSL
Shader Fragments’ Implemented in XNA 4.0.” http://timjones.tw/blog/
archive/2010/11/13/introducing-stitchup-generating-shaders-from-hlsl
-shader-fragments, November 13, 2010.
[Trapp and Döllner 07] Matthias Trapp and Jürgen Döllner. “Automated Com-
bination of Real-Time Shader Programs.” In Proceedings of Eurographics
2007, edited by P. Cignoni and J. Sochor, pp. 53–56. Eurographics, Aire-la-
Ville, Switzerland: Eurographics Association, 2007.
[Väänänen 13] Pekka Väänänen. “Generating GLSL Shaders from Mako Tem-
plates.” http://www.lofibucket.com/articles/mako glsl templates.html, Oc-
tober 28, 2013.
[Wikipedia 15a] Wikipedia. “A* Search Algorithm.” http://en.wikipedia.org/
wiki/A* search algorithm, 2015.
[Wikipedia 15b] Wikipedia. “Template Method Pattern.” http://en.wikipedia.
org/wiki/Template method pattern, 2015.
[Yeung 12] Simon Yeung. “Shader Generator.” http://www.altdev.co/2012/08/
01/shader-generator/, August 1, 2012.
3
VII
3.1 Introduction
The Almost Native Graphics Layer Engine (ANGLE) is a portable, open source,
hardware-accelerated implementation of OpenGL ES 2.0 used by software like
Google Chrome to allow application-level code to target a single 3D API, yet ex-
ecute on platforms where native OpenGL ES support may not be present. As of
this writing, ANGLE’s OpenGL ES 3.0 implementation is under active develop-
ment. Applications may choose among ANGLE’s multiple rendering backends at
runtime, targeting systems with varying levels of support. Eventually, ANGLE
will target multiple operating systems.
ANGLE’s original development was sponsored by Google for browser support
of WebGL on Windows systems, which may not have reliable native OpenGL
drivers. ANGLE is currently used in several browsers, including Google Chrome
and Mozilla Firefox. Initially, ANGLE provided only an OpenGL ES 2.0 imple-
mentation, using Direct3D 9 as its rendering backend. D3D9 was a good initial
target since it’s supported in Windows systems running XP or newer for a very
large range of deployed hardware.
Since that time, WebGL has been evolving, and ANGLE has evolved along
with it. The WebGL community has drafted new extensions against the current
WebGL specification, as well as draft specifications for WebGL 2.0. Some of the
features contained within these, such as sRGB textures, pixel buffer objects, and
3D textures, go beyond the feature set available to ANGLE in Direct3D 9. For
this reason, it was clear that we would need to use a more modern version of
Direct3D to support these features on Windows systems, which led us to begin
work on a Direct3D 11 rendering backend.
521
522 VII 3D Engine Design
3.2 Direct3D 11
Of the API differences we encountered while implementing ANGLE’s new Di-
rect3D 11 backend, some were relatively minor. In the case of fragment coordi-
nates, for example, Direct3D 11 more closely aligns with OpenGL ES and related
APIs, in that pixel centers are now considered to be at half-pixel locations—i.e.,
(0.5, 0.5)—just as they are in OpenGL. This eliminates the need for half-pixel
offsets to be applied to fragment coordinates as in our Direct3D 9 implementa-
tion. There are quite a few places, however, where Direct3D 11 differs from both
Direct3D 9 and OpenGL, requiring ANGLE to find new workarounds for this
rendering backend.
3.2.1 Primitives
Direct3D 9’s available set of primitive types for draw calls is more limited than
OpenGL’s, and Direct3D 11’s is reduced slightly further by removing triangle
fans. ANGLE enables GL_TRIANGLE_FAN by rewriting the application-provided
index buffer to express the same polygons as a list of discrete triangles. This is
a similar tactic to the one we employed to support GL_LINE_LOOP in Direct3D 9
3. ANGLE: Bringing OpenGL ES to the Desktop 523
(and which is still necessary in Direct3D 11), although the modification required
to index buffers for line loops is considerably simpler—we need only repeat the
initial point to close the line loop.
Direct3D 11 also removes support for large points, commonly used for ren-
dering point sprites. While point lists themselves are still supported, the size of
points is no longer configurable. This is a less trivial problem for ANGLE to
solve. Thankfully, Direct3D 11 also introduces geometry shaders, which allow us
to expand each point into a billboarded quad, achieving the same effect without
CPU overhead.
“while (i < 4) { … }”
TIntermLoop(ELoopWhile)
Condition Body
TIntermBinary(EOpLessThan) …
Left Right
TIntermSymbol(“i”) TIntermConstantUnion(4)
continuing (which is especially useful when you’ve set your WebGL application
as the startup page) or use --new-window yoursite.com. For HLSL compilation
issues, you can set a breakpoint at HLSLCompiler::compileToBinary() (function
name subject to change).
You can also retrieve the HLSL code from within WebGL through the WEBGL_
debug_shaders extension, as in Listing 3.1. Note that the format returned by this
extension is implementation specific.
You may notice that the original variable names have been replaced by hardly
legible _webgl_<hexadecimal> names. This circumvents bugs in drivers that can’t
handle long variable names, but makes the HLSL difficult to debug. To disable
this workaround, you can use Chrome’s --disable-glsl-translation flag. Note
that this merely disables Chrome’s ESSL-to-ESSL translation, meant only for
validation and driver workaround purposes, not ANGLE’s ESSL-to-HLSL trans-
lation. This may change in the future as more of the validation becomes ANGLE’s
responsibility and duplicate translation is avoided. Even with the aforementioned
flag, some variable names may have been modified to account for differences in
scoping rules between GLSL and HLSL.
More recently, we’ve started dealing with these differences at the AST level
itself instead of at the string output level. When a short-circuiting operator is
encountered, we replace it with a temporary variable node and move the node
representing the short-circuiting operator itself up through the tree before the
most recent statement and turn it into an if...else node. When the child
nodes are visited and they themselves contain short-circuiting operators, the same
process takes place. So this approach takes advantage of the naturally recursive
nature of AST traversal. This doesn’t work at the string level because that would
require inserting code into part of the string that has already been written.
the closest neighboring texels (potentially across multiple faces)! In any case, it
was an interesting exercise in software rendering on the GPU, and we expect to
encounter more occurrences like this in the future as graphics APIs become more
low level and the operations become more granular and software controlled.
makes this nontrivial. Although issues like these are eventually addressed by the
graphics card vendors, it takes a while for these fixes to be deployed to all users,
so thus far we’ve always left these kinds of workarounds enabled.
Driver bugs are even less under our control than HLSL compiler issues, but
hopefully graphics APIs will continue to become more low level so that eventually
we get access to the bare instructions and data. Just like on a CPU, the behavior
of elementary operations is very tightly defined and verifiable so that compilers
can generate code that produces dependable results.
v1 v2 a1
a1 a2 f
a2
uniform sampleBlock
{
mat2 m ;
vec2 v ;
float a [ 2 ];
float f ;
};
The GL API defines three layouts for unpacking data from UBOs to the
shader. We treat the packed and shared layouts identically; in both, the details are
left to the GL implementation. The std140 layout, however, is defined precisely
by the GL specification. Because it’s an application, you can choose the simplicity
of the standardized layout or the benefit of a memory-saving packed layout. With
a GL implementation, you must at the very least support the std140 layout.
UBOs map relatively closely to Direct3D 11’s concept of constant buffers
[MSDN 14d]. We chose to implement UBOs on top of constant buffers and
offer the memory-saving benefits of the packed layout, while maintaining the
necessary std140 layout. In both cases, good performance is also a requirement.
Unsurprisingly, HLSL’s default unpacking scheme for constant buffers differs from
534 VII 3D Engine Design
Render to
Texture
Render to Render
Target
Texture
Texture
Staging Async Copy
Texture
PBO Async Copy
Fence
Fence CPU
Buffer Pack Pixels
Readback
Readback
(a) (b)
Application
OpenGL ES
Angle Backend
1. The entry point/validation layer exports all of the EGL and OpenGL ES
entry point functions and handles validation of all paramters. All values
passed to the layers below this are assumed to be valid.
2. The object layer contains C++ representations of all EGL and OpenGL ES
objects and models their interactions. Each object contains a reference to
a native implementation for forwarding actions to the native graphics API.
3. The renderer layer provides the implementation of the EGL and GL objects
in the native graphics API; the interfaces are simplified to only action calls
such as drawing, clearing, setting buffer data, or reading framebuffer data.
All queries and validation are handled by the layers above.
538 VII 3D Engine Design
Validation
cla s s BufferImpl
{
public :
v i r t u a l void s e t D a t a ( s i z e _ t size , void data ,
GLenum usage ) = 0;
v i r t u a l void s e t S u b D a t a ( s i z e _ t offset , s i z e _ t size ,
void data ) = 0;
v i r t u a l void map ( GLenum access ) = 0;
v i r t u a l void unmap ( ) = 0;
};
The buffer object. A simple example of a renderer layer object that requires a
native implementation is the OpenGL ES 3.0 Buffer (see Listing 3.2). ANGLE’s
Direct3D 9 implementation simply stores the supplied memory in CPU-side mem-
ory until the first use of the Buffer in a draw call when the data is uploaded to a
IDirect3DVertexBuffer9 or IDirect3DIndexBuffer9 . The Direct3D 11 implemen-
tation stores the data in a ID3D11Buffer with the D3D11_USAGE_STAGING flag and
will copy the buffer data lazily to one of several specialized buffers for use as an
index buffer, vertex buffer, transform feedback buffer, or pixel buffer.
3. ANGLE: Bringing OpenGL ES to the Desktop 539
cla s s Renderer
{
public :
vi rtu al BufferImpl createBuffer () = 0;
vi rtu al TextureImpl createTexture () = 0;
...
v i r t u a l v o i d d r a w A r r a y s ( c o n s t g l : : S t a t e &s t a t e , G L e n u m m o d e ,
s i z e _ t first , s i z e _ t count ) = 0 ;
...
v i r t u a l v o i d c l e a r ( c o n s t g l : : S t a t e &s t a t e ,
GLbitfield mask ) = 0;
...
};
const char ex = e g l Q u e r y S t r i n g ( E G L _ N O _ D I S P L A Y , E G L _ E X T E N S I O N S ) ;
i f ( s t r s t r ( ex , ” EGL ANGLE platform angle ” ) != N U L L &&
s t r s t r ( ex , ” EGL ANGLE platform angle d3d ” ) != N U L L )
{
EGLint renderer = EGL_PLATFORM_ANGLE_TYPE_D3D11_ANGLE ;
const EGLint attribs [ ] =
{
EGL_PLATFORM_ANGLE_TYPE_ANGLE , renderer ,
EGL_NONE ,
};
display = eglGetPlatformDisplayEXT ( EGL_PLATFORM_ANGLE_ANGLE ,
nativeDisplay , attribs ) ;
}
The renderer object. The Renderer object is the main interface between the object
layer and the renderer layer. It handles the creation of all the native implementa-
tion objects and preforms the main actions, such as drawing, clearing, or blitting.
See Listing 3.3 for a snippet of the Renderer interface.
Runtime renderer selection. Specific renderers can be selected in EGL by using the
EGL_ANGLE_platform_angle extension. Each renderer implemented by ANGLE has
an enum that can be passed to eglGetDisplayEXT or a default enum that can be
used to allow ANGLE to select the best renderer for the specific platform it is
running on. See Listing 3.4 for an example of selecting the Direct3D 11 renderer
at runtime.
platforms and allow users to write OpenGL ES applications that run on all mobile
and desktop platforms.
Despite the ANGLE project originally being created to work around the poor
quality of OpenGL drivers on the Windows desktop, the quality has improved
enough over the last five years that offering an OpenGL renderer is viable. With
having the Direct3D renderer fallback, ANGLE will be able to offer OpenGL
renderers on driver versions that are known to be stable and fast with less CPU
overhead than a Direct3D renderer.
Dealing with the enormous number of permutations of client version and ex-
tension availability in desktop OpenGL will be a complicated aspect of imple-
menting an OpenGL renderer. Loading function pointers or using texture format
enumerations may involve checking a client version and up to three extensions.
For example, creating a framebuffer object could be done via glGenFramebuffers ,
glGenFramebuffersEXT , glGenFramebuffersARB , or glGenFramebuffersOES (when
passing through to OpenGL ES), depending on the platform.
Driver bugs are notoriously common in OpenGL drivers, and working around
them will be necessary. In order to promise a conformant OpenGL ES imple-
mentation, ANGLE will have to maintain a database of specific combinations of
driver versions, video card models, and platform versions that have known con-
formance issues and attempt to work around these issues by avoiding the issue or
manipulating inputs or outputs. In the worst case, when a driver bug cannot be
hidden, EGL offers the EGL_CONFORMANT configuration field to warn the user that
there are issues that cannot be fixed.
• Avoid line loops and triangle fans. Instead try using line lists and triangle
lists.
• Wide lines are not supported. Many native OpenGL implementations also
don’t support them, because there’s no consensus on how to deal with corner
cases (pun intended). Implement wide lines using triangles.
• Test your WebGL application with early releases of Chrome (Beta, Dev,
and Canary). It’s the best way to catch bugs early, fix them, and create a
conformance test for it so it will never affect your users.
Bibliography
[3Dlabs 05] 3Dlabs. “GLSL Demos and Source Code from the 3Dlabs OpenGL
2.” http://mew.cx/glsl/, 2005.
[Koch and Capens 12] Daniel Koch and Nicolas Capens. “The ANGLE Project:
Implementing OpenGL ES 2.0 on Direct3D.” In OpenGL Insights, edited by
Patrick Cozzi and Christophe Riccio, pp. 543–570. Boca Raton, FL: CRC
Press, 2012.
[MSDN 14c] MSDN. “Direct3D Feature Levels.” Windows Dev Center, http:
//msdn.microsoft.com/en-us/library/windows/desktop/ff476876(v=vs.85)
.aspx, 2014.
542 VII 3D Engine Design
[MSDN 14d] MSDN. “How to: Create a Constant Buffer.” Windows Dev
Center, http://msdn.microsoft.com/en-us/library/windows/desktop/ff476
896(v=vs.85).aspx, 2014.
[MSDN 14e] MSDN. “Stream-Output Stage.” Windows Dev Center, http:
//msdn.microsoft.com/en-us/library/windows/desktop/bb205121(v=vs.85)
.aspx, 2014.
About the Editors
Marius Bjørge is a Graphics Research Engineer at ARM’s office in Trondheim,
Norway. Prior to ARM he worked in the games industry as part of Funcom’s
core engine team. He has presented research at SIGGRAPH, HPG, and GDC
and is keenly interested in anything graphics-related. He’s currently looking at
new ways of enabling advanced real-time graphics on current and future mobile
GPU technology.
Wessam Bahnassi is a software engineer and an architect (that is, for buildings
not software). This combination drives Wessam’s passion for 3D engine design.
He has written and dealt with a variety of engines throughout a decade of game
development. Currently, he is leading the programming effort at IN—Framez
Technology, the indie game company he cofounded with his brother Homam.
Their first game (Hyper Void ) is a live shaders showcase (some of which have
been featured previous GPU Pro volumes), and it is in the final development
stages.
Wolfgang Engel is the CEO of Confetti (www.conffx.com), a think tank for ad-
vanced real-time graphics for the game and movie industry. Previously he worked
for more than four years in Rockstar’s core technology group as the lead graph-
ics programmer. His game credits can be found at http://www.mobygames.com/
developer/sheet/view/developerId,158706/. He is the editor of the ShaderX and
543
544 About the Editors
GPU Pro book series, the author of many articles, and a regular speaker at com-
puter graphics conferences worldwide. He is also a DirectX MVP (since 2006),
teaches at UCSD, and is active in several advisory boards throughout the indus-
try. You can find him on twitter at @wolfgangengel.
545
546 About the Contributors
Dan Curran is a researcher in the HPC group at the University of Bristol, where
his work focuses on the development of efficient algorithms for many-core com-
puter architectures. He has worked on a range of different applications, including
computational fluid dynamics, de-dispersion for the SKA, lattice Boltzmann, and
computational photography. He is an expert in GPU computing, with a partic-
ular focus on OpenCL. Dan graduated with an MEng in computer science in
2012.
Michael Delva always thought he would be a sports teacher until he realized after
his studies that his way was in programming. He learned C++ by himself, and
he created his own company to develop and sell a basketball video and statistical
analysis software, until he had to end this great period four years later. Then,
he worked for a few years at NeuroTV, where he participated in the development
of real-time 3D solutions and interactive applications for the broadcast industry.
He is now happy to be able to mix his passion for programming and video games
at Fishing Cactus, where he works as an engine/gameplay programmer.
Alex Dunn, as a developer technology engineer for NVIDIA, spends his days pas-
sionately working toward advancing real-time visual effects in games. A former
graduate of Abertay University’s Games Technology Course, Alex got his first
taste of graphics programming on the consoles. Now working for NVIDIA, his
time is spent working on developing cutting-edge programming techniques to
ensure the highest quality and best player experience possible is achieved.
Holger Gruen ventured into creating real-time 3D technology over 20 years ago
writing fast software rasterizers. Since then he has worked for games middleware
vendors, game developers, simulation companies, and independent hardware ven-
dors in various engineering roles. In his current role as a developer technology
engineer at NVIDIA, he works with games developers to get the best out of
NVIDIA’s GPUs.
548 About the Contributors
James L. Jones graduated with a degree in computer science from Cardiff Uni-
versity and works on real-time graphics demos in the demo team at Imagination
Technologies. He is currently focused on physically based rendering techniques for
About the Contributors 549
modern embedded graphics platforms and research for demos with Imagination’s
real-time ray-tracing technology.
Ramses Ladlani is lead engine programmer at Fishing Cactus, the video game
company he co-founded in 2008 with three former colleagues from 10tacle Studios
Belgium (a.k.a. Elsewhere Entertainment). When he is not working on the next
feature of Mojito, Fishing Cactus’s in-house cross-platform engine, he can be
found playing rugby or learning his new role as a father. He received his master’s
degree in computer engineering from Université Libre de Bruxelles.
Hongwei Li received his PhD in computer science from Hong Kong University of
Science and Technology. He was a researcher in AMD advanced graphics research
group, focusing on real-time rendering and GPGPU applications. He is also very
active in the open source community and is the main contributor of a rendering
engine for mobile platforms.
Anton Lokhmotov has been working in the area of programming languages and
tools for 15 years, both as a researcher and engineer, primarily focussing on pro-
ductivity, efficiency, and portability of programming techniques for heterogeneous
systems. In 2015, Anton founded dividiti to pursue his vision of efficient and
reliable computing everywhere. In 2010–2015, Antonled development of GPU
Compute programming technologies for the ARM Mali GPU series, including
production (OpenCL, RenderScript) and research (EU-funded project CARP)
compilers. He was actively involved in educating academic and professional de-
velopers, engaging with partners and customers, and contributing to open source
550 About the Contributors
Jamie Madill works on Google Chrome’s GPU team to help Chrome’s OpenGL
backend work uniformly across every device and API, via ANGLE. His back-
ground is in simulation and rendering, with which he still tinkers in his spare
time. He graduated with a master’s degree in computer science from Carleton
University in 2012.
Simon McIntosh-Smith leads the HPC research group at the University of Bristol
in the UK. His background is in microprocessor architecture, with a 15-year ca-
reer in industry at companies including Inmos, STMicroelectronics, Pixelfusion,
and ClearSpeed. Simon co-founded ClearSpeed in 2002 where, as director of ar-
chitecture and applications, he co-developed the first modern many-core HPC
accelerators. In 2003 he led the development of the first accelerated BLAS/LA-
PACK and FFT libraries, leading to the creation of the first modern accelerated
Top500 system, TSUBAME-1.0, at Tokyo Tech in 2006. He joined the University
of Bristol in 2009, where his research focuses on efficient algorithms for heteroge-
neous many-core architectures and performance portability. He is a joint recipient
of an R&D 100 award for his contribution to Sandia’s Mantevo benchmark suite,
and in 2014 he was awarded the first Intel Parallel Computing Center in the UK.
Simon actively contributes to the Khronos OpenCL heterogeneous many-core
programming standard.
Doug McNabb is currently a game developer’s voice inside Intel. He’s currently
creating new technologies to help advance the state of the art in visual computing.
He was previously the CTO and rendering system architect at 2XL Games and
the rendering system architect at Rainbow Studios. He contributed to more than
a dozen games, with the most-recent being Baja: Edge of Control. You can find
him on twitter @mcnabbd.
Gareth Morgan has been involved in games and 3D graphics since 1999, starting
at Silicon Graphics followed by several games companies including Activision and
BAM Studios. Since 2008 he has been a leading software engineer at Imagination
About the Contributors 551
Gustavo Bastos Nunes is a graphics engineer in the Engine team at Microsoft Turn
10 Studios. He received his BSc in computer engineering and MSc in computer
graphics from Pontifı́cia Universidade Católica do Rio de Janeiro, Brazil. He
has several articles published in the computer graphics field. He is passionate
about everything graphics related. Gustavo was part of the teams that shipped
Microsoft Office 2013, Xbox One, and Forza Motorsport 5.
Benjamin Rouveyrol has been working in the game industry for the past ten years,
working on the Far Cry and Assassin’s Creed series. He is currently working at
Ubisoft Montreal on Rainbow Six Siege, making pixel faster and prettier.
Shannon Woods is the project lead for ANGLE at Google. Prior to her current
work, she explored other corners of the 3D graphics world, developing software
for game portability and real-time distributed simulation. She is a graduate of
the University of Maryland and enjoys close specification analysis, music, and
teapots.
Exploring recent developments in the rapidly evolving field of real-time rendering, GPU Pro6: Advanced
Rendering Techniques assembles a high-quality collection of cutting-edge techniques for advanced graphics
processing unit (GPU) programming. It incorporates contributions from more than 45 experts who cover the latest
developments in graphics programming for games and movies.
The book covers advanced rendering techniques that run on the DirectX or OpenGL runtimes, as well as on any
Techniques
Rendering
Advanced
other runtime with any language available. It details the specific challenges involved in creating games across
the most common consumer software platforms such as PCs, video consoles, and mobile devices.
The book includes coverage of geometry manipulation, rendering techniques, handheld devices programming,
effects in image space, shadows, 3D engine design, graphics-related tools, and environmental effects. It also
includes a dedicated section on general purpose GPU programming that covers CUDA, DirectCompute, and
OpenCL examples.
In color throughout, GPU Pro6 presents ready-to-use ideas and procedures that can help solve many of your daily
graphics programming challenges. Example programs with downloadable source code are also provided on the
book’s CRC Press web page.
K24427 Engel
ISBN: 978-1-4822-6461-6
90000
9 781482 264616
Edited by Wolfgang Engel