0% found this document useful (0 votes)
16 views12 pages

IEEEMicro Ge Force 6800

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3215459

The GeForce 6800

Article in IEEE Micro · April 2005


DOI: 10.1109/MM.2005.37 · Source: IEEE Xplore

CITATIONS READS

73 450

2 authors, including:

Henry P. Moreton
NVIDIA
30 PUBLICATIONS 1,306 CITATIONS

SEE PROFILE

All content following this page was uploaded by Henry P. Moreton on 03 January 2017.

The user has requested enhancement of the downloaded file.


THE GEFORCE 6800
GRAPHICS PROCESSING UNITS (GPUS) CONTINUE TO TAKE ON INCREASING
COMPUTATIONAL WORKLOADS AND TODAY SUPPORT INTERACTIVE

RENDERING THAT APPROACHES CINEMATIC QUALITY. THE ARCHITECTURAL

DRIVERS FOR GPUS ARE PROGRAMMABILITY, PARALLELISM, BANDWIDTH, AND

MEMORY CHARACTERISTICS. THIS ARTICLE DESCRIBES HOW ONE TEAM

APPROACHED THE DESIGN PROBLEM.

The graphics processing unit (GPU) evolution of image quality.


market is large, growing, and varied, shipping
more than 500 million units per year. Table 1 The graphics problem
profiles this market. The core GPU market is What does a GPU do? Under the control
interactive gaming on the PC platform, where of an application generically called a render-
the goal is film-quality rendering with real- er, the GPU computes the color of each pixel.
time response. Game releases rival movie This image synthesis entails resampling a scene
openings in revenue. The release of Halo 2, described by triangles of materials simulated
an Xbox title, grossed $125 million in the first using sampled images (textures) and numer-
24 hours (www.pcmag.com). In contrast, The ically approximated properties. The GPU per-
Incredibles grossed $70.5 million during its forms image synthesis calculations in three
first three days (www.the-numbers.com). steps. First, it processes the triangles’ vertices,
In addition to workstations used to devel- computing screen positions and attributes
op motion pictures and
John Montrym games, GPU markets include Table 1. Graphics processing unit market
traditional professional work- breakdown.
Henry Moreton stations, flight and driving
simulators, and various con- Sector Millions of Units
Nvidia sumer devices. General-pur- Interactive gaming 50
pose computing using GPUs Digital content creation
is both an area of research and professional 1
an emerging market. GPUs home 50
are well suited for large data- Computer-aided design and manufacture 1
parallel problems such as Visual simulations 0.1
fluid dynamics, weather sim- General computing 3
ulation, and financial option Consumer
price modeling. handheld devices 50
The computational load consoles 100
on GPUs keeps growing, and media centers 5
image quality has made huge cell phones 600
strides during the last 15 Total 860.1
years. Figure 1 illustrates the

2 Published by the IEEE Computer Society 0272-1732/05/$20.00 © 2005 IEEE


such as color and surface ori-
entation. Next, a rasterizer
samples each triangle to iden-
tify fully and partially covered
pixels, called fragments.
Finally, it processes the frag- (a) (b) (c)
ments using texture sam-
pling, color calculation,
visibility, and blending. The
vertex and fragment process-
ing steps enjoy a high degree
of independent programma-
ble processing.
How does a rendering
application use the GPU to
simulate the appearance of
materials? Figure 2 shows a
progression of scene drawing
(d) (e)
techniques.
The desire for increased Figure 1. Evolution of image quality in PC games. In the 1990 game Marooned, the PC state of
realism has driven greater pre- the art was two-dimensional sprites, and the graphics card was little more than a CRT controller
cision and functionality. A (a). Simple three-dimensional graphics appeared in 1991, as shown in Hovertank (b). The Doom
recent example is high- series introduced texture mapping of simple characters in 1993 (c). Quake, in 1996, brought
dynamic-range (HDR) ren- greater quality, texture filtering, and more characters (d). Today, with Doom3, we see correct
dering.1 In Figure 3, for shadows, accurate lighting models, and high-quality filtering (e). (Images used by permission of
example, the light through the Id Software Inc. Wolfenstein 3D, DOOM, QUAKE, and DOOM 3 are either registered trade-
window is hundreds of times marks or trademarks of Id Software Inc. in the United States and/or other countries.)
brighter than the obelisks, but
the obelisks are not solid
black. The glow produces a
more cinematic image.
Until recently, in interac-
tive systems, GPUs repre-
sented final colors with
fractions between 0.0 and
(a) (b) (c)
1.0, at 8-bit precision. The
GPU’s fragment processor
also clamped calculations to
this limited range. Along
with limited precision, this
resulted in cheaper hardware.
The first evolutionary step
was support for increased
range and precision during (d) (e) (f)
calculation. Today, a GPU
performs calculations and Figure 2. Basic primitives used in rendering. The renderer approximates objects using trian-
stores integer and floating- gles, defined by vertices (a) and lines connecting the vertices to make triangles (b). The sim-
point results at up to 32-bit plest form of lighting assumes a perfectly diffuse surface (c). In simple texture mapping, the
precision. GPU samples and filters images to determine pixel fragment color (d). Given the eye loca-
tion and surface orientation at the fragment, the GPU can look up a reflected color (e) in a
Three-phase rendering texture called an environment, or cube, map to perform reflection mapping. The GPU can
A typical cinematic ren- also simulate bumpiness (bump mapping) by perturbing local surface orientation (f).
derer divides the work for

MARCH–APRIL 2005 3
HOT CHIPS 16

synthesis through local calculations, but ren-


dering shadows correctly requires considering
whether each triangle obscures light to any
other triangle.
One pertinent and well-known algorithm
is stencil shadow volumes.2,3 First, the renderer
creates the depth buffer for the scene, writing
only depth values. During the second stage,
the renderer makes a preprocessing pass for
each light source. It draws triangles (but not
color or depth) into the stencil buffers, count-
ing entry and exit to compute the regions of
space in which some object casts a shadow or
Figure 3. High-dynamic-range rendering. obscures the light source. Specifically, front-
facing triangles increment the stencil value at
a pixel, and back-facing tri-
angles decrement this value.
After the renderer has drawn
all the triangles, the stencil
value at each sample indicates
whether the light source illu-
minates that sample or
whether it is in shadow. A
(a) (b) (c) (d)
nonzero stencil value indi-
Figure 4. Steps in stencil shadow-volume calculation. For a subject character (a), the render- cates shadow.
er computes the object’s silhouette edges, shown here highlighted in white, with respect to Figure 4 illustrates shadow
the light (b). The renderer draws quadrilaterals (triangle pairs) starting at each silhouette volume generation. The sil-
edge, extruded away from the light source, and updates the stencil buffer (c). This process houette quadrilaterals com-
yields the rendered scene with shadows (d). bined with the facets of the
model facing away from the
light define the shadow vol-
each frame into three phases: prerendering, umes. Because the renderer must compute the
main rendering, and postprocessing. First, the shadow volume separately for each light, the
renderer computes the data it will need for the number of shadow triangles drawn can be very
main rendering phase. These data consist of large. Although the pixels are computational-
shadow maps or shadow volumes for each ly simple during stencil shadow-volume gen-
light source, along with environment maps. eration, it’s not uncommon for the shadow
In the main phase, the renderer draws the volume prepass to consume about two-thirds
scene from the camera’s viewpoint; this is what of total rendering time.
we usually think of as computer graphics ren-
dering. For every light source, the renderer Architectural drivers
accumulates light energy contributions to The GeForce 6800 architecture has three
each pixel. In the postprocessing phase, the major drivers.
renderer uses image processing, for example,
to simulate lens flare and map HDR color val- • Programmability. Programmable ele-
ues to the display device’s limited gamut. ments, evolved from configurable logic,
Shadows increase geometric complexity and afford much greater algorithmic flexibil-
provide important visual cues. In games, they ity. Programmability also lets content
set a mood—for example, creating fear when developers add value with their propri-
you see an enemy’s shadow in a corridor. etary algorithms.
Shadows have long been one of the most chal- • Parallelism. The rendering problem has
lenging problems in interactive computer a great deal of data parallelism. The
graphics. A renderer can handle most image scenes comprise objects defined by ver-

4 IEEE MICRO
tices, which the GPU can process inde- OpenGL with just-in-time compiled machine-
pendently. The renderer expresses the independent assembler as well as higher-level
result of its calculations as millions of C-like programming languages.
independent pixels. These high levels of
parallelism permit the efficient deploy- Parallelism
ment of broadly and deeply parallel com- Contrasting CPUs and GPUs makes it eas-
putational resources. ier to understand the motivation behind the
• Memory. The memory subsystem is the GPU architecture. The GPU workload offers
most precious resource in any graphics more independent calculations than a typical
system, and its characteristics heavily CPU workload; the programmer’s view is sin-
influence the GPU’s design. Designers gle threaded, while the machine is actually
must fit the GPU architecture to the deeply multithreaded. The GPU can afford
memory subsystem’s bandwidth and larger amounts of floating-point computational
latency characteristics. power because the control overhead per oper-
ation is lower than that for a CPU, and a GPU
Programmability can effectively execute extensive floating-point
The GeForce 6800’s programming model computations. The simple programming
enables parallelized acceleration. There are two model and large amount of independent cal-
separate programs: The application executes a culation result in deep and wide parallelism for
vertex program independently on every ver- the GeForce 6800 to exploit.
tex; similarly, the GPU applies a fragment pro- Another interesting difference between
gram independently to every pixel fragment. CPUs and GPUs is the use of dedicated
For every vertex received in the command mode-controlled functional units for special-
stream, the GPU launches a thread executing ized performance-critical tasks. In addition to
the vertex program. For every rasterized pixel the programmable vertex and fragment
fragment, the machine dispatches one thread processors, there are specialized units for data
of the fragment program. Each thread has its fetch, rasterization (conversion from triangles
own unique inputs available in read-only reg- to pixel fragments), and texture filtering. We
isters. Supporting hardware loads these inputs determined the processor instruction set by
before thread launch. Each thread also has analyzing the graphics workload. For exam-
write-only output registers, whose content the ple, because of their importance to graphics
machine forwards to the next processing stage. algorithms, the GeForce 6800 includes fast
In addition to these inputs and outputs, each and accurate transcendental functions and
thread has private temporary registers, read- inner-product instructions.
only program parameters, and access to filtered
and resampled texture map images. Memory
Nvidia introduced the first programmable The memory bandwidth demands of GPU
GPU, the GeForce3, in 2001. The GeForce3 systems have always been insatiable, largely
supported a programmable vertex processor.4 because there are so many concurrently active
In 2002, the original GeForce FX series intro- threads. CPUs have dealt with memory limi-
duced programmable vertex and fragment tations by using ever-larger caches, but graph-
processors. Now, the GeForce 6800 has uni- ics working-set sizes have grown at least as fast
fied these capabilities and made them orthog- as transistor density, and it remains prohibitive
onal. The fragment processor supports to implement an on-chip cache large enough
dynamic flow control, as the vertex processor to achieve 99 percent hit rates. Caches as part
did in the GeForce FX. In addition, the vertex of the memory hierarchy cannot affordably
program can access the texture subsystem, pre- support long-term reuse. Therefore, our GPU
viously available only to fragment programs. cache designs assume a 90 percent hit rate
The FX had introduced floating-point textures with many misses in flight. Stated another
and frame buffers; the GeForce 6800 adds the way, we implement caches that support effec-
ability to blend and filter in floating-point. tive streaming with local reuse of fetched data.
Finally, from a language and API perspective, Because of bandwidth limitations, we aim
the GeForce 6800 supports both Direct3D and for 100 percent memory bandwidth utiliza-

MARCH–APRIL 2005 5
HOT CHIPS 16

key memory clients to saturate all available


GeForce 6800 statistics memory bandwidth. Dozens of rendering
The GeForce 6800 has high-throughput programmable floating-point processors, efficient regimes require “speed of light” performance
special-purpose engines, and a flexible memory subsystem that supports a wide range of limited only by memory bandwidth.
DRAM types, from the commodity to the exotic. Its notable statistics include We already mentioned stencil shadow-vol-
ume rendering as a specialized non–color-
• 222 million transistors, updating phase of rendering. This rendering
• 303-mm2 area, step has two highly specialized modes of oper-
• 550-MHz double-data-rate memory clock, ation: one mode renders only depth values,
• 400+ MHz core clock, and the other updates only the stencil value.
• 400 million vertices per second, and Because we designed the GPU to saturate
• 120+ Gflops peak (equal to six 5-GHz Pentium 4 processors). DRAM bandwidth at 16 pixels per clock cycle
when the renderer is updating both color and
depth, the processor must deliver an even
tion, which forces the internal processors and higher pixel rate to saturate memory when
fixed-function units to be latency tolerant and performing only depth or stencil work.
to respect page locality. We also schedule
DRAM cycles to minimize idle data-bus time A tour of the GeForce 6800
caused by read-write direction changes. GPUs Figure 5 is a top-level diagram of the
improve page locality by mapping two- and GeForce 6800. Work flows from top to bot-
three-dimensional spatial locality to corre- tom, starting with the six identical program-
sponding locality at the granularity of a one- mable vertex processors. Because all vertices
dimensional DRAM page. are independent of each other, the data fetch-
The GeForce 6800 memory subsystem er assigns incoming work to any idle proces-
comprises four independent 64-pin partition sor, and the parallel utilization is nearly
controllers. Because of fluctuations in DRAM perfect. The “GeForce 6800 statistics” side-
supply, it’s important that the GeForce 6800 bar provides more specifics.
maintain plenty of flexibility with respect to Results from the vertex stage are reassem-
the specific memory used. The memory con- bled in the original application-specified order
troller supports double-data-rate (DDR2) and to feed the triangle setup and rasterization
its graphics-oriented counterpart GDDR3 sig- units. For each primitive, the rasterizer iden-
naling and protocols at various clock frequen- tifies constituent pixel fragments and sends
cies with widely programmable memory cycle them to a fragment processor. Sixteen pro-
timings. The memory controller also maps lin- grammable fragment processors operate on
ear addresses to pages and individual parti- the workload in parallel. Each thread receives
tions. For efficiency, the controllers arbitrate the (x, y) addresses and interpolated inputs
among a dozen sources of read and write traf- from the rasterizer. Because fragments are
fic, and they balance bus utilization with laten- independent of one another, the processors
cy. To further increase effective bandwidth, the approach 100 percent utilization.
controller uses lossless compression and Finally, a crossbar distributes color and depth
decompression, which is transparent to clients. results from the fragment processors to 16 fixed-
function pixel-blending units, which perform
Performance regimes frame buffer operations such as color blending,
The GPU application space is extremely mul- antialiasing, and stencil test and update. It’s pos-
timodal: No single performance mode charac- sible to feed the result from any fragment
terizes any given application. For example, processor to any frame buffer location.
stencil shadow volumes can consume two-thirds
of a frame’s rendering time without writing any Vertex processor
color or depth values. Different applications, The vertex processor executes very large
and different millisecond time slices within a instruction words. The instruction load unit
single application, have different characteristics. forms a 123-bit internal instruction from
In designing for these regimes, we sought opti- either of two driver-visible instruction set
mal use of the most expensive resource, sizing architectures (ISAs); Nvidia supports two ISA

6 IEEE MICRO
Command and data fetch

Vectex processors

Triangle setup rasterizer

Z-cull Shader thread dispatch


Fragment
processors
Level 2
texture
cache

Fragment crossbar

Pixel-
blending
units

Memory Memory Memory Memory


partition partition partition partition

Figure 5. GeForce 6800 block diagram.

generations to aid in stream-


lining initial product and dri- Constant RAM
Input
registers
ver development. As Figure 5 512 × 128 bits
16 × 128 bits
shows, there are six vector
floating-point processors.
Each processor’s data path
comprises a vector multiply- Vertex
add unit, a scalar special- texture Multiply
function unit, and a texture unit Instruction
Special-
unit, as shown in Figure 6. function RAM
unit 512 × 123 bits
The vector unit can perform Level 2
four IEEE single-precision texture
Add
cache
multiply, add, or multiply-
add operations, as well as
inner products, max, min,
and so on. The special-func- Memory
Output Temporary
tion unit performs transcen- Texture related registers registers
Computation unit 16 × 128 bits 32 × 128 bits
dental operations such as
sine, cosine, log, and expo-
nential to within one unit in Figure 6. Vertex processor block diagram.
the last place (ULP) of IEEE

MARCH–APRIL 2005 7
HOT CHIPS 16

vectors of homogeneous coordinates, the stan-


dard method for handling perspective fore-
shortening.5 Although we divide through by
the fourth component, we check to see
whether the assembled primitive is outside the
view frustum. If so, the primitive is culled;
otherwise, after perspective division, we apply
the viewport scale and offset to obtain screen-
space x, y, and z (depth). Next, the setup unit
computes coefficients describing the primi-
tive’s edges. Finally, the rasterizer converts the
primitive into pixel fragments for input to the
array of fragment processors. The rasterizer
traverses the primitive in a DRAM-page-
friendly order like that shown in Figure 7.

Fragment processor
Figure 7. Page-friendly rasterization. The GPU forwards attributes, specified at
the triangle’s vertices, from the vertex proces-
sor to the fragment processor. The fragment
single-precision accuracy for operands in the processor smoothly interpolates these attrib-
nominal range. utes across the triangle’s face. Using these
The computational units fetch operands interpolated input attributes, a fragment pro-
from a 512 × 128-bit constant RAM, from tem- gram computes output colors, using math and
porary registers up to 32 × 128 bits, and from texture lookup instructions. The GeForce
16 × 128-bit input registers. The processor feeds 6800 fragment processor can perform opera-
computed results back into the temporary reg- tions with 16- or 32-bit floating-point preci-
isters or out to one of the 16 × 128-bit output sion (FP16 and FP32). The inputs to the
registers. The vertex processor reads instructions fragment processor are position, color, depth,
from a 512-entry instruction RAM. fog, and 10 generic 4 × FP32 attributes. The
To preserve a simple implementation-inde- processor sends its outputs to as many as four
pendent programming model, the vertex render target buffers. Like the vertex proces-
processor uses threads to make the data path sor, the fragment processor is general purpose,
appear to have unity latency, and it uses score- and it has constants, temporary register
boarding to hide texture fetch latency. The resources, and branching capabilities similar
implementation is fully multiple instruction, to those of the vertex processor.
multiple data; therefore, data-dependent
branches are free of the penalty normally Fragment processor detail
accompanying single-instruction, multiple- As Figure 8 shows, each of the 16 fragment
data implementations. Finally, the processor processors includes an interpolation block for
can issue instructions to both vector and scalar input attributes, two vector math units, a spe-
data paths at every clock cycle. cial-function/normalize unit, and a texture
unit. Both computation blocks can perform 4-
Primitive setup and rasterizer vector floating-point operations. The lower
The APIs define the various activities occur- block can do a multiply-add operation. Com-
ring between the vertex and fragment stages bined, the two blocks can sustain 12 floating-
with unique precision requirements. There- point operations per pixel per clock cycle. The
fore, these activities don’t require program- lower block also supports the same transcen-
mability and are implemented efficiently in dental functions supported in the vertex proces-
fixed-function units. sor’s special-function unit. To hide the latency
The primitive assembly unit assembles of texture lookups that fetch from external
primitives such as lines or triangles from trans- memory, each fragment processor maintains
formed vertices. Vertex positions arrive as 4- state for hundreds of in-flight threads.

8 IEEE MICRO
Superscalar instruction issue
Attribute
Microsoft’s DirectX 9 graphics API sup- interpolation
ports a vector-oriented instruction set. The
assembler has instructions that perform most
operations on 4-vectors of FP32 data. How-
Vector
ever, many fragment processing algorithms unit
treat alpha, the transparency component, sep-
arately from the three color components. As
a result, the assembler has provisions to indi- Level 1 Fragment
cate a pairing of instructions—that is, an texture cache texture unit
instruction operating on a 3-vector, usually
RGB, paired with an instruction operating on
a scalar, usually alpha. This mechanism per- Level 2 Vector and
mits dual issue of source-level instructions. texture cache special-function unit
The GeForce 6800’s fragment processor sup-
ports fully general 4-vector split operations—
4-vector, 3/1-vector, and 2/2-vector Temporary
operations—as Figure 9 illustrates. Memory registers
Texture related
The two computation stages can exploit Computation unit
this dual issue of instructions to perform two Output
distinct operations on different subsets of the
4-vector. Together with texture and special Figure 8. Fragment processor block diagram.
functions, each fragment processor can exe-
cute up to six DirectX 9
instructions per pixel per
clock cycle. Figure 10 is an R G B A R G B A R G B A
example of six-issue code.
4-vector single issue 3/1-vector dual issue 2/2-vector dual issue
Texture unit
The literature provides a Figure 9. Vector issue options.
good overview of texture
mapping.6-8 A texture map is
an array of data in one, two, or three dimen- result, given the input address vector. Proper
sions. The simplest uses of texture in render- sampling is a weighted average of a collection
ing involve mapping a decal image onto some of samples near the ideal sample location, with
object built from a collection of geometric minimal aliasing, and it shouldn’t introduce
primitives. Figure 11 provides an example. too much blurring.
Because each pixel maps to a region of the The texture unit operates with a deeply
image, filtering is necessary to eliminate image pipelined cache. Typically, the cache has many
frequency content above the sampling rate hits and misses in flight. To reduce memory
implied by the pixel footprint in texture space. traffic, the application can use compressed-
Instead of the fragment footprint, which texture formats. To facilitate fine-grained
includes a coverage mask, the texture unit uses access and random addressability, these for-
the fully covered pixel footprint to determine mats use small-grained fixed-ratio schemes,
filtering. with a fixed compression ratio of 4:1. Because
With arbitrary programs, a texture is more the ratio is fixed, it is also a lossy scheme.
generally a way to express a function of one, The texture subsystem must filter results
two, or three variables as a table. We can think before returning them to the requesting frag-
of the function value as a color 4-tuple (red, ment processor. The GeForce 6800 supports
green, blue, and alpha) or more generally as four types of filtering: point-, bilinear-, and
an n-tuple of arbitrary values. As with simple trilinear-sampled, and anisotropic. A point-
image mapping, the fixed-function texture sampled request simply returns the texel (tex-
unit’s job is to return a properly sampled ture element, or pixel) nearest to the address

MARCH–APRIL 2005 9
HOT CHIPS 16

ps_2_0
def c1, 2.0, -1.0, 0.0, 0.0
dcl t0.rg
dcl t1
dcl t4.rgb
dcl v0
dcl_2d s0
dcl_2d s1
dcl_cube s2
dcl_2d s3

# clock 1
texld r0, t0, s0; # tex fetch
madr r0,r0,c1.r,c1.g # _bx2 in tex
nrm r1.rgb, t4 # nrm in shdr0 Figure 12. Mip-map hierarchy.
dp3 r1.r,r1,r0 # 3D dot in shdr1
mul r0.a,r0,r0 # dual issue in shdr1

the requester provided. When performing


# clock 2
mul r1.a,r0.a,c2.a # dual issue in shdr0 bilinear-sampled filtering, the texture unit
mul r0.rgb,r1.r,r0 # dual issue in shdr0 takes the weighted average of four texels. Tri-
add r0.a,r1.r,r1.r # fx2 in shdr0 linear-sampled filtering uses prefiltered ver-
mad r0.rg,r0.a,c1,c1.a # mad in shdr1
mul r1.ba,r1.a,r0.a,c2 # dual issue in shdr1 sions of the texture, which form a hierarchy,
or stack, of textures called a mip-map,9 illus-
# clock 3 trated in Figure 12. In trilinear-sampled
rcp r0.a,r0.a # recip in shdr0
mul r0.rgr0,r0.a # div in shdr0 mode, the filtering operation blends eight tex-
mul r0.a,r0.a,r1.a # dual issue in shdr0 els—that is, the operation linearly blends two
texld r2,r0, s1 # texture fetch bilinearly filtered levels.
mad r2.rgb,r0.a,r2,c5 # mad in shdr1
abs r0.a,r0.a # abs in shdr1 In Figure 11, a circle in screen space (Figure
log r0.a,r0.a # log in shdr1 11b) maps to an ellipse in texture space (Fig-
ure 11a). This means the texels needed to
<< etc >>
obtain one pixel’s color value occupy an ellip-
mov oC0, r0 # output color tical footprint in texture memory. The degree
of anisotropy is the ratio of the ellipse’s major
Figure 10. Annotations in this DirectX 9 program code show and minor axes. Larger anisotropy ratios
how the compiler schedules instruction sequences for the require more texels to be read and evaluated
GeForce 6800 fragment processor. when performing an anisotropic filtering
operation. The GeForce 6800 supports up to
16:1 anisotropic filtering, and it processes tex-
ture lookup requests at four FP16 texels per
clock cycle per texture unit.

Pixel engines
The GeForce 6800 contains 16 pixel
engines. These fixed-function units perform
depth and stencil test and update, as well as
color blending, at 16 pixels per clock cycle. If
(a) (b) no color destination is active, depth and sten-
cil test can run at 32 pixels per clock cycle; fast
depth and stencil update accelerates shadow
volume rendering. Blending of 16-bit float-
ing-point frame buffer values has proved to
(c) (d) be one of the GeForce 6800’s most important
Figure 11. Texture and perspective view: texture with elliptical footprint (a), new features because it directly accelerates
perspective image with circular footprint in screen space (b), texture close- HDR rendering and light accumulation. The
up (c), and resampled image (d). memory controller uses lossless color and
depth compression to reduce bandwidth

10 IEEE MICRO
demands. Finally, the pixel engines support Data from fragment processor
high-quality antialiasing (filtering).

Pixel pipeline detail. Each pixel engine con-


nects to a specific memory partition (see Fig- Pixel X-bar interconnect
ure 5). The pixel engines expand the depth
and color of each fragment into multiple sam-
ples when the renderer enables antialiasing.
When possible, the engines losslessly compress Multisample antialiasing
depth and color, indicated by depth com-
pression and color compression in Figure 13.
The depth and color units then read and write
to the local memory partition to carry out the Depth Color
depth and stencil, and color-blend operations. compression compression

Antialiasing. The GeForce 6800 supports var-


ious antialiasing options, which trade image
quality for performance. The two primary Depth raster Color raster
operation operation
algorithms are multisampling and supersam-
pling. Both involve generating two, four, or
eight samples for each displayed pixel, then
taking a weighted average of all samples to
produce the pixel’s displayed color. Multi-
sampling executes the fragment program once Frame buffer partition
per pixel fragment and reuses the resulting
color value for all its samples. Supersampling
reruns the fragment program to generate a Memory
unique color for every sample. In both cases,
we evaluate the depth correctly and uniquely Figure 13. Pixel engine block diagram.
at each pixel subsample location. This fre-
quency of evaluation is necessary to avoid
image artifacts and to achieve smooth edges Eurographics Rendering Workshop,
at silhouettes and object interpenetrations. European Assoc. for Computer Graphics,
Multisampling imposes a significantly small- 2001, pp. 313-320.
er fragment processor load while antialiasing 2. F. Crow, “Shadow Algorithms for Computer
edges and interpenetrations. Supersampling Graphics,” Proc. 24th Ann. Conf. Computer
multiplies the fragment processor load by the Graphics and Interactive Techniques
sample count to provide additional antialias- (Siggraph 77), ACM Press, 1977, pp. 242-
ing of each fragment’s resulting color. 248.
3. C. Everett and M.J. Kilgard, “Practical and

T he GeForce 6800, the flagship of an


architectural line targeted at a large and
diverse market, supports interactive render-
Robust Shadow Volumes for Hardware-
Accelerated Rendering,” Mar. 2002;
http://developer.nvidia.com/object/robust_s
ing approaching cinematic quality. The archi- hadow_volumes.html.
tecture is tailored to its highly parallel task and 4. E. Lindholm, M. Kilgard, and H. Moreton, “A
can also scale down to low-power, low-cost User-Programmable Vertex Engine,” Proc.
devices. The GeForce 6800 is one of the most 28th Ann. Conf. Computer Graphics and
complex logic designs shipping in high vol- Interactive Techniques (Siggraph 01), 2001,
ume today. MICRO ACM Press, pp. 149-158.
5. J.D. Foley et al., Computer Graphics:
References Principles and Practice, 2nd ed., Addison-
1. J. Cohen et al., “Real-time High Dynamic Wesley, 1990.
Range Texture Mapping,” Proc. 12th 6. P.S. Heckbert and H.P. Moreton,

MARCH–APRIL 2005 11
HOT CHIPS 16

“Interpolation for Polygon Texture Mapping


and Shading,” State of the Art in Computer
Graphics: Visualization and Modeling,
Springer-Verlag, 1991, pp. 101-111.
7. P.S. Heckbert, “Survey of Texture Mapping,”
IEEE Computer Graphics and Applications,
vol. 6, no. 6, Nov. 1986, pp. 56-67.
8. T. Huettner and W. Strasser, “Fast Footprint
MIPmapping,” Proc. Eurographics/SIGGRAPH
Workshop Graphics Hardware, ACM Press,
1999, pp. 35-44.
9. L. Williams, “Pyramidal Parametrics,” Proc.
10th Ann. Conf. Computer Graphics and
Interactive Techniques (Siggraph 83), ACM
Press, 1983, pp. 1-11.

John Montrym is the chief architect at


Nvidia, where he has influenced the develop-
ment of the architecture, hardware design, and
design methodologies of 12 GPU products.
He has a BS in electrical engineering from the
Massachusetts Institute of Technology.

Henry Moreton is a member of the architec-


ture group at Nvidia. His research interests
include GPU programming models and archi-
tecture. Moreton has a PhD in computer sci-
ence from the University of California,
Berkeley.

Direct questions and comments about this


article to John Montrym or Henry Moreton
at Nvidia, 2701 San Tomas Expressway, Santa
Clara, CA 95050; montrym@nvidia.com or
moreton@nvidia.com.

For further information on this or any other


computing topic, visit our Digital Library at
http://www.computer.org/publications/dlib.

12 IEEE MICRO

View publication stats

You might also like