IEEEMicro Ge Force 6800
IEEEMicro Ge Force 6800
IEEEMicro Ge Force 6800
net/publication/3215459
CITATIONS READS
73 450
2 authors, including:
Henry P. Moreton
NVIDIA
30 PUBLICATIONS 1,306 CITATIONS
SEE PROFILE
All content following this page was uploaded by Henry P. Moreton on 03 January 2017.
MARCH–APRIL 2005 3
HOT CHIPS 16
4 IEEE MICRO
tices, which the GPU can process inde- OpenGL with just-in-time compiled machine-
pendently. The renderer expresses the independent assembler as well as higher-level
result of its calculations as millions of C-like programming languages.
independent pixels. These high levels of
parallelism permit the efficient deploy- Parallelism
ment of broadly and deeply parallel com- Contrasting CPUs and GPUs makes it eas-
putational resources. ier to understand the motivation behind the
• Memory. The memory subsystem is the GPU architecture. The GPU workload offers
most precious resource in any graphics more independent calculations than a typical
system, and its characteristics heavily CPU workload; the programmer’s view is sin-
influence the GPU’s design. Designers gle threaded, while the machine is actually
must fit the GPU architecture to the deeply multithreaded. The GPU can afford
memory subsystem’s bandwidth and larger amounts of floating-point computational
latency characteristics. power because the control overhead per oper-
ation is lower than that for a CPU, and a GPU
Programmability can effectively execute extensive floating-point
The GeForce 6800’s programming model computations. The simple programming
enables parallelized acceleration. There are two model and large amount of independent cal-
separate programs: The application executes a culation result in deep and wide parallelism for
vertex program independently on every ver- the GeForce 6800 to exploit.
tex; similarly, the GPU applies a fragment pro- Another interesting difference between
gram independently to every pixel fragment. CPUs and GPUs is the use of dedicated
For every vertex received in the command mode-controlled functional units for special-
stream, the GPU launches a thread executing ized performance-critical tasks. In addition to
the vertex program. For every rasterized pixel the programmable vertex and fragment
fragment, the machine dispatches one thread processors, there are specialized units for data
of the fragment program. Each thread has its fetch, rasterization (conversion from triangles
own unique inputs available in read-only reg- to pixel fragments), and texture filtering. We
isters. Supporting hardware loads these inputs determined the processor instruction set by
before thread launch. Each thread also has analyzing the graphics workload. For exam-
write-only output registers, whose content the ple, because of their importance to graphics
machine forwards to the next processing stage. algorithms, the GeForce 6800 includes fast
In addition to these inputs and outputs, each and accurate transcendental functions and
thread has private temporary registers, read- inner-product instructions.
only program parameters, and access to filtered
and resampled texture map images. Memory
Nvidia introduced the first programmable The memory bandwidth demands of GPU
GPU, the GeForce3, in 2001. The GeForce3 systems have always been insatiable, largely
supported a programmable vertex processor.4 because there are so many concurrently active
In 2002, the original GeForce FX series intro- threads. CPUs have dealt with memory limi-
duced programmable vertex and fragment tations by using ever-larger caches, but graph-
processors. Now, the GeForce 6800 has uni- ics working-set sizes have grown at least as fast
fied these capabilities and made them orthog- as transistor density, and it remains prohibitive
onal. The fragment processor supports to implement an on-chip cache large enough
dynamic flow control, as the vertex processor to achieve 99 percent hit rates. Caches as part
did in the GeForce FX. In addition, the vertex of the memory hierarchy cannot affordably
program can access the texture subsystem, pre- support long-term reuse. Therefore, our GPU
viously available only to fragment programs. cache designs assume a 90 percent hit rate
The FX had introduced floating-point textures with many misses in flight. Stated another
and frame buffers; the GeForce 6800 adds the way, we implement caches that support effec-
ability to blend and filter in floating-point. tive streaming with local reuse of fetched data.
Finally, from a language and API perspective, Because of bandwidth limitations, we aim
the GeForce 6800 supports both Direct3D and for 100 percent memory bandwidth utiliza-
MARCH–APRIL 2005 5
HOT CHIPS 16
6 IEEE MICRO
Command and data fetch
Vectex processors
Fragment crossbar
Pixel-
blending
units
MARCH–APRIL 2005 7
HOT CHIPS 16
Fragment processor
Figure 7. Page-friendly rasterization. The GPU forwards attributes, specified at
the triangle’s vertices, from the vertex proces-
sor to the fragment processor. The fragment
single-precision accuracy for operands in the processor smoothly interpolates these attrib-
nominal range. utes across the triangle’s face. Using these
The computational units fetch operands interpolated input attributes, a fragment pro-
from a 512 × 128-bit constant RAM, from tem- gram computes output colors, using math and
porary registers up to 32 × 128 bits, and from texture lookup instructions. The GeForce
16 × 128-bit input registers. The processor feeds 6800 fragment processor can perform opera-
computed results back into the temporary reg- tions with 16- or 32-bit floating-point preci-
isters or out to one of the 16 × 128-bit output sion (FP16 and FP32). The inputs to the
registers. The vertex processor reads instructions fragment processor are position, color, depth,
from a 512-entry instruction RAM. fog, and 10 generic 4 × FP32 attributes. The
To preserve a simple implementation-inde- processor sends its outputs to as many as four
pendent programming model, the vertex render target buffers. Like the vertex proces-
processor uses threads to make the data path sor, the fragment processor is general purpose,
appear to have unity latency, and it uses score- and it has constants, temporary register
boarding to hide texture fetch latency. The resources, and branching capabilities similar
implementation is fully multiple instruction, to those of the vertex processor.
multiple data; therefore, data-dependent
branches are free of the penalty normally Fragment processor detail
accompanying single-instruction, multiple- As Figure 8 shows, each of the 16 fragment
data implementations. Finally, the processor processors includes an interpolation block for
can issue instructions to both vector and scalar input attributes, two vector math units, a spe-
data paths at every clock cycle. cial-function/normalize unit, and a texture
unit. Both computation blocks can perform 4-
Primitive setup and rasterizer vector floating-point operations. The lower
The APIs define the various activities occur- block can do a multiply-add operation. Com-
ring between the vertex and fragment stages bined, the two blocks can sustain 12 floating-
with unique precision requirements. There- point operations per pixel per clock cycle. The
fore, these activities don’t require program- lower block also supports the same transcen-
mability and are implemented efficiently in dental functions supported in the vertex proces-
fixed-function units. sor’s special-function unit. To hide the latency
The primitive assembly unit assembles of texture lookups that fetch from external
primitives such as lines or triangles from trans- memory, each fragment processor maintains
formed vertices. Vertex positions arrive as 4- state for hundreds of in-flight threads.
8 IEEE MICRO
Superscalar instruction issue
Attribute
Microsoft’s DirectX 9 graphics API sup- interpolation
ports a vector-oriented instruction set. The
assembler has instructions that perform most
operations on 4-vectors of FP32 data. How-
Vector
ever, many fragment processing algorithms unit
treat alpha, the transparency component, sep-
arately from the three color components. As
a result, the assembler has provisions to indi- Level 1 Fragment
cate a pairing of instructions—that is, an texture cache texture unit
instruction operating on a 3-vector, usually
RGB, paired with an instruction operating on
a scalar, usually alpha. This mechanism per- Level 2 Vector and
mits dual issue of source-level instructions. texture cache special-function unit
The GeForce 6800’s fragment processor sup-
ports fully general 4-vector split operations—
4-vector, 3/1-vector, and 2/2-vector Temporary
operations—as Figure 9 illustrates. Memory registers
Texture related
The two computation stages can exploit Computation unit
this dual issue of instructions to perform two Output
distinct operations on different subsets of the
4-vector. Together with texture and special Figure 8. Fragment processor block diagram.
functions, each fragment processor can exe-
cute up to six DirectX 9
instructions per pixel per
clock cycle. Figure 10 is an R G B A R G B A R G B A
example of six-issue code.
4-vector single issue 3/1-vector dual issue 2/2-vector dual issue
Texture unit
The literature provides a Figure 9. Vector issue options.
good overview of texture
mapping.6-8 A texture map is
an array of data in one, two, or three dimen- result, given the input address vector. Proper
sions. The simplest uses of texture in render- sampling is a weighted average of a collection
ing involve mapping a decal image onto some of samples near the ideal sample location, with
object built from a collection of geometric minimal aliasing, and it shouldn’t introduce
primitives. Figure 11 provides an example. too much blurring.
Because each pixel maps to a region of the The texture unit operates with a deeply
image, filtering is necessary to eliminate image pipelined cache. Typically, the cache has many
frequency content above the sampling rate hits and misses in flight. To reduce memory
implied by the pixel footprint in texture space. traffic, the application can use compressed-
Instead of the fragment footprint, which texture formats. To facilitate fine-grained
includes a coverage mask, the texture unit uses access and random addressability, these for-
the fully covered pixel footprint to determine mats use small-grained fixed-ratio schemes,
filtering. with a fixed compression ratio of 4:1. Because
With arbitrary programs, a texture is more the ratio is fixed, it is also a lossy scheme.
generally a way to express a function of one, The texture subsystem must filter results
two, or three variables as a table. We can think before returning them to the requesting frag-
of the function value as a color 4-tuple (red, ment processor. The GeForce 6800 supports
green, blue, and alpha) or more generally as four types of filtering: point-, bilinear-, and
an n-tuple of arbitrary values. As with simple trilinear-sampled, and anisotropic. A point-
image mapping, the fixed-function texture sampled request simply returns the texel (tex-
unit’s job is to return a properly sampled ture element, or pixel) nearest to the address
MARCH–APRIL 2005 9
HOT CHIPS 16
ps_2_0
def c1, 2.0, -1.0, 0.0, 0.0
dcl t0.rg
dcl t1
dcl t4.rgb
dcl v0
dcl_2d s0
dcl_2d s1
dcl_cube s2
dcl_2d s3
# clock 1
texld r0, t0, s0; # tex fetch
madr r0,r0,c1.r,c1.g # _bx2 in tex
nrm r1.rgb, t4 # nrm in shdr0 Figure 12. Mip-map hierarchy.
dp3 r1.r,r1,r0 # 3D dot in shdr1
mul r0.a,r0,r0 # dual issue in shdr1
Pixel engines
The GeForce 6800 contains 16 pixel
engines. These fixed-function units perform
depth and stencil test and update, as well as
color blending, at 16 pixels per clock cycle. If
(a) (b) no color destination is active, depth and sten-
cil test can run at 32 pixels per clock cycle; fast
depth and stencil update accelerates shadow
volume rendering. Blending of 16-bit float-
ing-point frame buffer values has proved to
(c) (d) be one of the GeForce 6800’s most important
Figure 11. Texture and perspective view: texture with elliptical footprint (a), new features because it directly accelerates
perspective image with circular footprint in screen space (b), texture close- HDR rendering and light accumulation. The
up (c), and resampled image (d). memory controller uses lossless color and
depth compression to reduce bandwidth
10 IEEE MICRO
demands. Finally, the pixel engines support Data from fragment processor
high-quality antialiasing (filtering).
MARCH–APRIL 2005 11
HOT CHIPS 16
12 IEEE MICRO