The Graphics Card As A Stream Computer: Suresh Venkatasubramanian AT&T Labs - Research

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

The Graphics Card as a Stream Computer

Suresh Venkatasubramanian
AT&T Labs – Research

1 Introduction gram issues requests to the card by specifying points, lines,


or facets, (and their attributes) and issuing a rendering re-
Massive data sets have radically changed our understand- quest. The vertices go thru several processing phases (called
ing of how to design efficient algorithms; the streaming a vertex program) and finally are scan-converted into frag-
paradigm, whether it in terms of number of passes of an ments by a rasterizer that takes a three-dimensional scene
external memory algorithm, or the single pass and limited and breaks it into pixels. Each fragment is then processed
memory of a stream algorithm, appears to be the dominant by another stream computation (called a fragment program)
method for coping with large data. and finally ends up as a pixel on a screen. Alternatively, the
A very different kind of massive computation has had the pixels and their attributes can be captured in internal stor-
same effect at the level of the CPU. It has long been ob- age and their contents can be retrieved to the CPU. Spatial
served [Backus 1977] that the traditional Von Neumann-style parallelism occurs at the level of a fragment; each fragment
architecture creates memory bottlenecks between the CPU program can be viewed as running in parallel at each pixel.
and main memory, and much of chip design in the past many The vertex programs are executed on each point. Texture
years has been focused on methods to alleviate this bottle- memory provides a form of limited local storage for main-
neck, by way of fast memory, caches, prefetching strategies taining state; access to it is restricted though.
and the like. However, all of this has made the memory
bottleneck itself the focus of chip optimization efforts, and
has reflected in the amount of real estate on a chip devoted 2.1 A Computational Model
to caching and memory access circuitry, as compared to the From the above description, it is clear that a GPU executes
ALU itself [Duca et al. 2003]. a stream program. There are some key differences between it
For compute-intensive operations, this is an unacceptable and a general purpose stream program. The most essential
tradeoff. The most prominent example is that of the compu- one is that every object in the stream is processed by the same
tations performed by a graphics card. The operations them- function
selves are very simple, and require very little memory, but Recall that in a general stream computation [Feigenbaum
require the ability to perform many computations extremely et al. 1999] the data streams by an arbitrary Turing machine
fast and in parallel to whatever degree possible. Inspired in that has access to limited memory. There is no restriction in
part by dataflow architectures and systolic arrays, the de- general on what is computed; the restriction that the hard-
velopment of graphics chips focused on high computation ware enforces is thus significant. Further, the computations
throughput while sacrificing (to a degree) the generality of themselves are limited to branching programs; no looping or
a CPU. jumping is allowed1 .
What has resulted is a stream processor that is highly
Another important (and related) difference is the access
optimized for stream computations. Today’s GPUs (graph-
to memory; a stream program on a graphics card has access
ics processing units) can process over 50 million triangles
to much smaller memory than a general stream program. Al-
and 4 billion pixels in one second. Their “Moore’s Law” is
though graphics cards have a great deal of memory on them
faster than that for CPUs, owing primarily to their stream
(256 MB in the best cards available today), each processing
architecture which enables all additional transistors to be
unit really has access to only a few fast registers, unlike the
devoted to increasing computational power directly. An in-
much larger memory limits on a standard streaming process.
triguing side effect of this is the growing use of a graphics
Finally, the typical computation performed in the GPU is
card as a general purpose stream processing engine. In an
multi-pass; a constant or even ω(1) passes may be required
ever-increasing array of applications, researchers are discov-
for a particular computation.
ering that performing a computation on a graphics card is
far faster than performing it on a CPU, and so are using a
GPU as a stream co-processor. Another feature that makes 2.2 Less Is More
the graphics pipeline attractive (and distinguishes it from
other stream architectures) is the spatial parallelism it pro- Although it seems that the above restrictions are artificial
vides. Conceptually, each pixel on the screen can be viewed and limiting, it turns out they are at the heart of the per-
as a stream processor, potentially giving a large degree of formance gains achieved by graphics cards. Because each
parallelism. item in the stream is processed in the same way, microcode
for the computation can be loaded directly onto the GPU,
and no expensive memory access are needed to fetch the
2 A Model Of A Graphics Card next instruction. Using branching programs enforces the
pipeline discipline, and ensures that items are processed
A detailed description of the architecture of a graphics card without stalls. Lack of memory access allows small caches to
is beyond the scope of this note; what follows is a distillation be used and prevents the memory bottleneck from swamping
of the key components. The stream elements that a card pro- the chip. It is also important to observe that other streaming
cesses are points, and at a later stage, fragments, which are
essentially pixels with color and depth information. A pro- 1 there are some exceptions, but these are not significant
architectures that have been designed and proposed ([Ka- from a theoretical point of view, but as a way of indicating
pasi et al. 2002; Bove and Watlington 1995]) share little in what problems are likely to be amenable to efficient stream
common with graphics cards but share the above unifor- computation.
mity assumption. Thus, uniform streaming appears to be
the common underlying computational metaphor for stream
architectures in general. References
Agarwal, P., Krishnan, S., Mustafa, N., and
Limitations On Computing Power As far back as 1989, Venkatasubramanian, S. 2002. Streaming geometric
Fournier and Fussell [1988] presented a view of the GPU computations using graphics hardware. Tech. rep., AT&T
as a stream computing engine, and presented various up- Labs–Research.
per and lower bounds for simple problems. They show for
example that sorting is a hard problem in this model, and Backus, J., 1977. Can programming be liberated from the
investigate how the power of the pipeline changes as more von Neumann style ? a functional style and its algebra of
and more registers and operations are added to the model. programs. ACM Turing Award Lecture.
More recently, Guha et al. [2002] showed that computing
the median of n numbers takes n passes in the “standard” Bove, V., and Watlington, J., 1995. Cheops: A recon-
graphics pipeline model. This contrasts to O(log n) passes figurable data-flow system for video processing.
required to compute the median in a stream model. Inter-
estingly, the addition of an extra operator (an interval test Buck, I., and Hanrahan, P., 2003. Data parallel compu-
of the form “Given z, is a ≤ z ≤ b”) reduces the number of tation on graphics hardware. Manuscript.
passes needed (randomized) to O(log n).
Chan, E., Ng, R., Sen, P., Proudfoot, K., and Hanra-
han, P. 2002. Efficient partitioning of fragment shaders
3 Examples for multipass rendering on programmable graphics hard-
ware. In Proc. SIGGRAPH/Eurographics Workshop on
There are numerous examples of places where the GPU has Graphics Hardware.
been used to perform general stream computations. A some-
what incomplete list is below: Diewald, U., Preusser, T., Rumpf, M., and Strzodka.,
R. 2001. Diffusion models and their accelerated solution
• General distance field calculations: Given a distance in computer vision applications. Acta Mathematica Uni-
function and a set of objects, compute for each object versitatis Comenianae (AMUC).
its (nearest/farthest/median) neighbour. Yields algo-
Duca, N., Cohen, J., and Kirchner, P., 2003. Stream
rithms for Voronoi diagrams [Hoff III et al. 1999], gen-
caching: A mechanism to support multi-record compu-
eral shape fitting [Agarwal et al. 2002], spatial convo-
tations within stream processing architectures. DIMACS
lution and outlier detection and other problems.
Working Group on Streaming Analysis II, March.
• Object intersections: given two objects, determine if
Feigenbaum, J., Kannan, S., Strauss, M., and
they intersect and by how much. Yields algorithms for
Viswanathan., M. 1999. An approximate l1-difference
spatial joins [Sun et al. 2003], penetration depth [Agar-
algorithm for massive data streams. In Proc. 40th Sym-
wal et al. 2002], range searching [Krishnan et al. 2002].
posium on Foundations of Computer Science, IEEE.
• Scientific computing: Solve partial differential equation
Fournier, A., and Fussell, D. 1988. On the power of the
via finite element methods [Diewald et al. 2001]
frame buffer. ACM Transactions on Graphics, 103–128.
• Physical simulation: Simulate the motion of objects in a Guha, S., Krishnan, S., Munagala, K., and Venkata-
physically realistic way. Useful for game programming subramanian, S. 2002. The power of a two-sided depth
and dynamical system modelling. test and its application to csg rendering and depth extrac-
It is worth noting that none of the above applications tion. Tech. Rep. TD-5FDS6Q, AT&T.
make use of the extensive features of vertex and fragment Hoff III, K. E., Keyser, J., Lin, M., Manocha, D.,
programs, and in fact only use a very primitive subset of and Culver, T. 1999. Fast computation of general-
operations. ized Voronoi diagrams using graphics hardware. Computer
Graphics 33, Annual Conference Series, 277–286.
4 Directions Kapasi, U. J., Dally, W. J., Rixner, S., Owens, J. D.,
and Khailany, B. 2002. The imagine stream processor.
There are a number of challenging directions to pursue.
In Proc. IEEE International Conference on Computer De-
There is already work on software and language constructs
sign, 282–288.
to express general purpose stream programs in terms of low-
level graphics hardware commands (so that the programmer Krishnan, S., Mustafa, N., Muthukrishnan, S., and
need know nothing about graphics) [Buck and Hanrahan Venkatasubramanian, S. 2002. Extended intersection
2003]. There is work in compilers on optimizing the transla- queries on a geometric SIMD machine model. Tech. rep.,
tion of high-level stream constructs [Chan et al. 2002]. Other AT&T.
work seeks to extend the uniform streaming model to allow
limited caching of state [Duca et al. 2003]. Sun, C., Agrawal, D., and Abbadi, A. E. 2003. Hard-
From a computational standpoint, all of this leads to a ware acceleration for spatial selections and joins. In SIG-
fundamental question: What is the power of uniform stream- MOD.
ing and its variants ? The answer will be invaluable not only

You might also like