Group6 Simplegpu
Group6 Simplegpu
Group6 Simplegpu
Abstract—A Graphics Processing Unit (GPU) is important graphics focused algorithms not covered in class.
for many user interface based systems. By offloading a Namely, a four dimensional matrix multiply, a divider, a
specific set of mathematically intense image generation transform from normalized camera space to screen
operations from the Central Processing Unit (CPU), it frees dimensions, a memory address calculator, and
up processing time for more general operations. In this Bresenham’s line drawing algorithm. This system was
project we developed a subset of these operations to run on developed in VHDL for the purpose of being placed on
the ZYNQ PL hardware and interfaced this newly created the programmable logic of the ZYNQ chip, a CPU-FPGA
Simple GPU with the ZYNQ’s built in ARM microprocessor. hybrid. Because this was designed for an FPGA and
The first and secondary goals were achieved successfully;
consisted of many mathematical operations, the
the Simple GPU was capable of projecting CPU-designated
components were written to operate on fixed point
three dimensional coordinates onto a display and optionally
draw lines between them. Due to time constraints, the numbers as the LUT requirements for floating point
system lacks stability in several cases and future work could support on an FPGA are quite large.
center on correcting these design flaws. The system could The inputs to the GPU are the memory address of the
also be extended with more advanced functionality, such as current frame buffer, the Model View Perspective (MVP)
color interpolation, triangle filling, and UV mapping. matrix and the vertices to draw. The GPU transforms the
vertices into the camera’s view, and then calculates the
I. INTRODUCTION address to store the output. The output of the GPU is
stored in a frame buffer in the DDR memory onboard the
An increasingly common requirement for many
Zybo. This frame buffer is then read by the display
embedded systems is the necessity of providing a visually
controller to show on the screen. The display controller
appealing interface for a user. Since a CPU must perform
was created using IP cores from Xilinx [2] and Digilent
the primary function for a system, these older interfaces
[3-4].
were typically limited to fixed function displays or basic
To send data to the GPU from the processor, two
text displays. Newer systems overcome this limitation by
different interfaces are used. An AXI Lite interface
adding a coprocessor, the GPU, which handles the
provides easy configuration of parameters that do not
generation of images for user feedback.
change often, such as the MVP matrix. A Slave AXI Full
A GPU consists of two primary functions. First, as a
interface allows sending of vertices and their color
vector processor, it performs the four dimensional math
quickly to the GPU. Finally, a Master AXI Full interface
necessary to project three dimensional points onto a
allows sending pixel data straight to DDR memory
plane. Second, as a pixel processor, it directly accesses
through the High-Performance AXI ports of the Zynq PS.
memory in order to store correctly colored pixels onto a
frame buffer that can be displayed on a screen without II. METHODOLOGY
CPU involvement. A more feature rich system would
provide more specific functionality under these two Each major component in the System pictured in
categories, but the focus of this project was to produce a figure 1.
system with the minimum number of component A. Matrix Multiplication
necessary to provide the basic functionality of a GPU.
A matrix multiplication is a binary operation that
The Simple GPU consists of a several stage pipeline,
consists of multiplying every row of one matrix with the
consisting of abstract mathematical functions and
column of another matrix element wise, then summing the
B. Divider
The divider uses the pipelined version provided by
Llamocca, but wrapped with some additional logic to
handle negative numbers. The divider is important
because the results while in view space after the matrix
multiplication are not normalized to the view edges [6].
The scales can change dramatically depending on the
distance between the camera and the vertex in question.
Since the w component was known to be one, it was used
Figure 4. Screen clip detection datapath.
E. Screen Space to Memory
The final step is to
convert the screen
coordinates of the vertex
into an offset into an array.
This is multiplying the y
coordinate by the width to
bring the index to the correct
row and adds the x
coordinate to bring the index
to the correct column. This
is a pixel index, but AXI is
expecting a byte address.
Since a pixel consists of 4
bytes, two least significant
zeros are added to the
address. Figure 7. Line drawing state machine and datapath.
Figure 5. Screen space to This has a major limitation. It only increments x, and
framebuffer datapath. F. Line Drawing
y; limiting the direction the line can be drawn to the first
Line drawing takes two quadrant of the
screen space vertices and draws a single pixel width line coordinate space.
between them. The full algorithm is presented in the Also since y only
figure 6 pseudocode. “DIF” represents a replacement for increments when x
error from normal fractional based methods, allowing this increments, the
to be integer math. Every loop of the algorithm x is greatest slope is
incremented and the difference between the y coordinates limited to 1, so lines
is added to DIF. When DIF reaches zero y is also can only be drawn
incremented and the difference between the x coordinates within the first 45° of
is subtracted from DIF. A simple example shows that if the coordinate space
the x distance is three times the y distance, DIF will reach as shown in figure 8.
zero every three increments of x. Thus a line with the Thus, the blue line
proper slope of ⅓ will be drawn.
Figure 8. Line drawing limitation. cannot be drawn.
To overcome this
plotLine(x0,y0, x1,y1) limitation, the rest of
dx=x1-x0 the coordinate space needs to be folded into that 45°
dy=y1-y0 space. Then the output needs to be unfolded to the correct
quadrant at the output for each pixel of the line. The
DIF = 4*dy - dx block diagram is shown in figure to 9. The important part
y=y0 is the three comparison operators, which detects the
eighth of the coordinate space that contains the line.
for x from x0 to x1
plot(x,y)
DIF = DIF + (2*dy)
if DIF > 0
y = y+1
DIF = DIF - (2*dx)
Figure 6. Bresenham's line drawing algorithm pseudocode.
III. EXPERIMENTAL SETUP ILA. This method places a debug core on-chip, which
Each non-trivial component of the GPU that we samples the desired signals every clock cycle. The
created has an associated testbench. This includes the 4x4 sampled signals are put into a FIFO and can be sent to the
matrix multiply, the line-drawing component, and one to PC for viewing. Since the bandwidth between the debug
test several representative values of the entire GPU. To core and the PC isn’t typically high enough to send all the
test the entire GPU including AXI interfaces however signal’s data, a trigger is used to tell the core when to
requires a bit more work. A basic AXI lite master send data. For our purposes, we set the trigger to the
testbench model from [7] provided a good start, but the AWVALID signal so that we would see an AXI
Simple GPU has a slave full and a master full interface as transaction occurring.
well. The slave full AXI interface - requiring a master IV. RESULTS
full interface testbench to test - is easy to simulate
because all of the extra inputs of the full over the lite The resulting system was made up of two GPUs: one
interface can be set to constants. The master full AXI capable of dot-drawing and one capable of line-drawing.
interface is a bit trickier. For that a slave AXI full model As can be seen in the video of the project, available at [1],
had to be created, which was based off of the example several long lines and many dots can be drawn on the
master AXI interface generated by Vivado. Using all screen via the GPUs. Figure 13 shows a picture of the
three AXI interface testbench models, it is possible to running system as well. The line-drawing GPU was used
simulate the entire GPU system from start to end. Using to form a cube, while one hundred dots were drawn with
this testbench, the configuration registers were set up, and the dot-drawing GPU to create a sparkler effect. If the
several vertices written to the GPU. The entire system number of dots drawn increased too much, beyond 200,
was observed as well the final pixel output being written the colors became offset by a couple dots: so a dot that
into the slave representing the HP ports. Below is shown was supposed to be red may show up green for example.
an AXI write of a pixel from the GPU. Line drawing also showed issues in two cases. If lines
In addition to the simulations that were run, with different colors were drawn, the colors would bleed
debugging of the hardware while it was running was used into the next line. Also if lines were to go off the edge of
to verify the functionality of the AXI communication with the screen, a flickering effect was observed, with lines
the processor. Figure one shows the result of an AXI disappearing and sometimes connecting the wrong
write on the Master AXI bus created in the Simple GPU endpoints. These are most probably resulting from design
component. In this case, the processor’s High- issues with the FIFOs. If the two separate streams of data
Performance AXI port was the slave. The slave can be containing color and memory addresses were to get out of
seen setting AWREADY, AREADY, and BVALID sync these effects could to be expected.
indicating a successful write.
Debugging of running hardware was achieved using
Chipscope Pro debug cores, now also known as Vivado
Figure 12. Master AXI write viewed on chip with Vivado ILA.
Figure 13. On-screen drawing of cube and sparkler.
CONCLUSIONS
The next step for the GPU would be to allow for entire
triangle drawing. Just as dot drawing was upgraded to
line drawing, line drawing can be upgraded to triangle
drawing. Three vertices would be input instead of two,
and the pixel processor would have to fill in the area
between the vertices. This is a big step up in complexity
and in the time required to process the vertex stream.
Another improvement for the GPU system as a whole
would be to add multiple pipelines of vertex calculations
and pixel processing. For the single dot mode, this is not
necessary as the CPU cannot send vertices fast enough to
overflow a single GPU pipeline. But in the line drawing
or proposed triangle drawing modes, the pixel processor
can overflow its FIFOs. Adding more GPU pipelines
would allow sharing of the workload between the multiple
pipelines.
REFERENCES
[1] Video of project:
https://drive.google.com/open?id=0B4z_15QSfhUSOWNWWjhE
Z24xTEE
[2] Xilinx AXI VDMA datasheet
http://www.xilinx.com/support/documentation/ip_documentation/a
xi_vdma/v6_2/pg020_axi_vdma.pdf
[3] Digilent Zybo Base System Design
http://www.digilentinc.com/Data/Products/ZYBO/zybo_base_syste
m.zip
[4] Digilent Vivado Library https://github.com/DigilentInc/vivado-
library/tree/master/ip
[5] OpenGL Projection Matrix
http://www.songho.ca/opengl/gl_projectionmatrix.html
[6] OpenGL Transformation
http://www.songho.ca/opengl/gl_transform.html
[7] AXI Testbenches https://github.com/Architech-Silica/Designing-a-
Custom-AXI-Master-using-
BFMs/tree/master/HDL_sources/Testbench_Sources