ps3 Tutorial 002
ps3 Tutorial 002
Lesson 2
Playstation 3 Development
Introduction to Playstation 3 Programming
Sam Serrels and Benjamin Kenwright1
Abstract
An introduction to the system architecture of the Sony Playstation 3 (PS3). The features of the Cell Broadband Engine (the
main processor used by the PS3) and the Nvidia RSX ’Reality Synthesizer’ graphics processor will be explained. A starting
guide to programming on the PS3 is also included, which details some essential knowledge needed when writing code for
this architecture.
Keywords
Sony, PS3, PlayStation, Setup, Windows, Target Manager, ELF, PPU, SPU, Programming, ProDG, Visual Studio, Memory
alignment
1 Edinburgh Napier University, School of Computer Science, United Kingdom: b.kenwright@napier.ac.uk
Figure 1. PS3 System Diagram - A high level overview of the main system components
Synthesizer’, which can produce resolutions from 480i/576i improve production yields. PlayStation 3’s Cell CPU achieves
SD up to 1080p HD. a theoretical maximum of 230.4 GFLOPS in single preci-
The PlayStation 3 has 256 MB of XDR DRAM main memory sion floating point operations and up to 100 GFLOPS double
and 256 MB of GDDR3 video memory for the RSX. precision using iterative refinement for the solution of linear
All PS3 models have user-upgradeable 2.5” SATA hard drives equations. The PS3 has 256 MB of Rambus XDR DRAM,
and come installed with drives of various sizes up to 500 GB. clocked at CPU die speed.
The system has Bluetooth 2.0 (with support for up to 7 blue- Cell is a multi-core microprocessor microarchitecture which
tooth devices), gigabit Ethernet, 2x speed Blu-ray Disc drive, can have a number of different configurations, the basic con-
USB 2.0 and HDMI 1.4 built in on all currently shipping mod- figuration is a multi-core chip composed of one ”Power Pro-
els. cessor Element” (”PPE”) (sometimes called ”Processing El-
Wi-Fi networking and a flash card reader (compatible with ement”, or ”PE”), and multiple ”Synergistic Processing Ele-
Memory Stick, SD/MMC and CompactFlash/Microdrive me- ments” (”SPE”).The PPE and SPEs are linked together by an
dia) is built-in on most models. internal high speed bus dubbed ”Element Interconnect Bus”
(”EIB”).
3. The Cell processor
3.1 The PPE
The PS3 uses the Cell microprocessor, which is made up of The PPE is the Power Architecture based, two-way multi-
one 3.2 GHz PowerPC-based ”Power Processing Element” threaded core acting as the controller for the eight SPEs, which
(PPE) and six accessible Synergistic Processing Elements handle most of the computational workload. The PPE will
(SPEs). A seventh runs in a special mode and is dedicated work with conventional operating systems due to its similarity
to aspects of the OS and security, and an eighth is a spare to to other 64-bit PowerPC processors, while the SPEs are de-
www.napier.ac.uk/games/
Lesson 2
Playstation 3 Development — 3/7
signed for vectorized floating point code execution. The PPE G70/G71 hybrid architecture with some modifications. The
contains a 64 KiB level 1 cache (32 KiB instruction and a 32 RSX has separate vertex and pixel shader pipelines. The GPU
KiB data) and a 512 KiB Level 2 cache. makes use of 256 MB GDDR3 RAM clocked at 650 MHz, this
is referred to as ”Local Memory” in the Sony documentation.
3.2 The SPE
Specifications
Each SPE is composed of a ”Synergistic Processing Unit”,
• 500 MHz on 90 nm process (shrunk to 65 nm in 2008
SPU, and a ”Memory Flow Controller”, MFC.
and to 40 nm in 2010)
The SPU runs a specially developed instruction set (ISA) with
• 256 MB of GDDR3 memory running at 700MHZ.
128-bit SIMD organization for single and double precision
• Multi-way parallel FP shader pipelines.
instructions. Each SPE contains a 256 KB embedded SRAM
• Independent Vertex/Pixel shaders.
for instruction and data, called ”Local Storage”which is vis-
• Programmable shading processors – 136 shader opera-
ible to the PPE and can be addressed directly by software.
tions per cycle.
(Not to be mistaken for ”Local Memory”, which is VRAM on
• 128-bit pixel precision.
the RSX) The local store does not operate like a conventional
• Support for PSGL (OpenGL ES 1.1 + Nvidia Cg)
CPU cache since it is neither transparent to software nor does
• Support for S3TC texture compression
it contain hardware structures that predict which data to load.
Note that the SPU cannot directly access system memory; Comparisons Here is the RSX up against some other graph-
the 64-bit virtual memory addresses formed by the SPU must ics chips.
be passed from the SPU to the SPE memory flow controller
(MFC) to set up a DMA operation within the system address 4.1 Vram
space. In one typical usage scenario, the system will load the Although the RSX has 256MB of GDDR3 RAM, not all of it
SPEs with small programs (similar to threads), chaining the is usable. The last 4MB is reserved for keeping track of the
SPEs together to handle each step in a complex operation. RSX internal state and issued commands.
An SPE can operate on sixteen 8-bit integers, eight 16-bit Because of the VERY slow Cell Read speed from VRAM,
integers, four 32-bit integers, or four single-precision floating- it is more efficient for the Cell to work in XDR and then
point numbers in a single clock cycle, as well as a mem- have the RSX pull data from XDR and write to GDDR3 for
ory operation. At 3.2 GHz, each SPE gives a theoretical output to the HDMI display. This is why extra texture lookup
25.6 GFLOPS of single precision performance. For double- instructions were included in the RSX to allow loading data
precision floating point operations, as sometimes used in per- from XDR memory (as opposed to just the local memory).
sonal computers and often used in scientific computing, Cell
performance drops by an order of magnitude, but still reaches 4.2 GCM and PSGL
20.8 GFLOPS (1.8 GFLOPS per SPE, 6.4 GFLOPS per PPE). Developing with the official SDK leaves you with two APIs to
Compared to Desktop processors at the time of release, the choose from in terms of rendering. GCM and PSGL (Playsta-
relatively high overall floating point performance of a Cell tion OpengGL). GCM is specific to the hardware and is as
processor seemingly dwarfs the abilities of the SIMD unit in low level as it gets. As a result what you make with it will (or
CPUs like the Pentium 4 and the Athlon 64. should) preform somewhat better.
However, comparing only floating point abilities of a system However, it should be noted, the PSGL is also popular due
is a one-dimensional and application-specific metric. Unlike a to using the OpenGL convention(OpenGL ES 1.0) - hence
Cell processor, such desktop CPUs are more suited to the gen- simple to understand and implement. The sample engine
eral purpose software usually run on personal computers. As framework developed by Sony, PhyreEngine, uses PSGL as
to be expected, modern day desktop processors have caught it’s rendering framework for simplicity reasons.
up and overtaken the PS3 Cell processor in almost all of it’s This is covered in greater detail in ’Tutorial 1-4 Basic Graph-
strengths due to advances in multi-core and multi-threaded ics’.
optimisations and software design.
A further difference to desktop processors is that the SPU 5. Writing code for the Ps3
has no branch prediction, features in the compiler are used to
compensate for this. Code analysis at compile time is used to stdio and stdlib All of the standard C/C++ libraries have
add in prepare-to -branch ’hints’ into the code. been ported across to the PS3 - hence, it’s very easy to port
across basic C/C++ code to the PS3 (e.g, sprint, fopen, write,
puts).
4. The RSX
The RSX ’Reality Synthesizer’ is a proprietary graphics pro- 5.1 Memory alignment
cessing unit (GPU) co-developed by Nvidia and Sony for the When transfering data to and from SPUs/RSX, the data being
PlayStation 3 game console. It is a GPU based on the Nvidia transferred has certain restrictions placed upon it. The Primary
7800GTX graphics processor and, according to Nvidia, is a restriction is the size of the data, the other is the alignment.
www.napier.ac.uk/games/
Lesson 2
Playstation 3 Development — 4/7
Attribute RSX XBOX 360 Xenos 7800GTX GTX 780 PS4 APU Xbox One APU
Date 2005 2005 2005 2013 2013 2013
Core clock 500 MHz 500MHz 550MHz 863MHz 800MHz 853MHz
Mem Bus 128bit 128bit 256bit 384bit 256bit 256bit
Mem Clock 700 MHZ 1400 MHz 850 MHz 6000 MHz 5000 MHz 2132 MHz
Mem Bandwidth 22.4 GB/s 22.4 GB/s 54.4 GB/s 288.4 GB/s 176 GB/s 68.2 GB/s
RAM 256MB 10MB + 512MB(shared) 512MB 3GB 8GB(shared) 5GB(shared)
ROPs1 8 8 16 48 32 16
TMUs2 24 16 24 192 80 48
Technology 40nm 45nm 110nm 28nm 28nm 28nm
1 Raster Operation Units 2 Texture mapping units
For example when transferring data to an SPU via DMA, data heap and have to be Deleted() manually, Malloc is similar in
must be 16-byte aligned. This means that that total size AND this regard, memory blocks reserved by malloc have to the
the start address of the data, must be divisible by 16. So if function free() called to release the memory.
you need to transfer 24 bytes of data, you must pad it with an
extra 8 bytes to push it upto 32 bytes, which is divisible by Memalign When we need a piece of data aligned to a spe-
16. cific boundary (16 bytes etc..) we need to call this ancient
Almost all of the standard datatypes are evenly aligned (1,4,8,16 malloc() function to give us a chunk of memory to do the
bytes), but when you join them up in structs or arrays you alignment in. Fortunately, in the Sony stdlib library there is a
can get odd sizes which need to be padded. Do not forget function that does this and more for us.
that it isn’t just the size, but the starting address also, which void *memalign(size t boundary, size t size)
makes things much more complicated and can lead to memory The function allocates size bytes and returns a pointer to the al-
fragmentation. located memory. The memory block returned will be aligned
on a multiple of boundary.
5.2 Debugging
Sony provide a large set of tools and libraries for debugging
Figure 2. Memory structure - applications and measuring performance. With specialized
hardware like the Playstation3 optimisation plays a huge part
in game development, getting code to run efficiently as pos-
Malloc Malloc(N) is an old C function that allocates a block sible split across 6 SPUs, 1 PPU and a GPU while using the
of N bytes of memory, returning a pointer to the beginning of minimum possible amount or ram takes a massive amount
the block. In modern C++ code, malloc() is almost never used. of work. Measuring everything, literally every operation, is
It was replaced by the C++ New() method, which allocates the key to performance, without doing so will not allow unex-
memory for a specified class, instantiates it and calls the pected bottlenecks to be found, which is why good debugging
constructor. Classes created with New() are placed on the libraries are paramount in this type of development.
www.napier.ac.uk/games/
Lesson 2
Playstation 3 Development — 5/7
Of course this only applies once the code is actually work- 7 #define HALT asm {int 3}
ing. Debugging in it’s literal meaning and traditional sense 8 #else
9 #error ”unknown platform’
is finding and removing bugs, and doing that on a weird and 10 #endif
wonderful device over the network is a large step up from
debugging local win32 applications in Visual studio. So now, assuming Either PS3, XBOX or PC is defined before
With the tools provided, and the knowledge of how to use this somewhere (an easy thing to do), we can call HALT
them, debugging PS3 applications is not as daunting as it anywhere in the code and it will call the correct version for
would seem. The local debugger is well featured, and the li- the platform. Now what about disabling all halts? Easy:
braries that run on the console side are robust and battle-tested.
1 //There are better ways to do this, but this is super simple:
Debugging on the PS3 is not hard, it just has a steeper learning 2 #if DEBUG
curve, and you will be a better software engineer at the end of 3 #if PS3
it as the skills are transferable to any software project. 4 #define HALT......
5 #endif
6 #endif
Break-Points If your only experience with breakpoints is
clicking on a line of code in visual studio and letting it do Great, we can change platform by defining one variable,
all the work, then this segment will introduce you to some and toggle breakpoints with #define DEBUG = TRUE/FALSE.
low-level assembly magic. Break points can be manually in- Is there anything else macros can do for us here? What about
serted into code via special assembly commands, as assembly conditional breakpoints, can we simplify them? Yes:
is specific to a platform, the commands differ for different
hardware/compilers and debuggers. 1 // call DBG HALT on assert fail
2 #define ASSERT(exp){ if ( !(exp) ) {HALT;}}
3 // Prints the suplied string on assert fail, then call HALT
Listing 1. Halts on different platforms 4 #define ASSERT M(exp,msg){if(!(exp)){puts(msg);HALT;}}
5 // Calls the suplied function on assert fail, then call HALT
1 //IA−32 (Intel Architecture, 32−bit) 6 #define ASSERT F(exp,func){if(!(exp)){func; HALT;}}
2 asm { int 3 } 7 //Now we can do
3 //x86/XBOX/Win32(basically a robust wrapper for int 3) 8 int a = 0;
4 //Only supported in visual studio 9 ASSERT(a > 1);
5 debugbreak() ; 10 ASSERT M ((a > 1), ”A is less than 1!”);
6 // Halts a program running on PPC32 or PPC64 (e.g. PS3). 11 ASSERT F ((a > 1 , printf(”Error : %i\n”, a) );
7 //Also works for ARM and in GCC/XCode
8 asm volatile( ”trap” ) ; So what would all this look like in full?
Break-Point Macros If you need to stop the code at one Listing 2. Over engineered Macro sample
specific location to quickly take a peak at the internal work- 1 // −− Take a guess at the current platform
ings of code, then manually inserting a breakpoint there is 2
3 //The PS3 compiler defines either of these
an O.K solution. As with almost code design, this becomes 4 #if defined( PPU ) || defined( SPU )
infeasible as it scales. Wrapping a breakpoint in an IF state- 5 #define PS3
6
ment is quick way of having conditional breakpoints that only 7 //This probably doesn’t exist, but it’s here for completness
fire when something goes wrong, but now you could have 8 #elif defined( XBOX360 )
breakpoints sprinkled all through your code. What if you need 9 #define XBOX360
10
to disable them all for a release build? They are embedded 11 //Windows has a lot of variants.
into the code so it’s not just a case of telling the debugger to 12 //In production you would want to split this up,
13 // as win64 might have issues with win32 stuff
not pay attention. You could wrap them all in an additional IF, 14 #elif defined( WIN32) || defined( WIN64) || defined(WIN32) || ←-
or comment them all out, but doing something in code more defined( CYGWIN ) || defined( MINGW32 )
than once means there is almost certainly a better and quicker 15 #define WINDOWS
16
way. There is, Macros 17 //This works, but it could be either an OSX or iOS device
18 #elif defined( APPLE )
19 #define MAC
#define MYCOOLMACRO ”my cool macro” 20
When the preprocessor encounters this directive, it replaces 21 //This works, but isn’t a complete solution, google unix
any occurrence of MYCOOLMACRO in the rest of the code 22 #elif defined( linux )
23 #define LINUX
with ”my cool macro”. This replacement can be an expres- 24
sion, a statement, a block or simply anything, e.g a breakpoint 25 #else
26 #error ”unknown platform’
command 27
28 #endif
1 // In your implementation you would do something like this: 29
2 #if PS3 30 #define DEBUG true
3 #define HALT asm volatile( ”trap” ) 31
4 #elif XBOX 32 // −− Platform specific halts
5 #define HALT debugbreak(); 33 #if DEBUG
6 #elif PC 34 #if defined(PS3)
www.napier.ac.uk/games/
Lesson 2
Playstation 3 Development — 6/7
www.napier.ac.uk/games/
Lesson 2
Playstation 3 Development — 7/7
Recommended Reading
Programming the Cell Processor: For Games, Graphics, and
Computation, Matthew Scarpino, ISBN: 978-0136008866
Vector Games Math Processors (Wordware Game Math Li-
brary), James Leiterman, ISBN: 978-1556229213
Clean Code: A Handbook of Agile Software Craftsmanship,
Robert C. Martin, ISBN: 978-0132350884
References
[1] James Leiterman. Vector games math processors (word-
ware game math library) (isbn:978-1556229213), 2011.
1
[2] Syd Logan. Cross-platform development in c++: Build-
ing mac os x, linux, and windows applications (isbn:978-
0321246424), 2007. 1
[3] Matthew Scarpino. Programming the cell processor: For
games, graphics, and computational proccessing (isbn:
978-0136008866), 2011. 1
[4] Edinburgh Napier Game Technology Website.
www.napier.ac.uk/games/. Accessed: Feb 2014,
2014. 1
www.napier.ac.uk/games/