CUDA
CUDA
CUDA
Overview
Basic Introduction Intro to the Operational Model Simple Example ! Memory Allocation and Transfer ! GPU-Function Launch Grids of Blocks of Threads GPU Programming Issues Performance Issues/Hints
! All current and future display drivers from NVIDIA will include
http://www.nvidia.com/object/cuda_learn_products.html
Yesterdays Announcement
NVIDIA recently held their annual developer conference and released info on the next generation of GPUs ... Fermi
3B transistors, 40nm 512 compute elements 8x increase in DP performance (~700GFLOPS) GDDR5 memory (230GB/sec) ECC memory L1 and L2 Cache memory (configurable?)
Operational Model
CUDA assumes a heterogeneous architecture -- both CPUs and GPUs -- with separate memory pools
! CPUs are masters and GPUs are the workers
" " "
CPUs launch computations onto the GPU CPUs can be used for other computations as well GPUs have limited communication back to CPU
Synchronous Xfer -- CPU waits for xfer to complete Async Xfer -- CPU continues with other work, can check if xfer is complete
CPU
GPU
HT 20.8GB/s
Memory
Transfer the input data out to the GPU Run the code on the GPU ! Simultaneously run code on the CPU (??) ! Can run multiple GPU-code-blocks on the GPU sequentially Transfer the output data back to the CPU
Transfer data to GPU Loop: Run the code on the GPU Compute error on the GPU Transfer error to CPU If error > tol, continue Transfer data to CPU
Transfer data to GPU For t=1 to 1000000: Run the code on the GPU If (t%100)==0, transfer data to CPU Print/save data on CPU Transfer data to CPU
Simple Example
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; y[ix] = cos( x[ix] ); } int main() { float *host_x, *host_y; float *dev_x, *dev_y; int n = 1024; host_x = (float*)malloc( n*sizeof(float) ); host_y = (float*)malloc( n*sizeof(float) ); cudaMalloc( &dev_x, n*sizeof(float) ); cudaMalloc( &dev_y, n*sizeof(float) ); /* TODO: fill host_x[i] with data here */ cudaMemcpy( dev_x, host_x, n*sizeof(float), cudaMemcpyHostToDevice ); /* launch 1 thread per vector-element, 256 threads per block */ bk = (int)( n / 256 ); vcos<<<bk,256>>>( n, dev_x, dev_y ); cudaMemcpy( host_y, dev_y, n*sizeof(float), cudaMemcpyDeviceToHost ); /* host_y now contains cos(x) data */ return( 0 ); }
This allocates memory for the data ! C-standard malloc for host (CPU) memory ! cudaMalloc for GPU memory
" "
DONT use a CPU pointer in a GPU function ! DONT use a GPU pointer in a CPU function ! # And note that CUDA cannot tell the difference, YOU have to keep all the pointers straight!!!
This copies the data between CPU and GPU ! Again, be sure to keep your pointers and direction (CPU-to-GPU or GPU-to-CPU) consistent !
"
CUDA cannot tell the difference so it is up to YOU to keep the pointers/directions in the right order
Stream Computing
GPUs are multi-threaded computational engines ! They can execute hundreds of threads simultaneously, and can keep track of thousands of pending threads
"
Note that GPU-threads are expected to be short-lived, you should not program them to run for hours continuously
We usually restrict each thread to be doing more or less the same thing as all the other threads ... SIMD programming Each element in a stream of data is processed with the same kernelfunction, producing an element-wise stream of output data # Previous GPUs had stronger restrictions on data access patterns, but with CUDA, these limitations are gone (though performance issues may still remain)
Kernel Func:
-1
-1
Input:
Output:
-1
-5
-4
Kernel Func:
-1
-1
Input:
Output:
-1
-5
-4
Parallel (4-way) computation ... 2 clock-ticks ... NVIDIA G100 has 240-way parallelism !!
Grid dims <= 65536 Block dims <= 512, total <= 768 threads
NVIDIA G100: 1024
! The GPU program (each thread) must know how to configure itself
Thread Number
Block #s:
Thread #s:
0-63
64-127
128-191
192-255
Vector:
0-255
(1,0)
(0,0)(0,1)(0,2)(0,3) (1,0)(1,1)(1,2)(1,3) (2,0)(2,1)(2,2)(2,3) (3,0)(3,1)(3,2)(3,3) Block Index (1,1,0) Thread Index (2,1,0)
Grid 2x2
Threads 4x4
Grid: 3x3x1 Block: 3x3x1 Each block handles 100x100x300 Each thread handles ~ 33x33x300
Grid: 3x3x1 Block: 1x1x3 Each block handles 100x100x300 Each thread handles 100x100x100
The vcos<<<m,n>>> syntax is what launches ALL of the GPU threads to execute the vcos GPU-function ! Launches m grid blocks, each of size n threads
" "
Total of m*n GPU-threads are created Each thread has a unique {blockIdx.x,threadIdx.x}
also available: {blockDim.x,gridDim.x}
int ix is the global index number for this threads calculations ! We compute it from the built-in, thread-specific variables (set by the run-time environment)
"
"
So each GPU-thread will also have a unique ix value # It is up to YOU to make sure that all data is processed (i.e. that all valid ix values are hit)
__global__ void vcos( int n, float* x, float* y ) { int i; int ix0 = blockIdx.x*blockDim.x + 64*threadIdx.x; for(i=0;i<64;i++) { y[i+ix0] = cos( x[i+ix0] ); } }
or grid.x*grid.y*block.x*block.y*block.z threads
! This may not match up with how much data you actually need to
process ! You can turn threads (and blocks) off by letting them exit the GPU-function
__global__ void vcos( int n, float* x, float* y ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; if( ix < n ) { y[ix] = cos( x[ix] ); } } __global__ void image_proc( int wd, int ht, float* x, float* y ) { if( ((blockIdx.x*blockDim.x+threadIdx.x) < wd) && ((blockIdx.y*blockDim.y+threadIdx.y) < ht) ) { . . . } }
__global__ Functions
Note that __global__ functions must return type void ... that is, they do not return a value ! If your function encounters an error, you must provide that error/return value some other way
! There are ways of detecting if a function could not be launched, or
other CUDA errors -- but user-defined errors must be sent back through some other means
"
And note that you cant send a __global__ function a CPU-pointer! # So you have to save the error/return code to GPU-memory, then do a mem-copy # Watch out for race conditions if all threads write to same error/return code area
Compilation
% nvcc -o simple simple.cu
The compilation process is handled by the nvcc wrapper ! It splits out the CPU and GPU parts ! The CPU parts are compiled with gcc ! The GPU parts are compiled with ptxas (NV assembler) ! The parts are stitched back together into one big object or executable file
! Usual options also work
" " "
-I/include/path -L/lib/path -O
myprog.cu
nvcc
myprog (C Code)
gcc
nvcc
myprog.exe
-Xptxas=-v ! verbose output from NV assembler ! Gives register usage, shared-mem usage, etc.
! The run-time (cudart) pushes all the GPU-code out to the GPU
"
! The run-time/display-driver control the mem-copy timing and sync ! The run-time/display-driver tell the GPU to execute the GPU-
code
Error Handling
All CUDA functions return a cudaError_t value ! This is a typedef enum in C ... #include <cuda.h>
cudaError_t err; err = cudaMemcpy( dev_x, host_x, nbytes, cudaMemcpyDeviceToHost ); if( err != cudaSuccess ) { /* something bad happened */ printf(Error: %s\n, cudaGetErrorString(err) ); }
Function launches do not directly report an error, but you can use:
cudaError_t err; func_name<<<grd,blk>>>( arguments ); err = cudaGetLastError(); if( err != cudaSuccess ) { /* something bad happened during launch */ }
Using features that your GPU does not support (double-precision?) Too many blocks or threads No CUDA-capable GPU found (pre-G80?)
But some bad things cannot be caught until AFTER the launch: ! Array overruns dont happen until the code actually executes; so the launch may be good, but the function crashes later ! Division-by-Zero, NaN, Inf, etc.
"
In this example, err2 could report an error from running func1, e.g. array-bounds overrun
"
func_name<<<grd,blk>>>( arguments ); err1 = cudaGetLastError(); err1b = cudaThreadSynchronize(); ... err2 = cudaMemcpy( host_x, dev_x, nbytes, cudaMemcpyDeviceToHost );
memcpy errors
NOTE: there are no signaling NaNs on the GPU ! E.g. divide-by-zero in a GPU-thread is not an error that will halt the program, it just produces a Inf in the output and you have to detect that separately
" " "
Inf + number => Inf NaN + anything => NaN 0/0 or Inf/Inf => NaN
number / 0 => Inf Inf - Inf => NaN 0 * Inf => NaN
! Inf/NaN values tend to persist and propagate until all your data is
screwed up
"
DONT DESPAIR !!
Performance tuning on GPUs is definitely a black art ! Grid size, Block size, GPU size, registers per thread, occupancy, computational density, loop overheads, if/then statements, memory access pattern, shared memory, texture references
"
ALL impact performance Some of these are low-order bits and can often be ignored
"
Keep in mind that youre starting with 1TFLOPS of performance ! If you hit 50% efficiency, thats still not too bad
Register Usage
GPU PC R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Thr#2 Thr#1
If your algorithm is too complex, it may require additional registers for each thread ! But that can reduce the number of threads that a given GPU-core can handle Real GPU-cores have 8192 (now 16384) registers as well as 768 (now 1024) thread place-holders ! So you can be working on 768 threads simultaneously ! But only if you can fit 768 threads in the register set
G80 ... 8K registers G100 ... 16K registers Fermi ... 32K registers
"
A block size of 48x16, 32x24, 128x6, 256x3 would fill a GPU-core # Note: only 512 threads per dimension, so 768x1 is not possible ... but a block size of 16x16 would potentially use only 33% of the computational units
! However, you can launch more than one block onto a GPU-core
"
In fact, the GPU will automatically launch 3 16x16 blocks, simultaneously, onto each GPU-core # 100% of the computational units would be used # 18x18 .. 324 threads/block .. 2 per GPU .. 84% utilization Note: you cannot run 1 block across 2 GPU-cores
This is BEST CASE utilization
"
Each block requires 3072 registers ... so 2 blocks per GPU-core But 2 blocks is only 512 running threads ... we COULD do 768 # 67% Occupancy
128 threads per block ... 5 blocks per GPU-core ... 83% 2560 registers ... 3 blocks per GPU-core ... 100% utilization
Compile with -Xptxas=-v to see your register usage
Occupancy, contd
GeForce-8 has 8192 registers, and 768 simultaneous threads
Varying Register Use (GF8, GF9): 10 reg .. 128 th/bk .. 6 bk/core .. 256 th/bk .. 3 bk/core 12 reg .. 128 th/bk .. 5 bk/core 16 reg .. 128 th/bk .. 4 bk/core .. 256 th/bk .. 2 bk/core 20 reg .. 128 th/bk .. 3 bk/core 32 reg .. 128 th/bk .. 2 bk/core .. 256 th/bk .. 1 bk/core .. .. .. .. .. .. .. .. 768 768 640 512 512 384 256 256 th/core th/core th/core th/core th/core th/core th/core th/core .. .. .. .. .. .. .. .. 100% 100% 83% 67% 67% 50% 33% 33%
Grid Size
The general guidance is that you want lots of grid-blocks ! Lots of blocks per grid means lots of independent parallel work Helps to future-proof your code since future GPUs will be able to handle more grid-blocks simultaneously
! GeForce-8 has up to 16 GPU-cores
"
E.g. 10 reg/thread, 256 thr/blk, 3 blk/core ... minimum of 48 blocks E.g. 10 reg/thread, 256 thr/blk, 8 blk/core ... minimum of 240 blocks!
Fermi has up to 512 GPU-cores
the number of blocks needed to do the work ... but you also increase the number of blocks-per-core
However, if the per-thread work gets too small, then there can be other basic performance limiters ! To read an array entry, we first read the pointer-x, then calculate x+ix*4 (1 mult and 1 add), then we can finally read x[ix]
"
Once weve done all that, we can easily read x[ix+1] by just adding 4 to the new pointer
Future GPUs are likely to have more GPU-cores Future GPUs are likely to have more threads per core Err on the side of more blocks per grid, with a reasonable number of threads per block (128 min, 256 is better) GPUs are rapidly evolving so while future-proofing your code is nice, it might not be worth spending too much time and effort on ! CUDA is only on v.2.3 and yet it supports 4 versions of GPUs, and dozens of graphics products
major=1 ... CUDA-capable GPU minor=0 ... GeForce-8 ... 768 threads per core minor=1 ... GeForce-9 ... 768 threads per core, atomic ops minor=3 ... G100 ... 1024 threads per core, double-precision GeForce 8600GT ... GF8 chip with 4 cores GeForce 8800GTX ... GF8 chip with 16 cores GeForce 8800GT ... GF9 chip with 14 cores # See CUDA Programming Guide, Appendix A
Some Examples
__global__ void func( int n, float* x ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; x[ix] = 0.0f; } nblk = size/256; func<<<nblk,256>>>( size, x );
#define BLK_SZ (256) __global__ void func( int n, float* x ) { int ix = 4*(blockIdx.x*BLK_SZ + threadIdx.x); x[ix] = 0.0f; x[ix+BLK_SZ] = 0.0f; x[ix+2*BLK_SZ] = 0.0f; Be careful with integer division! x[ix+3*BLK_SZ] = 0.0f; } nblk = size/(4*BLK_SZ); func<<<nblk,BLK_SZ>>>( size, x );
__global__ void func( int n, float* x ) { int i,ix = blockIdx.x*blockDim.x + threadIdx.x; for(i=ix;i<n;i+=blockDim.x*gridDim.x) { x[i] = 0.0f; } } func<<<48,256>>>( size, x );
#define GRD_SZ (48) #define BLK_SZ (256) __global__ void func( int n, float* x ) { int i,ix = blockIdx.x*BLK_SZ + threadIdx.x; for(i=ix;i<n;i+=BLK_SZ*GRD_SZ) { x[i] = 0.0f; } } func<<<GRD_SZ,BLK_SZ>>>( size, x );
Turning on the profiler will produce a log file with all the GPUfunction launches and memory transfers recorded in it ! Note that if a GPU function is called inside an inner loop, youll get lots and lots of output! Also reports GPU occupancy for GPU-function launches There is now a visual CUDA Profiler as well
Performance Issues
Hard-coding your grid/block sizes can help reduce register usage
#define BLK_SZ (256)
instruction stream, not stored in a register Choosing the number of grid-blocks based on problem size can essentially unroll your outer loop ... which can improve efficiency and reduce register count ! E.g. nblks = (size/nthreads) ! You may want each thread to handle more work, e.g. 4 data elements per thread, for better thread-level efficiency (less loop overhead)
"
Consider writing several different variations of the function where each variation handles a different range of sizes, and hard-codes a different grid/block/launch configuration ! E.g. small, medium, large problem sizes
" " "
small ... (size/256) blocks of 256 threads ... maybe not-so-efficient, but for small problems, its good enough medium ... 48 blocks of 256 threads large ... 48 blocks of 256 threads with 4 data elements per thread
There is some amazing C-macro programming in the CUBLAS, take a look at the (open-)source code!
Technically, main memory is shared by all grids/blocks/threads ! BUT: main memory is _not_ guaranteed to be consistent (at least not right away) ! BUT: main memory writes may not complete in-order
! Newer GPU (GF9 or G100) can do atomic operations on main
memory ... but they essentially lock-out all other threads while they do their atomic operation (could be bad for performance)
Mem-1
Mem-2
Mem-3
Mem-4
GPU
! Dont have thread-0 touch x[0], x[1], x[2], ..., while thread-1
once
Block-Shared Memory
CUDA assumes a GPU with blockshared as well as program-shared memory ! Threads in the same block can communicate through this shared memory
" GPU PC M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Thr#3 Thr#2 Thr#1
E.g. all threads in Block (1,0,0) see the same data, but cannot see Block (1,1,0)s data
Block-shared memory is not immediately synchronized after every read or write ! E.g. if Thread-1 writes data and Thread-2 reads it ... still not guaranteed to be the same data
"
Be careful that you dont overrun the __shared__ array bounds -Xptxas=-v will also show your block-shared memory usage
Texture References
Texrefs are used to map a 2-D skin onto a 3-D polygonal model ! In games, this allows a low-res (fast) game object to appear to have more complexity
This is done VERY OFTEN in games, so there is extra hardware in the GPU to make it VERY FAST
Multi-GPU Programming
If one is good, four must be better!! ! S870 system packs 4 GF8s into an external box (external power)
"
Need to do this before any mem-copy, launch, thread-sync, etc. If you forget what GPU youre talking to ... BAD!! Each GPU has its own memory pool ... need to keep pointers straight Note that CUDA will time-share any GPU, so if you dont explicitly set the device, the program will still run (on GPU#0) ... slowly
in CUDA
! All threads, even those who fail the conditional, walk through
Conditionals
Generally, conditionals on some F(threadIdx) are bad for performance ! Some threads will be idle (not doing work) some of the time
" "
Unless you can guarantee that the conditional keeps Warps together Presently a warp is a set of 32 threads; 0-31, 32-63, etc.
Asynchronous Launches
When your program executes vcos<<<m,n>>>, it launches the GPUthreads and then IMMEDIATELY returns to your (CPU) program ! So you can have the CPU do other work WHILE the GPU is computing vcos If you want to wait for the GPU to complete before doing any other work on the CPU, you need to explicitly synchronize the two:
vcos<<<m,n>>>( n, dev_x, dev_y ); /* CPU can do work here */ cudaThreadSynchronize(); /* GPU is now done, CPU is syncd */
Note that cudaMemcpy automatically does a synchronization, so you do NOT have to worry about copying back bad data
Synchronizing all of this gets complicated ! See cudaEvent and cudaStream functions