rCUDA Guide
rCUDA Guide
Antonio J. Pea n Grupo de Arquitecturas Paralelas Departamento de Informtica de Sistemas y Computadores a Universitat Polit`cnica de Val`ncia e e Camino de Vera, s/n 46022 Valencia, Spain Email: apenya@gap.upv.es October 19, 2011
Contents
1 Introduction
3 3 4
3 Current limitations
4 Further Information
9 9 9
Chapter 1
Introduction
The rCUDA framework enables the concurrent usage of CUDA-compatible devices remotely. To enable a remote GPU-based acceleration, this framework creates virtual CUDA-compatible devices on those machines without a local GPU. These virtual devices represent physical GPUs located in a remote host oering GPGPU services. rCUDA employs the sockets API for the communications between clients and servers. Thus, it can be useful in three dierent environments:
Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to increase GPUs use and to energy savings, as well as other related savings like acquisition costs, maintenance, space, cooling, etc. Academia. In commodity networks, to oer access to a few high performance GPUs concurrently to several students. Virtual Machines. To enable the access to the CUDA facilities on the physical machine.
The current version of rCUDA (v3.1) implements all functions in the CUDA Runtime API version 4.0, excluding those related with graphics interoperability. rCUDA 3.1 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides.
Chapter 2
2.1
Client Side
The client side middleware is distributed in two les: libcudart.so.4.0 and libcublas.so.4.0. These shared libraries should be placed in that machine(s) accessing remote GPGPU services. Set the LD LIBRARY PATH environment variable according to the nal location of these les (typically /$HOME/rCUDA/ framework/rCUDAl or /usr/local/cuda/lib64). In order to properly execute the applications using the rCUDA library, set the RCUDA environment variable as a list of pairs <server>[@<port>] separated by the colon character (e.g., by inserting the line export RCUDA=192.168.0.1 in the .bashrc le of the home directory). The library will try to connect each specied server listed in that variable until success. The default port is 8308. To compile applications with the rCUDA framework, follow these steps:
Figure 2.1: rCUDA architecture. Install CUDA Toolkit >= 4.0, in order to have the CUDA header les available. Rewrite the application avoiding the use of the CUDA C extensions, that is, using the plain C API. Separate host and device code into dierent les. Host code les must be compiled with the native C/C++ compiler (e.g., GNU gcc). Device code les must be compiled employing the NVIDIA Compiler Driver Utility nvcc. Make use of the nvcc option -fatbin (see NVIDIAs documentation) in order to generate a fat binary object, also called fatbin, which is a collection of dierent cubin and/or PTX les, all representing the same device code, but compiled and optimized for dierent architectures. The resulting le has to be named as the binary plus the extension .fatbin. Note that only one le will be used, so this le must contain all the GPU code. This can be accomplished by manually concatenating the dierent fatbin les generated. For further information, see the makeles of the examples provided with the rCUDA package or those included in the rCUDA SDK.
2.2
Server Side
The rCUDA daemon (rCUDAd) should be run in that machine(s) oering remote GPGPU services. 4
This daemon oers the following command-line options: -d <device> : Select device (rst working device by default). -i : Do not daemonize. Instead, run in interactive mode. -l : Local mode using AF UNIX sockets. -n <number> : Number of concurrent servers allowed. 0 stands for unlimited (default). -p <port> : Specify the port to listen to (default: 8308). -v Verbose mode. -h Print usage information.
Chapter 3
Current limitations
The current implementation of rCUDA features the next limitations: Graphics interoperability is not implemented. Missing modules: OpenGL Interoperability, Direct3D 9 Interoperability, Direct3D 10 Interoperability, Direct3D 11 Interoperability, VDPAU Interoperability, Graphics Interoperability. The daemon has to be compiled with CUDA Toolkit >= 4.0. Targets the Linux OS (32- and 64-bit architectures) on both client and server sides, but these have to match. Virtualized devices do not oer zero copy capabilities. The rCUDA library is not thread-safe yet. Thus, multiple devices have to be managed from dierent processes. Device and host code have to be kept in separate les. Host code is compiled with a native compiler (e.g. gcc), while device code is compiled with nvcc. Refer to Section 2.1. As the CUDA APIs do not explicitly provide a method to nd and use embedded device code, rCUDA does not support this feature. Thus, device code has to be compiled using the option -fatbin of nvcc, and the use of precompiled CUDA libraries not explicitly supported (CUFFT, CUDPP, etc.) is not possible. Lack of support for the CUDA C extensions. The plain C API has to be used instead. For instance, a kernel call using the CUDA C extensions like:
kernel<<<blocks, threads>>>(a, b, c);
#define ALIGN_UP(offset, align) (offset) = \ ((offset) + (align) - 1) & ~((align) - 1) cudaConfigureCall(blocks, threads); int offset = 0; ALIGN_UP(offset, __alignof(a)); cudaSetupArgument(&a, sizeof(a), offset); offset += sizeof(a); ALIGN_UP(offset, __alignof(b)); cudaSetupArgument(&b, sizeof(b), offset); offset += sizeof(b); ALIGN_UP(offset, __alignof(c)); cudaSetupArgument(&c, sizeof(c), offset); cudaLaunch("kernel");
However, the 3 lines of code introduced for each argument setup operation can be replaced by a single line calling the following function:
template<class T> inline void setupArg(T arg, int *offst) { ALIGN_UP(*offst, __alignof(arg)); cudaSetupArgument(&arg, sizeof(arg), *offst); *offst += sizeof(arg); }
For convenience, a header le dening this function (rCUDA util.h) and other facilities is included within the rCUDA package under the util directory. Timing with the event management functions might be inaccurate, since these timings will discard network delays. Using standard Posix timing procedures such as clock gettime is recommended.
Chapter 4
Further Information
Be careful with kernel names to be passed to the cudaLaunch function. To avoid C++ mangling, declare kernels as extern C. If not possible (e.g. if using templates), compile rst the device code with the option -Xptxas=-v in order to obtain the real names of the kernels. For further information, please refer to [1, 2, 3]. Also, do not hesitate to contact Antonio J. Pea (apenya@gap.upv.es) for any questions or bug reports (see the n next chapter).
Chapter 5
Credits
5.1 Management
Jos Duato and Federico Silla e Grupo de Arquitecturas Paralelas Departamento de Informtica de Sistemas y Computadores a Universitat Polit`cnica de Val`ncia e e Camino de Vera, s/n 46022 Valencia, Spain Email: {jduato, fsilla}@disca.upv.es
Rafael Mayo and Enrique S. Quintana-Ort High Performance Computing and Architectures Group Departamento de Ingenier y Ciencia de los Computadores a Universidad Jaume I Av. Vicente Sos Baynat, s/n 12071 Castelln, Spain o Email: {mayo, quintana}@icc.uji.es
5.2
Development
Antonio J. Pea and Carlos Reao n n Grupo de Arquitecturas Paralelas Departamento de Informtica de Sistemas y Computadores a Universitat Polit`cnica de Val`ncia e e Camino de Vera, s/n 46022 Valencia, Spain Email: {apenya, carregon}@gap.upv.es 9
Adrin Castell a o High Performance Computing and Architectures Group Departamento de Ingenier y Ciencia de los Computadores a Universidad Jaume I Av. Vicente Sos Baynat, s/n 12071 Castelln, Spain o Email: adcastel@icc.uji.es
10
Acknowledgements
This work was supported by PROMETEO from Generalitat Valenciana (GVA) under Grant PROMETEO/2008/060, by the Spanish Ministry of Science and Innovation under Grant CONSOLIDER INGENIO CSD2006-00046, by the Spanish Ministry of Science and FEDER (contract no. TIN2008-06570-C04), and by the Fundacin Caixa-Castell/Bancaixa (contract no. P1-1B2009-35). o o
Bibliography
[1] Jos Duato, Francisco D. Igual, Rafael Mayo, Antonio J. Pea, Enrique S. e n Quintana-Ort and Federico Silla. An ecient implementation of GPU vir, tualization in high performance clusters. In Euro-Par 2009, Parallel Processing Workshops, volume 6043 of Lecture Notes in Computer Science, pages 385394. Springer-Verlag, 2010. [2] Jos Duato, Antonio J. Pea, Federico Silla, Rafael Mayo, and Enrique S. e n Quintana-Ort rCUDA: reducing the number of GPU-based accelerators in . high performance clusters. In Proceedings of the 2010 International Conference on High Performance Computing and Simulation (HPCS 2010), pages 224231, Caen, France, June 2010. [3] Jos Duato, Antonio J. Pea, Federico Silla, Rafael Mayo, and Enrique S. e n Quintana-Ort Performance of cuda virtualized remote gpus in high per. formance clusters. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP 2011), Taipei, Taiwan, September 2011.
12