0% found this document useful (0 votes)

585 views

rCUDA Guide

This document is the user guide for rCUDA version 3.1, a framework that enables the concurrent usage of CUDA-compatible devices remotely. It describes the client and server components, how to compile and run applications using rCUDA, and current limitations. The client side uses wrapper libraries to access virtual GPU devices on remote servers. The server side runs an rCUDA daemon to service client requests. It allows GPU resources to be shared across clusters, academic networks, and virtual machines.

Uploaded by

최현준

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

585 views

rCUDA Guide

Uploaded by

최현준

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

rCUDA Users Guide v3.

Antonio J. Pea n Grupo de Arquitecturas Paralelas Departamento de Informtica de Sistemas y Computadores a Universitat Polit`cnica de Val`ncia e e Camino de Vera, s/n 46022 Valencia, Spain Email: apenya@gap.upv.es October 19, 2011

Contents

1 Introduction

2 Components and usage 2.1 2.2 Client Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Server Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4

3 Current limitations

4 Further Information

5 Credits 5.1 5.2 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 9 9

Chapter 1

Introduction
The rCUDA framework enables the concurrent usage of CUDA-compatible devices remotely. To enable a remote GPU-based acceleration, this framework creates virtual CUDA-compatible devices on those machines without a local GPU. These virtual devices represent physical GPUs located in a remote host oering GPGPU services. rCUDA employs the sockets API for the communications between clients and servers. Thus, it can be useful in three dierent environments:

Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to increase GPUs use and to energy savings, as well as other related savings like acquisition costs, maintenance, space, cooling, etc. Academia. In commodity networks, to oer access to a few high performance GPUs concurrently to several students. Virtual Machines. To enable the access to the CUDA facilities on the physical machine.

The current version of rCUDA (v3.1) implements all functions in the CUDA Runtime API version 4.0, excluding those related with graphics interoperability. rCUDA 3.1 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides.

Chapter 2

Components and usage

This framework is composed of a client middleware, which is a library of wrappers that replaces the CUDA Runtime (provided by NVIDIA as a dynamic library), and a server middleware, congured as a daemon which runs in those nodes oering GPGPU acceleration services. rCUDA is organized as a client-server distributed architecture, as depicted in Figure 2.1. Clients use a library of wrappers to the CUDA Runtime API to access virtualized devices, while nodes hosting the physical accelerators run a daemon servicing API execution requests. Clients and servers communicate using the sockets API. In order to optimize client/server data exchange, rCUDA employs a customized application-level communication protocol.

2.1

Client Side

The client side middleware is distributed in two les: libcudart.so.4.0 and libcublas.so.4.0. These shared libraries should be placed in that machine(s) accessing remote GPGPU services. Set the LD LIBRARY PATH environment variable according to the nal location of these les (typically /$HOME/rCUDA/ framework/rCUDAl or /usr/local/cuda/lib64). In order to properly execute the applications using the rCUDA library, set the RCUDA environment variable as a list of pairs <server>[@<port>] separated by the colon character (e.g., by inserting the line export RCUDA=192.168.0.1 in the .bashrc le of the home directory). The library will try to connect each specied server listed in that variable until success. The default port is 8308. To compile applications with the rCUDA framework, follow these steps:

Figure 2.1: rCUDA architecture. Install CUDA Toolkit >= 4.0, in order to have the CUDA header les available. Rewrite the application avoiding the use of the CUDA C extensions, that is, using the plain C API. Separate host and device code into dierent les. Host code les must be compiled with the native C/C++ compiler (e.g., GNU gcc). Device code les must be compiled employing the NVIDIA Compiler Driver Utility nvcc. Make use of the nvcc option -fatbin (see NVIDIAs documentation) in order to generate a fat binary object, also called fatbin, which is a collection of dierent cubin and/or PTX les, all representing the same device code, but compiled and optimized for dierent architectures. The resulting le has to be named as the binary plus the extension .fatbin. Note that only one le will be used, so this le must contain all the GPU code. This can be accomplished by manually concatenating the dierent fatbin les generated. For further information, see the makeles of the examples provided with the rCUDA package or those included in the rCUDA SDK.

2.2

Server Side

The rCUDA daemon (rCUDAd) should be run in that machine(s) oering remote GPGPU services. 4

This daemon oers the following command-line options: -d <device> : Select device (rst working device by default). -i : Do not daemonize. Instead, run in interactive mode. -l : Local mode using AF UNIX sockets. -n <number> : Number of concurrent servers allowed. 0 stands for unlimited (default). -p <port> : Specify the port to listen to (default: 8308). -v Verbose mode. -h Print usage information.

Chapter 3

Current limitations
The current implementation of rCUDA features the next limitations: Graphics interoperability is not implemented. Missing modules: OpenGL Interoperability, Direct3D 9 Interoperability, Direct3D 10 Interoperability, Direct3D 11 Interoperability, VDPAU Interoperability, Graphics Interoperability. The daemon has to be compiled with CUDA Toolkit >= 4.0. Targets the Linux OS (32- and 64-bit architectures) on both client and server sides, but these have to match. Virtualized devices do not oer zero copy capabilities. The rCUDA library is not thread-safe yet. Thus, multiple devices have to be managed from dierent processes. Device and host code have to be kept in separate les. Host code is compiled with a native compiler (e.g. gcc), while device code is compiled with nvcc. Refer to Section 2.1. As the CUDA APIs do not explicitly provide a method to nd and use embedded device code, rCUDA does not support this feature. Thus, device code has to be compiled using the option -fatbin of nvcc, and the use of precompiled CUDA libraries not explicitly supported (CUFFT, CUDPP, etc.) is not possible. Lack of support for the CUDA C extensions. The plain C API has to be used instead. For instance, a kernel call using the CUDA C extensions like:
kernel<<<blocks, threads>>>(a, b, c);

could be rewritten in a straightforward way like:

#define ALIGN_UP(offset, align) (offset) = \ ((offset) + (align) - 1) & ~((align) - 1) cudaConfigureCall(blocks, threads); int offset = 0; ALIGN_UP(offset, __alignof(a)); cudaSetupArgument(&a, sizeof(a), offset); offset += sizeof(a); ALIGN_UP(offset, __alignof(b)); cudaSetupArgument(&b, sizeof(b), offset); offset += sizeof(b); ALIGN_UP(offset, __alignof(c)); cudaSetupArgument(&c, sizeof(c), offset); cudaLaunch("kernel");

However, the 3 lines of code introduced for each argument setup operation can be replaced by a single line calling the following function:

template<class T> inline void setupArg(T arg, int *offst) { ALIGN_UP(*offst, __alignof(arg)); cudaSetupArgument(&arg, sizeof(arg), *offst); *offst += sizeof(arg); }

For convenience, a header le dening this function (rCUDA util.h) and other facilities is included within the rCUDA package under the util directory. Timing with the event management functions might be inaccurate, since these timings will discard network delays. Using standard Posix timing procedures such as clock gettime is recommended.

Chapter 4

Further Information
Be careful with kernel names to be passed to the cudaLaunch function. To avoid C++ mangling, declare kernels as extern C. If not possible (e.g. if using templates), compile rst the device code with the option -Xptxas=-v in order to obtain the real names of the kernels. For further information, please refer to [1, 2, 3]. Also, do not hesitate to contact Antonio J. Pea (apenya@gap.upv.es) for any questions or bug reports (see the n next chapter).

Chapter 5

Credits
5.1 Management
Jos Duato and Federico Silla e Grupo de Arquitecturas Paralelas Departamento de Informtica de Sistemas y Computadores a Universitat Polit`cnica de Val`ncia e e Camino de Vera, s/n 46022 Valencia, Spain Email: {jduato, fsilla}@disca.upv.es

Rafael Mayo and Enrique S. Quintana-Ort High Performance Computing and Architectures Group Departamento de Ingenier y Ciencia de los Computadores a Universidad Jaume I Av. Vicente Sos Baynat, s/n 12071 Castelln, Spain o Email: {mayo, quintana}@icc.uji.es

5.2

Development
Antonio J. Pea and Carlos Reao n n Grupo de Arquitecturas Paralelas Departamento de Informtica de Sistemas y Computadores a Universitat Polit`cnica de Val`ncia e e Camino de Vera, s/n 46022 Valencia, Spain Email: {apenya, carregon}@gap.upv.es 9

Adrin Castell a o High Performance Computing and Architectures Group Departamento de Ingenier y Ciencia de los Computadores a Universidad Jaume I Av. Vicente Sos Baynat, s/n 12071 Castelln, Spain o Email: adcastel@icc.uji.es

Acknowledgements
This work was supported by PROMETEO from Generalitat Valenciana (GVA) under Grant PROMETEO/2008/060, by the Spanish Ministry of Science and Innovation under Grant CONSOLIDER INGENIO CSD2006-00046, by the Spanish Ministry of Science and FEDER (contract no. TIN2008-06570-C04), and by the Fundacin Caixa-Castell/Bancaixa (contract no. P1-1B2009-35). o o

Bibliography
[1] Jos Duato, Francisco D. Igual, Rafael Mayo, Antonio J. Pea, Enrique S. e n Quintana-Ort and Federico Silla. An ecient implementation of GPU vir, tualization in high performance clusters. In Euro-Par 2009, Parallel Processing Workshops, volume 6043 of Lecture Notes in Computer Science, pages 385394. Springer-Verlag, 2010. [2] Jos Duato, Antonio J. Pea, Federico Silla, Rafael Mayo, and Enrique S. e n Quintana-Ort rCUDA: reducing the number of GPU-based accelerators in . high performance clusters. In Proceedings of the 2010 International Conference on High Performance Computing and Simulation (HPCS 2010), pages 224231, Caen, France, June 2010. [3] Jos Duato, Antonio J. Pea, Federico Silla, Rafael Mayo, and Enrique S. e n Quintana-Ort Performance of cuda virtualized remote gpus in high per. formance clusters. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP 2011), Taipei, Taiwan, September 2011.

The Lausanne Covenant - Lausanne Movement
No ratings yet
The Lausanne Covenant - Lausanne Movement
10 pages
ECS & QCX Core - Access Control - Reference Manual
No ratings yet
ECS & QCX Core - Access Control - Reference Manual
28 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
No ratings yet
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
11 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
CUDA Getting Started Linux
No ratings yet
CUDA Getting Started Linux
19 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
CUDA Getting Started Linux PDF
No ratings yet
CUDA Getting Started Linux PDF
32 pages
Nvidia Cuda C Getting Started Guide For Linux: Installation and Verification On Linux Systems
No ratings yet
Nvidia Cuda C Getting Started Guide For Linux: Installation and Verification On Linux Systems
16 pages
cuda
No ratings yet
cuda
25 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
1 Cuda
100% (1)
1 Cuda
173 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
306 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
No ratings yet
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
6 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
CUDA Getting Started Guide For Linux
No ratings yet
CUDA Getting Started Guide For Linux
16 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA
No ratings yet
CUDA
46 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
CUDA Compatibility
No ratings yet
CUDA Compatibility
16 pages
D&I of GPU Based Image Processing On CASE Cluster
No ratings yet
D&I of GPU Based Image Processing On CASE Cluster
28 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Cuda C
No ratings yet
Cuda C
70 pages
CUDA C Programming Guide PDF
No ratings yet
CUDA C Programming Guide PDF
301 pages
CUDA_Installation_Guide_Windows
No ratings yet
CUDA_Installation_Guide_Windows
28 pages
CUDA
No ratings yet
CUDA
20 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
cuuda nvidai guide_Part2
No ratings yet
cuuda nvidai guide_Part2
15 pages
CUDA8.0 Installation Guide Linux
No ratings yet
CUDA8.0 Installation Guide Linux
41 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
NVIDIA CUDA C Programming Guide 3.1
No ratings yet
NVIDIA CUDA C Programming Guide 3.1
173 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
316 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
CUDA 1_Introduction to GPU, CUDA (1)
No ratings yet
CUDA 1_Introduction to GPU, CUDA (1)
21 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA C Programming Guide
100% (1)
CUDA C Programming Guide
275 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Mastering CUDA Python Programming
From Everand
Mastering CUDA Python Programming
Ed A Norex
No ratings yet
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Mastering CUDA C++ Programming: A Comprehensive Guidebook
From Everand
Mastering CUDA C++ Programming: A Comprehensive Guidebook
Brett Neutreon
No ratings yet
Responses To The Translated Literary Works With European Background in The 2017 Indonesian Newspaper
No ratings yet
Responses To The Translated Literary Works With European Background in The 2017 Indonesian Newspaper
17 pages
Rothman (1990) No Adjustments Are Needed For Multiple Comparisons PDF
No ratings yet
Rothman (1990) No Adjustments Are Needed For Multiple Comparisons PDF
5 pages
Lirik Lagu Natal Hesty
No ratings yet
Lirik Lagu Natal Hesty
7 pages
Ws Tip Headers PDF
No ratings yet
Ws Tip Headers PDF
8 pages
AWS Notes
No ratings yet
AWS Notes
5 pages
Edexcel_iGCSE_Maths Textbook (Collins)-Ch.4
No ratings yet
Edexcel_iGCSE_Maths Textbook (Collins)-Ch.4
12 pages
Taulean 2018
No ratings yet
Taulean 2018
6 pages
Entity Relationship Diagram - ER Diagram
100% (1)
Entity Relationship Diagram - ER Diagram
10 pages
The Study Quran Seyyed Hossein Nasr pdf download
100% (1)
The Study Quran Seyyed Hossein Nasr pdf download
53 pages
Item Response Theory and Modeling With Stata
No ratings yet
Item Response Theory and Modeling With Stata
13 pages
Grade 5 Crossword 1
No ratings yet
Grade 5 Crossword 1
2 pages
Who Will Go Who Will Go
No ratings yet
Who Will Go Who Will Go
2 pages
Template - Video REVIEW
No ratings yet
Template - Video REVIEW
1 page
10 DV L4 First Conditional - Marulanda Yury
No ratings yet
10 DV L4 First Conditional - Marulanda Yury
4 pages
OBT75+OBS40+OBT70-ATool A4-Student
No ratings yet
OBT75+OBS40+OBT70-ATool A4-Student
6 pages
May 29-31 Arrest Log
100% (1)
May 29-31 Arrest Log
10 pages
Log Cat 1735242530798
No ratings yet
Log Cat 1735242530798
3 pages
Handbook: Published On Musescore
No ratings yet
Handbook: Published On Musescore
50 pages
Elizabethan era notes
No ratings yet
Elizabethan era notes
2 pages
Full download Religious Rite and Ceremony in Milton s Poetry 1st Edition Thomas B. Stroup pdf docx
100% (8)
Full download Religious Rite and Ceremony in Milton s Poetry 1st Edition Thomas B. Stroup pdf docx
81 pages
Third Quarter Examination in Computer 4
No ratings yet
Third Quarter Examination in Computer 4
2 pages
Bahasa Inggris - Application Letter Kelompok 6
No ratings yet
Bahasa Inggris - Application Letter Kelompok 6
8 pages
Prove: (P Q) Q P Q: Proofs Using Logical Equivalences
No ratings yet
Prove: (P Q) Q P Q: Proofs Using Logical Equivalences
4 pages
Unit 7 - Top Notch 2 - Extra Material
No ratings yet
Unit 7 - Top Notch 2 - Extra Material
5 pages
Rizal's Exile in Dapitan: Significant Events and Learnings
No ratings yet
Rizal's Exile in Dapitan: Significant Events and Learnings
20 pages
IELTS Summary - Short
No ratings yet
IELTS Summary - Short
6 pages
Blooms Modern Critical Interpretations
No ratings yet
Blooms Modern Critical Interpretations
239 pages
+27632807647 HOW TO JOIN ILLUMINATI ONLINE】【FREEMASON】【JOIN ILLUMINATI】I WANT TO JOIN ILLUMINATI IN JOHANNESBURG CAPE TOWN PRETORIA DURBAN SOWETO NELSPRUIT BLOEMFONTEIN EAST LONDON PORT ELIZABETH KIMBERLEY POLOKWANE PIETERMARITZBURG RUSTENBURG
No ratings yet
+27632807647 HOW TO JOIN ILLUMINATI ONLINE】【FREEMASON】【JOIN ILLUMINATI】I WANT TO JOIN ILLUMINATI IN JOHANNESBURG CAPE TOWN PRETORIA DURBAN SOWETO NELSPRUIT BLOEMFONTEIN EAST LONDON PORT ELIZABETH KIMBERLEY POLOKWANE PIETERMARITZBURG RUSTENBURG
10 pages