Academia.eduAcademia.edu

JCudaMP

2010, Proceedings of the 3rd International Workshop on Multicore Software Engineering - IWMSE '10

JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm University of Erlangen-Nuremberg Computer Science Department 2 Martensstr. 3 • 91058 Erlangen • Germany {georg.dotzler, veldema, klemm}@informatik.uni-erlangen.de Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features—Concurrent programming structures; D.3.4 [Programming Languages]: Processors—Code generation, Run-time environments ABSTRACT We present an OpenMP framework for Java that can exploit an available graphics card as an application accelerator. Dynamic languages (Java, C#, etc.) pose a challenge here because of their writeonce-run-everywhere approach. This renders it impossible to make compile-time assumptions on whether and which type of accelerator or graphics card might be available in the system at run-time. We present an execution model that dynamically analyzes the running environment to find out what hardware is attached. Based on the results it dynamically rewrites the bytecode and generates the necessary gpGPU code on-the-fly. Furthermore, we solve two extra problems caused by the combination of Java and CUDA. First, CUDA-capable hardware usually has little memory (compared to main memory). However, as Java is a pointer-free language, array data can be stored in main memory and buffered in GPU memory. Second, CUDA requires one to copy data to and from the graphics card’s memory explicitly. As modern languages use many small objects, this would involve many copy operations when done naively. This is exacerbated because Java uses arrays-of-arrays to implement multi-dimensional arrays. A clever copying technique and two new array packages allow for more efficient use of CUDA. 1. INTRODUCTION Currently, many computers already have a graphics card that is capable of doing generic computations using, for example, CUDA [1] (nVidia). Since their core counts will grow and their cores will support even more generic computations, general purpose GPU (gpGPU) computing will become even more attractive. The common computing environment is therefore already heterogeneous and we need a programming environment to transparently adapt to this heterogeneity. In this situation one can learn from Java’s or .NET’s write-once- IWMSE ’10, May 1 2010, Cape Town, South Africa run-everywhere approach: the programmer no longer has to deal with various native executables or architectural details of machines. The same is needed for (heterogeneous) multi-core architectures that might or might not be equipped with gpGPUs or other accelerators. The programmer cannot be expected to explicitly write code for any chip combination. Instead, the programmer will expect that for such modern, dynamic, object-oriented languages, the runtime environment will make sure that the execution will exploit (any) gpGPU hardware when it is available, without any significant changes to the source code. To do so, we need to annotate the bytecode (here Java bytecode) to mark parallel code so that it can be easily found at run-time. We also need to dynamically generate code for the current hardware architecture and hardware versions. To circumvent the low-level parallel programming model of Java and to focus on SPMD-style parallelism that is suitable for graphics cards, we use JaMP [2] which provides OpenMP-like annotations to Java code. The JaMP compiler is an OpenMP compiler that translates OpenMP/Java code to multi-threaded Java bytecode. We extend the existing compiler to emit Java bytecode annotations. A new Java class loader reads the annotations and performs run-time code rewriting to adapt the bytecode to whatever multi-/many-core hardware is available in the system. The new Java class loader detects the presence and different versions of CUDA (each version has different capabilities and memory sizes). Other hardware that falls in the same category of parallel coprocessor hardware are the Cell-processor [3], Brook+ [4] (AMD’s gpGPU platform), and Intel’s future Larrabee [5] graphics/parallel processor. Once we can detect and use CUDA dynamically, the two important problems are (1) to achieve good memory transfer performance and (2) to allocate huge data structures on the graphics card. We solve (1) by efficiently packing many objects into big chunks for bulk transfers and by providing specialized array classes that allocate their payload on the gpGPU (if available) and copy elements to the host only when needed. An OpenMP extension solves problem (2) by semi-automatically fragmenting an array into smaller tiles that are just big enough to fit onto the graphics card. One could argue that a library based approach would be sufficient. Such a library would contain a number of pre-programmed, hand-optimized, compute kernels for a number of hardware types. There are two problems with such an approach. First, the preselected kernels will not fit all new problem domains or algorithms and second, new hardware will probably require new interfaces and so new libraries. However, in our approach, the programmer can write his own parallel region and let some lower-level software layer optimize and adapt it to available hardware. For new hardware, the programs themselves can thereafter remain untouched with only the class-loader/compiler changed. The main contributions of this paper are a technique to dynamically invade an available gpGPU (Sec.2), ways to minimize hostgpGPU object transfers (Sec. 3), and finally, a buffering that semiautomatically hides memory limits of attached graphics cards and enables the processing of large data structures (Sec. 4). 2. CODE GENERATION 2.1 Architecture To optimize the parallel region to the platform’s hardware, our framework must dynamically generate specialized code at run-time because the target hardware is not known at compile-time. This code specialization can occur in the VM (e.g. the JIT compiler) or in an extended Java class loader. Changing a VM requires many low level system dependent patches (this includes VM recompilation for every platform) whereas specifying a new class loader only requires a new JVM command line argument and is portable to all VMs. We therefore use a class loader based approach. Fig. 1 shows our compilation and execution pipeline. Figure 1: Execution model of OpenMP/Java programs in CUDA environments. First, a programmer creates Java code enriched with OpenMP/Java pragmas to tag code that is amenable to parallelization. For example: / / O r i g i n a l code : c l a s s Demo { v o i d work ( i n t [ ] a r r ) { / / # omp p a r a l l e l f o r f o r ( i n t i = S ; i < E ; i += K) { a r r [ i ]++; } } } / / G e n e r a t e d f o r Demo . work c l a s s DemoWorker e x t e n d s Worker { i n t a r r [ ] , S , E , K; . . . void run ( ) { J o b w; JCudaMP . i n i t ( S , E , K ) ; w h i l e ( ( w=JCudaMP . getWork ( ) ) ! = NULL) { f o r ( i n t i =w . s t a r t ; i <w . end ; i +=w . s t e p ) { a r r [ i ]++; } } } } / / p a t c h e d Demo : c l a s s Demo { v o i d work ( i n t [ ] a r r ) { JCudaMP . s t a r t ( new DemoWorker ( S , E , K, a r r ) ) ; } } The pragma tells the OpenMP compiler that the for loop can be executed in parallel, as each loop iteration is completely independent from others. Without prescribing how to do the parallelization, this code is then translated by the OpenMP/Java compiler to bytecode, effectively by generating the code below (multi-threaded code). If the class loader discovers CUDA-enabled hardware, it will rewrite it later on-the-fly. First, the loop’s body is placed in a new method (here DemoWorker.run) and surrounded with logic that requests chunks of the iteration space from the runtime. A sequential loop then processes the chunk’s iterations. The original parallel code is replaced by a call to JCudaMP.start and a DemoWorker instance as argument. A standard class loader will ignore the special bytecode section and will process the regular threaded implementation. In contrast, our JCudaMP class loader analyzes the original loop body in DemoWorker.run to see if it is CUDA-amenable. From the body of the inner loop in DemoWorker.run we generate CUDA code. The JCudaMP.start call is replaced by a call to DemoWorker.runCuda which calls the generated CUDA code (via JNI). We currently do not allow CUDA execution if the code contains meta-object access, native calls, try-catch statements, general object allocations, or calls to methods that the class loader has not previously marked as being CUDA-amenable. Meta-object access includes class casts, instance-of tests, and accesses to Class and its associates in the java.lang.reflect package. Object allocations are not allowed as memory is scarce on graphics cards. Arbitrary method calls are not allowed as CUDA-threads have very limited (call) stack space. Only Methods of the CudaRuntimeFunctions class can be used. For example, this class contains functions like sinf, cosf. In Java mode Math. sin, Math.cos are used. In CUDA mode, the class loader replaces them by the sinf and cosf functions provided by CUDA. Unfortunately, depending on hardware precision, the results of these functions could differ from the Java version. In addition, the OpenMP functions omp_get_num_threads and omp_get_thread_num are allowed. This is necessary for programs that expect certain thread IDs. If code is used inside the parallel region that prevents the execution on CUDA the compiler and classloader report a warning. At run-time, we invoke the CUDA compiler to generate a shared library and load it into the running JVM. However, this compilation process imposes some overhead. It requires to write the CUDA source code to disk, to invoke the (heavy weight) CUDA compiler to generate assembly code, to pass that to the assembler and linker, and finally to load the result into the VM by the shared library loader. The generated C/CUDA code is then called from Java by an inserted JNI call that replaces the call to JCudaMP.start. To reduce this cost, the class loader uses a second thread to compile all the parallel regions of a class in one go, so that compilation, assembling and linking are shared. There is a variety of different CUDA-capable cards, each with different capabilities (support for double, long, etc.) and memory sizes (from 256 MB to 4 GB). To deal with these differences, the class loader performs a test at start-up to determine the capabilities of the available hardware. The results of this test are used for the code generation decisions that are described later in this paper. 2.2 Introduction to CUDA To better understand CUDA code generation, some CUDA de- tails need to be known. CUDA extends C by some flavor of a parallel for loop that invokes a function (called a kernel) a number of times in parallel on the graphics card. For example: foo <<< b l o c k _ s i z e , g r i d _ s i z e , smem_size > > >( a , b , c ) ; causes block_size × grid_size parallel invocations of foo with arguments a, b, c. A group of block_size parallel invocations operate in SPMD fashion and can access smem_size bytes of fast shared memory. These groups of parallel invocations are then effectively repeated grid_size times (although with enough resources, some of these groups could also be invoked in parallel to each other as well). Note that the host’s memory is not directly accessible from CUDA. So before issuing a parallel call, we must allocate some graphics card memory with cudaMalloc() and copy data to the graphics card with cudaMemcpy(). After the parallel call it might be necessary to copy it back to main memory. 2.3 CUDA code generation For CUDA-amenable parallel regions our class loader generates CUDA code. Because bytecode instructions are for a stack machine and maintaining an actual stack for execution is expensive (due to the resulting memory accesses required), we first translate the Java bytecode to a register assignment representation. From the register representation we then generate C/CUDA code. The generation of code for array accesses needs a more detailed discussion since (1) array access can cause index-out-of-bounds exceptions, (2) each array has its length associated with it (the array.length field), and (3) multi-dimensional arrays in Java are specified as arrays-of-arrays instead of true multi-dimensional arrays. We solve (1) by giving each CUDA-kernel a pointer to a (CUDAside) variable which is set to ’1’ if an index violation occurs (initially zero). Upon returning from the kernel, we inspect the flag and throw the actual exception on the host. We solve (2) by giving each array a companion data structure holding its length (or lengths if its a multi-dimensional array). We chose not to prefix each array with an integer holding its size as the CUDA hardware prefers aligned array access. We do not solve (3) and use arrays-of-arrays at the CUDA side as well. This can cause the usual performance problems as each extra dimension access causes a memory access to get the pointer to a sub-array. Taking it all together, for an array access to ’a[i][j]’ we by default generate a CUDA kernel like this: __global__ void g e n e r a t e d _ k e r n e l ( i n t ∗ exception_type , i n t ∗∗∗ a _ d a t a , i n t ∗∗ a _ l e n g t h , unsigned i n t i , unsigned i n t j ) { i n t b o u n d a r y _ ch e c k _ n u m ; i f ( ( i >= a _ l e n g t h [ 0 ] [ 0 ] | | ( j >= a _ l e n g t h [ 1 ] [ i ] ) ) { b o u n d a r y _ ch e c k _ n u m = 1 ; goto e x c e p t i o n _ r e t u r n ; } i n t tmp = a _ d a t a [ i ] [ j ] ; .. return ; exception_return : ∗ e x c e p t i o n _ t y p e = b o u n d ar y _ c h e ck _ n u m ; return ; } The a_data holds the contents and the a_length array holds the lengths of the 1D and 2D sub-arrays. To handle an index-out-of-bounds exception, exception_type is used. It is tested after return from the kernel to throw the proper Java exception on the host. The value returned is the index of the bounds check over all bounds checks in the kernel. This is used to throw the proper Java exception upon return from the kernel. If multiple kernels throw index-out-of-bounds exceptions, one is chosen arbitrarily (fully accounting for all kernels would cause too much overhead to test afterward for small kernels). 3. MEMORY MANAGEMENT Since CUDA kernels can only access the local memory of a graphics card, explicit transfers between host and graphics card are needed. Specifically, this requires first that GPU memory is allocated and released and second that data is copied to and from the GPU. Both operations are expensive as they involve context switches. It turned out to be more efficient to implement a wrapper around CUDA’s malloc. The wrapper allocates a big chunk of GPU memory and then sub-allocates within that bigger chunk on the host. The CUDA runtime system does not see this sub-allocation. To copy memory (via DMA) to and from the GPU over the PCIe bus involves expensive context switches that reduce the available bandwidth considerably. This is especially noticeable for smaller data structures due to the latency. In the next few subsections we will therefore examine a number of ways to reduce the number of host-to-GPU transfers. 3.1 Avoid transfers with data-sharing clauses As observed in [6], some copying can be avoided by properly tagging the data with OpenMP data-sharing clauses. For example, if the data is marked firstprivate, the data needs only to be copied from host to card but not back again. If the data is lastprivate, the data needs only to be copied out of the card, never into the card. Only if the data is marked shared does it require both a copy-in and a copy-out of the card. Future versions could try determine the variable scope automatically. This is also discussed in [7]. 3.2 Efficiently creating a CUDA data image Since CUDA kernels can not only access flat data structures but can also access data structures with pointers we need to take care of pointer translations when performing data transfers from and to the GPU. Java’s normal object serialization copies a graph of data objects into a buffer and makes sure to ship objects only once, even if they are referred to more than once. The recipient will unpack all the objects and revector the pointers according to the objects’ new addresses. In JCudaMP we reuse the idea to bulk transfer a buffer of data. But in contrast to normal object serialization, we create the complete CUDA image on the host, i.e., all objects references in the serialization buffer are made into CUDA-side pointers, so that the shipped objects can be used without any further processing on the CUDA-side. In addition, JCudaMP’s serialization exploits the fact, that CUDA-amenable code may not use all kinds of classes, but only instances of final subclasses of CudaObject are permitted to be used. The current prototype does now allow a CudaObject subclass to contain any references to other objects or arrays. This greatly simplifies (and speeds up) the serialization process as no cycle detection needs to be performed to see if an object has been serialized before. Also, this simplifies the compilation process as we can beforehand easily determine which methods need to be compiled to CUDA to be available for calling on the graphics card from inside a compute kernel. 3.3 GPU-based array data By default, regular data is allocated on the host which makes it cheap to use by the CPU but expensive to use by CUDA (data needs to be copied to and from the GPU). Because we cannot be certain that the host will use the data after the kernel call, we need to conservatively copy all the data back to the host. For the next kernel call (perhaps the same call in a loop), we often need to copy the same data back into the GPU again even if the host did not use or change the data at all. Doing some form of compile-time analysis or host-side read- or write-detection is expensive and even unnecessary if no CUDA code was generated at all for a kernel. Consider the following scenario; a fluid dynamics problem iteratively updates a large 2D array: v o i d work ( f l o a t [ ] [ ] d a t a ) { f o r ( i n t i = 0 ; i <NUM_ITERATIONS ; i ++) { update_data ( data ) ; } } For a GPU kernel called in update_data, data needs to be copied to and from the GPU even though only code from the CUDA-amenable parallel region accesses it. Because we cannot a priori know that this is the case, we are forced to do the copying. Static analysis of the bytecode by the class loader is too expensive (as class loading happens at run time). Static analysis of the Java code at compile time will not work either as this analysis requires a closed-world assumption which does not hold for Java. Also, the class loader can not ’undo’ transformations it made on code that it has already loaded. Our solution is an array package that keeps the data on the GPU at all times. Because we want both performance and ease of implementation, we supply an array class for each combination of primitive type and number of dimensions (1–4) (e.g. Int2DArray, Float3DArray, etc.) Each of these classes has a constructor to pass the required sizes and each has get(index) and set( index, value) methods to access array elements. If the host code wants to access an array element of an array managed by the package, it will temporarily cache a block of array elements. Array elements remain cached on the host until either the (fixed-size) cache is full or a CUDA-call is made. Cached elements are then copied to CUDA. The example from above is now rewritten to use this array package. First the OpenMP version; second the corresponding CUDA kernel. / / # omp p a r a l l e l f o r f o r ( i n t x = 0 ; x<WIDTH ; x ++) { f o r ( i n t y = 0 ; y<HEIGHT ; y ++) { data . s e t ( x , y , compute_updated_value ( ) ) ; } } / / g e n e r a t e d CUDA/ C c o d e : void cuda_kernel ( ) { int x = . . . int y = . . . CUDA_ARRAY_SET_xxx ( d a t a , x , y , compute_updated_value ( ) ) ; } With a standard class loader or when no CUDA is available, the code works as described above; the array class adapts. The JCudaMP class loader, however, generates CUDA code and uses a form of semantic inlining [8] to improve performance. The get/set methods when invoked on the host, perform JNI calls to access GPU memory using cudaMemcpy. When the class loader sees a call to set/get when generating CUDA code, it replaces the call with direct memory accesses to array elements. Since the classes of our array package are final, set/get calls are easily detected and statically found (and not hidden by polymorphism). At run-time, the serialization code to copy objects to and from GPU memory is also aware of the array package and will instead pass the internal pointer of the array. Just like Java allows access to sub-arrays, our array classes provide sub-array-class access, since it can save many indirections and array bounds-checks in the inner loops. To give the user a choice, we also added true multi-dimensional array versions of the array package. The multi-dimensional array classes use address arithmetic to map multi-dimensional accesses to a one-dimensional array. These array classes can be more efficient in certain applications and on certain hardware. Deallocation of graphics card memory used by an array package, is done via the object’s finalizer. If the object’s finalizer is called by the garbage collector, it frees the associated graphics memory. Normal arrays/objects are copied in-and-out of the card immediately for every kernel invocation (and their graphics card memory freed) so that garbage collection is not an issue there. 3.4 Hiding the Array Packages The array packages provide a good solution to avoid unnecessary copy operations to the GPU but they require some effort by the programmer. A developer has to select the appropriate package for a given problem (e.g. to choose between true multi-dimensional and normal packages). Furthermore, it is necessary to replace the array uses in the code with getter and setter methods. To solve these problems, we provide an annotation syntax that the programmer can use to tell the compiler that a given array is managed. A managed array is to be implemented with an array package. The compiler then selects a array package and replaces the Java array with the package. The programmer can optionally provide hints to aid the compiler in the selection task, for example by providing a hint that a rectangular array would be preferred. For example: c l a s s ArrayAnnotationDemo { void foo ( ) { / / # omp managed ( a : r e c t a n g u l a r ) i n t [ ] [ ] a = new i n t [ 1 6 ] [ 1 6 ] ; / / # omp p a r a l l e l f o r f o r ( i n t i = 0 ; i < 1 6 ; i ++) f o r ( i n t j = 0 ; j < 1 6 ; j ++) a [ i ] [ j ] = 12345; A n o t h e r . zoo ( a ) ; } } c l a s s Another { / / # omp managed ( a : r e c t a n g u l a r ) s t a t i c v o i d zoo ( i n t [ ] a ) { . . . } } This annotation causes the compiler to rewrite the code by changing the type of ’a’ to a multi-dimensional array package class. The a[i][j] access in the loop will be rewritten to a setter call and, as described above replaced by the class loader if necessary. Note that this annotation changes the type of the listed arrays and all uses of the arrays outside the method also need to be annotated. For example, if ’a’ in the above example is passed to another function, that function is required to have the managed type annotation as well. Using a type-annotation instead of a new array syntax for man- float [][] aged arrays is advantageous because a standard Java compiler that is unaware of our annotation will still accept the program as normal Java. t i l e _ s r c = calc . copy_to_tile ( s r c , tmp_x , tmp_y ) ; float [ ] [ ] t i l e _ d s t = calc . copy_to_tile ( d s t , tmp_x , tmp_y ) ; i n t b o u n d a r y _ x = min ( tmp_x + t i l e _ s i z e _ x WIDTH ) ; i n t b o u n d a r y _ y = min ( tmp_y + t i l e _ s i z e _ y HEIGHT ) ; int dst_x_diff = calc . compute_diff_x ( dst int dst_y_diff = calc . compute_diff_y ( dst int src_x_diff = calc . compute_diff_x ( src int src_y_diff = calc . compute_diff_y ( src 4. TILING OF HUGE ARRAYS Compared to main memory (up to 128 GB) current graphics cards have only a small memory (up to 4 GB) that is too small to hold the complete working set of many scientific problems (that just barely fit into main memory). The data therefore has to be processed in a tiled fashion. Since, unfortunately, neither static analysis nor the class loader or JIT at run time can determine the sizes of the various arrays in a program beforehand. This would require the insertion of dynamic tests which incur runtime overhead. We therefore need the programmer help to identify huge arrays at compile time. Our newly introduced tiled(X, ...) clause extends OpenMP to state that array X is huge and should be processed in a tiled fashion. This annotation causes the array to be partitioned into tiles and automatically creates an additional for loop that iterates over the tiles. The size of a tile of one array depends on both the number of other (huge) arrays processed in the parallel for loop and the size of the available GPU memory. Unfortunately, we can no longer support random access to elements of arrays marked as tiled, i.e., it is not allowed to access array elements outside of the current tile. To relax this restriction a bit, tiles can overlap for a few elements on their sides. / / # omp p a r a l l e l f o r t i l e d ( dst : { x : 0 , 0; s r c : { x : −1 , 1 ; f o r ( i n t x = 1 ; x < WIDTH−1; x ++) { f o r ( i n t y = 1 ; y < HEIGHT−1; y ++) f l o a t value1 = src [ x − 1 , y + f l o a t value2 = src [ x + 1 , y ] ; dst [x , y ] = value1 + value2 ; } } y : 0 , 0} , y :0 , 1}) { 1]; In the example dst and src are two-dimensional and so two nested for loops must follow the tiled clause. The first argument to the annotation indicates that dst should be tiled and that all iterations access it only by index expressions x and y. The second argument announces that src can be indexed by x+1, x, and x-1 in the first dimension and by y and y+1 in the second dimension. Automatically the code is first transformed into two outer loops that iterate over the tiles. Two nested loops then (in parallel) execute over the elements of each tile to do the actual work: BlockSizeCalculator calc = new B l o c k S i z e C a l c u l a t o r ( ) ; c a l c . add ( d s t , x , 0 , 0 ) ; c a l c . add ( d s t , y , 0 , 0 ) ; c a l c . add ( s r c , x , −1, 1 ) ; c a l c . add ( s r c , y , 0 , 1 ) ; int tile_size_x =calc . compute_x_file_size ( ) ; int tile_size_y =calc . compute_y_file_size ( ) ; f o r ( tmp_x = 1 ; tmp_x < WIDTH − 1 ; tmp_x += t i l e _ s i z e _ x ) f o r ( tmp_y = 1 ; tmp_y < HEIGHT − 1 ; tmp_y += t i l e _ s i z e _ y ) p r o c e s s _ t i l e ( tmp_x , tmp_y ) ; v o i d p r o c e s s _ t i l e ( i n t tmp_x , i n t tmp_y ) { / / copy d a t a t o t i l e : , , ); ); ); ); / / # omp p a r a l l e l f o r f o r ( i n t x=tmp_x ; x < b o u n d a r y _ x ; x ++) { f o r ( i n t y= t m p _ t ; y < b o u n d a r y _ y ; y ++) { i n t new_x_src = x − s r c _ x _ d i f f ; i n t new_y_src = y − s r c _ y _ d i f f ; i n t new_x_dst = x − d s t _ x _ d i f f ; i n t new_y_dst = y − d s t _ y _ d i f f ; f l o a t value1 = s r c [ new_x_src − 1 , new_y_src + 1 ] ; f l o a t value2 = s r c [ new_x_src + 1 , new_y_src ] ; t i l e _ d s t [ new_x_dst , n e w _ y _ d s t ] = value1 + value2 ; } } calc . copy_from_tile ( t i l e _ s r c , src , tmp_x , tmp_y ) ; calc . copy_from_tile ( t i l e _ d s t , dst , tmp_x , tmp_y ) ; } Based on the arrays, the indexes, and their ranges, the BlockSizeCalculator object must figure out the appropriate tile sizes. It divides the available CUDA memory (established during the initial access of the hardware) between the fixed sized tiles of all arrays. The tiles of each array are made as large as possible in each dimension. Using large tiles reduces the number of DMA transfers to and from the GPU significantly. Note that this transformation occurs at compile time and not at run time in the class loader. If a programmer therefore writes the annotation, it is used for both the threaded version and the generated CUDA version. In the threaded version, copy_data_to_tile returns the original array reference to eliminate any copying overhead. The BlockSizeCalculator can fail if the index ranges are chosen too large. For example, the programmer could have added the clause tiled(src: {x:-size,size; y:-size,size}). This forces the tile size to be at least (size∗2)2 which can be larger than the available GPU memory. Also, all huge arrays should be mentioned in the tiled clause as otherwise these would not be tiled and then would unnecessarily occupy scarce GPU memory. 5. PERFORMANCE To test performance we use a simple matrix multiplication code as a micro benchmark and a number of bigger numerical applications. The machine used for testing is an eight core Intel Xeon (2.5 GHz, x86-64, 16 GB main memory) and a GeForce GTX 280 with 1 GB DDR3 memory (PCI Express 2.0 x16). The software environment is Linux kernel 2.6.24 with CUDA 2.2, GCC version 4.2.4 and Java Hotspot JIT in Version 1.6.0_13. For all benchmarks we use the same set of runs. We perform a run with only one Java thread and a run with 7 Java threads (leaving one core free for operating system, JVM, etc.) with CUDA disabled. We then enable CUDA code generation and see what happens when we try to disable background compilation (CUDA row in the performance tables that follow) or enable it (Async Comp row). Note that the times mentioned in the Async Comp row do not reflect the total compilation times, but instead give only the time needed to actively wait for CUDA compilation to finish. In addition, the final two rows use the arrays-of-arrays package (APkg row) and the true multi-dimensional array package (APkgRect row). Both allocate the data on the GPU to reduce copying. The rectangular version wins due to the reduced number of memory accesses. For each of the following performance tables, the Wall column is the total runtime from end to end with all overheads included. The next column (compile-time overhead) shows how long it took the CUDA compiler to generate a shared library for the kernel(s). Management overhead is the time required to allocate (and release) CUDA memory, serialize and map the data to the GPU’s address space, and to copy the data to the graphics card. All times are in seconds. 5.1 Matrix Multiplication Micro Benchmark Here we present the measurements for a few versions of matrixmultiplication for two matrix sizes (Table 1). Going from one to seven regular threads achieves a good speedup: 6.7 on 7 cores. With the larger matrix, the speedup decreases to 3.3 because the CPU caches are no longer effective with 7 threads. When generating CUDA code, performance increases dramatically. Note the overhead of a dynamic run of the CUDA compiler. This cost can be reduced slightly by compiling in the background (while other classes are being loaded by the class loader). Allocating the array data on the GPU helps (APkg and APkgRect). The array package versions reduce the number of DMA transfers. The multidimensional array package wins by more than a factor of two on the large matrix size. A manually optimized native CUDA version, however, is able to gain still more performance. Mainly because data alignment is optimized carefully and shared memory is used as cache. Our implementation would require low-level CUDA optimizations to reach this kind of performance. Let us consider two huge matrices (750MB and 1.14GB total). In Table 2 the tile clause is used in the CUDA versions. The benchmark is run once with 1GB and once with 500 MB available GPU memory. The pure Java version with 7 threads performs poorly (the sequential version even fails to run in our CPU-quota on our machine). On a GPU with 1 GB of memory the 1.14GB tiled version needs 10 tiles. This version delivers good speedup over the plain threaded version. If we artificially reduce the amount of available GPU memory to 500 MB, 23 tiles are needed. 5.2 Benchmarks BlackScholes [9] is a model that predicts the evolution of an option price by solving a partial differential equation. A single parallel for loop calculates the option prices. The BlackScholes application uses a single 1D array of floats with 4,000,000 elements. We then run 512 iterations, the results are shown in Table 3. The performance table below (left) shows that the threaded version with 7 threads wins over the naive CUDA implementation because the array is copied many times to and from the GPU. Again the array packages that copy the data only at program start to the GPU, yield the best performance. 2D Convolution is a straightforward convolution. In complex number space an M × M kernel is applied to an N × N data set. The outer loop of size N × N is parallelized. In each iteration one value of the result set is calculated by applying the kernel to the input data set. We use a 4096×4096 matrix with a 16×16 kernel. Again, we see that the copy overhead to and from the GPU is large causing the threaded version to outperform the naive CUDA version (Table 3, Column 2). With an array-package added, CUDA use again shows a small speedup over the threaded version. LBM [10] uses cellular automata to numerically simulate almost incompressible fluids at speeds of low Mach numbers. Space and time are discretized and normalized. The benchmark computes 50 time steps over the 3D grid and operates on a 3D domain divided into N × N × N cells that hold a finite number of states (D3Q19 model). In one time step the complete set of states is updated synchronously by deterministic, uniform update rules. For parallelization, we use the parallel for construct for the time-stepping loop. The loop over the x-axis of the 3D grid is augmented with the for directive (default scheduling). For a 64×128×256×19 grid and 100 iterations, we get the results in Table 3, Column 3. Because the working set is far bigger than the CPU caches, speedup of the plain threaded version is not that high (3.2 on 7 cores). With CUDA enabled, performance plummets as the overhead of serializing 4,194,304 small arrays and then copying the result to and from the GPU 100 times (once per iteration) dominates. Using any array package to keep the data on the GPU completely removes this overhead. Mersenne Twister [11] has two parts: (1) a pseudo-random number generator and (2) a transformation of the random number sequence into a normal distribution. Both parts process 210,000,000 random numbers divided, into 4096 key rings. In both cases the loop over the key rings is parallelized. The results are shown in Table 4. For both parts, the threaded version creates good speedups (5.5 and 6.2 resp.), whereas the CUDA versions are slow because of the high compilation and copying overheads. When using an array package, these costs are largely gone and CUDA shows good performance. If 1, 5 ∗ 109 random numbers are generated (see Table 5), tiling is necessary on CUDA. In the first phase of the application, the compute times of both CUDA and 7 Java threads are nearly equal. Due to the higher overheads, the tiled version does not achieve better performance. This is different in the second phase in which the compute time of the CUDA tiled version is lower than the time of the multi-threaded Java version. 6. RELATED WORK The current prototype does not generate CUDA code for the following OpenMP constructs: sections, single, master, critical, atomic, ordered, flush and barrier. OpenMP/C [6] makes inroads into supporting these features. In [6] compile-time loop transformations such as parallel loop-swap are used to increase performance. In this paper, we focus on how to manage dynamic translation and problems specific to Java such as solving type-rigidity by using array packages. Their work is therefore orthogonal to ours. There are numerous ways to use CUDA from Java. The simplest way is to use Java’s native interface to call CUDA functions in C. For example, see [12]. Another way is to hide native GPU computing in libraries, for example, by creating a CUDA-BLAS binding [13] for Java. The libraries then interact with the graphics hardware via JNI calls. The disadvantage of such approaches is of course the relative inflexibility of the libraries. With our dynamic code generation, application specific code can be written and executed at full native speed. Our approach adapts to whatever GPU hardware is available on-the-fly and even works (and achieve speedups on a multi-core host) if no GPU is available. In the absence of a GPU, CUDA code could still be generated if a driver were provided to mimic CUDA’s functionality on the host proces- Table 1: Matrix multiplication. 2048 × 2048 4096 × 4096 Wall Compile Mngmt. Wall Compile Mngmt. Overh. Overh. Overh. Overh. Java, 1 thread 105.0 - 855.0 Java, 7 threads 15.5 - 254.0 CUDA 5.2 3.7 0.1 17.4 3.6 0.5 Async Comp. 5.0 3.3 0.1 16.5 2.7 0.5 APkg 2.9 1.3 0.0 14.6 1.3 0.0 APkgRect 1.9 1.3 0.0 5.8 1.3 0.0 NVIDIA-manual 0.2 0.0 0.2 0.8 0.0 0.8 Table 2: Large Matrix multiplication. 7500 × 7500 10000 × 10000 Wall Compile Mngmt. Wall Compile Mngmt. Overh. Overh. Overh. Overh. Java, 7 threads 1019.0 - 2776.5 CUDA tiled, 1000 MB 91.1 3.6 6.4 759.7 3.8 21.0 CUDA tiled, 500 MB 130.0 3.5 26.4 778.3 3.8 31.9 Table 3: BlackScholes, Convolution and LBM measurements. BlackScholes Convolution Wall Compile Mngmt. Wall Compile Mngmt. Wall Overh. Overh. Overh. Overh. Java, 1 thread 608.0 43.2 53.9 Java, 7 threads 87.0 6.7 16.7 CUDA 111.4 3.8 102.4 20.8 3.9 16.8 334.3 Async Comp. 113.8 3.1 107.5 17.1 0.0 15.5 347.0 APkg 7.3 1.3 0.0 4.1 1.5 0.0 12.0 APkgRect 7.5 1.3 0.0 4.0 1.5 0.0 10.8 LBM Compile Overh. 3.6 0.0 1.6 1.6 Table 4: Mersenne measurements, small inputs. Mersenne, Part 1, small Mersenne, Part 2, small Wall Compile Mngmt. Wall Compile Mngmt. Overh. Overh. Overh. Overh. Java, 1 thread 27.6 - 45.3 Java, 7 threads 5.0 - 7.3 CUDA 6.7 2.5 2.6 4.1 1.4 2.6 Async Comp. 5.4 2.0 2.4 2.4 0.0 2.4 APkg 1.9 1.4 0.0 1.5 1.4 0.0 APkgRect 2.0 1.4 0.0 1.5 1.4 0.0 Table 5: Mersenne measurements, large input set. Mersenne, Part 1, 1, 5 ∗ 109 Mersenne, Part 2, 1, 5 ∗ 109 Wall Compile Mngmt. Wall Compile Mngmt. Overh. Overh. Overh. Overh. Java, 1 thread 195.4 - 327.8 Java, 7 threads 34.7 - 53.2 CUDA, tile 71.1 4.0 17.7 43.3 1.8 17.1 Mngmt. Overh. 312.8 329.4 0.6 0.6 sor. One such an (optimizing) driver is MCUDA [14]. However, performance of such a solution, would depend on the efficiency of the underlying MCUDA. The code of a computational kernel can also be assembled onthe-fly by building a Abstract-Syntax-Tree manually in the program and compiling that on-the-fly. For example, libSh [15] does so using C++. The created AST can then be used to generate CUDA code or normal threaded code. Our approach differs in that we do not build ASTs but rather, at pre-run-time, analyze and translate annotated bytecodes. CuPP [16] creates data structures that can have different data representations on hosts and on GPUs. Our array packages do the same, and the CuPP approach could be used to partially implement our array packages. CUDA-lite [17] is a CUDA code optimizer. It rewrites loops and allocates data in the memory region most suitable for the accesses involved. Our CUDA class loader could use CUDA-lite for post-optimization. OpenCL [18] is a standard for data parallelism on GPUs and SPMD processors, e.g., the Cell [3]. It will be straightforward to extend our framework to also generate OpenCL code. Since OpenCL [18] proposes unified standard parallel processing, it does not expose a specific vendor’s hardware features that might be the key to a code’s performance. By directly targeting a vendor’s lowerlevel API we dynamically avoid abstraction layers. Our system uses tiling to allow large data structures, where others mostly use tiling for parallelization and/or locality optimizations [19]. HTA [20] is a library that encapsulates arrays. Our array packages could conceptually use HTA for managing non-GPU allocated arrays. 7. CONCLUSIONS AND FUTURE WORK We observe that using the Java class loader to dynamically adapt a parallel loop to exploit available accelerators is feasible and leads to an excellent application speed-up. However, when naively generating CUDA code from Java/OpenMP, we have observed three obstacles. First, the significant overhead of invoking the CUDA compiler at run-time that can only be amortized with large compute times. Using fast JIT-style compilation of CUDA kernels would help to solve this. Second, there are high overheads for copying data to and from the GPU. The presented array packages encapsulate locality information provide a solution. Third, the small GPU memory that we hide with the tiling annotation. At the moment our runtime is only capable of using either the CPU or one GPU. In the future it should be possible to share work between the CPU and GPU and to use more than one GPU at the same time. Another avenue for future work is load-balancing of data and code across a cluster of workstations, each equipped with OpenCL/CUDA-capable hardware. [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] 8. REFERENCES [1] Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming with CUDA. Queue 6(2) (2008) 40–53 [2] Klemm, M., Bezold, M., Veldema, R., Philippsen, M.: JaMP: An Implementation of OpenMP for a Java DSM. Concurrency and Computation: Practice and Experience 18(19) (2007) 2333–2352 [3] Scarpino, M.: Programming the Cell Processor: For Games, Graphics, and Computation. Prentice Hall PTR, Upper Saddle River, NJ (2008) [4] Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream [20] computing on graphics hardware. In: SIGGRAPH ’04, Los Angeles, CA (2004) 777–786 Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Dubey, P., Junkins, S., Lake, A., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Abrash, M., Sugerman, J., Hanrahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. IEEE Micro 29(1) (2009) 10–21 Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Symp. on Principles and Practice of Parallel Programming, Raleigh, NC (2008) 101–110 Lin, Y., Terboven, C., an Mey, D., Copty, N.: Automatic scoping of variables in parallel regions of an openmp program. In Chapman, B.M., ed.: WOMPAT. Volume 3349 of Lecture Notes in Computer Science., Springer (2004) 83–97 Midkiff, S., Moreira, J., Snir, M.: Java For Numerically Intensive Computing: From Flops To Gigaflops. In: Symp. on the Frontiers of Massively Parallel Computation, Annapolis, MA (1999) 251–261 Black, F., Scholes, M.: The pricing of options and corporate liabilities. Journal of Political Economy 81(3) (1973) 637–54 Wolf-Gladrow, D.: Lattice-Gas Cellular Automata and Lattice Boltzmann Models. Number 1725 in Lecture Notes in Mathematics. Springer (2000) Matsumoto, M., Nishimura, T.: Mersenne Twister: a 623-dimensionally Equidistributed Uniform Pseudo-random Number Generator. ACM Trans. Model. Comput. Simul. 8(1) (1998) 3–30 JCuda. http://www.jcuda.org/ Barrachina, S., Castillo, M., Igual, F., Mayo, R., Quintana-Orti, E.: Evaluation and tuning of the Level 3 CUBLAS for graphics processors. In: Intl. Parallel and Distributed Processing Symp., Miami, FL (2008) 1–8 Stratton., J., Stone., S., Hwu, W.M.W.: MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, Edmonton, Canada (2008) 16–30 cCool, M., Toit, S.D.: Metaprogramming GPUs with Sh. AK Peters Ltd (2004) Breitbart, J.: CuPP – A framework for easy CUDA integration. In: HIPS: High-Level Parallel Programming Models and Supportive Environments, Rome, Italy (2009) 1–8 Ueng, S.Z., Lathara, M., Baghsorkhi, S., Hwu, W.M.W.: CUDA-Lite: Reducing GPU Programming Complexity. In: Languages and Compilers for Parallel Computing, Edmonton, Canada (2008) 1–15 Khronos. http://www.khronos.org/opencl/ Wolfe, M.: More iteration space tiling. In: Proc. of the 1989 ACM/IEEE conference on Supercomputing, Reno, Nevada (1989) 655–664 Guo, J., Bikshandi, G., Fraguela, B.B., Garzaran, M.J., Padua, D.: Programming with Tiles. In: PPoPP ’08: Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Salt Lake City, UT (2008) 111–122