Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
C++ to OpenCL/CUDA
Introduction by David Williams
Overview
} This presentation provides an introduction to
autoparallelisation, focusing on our GPSME
toolkit.
} We will cover:
◦ What autoparallelisation is and why we want it.
◦ How the autoparallelisation process is performed.
◦ An introduction to using our toolkit.
◦ Benchmarking the toolkit and performance
considerations.
◦ A demonstration of using the toolkit and frontend.
} Toolkit is available.
Who are we?
} The GPSME project is a collaboration between
industry and academia.
◦ Multiple partners across Europe.
◦ All with different problems to solve.
} Our research project aims to make GPU
computing more accessible.
◦ Reduce need for expert knowledge.
◦ Eliminate need for specialised languages.
◦ Avoid rewriting existing code.
Using the GPU
Flexibility / Assembly
Performance
OpenCL/CUDA
Libraries
Autoparallelisation
Ease of use
/ Speed of
development Drag-and-drop
Why autoparallelisation?
} Automatically converting C/C++ to OpenCL/CUDA
has a number of advantages:
◦ Single codebase – Simplifies the process of targeting
machines both with and without GPUS.
◦ Reuse existing code.
◦ Target a wide range of hardware.
◦ Achieve independence from specific backend
technologies.
◦ Avoid lengthy boilerplate code.
How autoparallelisation works
} At its heart, the GPSME toolkit converts C/C++
code into OpenCL/CUDA by following compiler
#pragmas.
◦ Transfer required data to the GPU
◦ Copy the body of a loop into an OpenCL/CUDA program.
◦ Execute the program on each core simultaneously.
} This is built on a framework called ROSE, by
extending a tool called Mint.
◦ See www.rosecompiler.org for more information.
How autoparallelisation works
A simple example
} Keep in mind that the GPU has two key architectural
differences compared to the CPU:
◦ Multiple cores operating in parallel.
◦ Separate memory space.
A simple example
} The code below performs a simple low-pass filter
(blur) from a source to a destination.
for (y = 1; y < imageHeight-1; y++)
{
for (x = 1; x < imageWidth-1; x++)
{
float sum = 0.0f;
for(offsetY = -1; offsetY <= 1; offsetY++)
{
for(offsetX = -1; offsetX <= 1; offsetX++)
{
int finalX = x + offsetX;
int finalY = y + offsetY;
sum += srcImage[finalY * imageWidth + finalX];
}
}
dstImage[y * imageWidth + x] = sum / 9.0f;
}
}
A simple example
} We can augment this with GPSME directives:
#pragma GPSME copy( srcImage, toDevice, imageWidth, imageHeight)
#pragma GPSME copy( dstImage, toDevice, imageWidth, imageHeight)
#pragma GPSME parallel
{
#pragma GPSME for nest(2) tile ( 16, 16 )
for (y = 1; y < imageHeight-1; y++)
{
for (x = 1; x < imageWidth-1; x++)
{
float sum = 0.0f;
for(offsetY = -1; offsetY <= 1; offsetY++)
{
for(offsetX = -1; offsetX <= 1; offsetX++)
{
//Removed code for brevity
}
}
dstImage[y * imageWidth + x] = sum / 9.0f;
}
}
}
#pragma GPSME copy( srcImage, fromDevice, imageWidth, imageHeight)
#pragma GPSME copy( dstImage, fromDevice, imageWidth, imageHeight)
A simple example
} The translator is a command line tool which runs
under Linux:
gpsme inputFile.cpp [options]
//In main.cpp
#include <windows.h>
#include “parallelisable.h”
.
.
someWindowsFunction();
.
.
//Now call parallelised function in parallelisable.cpp
Use of External Libraries
} A more difficult scenario:
#pragma GPSME for nest(2) tile(16,16)
for(int x = 0; x < 128; x++)
{
for(int y = 0; y < 128; y++)
{
.
.
// External function call
cvSomeFunction();
.
.
}
}
Use of External Libraries
} When working through the webserver:
◦ Make sure the required dependencies are installed.
◦ Upload all project-specific headers which are needed.
#include "OpenCV.h"
#include "VTK.h"
.
.
#include "MyHeader1.h" // Upload this one
#include "MyHeader2.h" // Upload this one
.
.
int main(int argc, char** argv)
{
//Some code here
}
Now let’s see how this works on
some harder problems…
Polybench benchmark suite
} Collection of micro-benchmarks
} Originally developed for the CPU
} CUDA/OpenCL versions were developed recently
} Linear Algebra:
2MM - 2 Matrix Multiplications (D=A*B; E=C*D)
3MM - 3 Matrix Multiplications (E=A*B; F=C*D; G=E*F)
ATAX - Matrix Transpose and Vector Multiplication
BICG - BiCG Sub Kernel of BiCGStab Linear Solver
GEMM - Matrix-multiply C=alpha.A.B+beta.C
GESUMMV - Scalar, Vector and Matrix Multiplication
GRAMSCHMIDT-Gram-Schmidt decomposition
MVT - Matrix Vector Product and Transpose
SYR2K - Symmetric rank-2k operations
SYRK - Symmetric rank-k operations
} Datamining:
CORRELATION - Correlation Computation
COVARIANCE - Covariance Computation
} Stencils:
FDTD-2D - 2-D Finite Difference Time Domain Kernel
Open standards
} OpenMP } OpenACC
◦ Open standard ◦ Open standard for
for directive- directive-based GPU
computing
based multi-core
◦ Announced at SC11
programming [November 2011]
◦ Most compilers ◦ Caps, Cray, and PGI
support it by now are currently providing
◦ Easy to harness OpenACC compilers
shared memory ◦ Version 2.0 is to be
multi-core released soon…
parallelism
Polybench initial results
} Most tests benefit from speed-ups compared to
the OpenMP version.
Example – GEMM OpenACC
#pragma acc data copyin(A[NI*NJ],B[NI*NJ]) copyout(C[NI*NJ]){
#pragma acc kernels loop independent vector(32)
for (i = 0; i < NI; i++) {
#pragma acc loop independent vector(32)
for (j = 0; j < NJ; j++) {
C[i*NJ + j] = 0.0;
for (k = 0; k < NK; ++k) {
C[i*NJ + j] += A[i*NK + k] * B[k*NJ + j];
}
}
}
}
Example – GEMM GPSME
#pragma GPSME copy(A,toDevice, NI, NJ)
#pragma GPSME copy(B,toDevice, NI, NJ)
#pragma GPSME parallel {
#pragma GPSME for nest(2) tile(32,32)
for (i = 0; i < NI; i++) {
for (j = 0; j < NJ; j++) {
C[i*NJ + j] = 0.0;
for (k = 0; k < NK; ++k) {
C[i*NJ + j] += A[i*NK + k] * B[k*NJ + j];
}
}
}
}
#pragma GPSME copy(C, fromDevice, NI,NJ)
Example – GRAMSCHMIDT
#pragma GPSME copy(A,toDevice, N, M)
#pragma GPSME copy(R,toDevice, N, M)
#pragma GPSME copy(Q,toDevice, N, M)
#pragma GPSME parallel{
#pragma GPSME for nest(1) tile(128)
for (k = 0; k < N; k++) {
nrm = 0; Reduction limits 2nd level
for (i = 0; i < M; i++) { parallelization
nrm += A[i*N + k] * A[i*N + k];
}
R[k*N + k] = sqrt(nrm);
for (i = 0; i < M; i++) {
Q[i*N + k] = A[i*N + k] / R[k*N + k];
}
for (j = k + 1; j < N; j++) {
R[k*N + j] = 0;
for (i = 0; i < M; i++) {
R[k*N + j] += Q[i*N + k] * A[i*N + j];
}
for (i = 0; i < M; i++) {
A[i*N + j] = A[i*N + j] - Q[i*N + k] * R[k*N + j];
}
}
}
}
#pragma GPSME copy(A,fromDevice, N, M)
Example – GRAMSCHMIDT
for (k = 0; k < N; k++) {
nrm = 0;
for (i = 0; i < M; i++) {
nrm += A[i*N + k] * A[i*N + k];
}
R[k*N + k] = sqrt(nrm);
for (i = 0; i < M; i++) {
Q[i*N + k] = A[i*N + k] / R[k*N + k];
}
}
#pragma GPSME copy(A,toDevice, N, M)
#pragma GPSME copy(R,toDevice, N, M)
#pragma GPSME copy(Q,toDevice, N, M)
#pragma GPSME parallel{
#pragma GPSME for nest(2) tile(16,16)
for (k = 0; k < N; k++) { Triangular loop limits
for (j = k + 1; j < N; j++) { nd level parallelization
R[k*N + j] = 0;
2
for (i = 0; i < M; i++) {
R[k*N + j] += Q[i*N + k] * A[i*N + j];
}
for (i = 0; i < M; i++) {
A[i*N + j] = A[i*N + j] - Q[i*N + k] * R[k*N + j];
}
}
}
}
Triangular loop support
} Thread blocks can be:
◦ Full: All threads are part of the iteration space. Resources are not wasted.
◦ Empty: No thread is part of the iteration space. Resources are not wasted.
◦ Half-full: This create divergent branch behavior. Some threads are to be
executed, and some are not.
Polybench benchmark suite
* 1024x1024 image
**12288X12288 image
OpenACC vs. GPSME
} OpenACC advantages:
◦ It’s an open standard implemented by compiler vendors.
◦ Flexibility
Synchronisation, memory and device management, caching.
◦ Ease of use (integrated into Visual Studio)
} GPSME advantages:
◦ Simplicity
◦ Generates cleaner output code
CUDA, as well as OpenCL code
◦ Doesn’t incur performance penalties for the above advantages
◦ Full access to source code makes it easily extendable
Conclusions
} GPSME toolkit can deliver large performance
gains for some classes of problems.
} Better or equal than PGI OpenACC compiler on
Polybench
} For real-world code, usually some revising of is
needed:
◦ Isolate code you wish to parallelise
◦ Try to eliminate library and loop dependencies.
◦ Consider memory transfers, especially inside loops
◦ Use SoA instead of AoS