3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

3 Some Commonly used CUDA API

3.1 Function Type Qualifiers

The three main types of the function qualifiers in CUDA are device, global, and host.

1. __device__
The functions with device qualifier are executed on the device. These functions are
callable from the device only.
2. __global__
The functions with global qualifier are executed on the device but they are callable from
the host only.
3. __host__
The functions with host qualifier are executed on the host and are callable from the host
only. When no qualifier is used, it means that the function will run on the host; it is
equivalent to the function declared with the _host_ qualifier.

3.2 Variable Type Qualifiers

The three main types of the variable qualifiers in CUDA are device, constant, and shared.

1. __device__
The variables declared with __device__ reside on the device. Other type qualifiers are
optionally used together with __device__. If a variable is declared only with __device__
qualifier then this variable resides in the global memory and it has the lifetime of the
application. Since it resides in the global memory, it is accessible from all the threads
(within the grid) and host through the runtime library.
2. __constant__
This qualifier is used to allocate constants on the device. It is optionally used together
with __device__ qualifier. This constant resides in constant memory, and has the lifetime
of an application. It is accessible from all the threads (within grid) and host through the
runtime library.
3. __shared__
This qualifier is used to allocate the shared variable. It is optionally used together with
__device__ qualifier. Shared variable resides in shared memory of a thread block, and has
the lifetime of a block. It is only accessible from all the threads within the block.

14
3.3 Built-in Variables

Following is a list of some of the built-in variables in CUDA:

1. gridDim: is of type dim3 and contains the dimensions of the grid.

2. blockIdx: is of type uint3 and contains the block index within the grid.

3. blockDim: is of type dim3 and contains the dimensions of the block.

4. threadIdx: is of type uint3 and contains the thread index within the block.

5. warpSize: is of type int and contains the warp size in threads.

3.4 Memory Management

1. Memory Allocation

float* darray;
cudaMalloc((void**)&darray, 1024 * sizeof(float));

2. Memory Deallocation

cudaFree(darray);

3.5 Copying Host to device

1. Copying host memory array to device memory:

cudaMemcpyToSymbol( const T& symbol, const void* src,


size_t count)
Example:
float cpuArray [1024];
_device_ float dArray [1024];
cudaMemcpyToSymbol (dArray, cpuArray, sizeof(cpuArray));

2. Another method

Example:
float cpuArray[1024];
int size = sizeof(cpuArray);

15
float* dArray;
cudaMalloc((void**)&dArray, size);
cudaMemcpy(dArray, cpuArray, size, cudaMemcpyHostToDevice);

2. Copying host memory array to constant memory:

Example:
__constant__ float constArray[1024];
float cpuArray[1024];
cudaMemcpyToSymbol(constArray, &cpuArray, sizeof(constArray));

3.6 Copying Device to Host

1. Copying device memory array to host memory:

cudaMemcpyFromSymbol( void *dst, const T& symbol, size_t


count)
Example:
float cpuArray [1024];
_device_ float dArray [1024];
cudaMemcpyFromSymbol (&cpuArray, dArray, sizeof(dArray));

2. Another method

Example:
float cpuArray[1024];
int size = sizeof(cpuArray);
float* dArray;
cudaMalloc((void**)&dArray, size);
cudaMemcpy(cpuArray, dArray, size, cudaMemcpyDeviceToHost);

3.7 Device Runtime Component


Device runtime components are only be used in the device functions and are prefixed with
an underscore symbol __. The following is a short list of these functions:

1. Mathematical Functions:
(e.g. __sinf(x) , __cosf(x), sqrt(x), etc)

16
2. Synchronization Function:

void __syncthreads();

3. Atomic Functions:
(e.g. atomicAdd(), etc. )
4. Texture Functions:

3.8 Device Emulation Mode

A device emulation mode is provided basically for the debugging purpose. –deviceemu
option is used with nvcc compile command. It only emulates the device, it is not the simulation.
Threads and the thread blocks are created on the host. Host’s native debugging (like Microsoft
Visual studio’s) can be used in setting break points and data inspection. It is especially helpful in
input or output operations to the files or to the screen, like the use of printf() function, that is not
possible to run on the device.

3.9 An Example
3.9.1 Sequential Code
A sequential program to calculate the distances from a specific point to the all other points
in a 2D Matrix of order N × N is given below:

const int N=16;


void main (void) {
int i, j, x, y;
float hgrid[N][N];

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &x);


printf( "\n\tEnter the y coordinate of node : " ); scanf_s("%d", &y);

// Code to find distance without using device


for (i=0; i<N; i++){
for (j=0; j<N; j++) {
n = ((i-x)*(i-x))+((j-y)*(j-y)); // distance formula
hgrid[i][j] = sqrt(n); // distance formula
printf("\t%.0lf", hgrid[i][j]);
}
printf("\n\n");
}
}

17
3.9.2 Parallel Code – 1D Grid

Now the same program is converted to the parallel code to run on the device. A one
dimensional grid with only one thread block is used. The thread block contains 16 * 16 threads
(hence 256 threads in total) in a two dimensional form.

const int N=16;


__device__ float dgrid[N][N]; // array on device memory

// function on device to calculate distance


__global__ void findDistance( int x, int y){
int i = threadIdx.x;
int j = threadIdx.y;

float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}

void main () {
int i, j;
float hgrid[N][N];

dim3 dBlock(N, N); // thread block with total 256 threads

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);


printf( "\n\tEnter the y coordinate of node : " ); scanf_s("%d", &j);
printf( "\n\tDistance from a node!\n\n\n" );

findDistance<<<1, dBlock>>>(i, j); // Calling kernel function


cudaMemcpyFromSymbol( &hgrid, dgrid, sizeof(dgrid)); //copy device memory to host

printf( "\n\n\tValues in hgrid!\n\n" );


for (i=0; i<N; i++){
for (j=0; j<N; j++)
printf("\t%.0lf", hgrid[i][j]);
printf("\n\n");
}
}

18
3.9.3 Parallel Code – 2D Grid (2 * 2)

Now the same program is converted to the parallel code to run on the device with a two
dimensional grid (2 thread blocks in x dimension and 2 in y dimension). The thread block
contains 16 * 16 threads (hence 256 threads in total) in a two dimensional form. Hence total 1024
threads will run in parallel in the device.

const int N=16;


const int D=2;
__device__ float dgrid[N*D][N*D]; // array on device memory

// function on device to calculate distance


__global__ void findDistance( int x, int y){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}

void main () {
int i, j;
float hgrid[N*D][N*D];

dim3 dGrid(D,D); // 2D grid with total 4 thread blocks


dim3 dBlock(N, N); // thread block with total 256 threads

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);


printf( "\n\tEnter the y coordinate of node : " ); scanf_s("%d", &j);
printf( "\n\tDistance from a node!\n\n\n" );

findDistance<<< dGrid, dBlock>>>(i, j); // Calling kernel function


cudaMemcpyFromSymbol( &hgrid, dgrid, sizeof(dgrid)); //copy device memory to host

printf( "\n\n\tValues in hgrid!\n\n" );


for (i=0; i<N*D; i++){
for (j=0; j<N*D; j++)
printf("\t%.0lf", hgrid[i][j]);
printf("\n\n");
}
}

19
3.9.4 Parallel Code – 2D Grid (4 * 4)

The same program is converted to the parallel code to run on the device with a two
dimensional grid (4 thread blocks in x dimension and 4 in y dimension). The thread block
contains 8 * 8 threads (hence 64 threads in total) in a two dimensional form. Hence total 1024
threads will run in parallel in the device.

const int N=8;


const int D=4;
__device__ float dgrid[N*D][N*D]; // array on device memory

// function on device to calculate distance


__global__ void findDistance( int x, int y){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}

void main () {
int i, j;
float hgrid[N*D][N*D];

dim3 dGrid(D,D); // 2D grid with total 16 thread blocks


dim3 dBlock(N, N); // thread block with total 64 threads

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);


printf( "\n\tEnter the y coordinate of node : " ); scanf_s("%d", &j);
printf( "\n\tDistance from a node!\n\n\n" );

findDistance<<< dGrid, dBlock>>>(i, j); // Calling kernel function


cudaMemcpyFromSymbol( &hgrid, dgrid, sizeof(dgrid)); //copy device memory to host

printf( "\n\n\tValues in hgrid!\n\n" );


for (i=0; i<N*D; i++){
for (j=0; j<N*D; j++)
printf("\t%.0lf", hgrid[i][j]);
printf("\n\n");
}
}

20

You might also like