3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers

3 Some Commonly used CUDA API
3.1 Function Type Qualifiers
The three main types of the function qualifiers in CUDA are device, global, and host.
1. __device__
The functions with device qualifier are executed on the device. These functions are
callable from the device only.
2. __global__
The functions with global qualifier are executed on the device but they are callable from
the host only.
3. __host__
The functions with host qualifier are executed on the host and are callable from the host
only. When no qualifier is used, it means that the function will run on the host; it is
equivalent to the function declared with the _host_ qualifier.
3.2 Variable Type Qualifiers
The three main types of the variable qualifiers in CUDA are device, constant, and shared.
1. __device__
The variables declared with __device__ reside on the device. Other type qualifiers are
optionally used together with __device__. If a variable is declared only with __device__
qualifier then this variable resides in the global memory and it has the lifetime of the
application. Since it resides in the global memory, it is accessible from all the threads
(within the grid) and host through the runtime library.
2. __constant__
This qualifier is used to allocate constants on the device. It is optionally used together
with __device__ qualifier. This constant resides in constant memory, and has the lifetime
of an application. It is accessible from all the threads (within grid) and host through the
runtime library.
3. __shared__
This qualifier is used to allocate the shared variable. It is optionally used together with
__device__ qualifier. Shared variable resides in shared memory of a thread block, and has
the lifetime of a block. It is only accessible from all the threads within the block.
14
3.3 Built-in Variables
Following is a list of some of the built-in variables in CUDA:
1. gridDim: is of type dim3 and contains the dimensions of the grid.
2. blockIdx: is of type uint3 and contains the block index within the grid.
3. blockDim: is of type dim3 and contains the dimensions of the block.
4. threadIdx: is of type uint3 and contains the thread index within the block.
5. warpSize: is of type int and contains the warp size in threads.
3.4 Memory Management
1. Memory Allocation
float* darray;
cudaMalloc((void**)&darray, 1024 * sizeof(float));
2. Memory Deallocation
cudaFree(darray);
3.5 Copying Host to device
1. Copying host memory array to device memory:
cudaMemcpyToSymbol( const T& symbol, const void* src,

size_t count)
Example:
float cpuArray [1024];
_device_ float dArray [1024];
cudaMemcpyToSymbol (dArray, cpuArray, sizeof(cpuArray));
2. Another method
Example:
float cpuArray[1024];
int size = sizeof(cpuArray);
15
float* dArray;
cudaMalloc((void**)&dArray, size);
cudaMemcpy(dArray, cpuArray, size, cudaMemcpyHostToDevice);
2. Copying host memory array to constant memory:
Example:
__constant__ float constArray[1024];
cudaMemcpyToSymbol(constArray, &cpuArray, sizeof(constArray));
3.6 Copying Device to Host
1. Copying device memory array to host memory:
cudaMemcpyFromSymbol( void *dst, const T& symbol, size_t

count)
Example:
float cpuArray [1024];
_device_ float dArray [1024];
cudaMemcpyFromSymbol (&cpuArray, dArray, sizeof(dArray));
2. Another method
Example:
int size = sizeof(cpuArray);
float* dArray;
cudaMalloc((void**)&dArray, size);
cudaMemcpy(cpuArray, dArray, size, cudaMemcpyDeviceToHost);
3.7 Device Runtime Component

Device runtime components are only be used in the device functions and are prefixed with
an underscore symbol __. The following is a short list of these functions:
1. Mathematical Functions:
(e.g. __sinf(x) , __cosf(x), sqrt(x), etc)
16
2. Synchronization Function:
void __syncthreads();
3. Atomic Functions:
(e.g. atomicAdd(), etc. )
4. Texture Functions:
3.8 Device Emulation Mode
A device emulation mode is provided basically for the debugging purpose. –deviceemu
option is used with nvcc compile command. It only emulates the device, it is not the simulation.
Threads and the thread blocks are created on the host. Host’s native debugging (like Microsoft
Visual studio’s) can be used in setting break points and data inspection. It is especially helpful in
input or output operations to the files or to the screen, like the use of printf() function, that is not
possible to run on the device.
3.9 An Example
3.9.1 Sequential Code
A sequential program to calculate the distances from a specific point to the all other points
in a 2D Matrix of order N × N is given below:
const int N=16;

void main (void) {
int i, j, x, y;
float hgrid[N][N];
printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &x);

printf( "\n\tEnter the y coordinate of node : " ); scanf_s("%d", &y);
// Code to find distance without using device

for (i=0; i<N; i++){
for (j=0; j<N; j++) {
n = ((i-x)*(i-x))+((j-y)*(j-y)); // distance formula
hgrid[i][j] = sqrt(n); // distance formula
printf("\t%.0lf", hgrid[i][j]);
}
printf("\n\n");
}
}
17
3.9.2 Parallel Code – 1D Grid
Now the same program is converted to the parallel code to run on the device. A one
dimensional grid with only one thread block is used. The thread block contains 16 * 16 threads
(hence 256 threads in total) in a two dimensional form.
const int N=16;

__device__ float dgrid[N][N]; // array on device memory
// function on device to calculate distance

__global__ void findDistance( int x, int y){
int i = threadIdx.x;
int j = threadIdx.y;
float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}
void main () {
int i, j;
float hgrid[N][N];
dim3 dBlock(N, N); // thread block with total 256 threads
printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);

printf( "\n\tEnter the y coordinate of node : " ); scanf_s("%d", &j);
printf( "\n\tDistance from a node!\n\n\n" );
findDistance<<<1, dBlock>>>(i, j); // Calling kernel function

cudaMemcpyFromSymbol( &hgrid, dgrid, sizeof(dgrid)); //copy device memory to host
printf( "\n\n\tValues in hgrid!\n\n" );

for (i=0; i<N; i++){
for (j=0; j<N; j++)
printf("\n\n");
}
}
18
3.9.3 Parallel Code – 2D Grid (2 * 2)
Now the same program is converted to the parallel code to run on the device with a two
dimensional grid (2 thread blocks in x dimension and 2 in y dimension). The thread block
contains 16 * 16 threads (hence 256 threads in total) in a two dimensional form. Hence total 1024
threads will run in parallel in the device.
const int N=16;

const int D=2;
__device__ float dgrid[N*D][N*D]; // array on device memory

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
float n = ((i-x)*(i-x))+((j-y)*(j-y));
}
void main () {
int i, j;
float hgrid[N*D][N*D];
dim3 dGrid(D,D); // 2D grid with total 4 thread blocks


findDistance<<< dGrid, dBlock>>>(i, j); // Calling kernel function


for (i=0; i<N*D; i++){
for (j=0; j<N*D; j++)
printf("\n\n");
}
}
19
3.9.4 Parallel Code – 2D Grid (4 * 4)
The same program is converted to the parallel code to run on the device with a two
dimensional grid (4 thread blocks in x dimension and 4 in y dimension). The thread block
contains 8 * 8 threads (hence 64 threads in total) in a two dimensional form. Hence total 1024
threads will run in parallel in the device.
const int N=8;

const int D=4;
__device__ float dgrid[N*D][N*D]; // array on device memory

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
float n = ((i-x)*(i-x))+((j-y)*(j-y));
}
void main () {
int i, j;
float hgrid[N*D][N*D];
dim3 dGrid(D,D); // 2D grid with total 16 thread blocks


findDistance<<< dGrid, dBlock>>>(i, j); // Calling kernel function


for (i=0; i<N*D; i++){
for (j=0; j<N*D; j++)
printf("\n\n");
}
}
20

3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers

Uploaded by

Copyright:

Available Formats

3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers

Uploaded by

Copyright:

Available Formats

3 Some Commonly used CUDA API

3.1 Function Type Qualifiers

3.2 Variable Type Qualifiers

Following is a list of some of the built-in variables in CUDA:

1. gridDim: is of type dim3 and contains the dimensions of the grid.

3. blockDim: is of type dim3 and contains the dimensions of the block.

5. warpSize: is of type int and contains the warp size in threads.

3.4 Memory Management

3.5 Copying Host to device

1. Copying host memory array to device memory:

cudaMemcpyToSymbol( const T& symbol, const void* src,

2. Copying host memory array to constant memory:

3.6 Copying Device to Host

1. Copying device memory array to host memory:

cudaMemcpyFromSymbol( void *dst, const T& symbol, size_t

3.7 Device Runtime Component

3.8 Device Emulation Mode

const int N=16;

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &x);

// Code to find distance without using device

const int N=16;

// function on device to calculate distance

dim3 dBlock(N, N); // thread block with total 256 threads

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);

findDistance<<<1, dBlock>>>(i, j); // Calling kernel function

printf( "\n\n\tValues in hgrid!\n\n" );

const int N=16;

// function on device to calculate distance

dim3 dGrid(D,D); // 2D grid with total 4 thread blocks

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);

findDistance<<< dGrid, dBlock>>>(i, j); // Calling kernel function

printf( "\n\n\tValues in hgrid!\n\n" );

const int N=8;

// function on device to calculate distance

dim3 dGrid(D,D); // 2D grid with total 16 thread blocks

printf( "\n\tEnter the x coordinate of node : " ); scanf_s("%d", &i);

findDistance<<< dGrid, dBlock>>>(i, j); // Calling kernel function

printf( "\n\n\tValues in hgrid!\n\n" );

You might also like