3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
The three main types of the function qualifiers in CUDA are device, global, and host.
1. __device__
The functions with device qualifier are executed on the device. These functions are
callable from the device only.
2. __global__
The functions with global qualifier are executed on the device but they are callable from
the host only.
3. __host__
The functions with host qualifier are executed on the host and are callable from the host
only. When no qualifier is used, it means that the function will run on the host; it is
equivalent to the function declared with the _host_ qualifier.
The three main types of the variable qualifiers in CUDA are device, constant, and shared.
1. __device__
The variables declared with __device__ reside on the device. Other type qualifiers are
optionally used together with __device__. If a variable is declared only with __device__
qualifier then this variable resides in the global memory and it has the lifetime of the
application. Since it resides in the global memory, it is accessible from all the threads
(within the grid) and host through the runtime library.
2. __constant__
This qualifier is used to allocate constants on the device. It is optionally used together
with __device__ qualifier. This constant resides in constant memory, and has the lifetime
of an application. It is accessible from all the threads (within grid) and host through the
runtime library.
3. __shared__
This qualifier is used to allocate the shared variable. It is optionally used together with
__device__ qualifier. Shared variable resides in shared memory of a thread block, and has
the lifetime of a block. It is only accessible from all the threads within the block.
14
3.3 Built-in Variables
2. blockIdx: is of type uint3 and contains the block index within the grid.
4. threadIdx: is of type uint3 and contains the thread index within the block.
1. Memory Allocation
float* darray;
cudaMalloc((void**)&darray, 1024 * sizeof(float));
2. Memory Deallocation
cudaFree(darray);
2. Another method
Example:
float cpuArray[1024];
int size = sizeof(cpuArray);
15
float* dArray;
cudaMalloc((void**)&dArray, size);
cudaMemcpy(dArray, cpuArray, size, cudaMemcpyHostToDevice);
Example:
__constant__ float constArray[1024];
float cpuArray[1024];
cudaMemcpyToSymbol(constArray, &cpuArray, sizeof(constArray));
2. Another method
Example:
float cpuArray[1024];
int size = sizeof(cpuArray);
float* dArray;
cudaMalloc((void**)&dArray, size);
cudaMemcpy(cpuArray, dArray, size, cudaMemcpyDeviceToHost);
1. Mathematical Functions:
(e.g. __sinf(x) , __cosf(x), sqrt(x), etc)
16
2. Synchronization Function:
void __syncthreads();
3. Atomic Functions:
(e.g. atomicAdd(), etc. )
4. Texture Functions:
A device emulation mode is provided basically for the debugging purpose. –deviceemu
option is used with nvcc compile command. It only emulates the device, it is not the simulation.
Threads and the thread blocks are created on the host. Host’s native debugging (like Microsoft
Visual studio’s) can be used in setting break points and data inspection. It is especially helpful in
input or output operations to the files or to the screen, like the use of printf() function, that is not
possible to run on the device.
3.9 An Example
3.9.1 Sequential Code
A sequential program to calculate the distances from a specific point to the all other points
in a 2D Matrix of order N × N is given below:
17
3.9.2 Parallel Code – 1D Grid
Now the same program is converted to the parallel code to run on the device. A one
dimensional grid with only one thread block is used. The thread block contains 16 * 16 threads
(hence 256 threads in total) in a two dimensional form.
float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}
void main () {
int i, j;
float hgrid[N][N];
18
3.9.3 Parallel Code – 2D Grid (2 * 2)
Now the same program is converted to the parallel code to run on the device with a two
dimensional grid (2 thread blocks in x dimension and 2 in y dimension). The thread block
contains 16 * 16 threads (hence 256 threads in total) in a two dimensional form. Hence total 1024
threads will run in parallel in the device.
float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}
void main () {
int i, j;
float hgrid[N*D][N*D];
19
3.9.4 Parallel Code – 2D Grid (4 * 4)
The same program is converted to the parallel code to run on the device with a two
dimensional grid (4 thread blocks in x dimension and 4 in y dimension). The thread block
contains 8 * 8 threads (hence 64 threads in total) in a two dimensional form. Hence total 1024
threads will run in parallel in the device.
float n = ((i-x)*(i-x))+((j-y)*(j-y));
dgrid[i][j] = sqrt(n);
}
void main () {
int i, j;
float hgrid[N*D][N*D];
20