Hi,
I would like to propose add support for no-loop kernels for OpenMP flang.
What is no-loop kernel?
To answer this question, let’s analyse simple test case:
!$omp target teams distribute parallel do
do i=1,N
c(i) = a(i) + b(i)
enddo
Currently there is limit of teams and threads which are launched by OpenMP offload runtime. That’s why execution of OpenMP GPU kernel looks like:
Ty KernelIteration = NumBlocks * NumThreads;
// Start index in the normalized space.
Ty IV = BId * NumThreads + TId;
// Cover the entire iteration space
if (IV < LoopTripCount) {
do {
// Execute the loop body.
LoopBody(IV, Arg);
// Go to the next iteration
IV += KernelIteration;
} while (IV < LoopTripCount);
If we can assume, that there are enough threads and teams so that num_teams * num_threads >= LoopTripCount
we can simplify the OpenMP kernel to no-loop kernel:
// Start index in the normalized space.
Ty IV = BId * NumThreads + TId;
// Cover the entire iteration space
if (IV < LoopTripCount) {
//the inner do-loop can be skipped
// Execute the loop body.
LoopBody(IV, Arg);
}
Current status:
Currently, Flang has partial support for no-loop OpenMP kernels. There are two flags that can be passed to GPU code generation: -fopenmp-assume-teams-oversubscription
and -fopenmp-assume-threads-oversubscription
. These flags are mapped to config::getAssumeTeamsOversubscription()
and config::getAssumeThreadsOversubscription()
DeviceRTL values. If these flags are set, the LLVM optimizer can eliminate the do-loop and generate a no-loop GPU kernel. For more details, see llvm-project/offload/DeviceRTL/src/Workshare.cpp at main · llvm/llvm-project · GitHub
Unfortunately, the current implementation does not provide a mechanism to consistently launch kernels with a sufficient number of teams and threads. If a developer uses the OpenMP offload library without assertions, there are no checks to indicate that the loop iteration space is not fully covered, which can lead to incorrect results.
Additionally, not all kernels will benefit from the no-loop mode. Results may vary across different GPUs; however, reduction kernels appear not to benefit from the no-loop mode. The current implementation relies on global module flags and automatically transforms all kernels in a given LLVM IR module into no-loop kernels, even if it’s beneficial to transform only some of them into no-loop kernels.
Proposition of implementation of no-loop kernels:
This proposal aims to add support for no-loop kernels for kernels:
!$omp target teams distribute parallel do //no reduction clause
do-loop
if user attaches oversubscription flags.
target teams distribute parallel do
kernel is the most popular type of GPU-OpenMP kernel, and it appears to be most beneficial to launch it in the no-loop mode. We can implement this in four steps::
- Modification of DeviceRTL workshare loop functions. We should pass information about the no-loop mode as one of the parameters of the
__kmpc_distribute_for_static_loop
function. This parameter will be hardcoded by the OpenMPIRBuilder during the lowering process from MLIR to LLVM IR. The optimizer will use this information to generate no-loop kernels only for selected kernels.
- Adding
SPMD_NO_LOOP
mode to the OpenMP runtime. If a kernel is marked as no-loop, then the OpenMP runtime should provide a sufficient number of teams and threads to fully cover the loop iteration space
- Changes in MLIR. We should add information about no-loop kernels to MLIR code to facilitate the lowering process to LLVM IR. There are two key areas where information about no-loop kernels is required:
a) for omp.target
lowering: The host code must be aware if a given kernel is a no-loop kernel. If it is, it needs to launch a sufficient number of teams and threads.
b) for omp.wsloop
lowering: The OpenMPIRBuilder needs to know the value of the additional parameter that determines the no-loop mode for the __kmpc_distribute_for_static_loop
function.
- Changes in Flang frontend. Currently, information about oversubscription is passed to the device LLVM IR. We need this information for both the host and device code to fully support launching no-loop kernels.
Code re-use for Clang:
This approach can be easily re-used by Clang if we will switch Clang LLVM IR code generation to OpenMPIRbuilder.
I would like to hear your thoughts on the proposed idea.