Hi,
I would like to propose add support for no-loop kernels for OpenMP flang.

What is no-loop kernel?
To answer this question, let’s analyse simple test case:

  !$omp target teams distribute parallel do
  do i=1,N
     c(i) = a(i) + b(i)
  enddo

Currently there is limit of teams and threads which are launched by OpenMP offload runtime. That’s why execution of OpenMP GPU kernel looks like:

    Ty KernelIteration = NumBlocks * NumThreads;

    // Start index in the normalized space.
    Ty IV = BId * NumThreads + TId;

    // Cover the entire iteration space
    if (IV < LoopTripCount) {
      do {

        // Execute the loop body.
        LoopBody(IV, Arg);

        // Go to the next iteration
        IV += KernelIteration;

      } while (IV < LoopTripCount);

If we can assume, that there are enough threads and teams so that num_teams * num_threads >= LoopTripCount we can simplify the OpenMP kernel to no-loop kernel:

    // Start index in the normalized space.
    Ty IV = BId * NumThreads + TId;

    // Cover the entire iteration space
    if (IV < LoopTripCount) {
        //the inner do-loop can be skipped
        // Execute the loop body.
        LoopBody(IV, Arg);
    }

Current status:

Currently, Flang has partial support for no-loop OpenMP kernels. There are two flags that can be passed to GPU code generation: -fopenmp-assume-teams-oversubscription and -fopenmp-assume-threads-oversubscription. These flags are mapped to config::getAssumeTeamsOversubscription() and config::getAssumeThreadsOversubscription() DeviceRTL values. If these flags are set, the LLVM optimizer can eliminate the do-loop and generate a no-loop GPU kernel. For more details, see llvm-project/offload/DeviceRTL/src/Workshare.cpp at main · llvm/llvm-project · GitHub

Unfortunately, the current implementation does not provide a mechanism to consistently launch kernels with a sufficient number of teams and threads. If a developer uses the OpenMP offload library without assertions, there are no checks to indicate that the loop iteration space is not fully covered, which can lead to incorrect results.

Additionally, not all kernels will benefit from the no-loop mode. Results may vary across different GPUs; however, reduction kernels appear not to benefit from the no-loop mode. The current implementation relies on global module flags and automatically transforms all kernels in a given LLVM IR module into no-loop kernels, even if it’s beneficial to transform only some of them into no-loop kernels.

Proposition of implementation of no-loop kernels:
This proposal aims to add support for no-loop kernels for kernels:

!$omp target teams distribute parallel do //no reduction clause
   do-loop

if user attaches oversubscription flags.

target teams distribute parallel do kernel is the most popular type of GPU-OpenMP kernel, and it appears to be most beneficial to launch it in the no-loop mode. We can implement this in four steps::

  1. Modification of DeviceRTL workshare loop functions. We should pass information about the no-loop mode as one of the parameters of the __kmpc_distribute_for_static_loop function. This parameter will be hardcoded by the OpenMPIRBuilder during the lowering process from MLIR to LLVM IR. The optimizer will use this information to generate no-loop kernels only for selected kernels.
  2. Adding SPMD_NO_LOOP mode to the OpenMP runtime. If a kernel is marked as no-loop, then the OpenMP runtime should provide a sufficient number of teams and threads to fully cover the loop iteration space
  3. Changes in MLIR. We should add information about no-loop kernels to MLIR code to facilitate the lowering process to LLVM IR. There are two key areas where information about no-loop kernels is required:
    a) for omp.target lowering: The host code must be aware if a given kernel is a no-loop kernel. If it is, it needs to launch a sufficient number of teams and threads.
    b) for omp.wsloop lowering: The OpenMPIRBuilder needs to know the value of the additional parameter that determines the no-loop mode for the __kmpc_distribute_for_static_loop function.
  4. Changes in Flang frontend. Currently, information about oversubscription is passed to the device LLVM IR. We need this information for both the host and device code to fully support launching no-loop kernels.

Code re-use for Clang:
This approach can be easily re-used by Clang if we will switch Clang LLVM IR code generation to OpenMPIRbuilder.

I would like to hear your thoughts on the proposed idea.

IIRC OpenMPOpt can already optimize out the inner loop if certain information is passed in IR properly. There is also another existing related work (not sure if that’s ongoing or PoC) that rewrites the distribution interface that avoid using inner loop.

Check -fopenmp-assume-teams-oversubscription and -fopenmp-assume-threads-oversubscription.

CC @jdoerfert

@shiltian Thanks for the comment. I made typo in original RFC in section current status. I mentioned -assume-teams-oversubscription and -assume-threads-oversubscription instead of -fopenmp-assume-teams-oversubscription and -fopenmp-assume-threads-oversubscription. I will correct original RFC.

Having said that: these flags are mapped into: @__omp_rtl_assume_teams_oversubscription = weak_odr hidden local_unnamed_addr addrspace(1) constant i32 1
@__omp_rtl_assume_threads_oversubscription = weak_odr hidden local_unnamed_addr addrspace(1) constant i32 1
in the GPU LLVM IR. If these flags are set in upstream Flang, the optimizer will generate no-loop GPU code for all kernels. Additionally, the host side will not be aware of this optimization, which may lead it to launch kernels with a number of teams/threads that is too small.

Additionally, the host side will not be aware of this optimization, which may lead it to launch kernels with a number of teams/threads that is too small.

If both flags are set, that means the user understands what they imply and is responsible for choosing right launch parameters. IMHO, I don’t see this as a problem.

If these flags are set in upstream Flang, the optimizer will generate no-loop GPU code for all kernels.

This is one of the drawbacks of relying solely on compiler flags since they lack fine-grained control. But I don’t think the proposal in this post offers a clear improvement in that regard.

Unless we introduce a slightly different construct, such as an additional clause to explicitly indicate that a given target region doesn’t require an inner loop, I don’t see much of a difference between this proposal and the current optimization.

That said, there could be room for an optimization that adds the two assumptions when the compiler can confirm that oversubscription won’t occur, based on the number of teams and team size. However, this doesn’t fully conform to the OpenMP standard, since both values can be changed at runtime using environment variables. In the end, you still need to tell the compiler to assume those values will remain fixed during runtime.

The problem with a compiler flag that it’s effect is global: The entire program must be written for the new semantics. It is also not always possible for the implementation to provide a sufficient number of threads. For one, when not using a compositive construct the number of loop iterations is not known when lauching the kernel. Second, the number of iterations may be larger than the hardware allows.

Instead, I propose an new schedule mode none. That is, loop iterations and threads are mapped 1:1. If there are more threads than iterations then they get masked out. If there are more iterations than threads than behavior is undefined.

!$omp target teams distribute parallel do schedule(none) num_threads(n)
   do i = 1, n
      ...

I proposed this in the OpenMP language committe but did not get picked up so far.

I can analyze the MLIR code and easily implement fine-grained control for OpenMP kernels. I can detect kernels for which the loop trip count is not defined at compilation time. I can also check if num_teams or num_threads clauses are present. Last but not least, I can detect reduction kernels, and I will not promote them to no-loop mode due to performance reasons. All these features are not possible with the current approach—a global flag in Device RTL.

The AMD team has done similar work for Clang. They have demonstrated that it’s beneficial to generate no-loop kernels for non-reduction kernels: https://dl.acm.org/doi/abs/10.1145/3624062.3624605 . I would like to follow their steps for Flang and launch only selected kernels in no-loop mode. I want to mark these kernels as SPMD_NO_LOOP kernels. The OpenMP host runtime will launch these kernels with enough teams and threads to fully cover the loop iteration space. Other kernels will be launched without modification.

There are some differences between Flang and Clang approaches:

  1. Flang requires only small modification of LLVM IR code generation.
  2. Flang can analyze kernels in MLIR.
    They make it easier for us to implement no-loop optimization for Flang in comparison to downstream Clang optimization.

I would like to introduce the flag -fopenmp-target-ignore-env-vars , which will ignore OpenMP environment variables. If this flag is set, Flang is allowed to generate no-loop kernels. See ROCm documentation for reference. We can also add @Meinersbur schedule(none) clause to the next OpenMP standard for users who want to be fully compliant with the standard.

I think Dominik’s proposal enables an end-to-end implementation of NoLoop kernel upstream. Currently, though the hooks and the 2 compiler options exist, I don’t know that we have a working implementation. Are there existing tests that exercise those 2 options?

As @DominikAdamski said, the 2 compiler options can be thought of as the user providing some guarantees to the implementation that allow for correct generation and execution of NoLoop kernels. The compiler/runtime is not required to generate NoLoop for all kernels, it can just pick the ones it deems appropriate. I do not know that we need another compiler option (e.g. -fopenmp-target-ignore-env-vars), I think the usage of the over-subscription options can convey the restrictions about environment variables. Note that the blocksize can still be changed at runtime using environment variables. Changing the number of teams may not be possible if it does not satisfy the NoLoop constraints. We can document this restriction as part of the compiler options.

I still don’t quite follow how this proposal improves things over what we already have.

On one hand, the post points out that using compiler flags lacks fine-grained control because it is a module level configuration. I completely agree with that. On the other hand, the proposal still relies on the same two flags. So how is this actually better? How does continuing to use these flags allow control over just a subset of kernels, or even a single kernel?

I think @Meinersbur’s proposal, combined with the SPMD_NO_LOOP mode suggested here, is heading in the right direction. You could even introduce something like an ompx_no_loop clause and still apply the mechanism proposed here. That way, the two flags wouldn’t be needed, and we’d also have well-defined behavior at runtime (i.e., environment variables would be ignored).

What I meant by not fully conforming to the OpenMP standard is that the compiler would be making those two assumptions automatically, based solely on the number of teams and team size inferred at compile time.

@shiltian If you wish I can replace both flags with option -fopenmp-target-fast if this option is set, then compiler is allowed to cut some OpenMP corners and generate no-loop kernels only for selected GPU OpenMP code. Is it ok for you?

Revising my earlier statement a bit, if we can guarantee correctness, this would be rather a compiler optimization instead of a language dialect. The hardware limit for number of teams/total number of threads is in the range of $2^31-1$ for NVidia and AMD, which seems sufficiently high considering that Clang only uses 32 bit integers to describe loop iterations by default anyway.
So if the compiler can

  • identify kernels for which the transformation is semantically correct. That can be whenever the number of teams is implementation-defined or no functionality is used that relies on the number of teams.
  • reasonably ensure that there are only few and not major regressions (e.g. do not optimize in the presence reductions as stated)

Then I don’t see a reason not to do as proposed.

I think you are referring to icv-modifiers such as omp_set_num_teams since environment variables remain fixed during execution. However, omp_set_num_teams only applies to the host device. I don’t think you can modify the number of teams/threads while a target kernel is running. If there are circumstances that we cannot determine at compile-time whether the no-loop constraints are fulfilled, the compiler can emit two versions of the kernel.

The upside of using compiler options is that the NoLoop optimization can be applied to existing code. Agreed that the user has to guarantee the constraints by asserting a module-level option. But if the user is able to do that, existing code can benefit. This approach does not prevent future addition of constructs like ompx_no_loop, etc.

Regarding choosing the number of teams and team sizes, they are not inferred at compile-time. They are still inferred at run-time before kernel launch. The only constraint is that the size of the execution grid must be at least as large as the tripcount which is typically known only at runtime. So even runtime environment variables can be supported as long as they satisfy the above constraint. However, user-specified number of teams and team sizes may not satisfy the above constraint in general, so it may not be possible to honor them for NoLoop kernels.

PR: https://github.com/llvm/llvm-project/pull/151959 introduced changes in the OpenMP runtime that are required for no-loop OpenMP kernels in Fortran. This PR also sparked a discussion on how to enable the generation of no-loop kernels only for selected GPU kernels. As a follow-up, I would like to describe in detail my proposition. I would like to introduce a new flag: -fopenmp-target-fast, which will cut some corners introduced by the OpenMP standard. If this flag is set, the compiler will generate no-loop kernels only for selected OpenMP offload kernels. This flag will imply the following:

  1. Environment variables like OMP_NUM_TEAMS will be discarded during program execution. I propose to add a separate flag: -fopenmp-target-ignore-env-vars to control ignoring environment variables. If the user specifies -fopenmp-target-fast, then the flag -fopenmp-target-ignore-env-vars will be automatically added. We should discard environment variables to ensure that we create enough teams and threads for the kernel’s trip count.
  2. -fopenmp-assume-no-thread-state: We can add this assumption, that no thread in a parallel region modifies an Internal Control Variable (ICV), to potentially reduce device runtime code execution.
  3. -fopenmp-assume-no-nested-parallelism: We can add this assumption, that no thread in a parallel region encounters another parallel region, to potentially reduce device runtime code execution.
  4. We will automatically add the -O3 flag if the user doesn’t specify any -O* flag.

I also propose that we retain the global flags (-fopenmp-assume-teams/threads-oversubscription) solely for experimental purposes.

I’m looking forward to your feedback.

Do we need additional flags beyond -fopenmp-assume-teams/threads-oversubscription?

Currently, the user can have No-Loop functionality by specifying the above options and adding appropriate num_teams and num_threads clauses. Under this model, runtime envars may not be honored if they don’t meet the constraints of No-Loop mode execution. So the current solution already needs to relax some constraints of the OpenMP spec.

This RFC extends the current No-Loop functionality under which the user is required to add the above options but not required to add the appropriate num_teams/num_threads clauses (these clauses are optional now). The implementation will transparently choose the appropriate launch bounds. Additionally, the implementation will analyze performance and may decide not to use No-Loop mode for a given kernel even if these options are specified. In this model, envars may not be honored if they don’t meet the constraints of No-Loop mode execution. But this is no different from the current model, so I don’t see why we need additional options. But we do need to add documentation describing this behavior of envars when these options are used.

In my opinion, the flags -fopenmp-assume-teams/threads-oversubscription are not intuitive for the user if we want to implement automatic no-loop optimization. Before we go further, let me describe the current status. Let’s analyze three options for the current Flang implementation:

  1. Only -fopenmp-assume-threads-oversubscription is set. We generate no-loop code only for loops:
!$omp target
!$omp parallel do
   ! do-loop
  1. Only -fopenmp-assume-teams-oversubscription is set. We generate no-loop code only for loops:
!$omp target teams distribute
   ! do-loop
  1. Both flags are set. We force no-loop code generation for all types of worksharing GPU loops.

The new approach described in the RFC will make only option 3 (setting both flags) meaningful. Setting only one of these two flags will have no effect. In my opinion, this is not intuitive for the end user. That’s why I propose adding the flag -fopenmp-target-fast, which clearly indicates that the user wants to bypass some OpenMP standard restrictions (such as skipping environment variables) to get faster code.

My opinion is different. I think the options -fopenmp-assume-teams/threads-oversubscription are quite intuitive for No-Loop mode. The option names are self-explanatory, indicating that the user is providing some additional guarantees regarding launch parameters. Of course, we need to document these options carefully, outlining what exactly they mean.

Note that the use of -fopenmp-assume-teams/threads-oversubscription does not mean ignoring or skipping environment variables. If the user specifies appropriate number of teams/threads that conform to the No-Loop constraints, the implementation can and should honor those envars. The argument I am making is that this is no different from regular compilation mode. Even today, the implementation may not be able to honor the number of teams/threads specified by envars to assure correct program execution. That’s why I think no other additional option is required for No-Loop mode.

I think that the option -fopenmp-target-fast is not appropriate for No-Loop because it does not imply anything about launch parameters. It is just an umbrella option. In addition, -Ofast for host compilation was deprecated recently, so I don’t know that -fopenmp-target-fast is a good option for offloading either.