Skip to content

Conversation

spavloff
Copy link
Collaborator

@spavloff spavloff commented Aug 11, 2025

Selection DAG has a more sophisticated execution order representation than the simple sequence used in IR, so building the DAG can take into account specific properties of the nodes to better express possible parallelism. The existing implementation does this for constrained function calls, some of them are considered as independent, which can potentially improve the generated code. However this mechanism incorrectly implies that the calls with exception behavior 'ebIgnore' cannot raise floating-point exception. The purpose of this change is to fix the implementation.

In the current implementation, constrained function calls don't immediately update the DAG root. Instead, the DAG builder collects their output chains and flushes them when the root is required. Constrained function calls cannot be moved across calls of external functions and intrinsics that access floating-point environment, they work as barriers. Between the barriers, constrained function calls can be reordered, they may be considered independent from viewpoint of raising exceptions. For strictfp functions this is possible only if floating-point trapping is disabled.

This change introduces a new restriction - the calls with default exception handling cannot not be moved between strictfp function calls. Otherwise the exceptions raised by such call can disturb the expected exception sequence. It means that constrained function calls with strict exception behavior act as barriers for the calls with non-strict behavior and vice versa. Effectively it means that the entire sequence of constrained calls in IR is split into "strict" and "non-strict" regions, in which restrictions on the order of constrained calls are relaxed, but move from one region to another is not allowed. It agrees with the representation of strictfp code in high-level languages. For example, C/C++ strictfp code correspond to blocks where pragma STDC FENV_ACCESS ON is in effect, this restriction should help preserving the intended semantics.

When floating-point exception trapping is enabled, constrained intrinsics with 'ebStrict' cannot be reordered, their sequence must be identical to the original source order. The current implementation does not distinguish between strictfp modes with trapping and without it. This change make assumption that the trapping is disabled. It is not correct in the general case, but is compatible with the existing implementation.

Two test cases added in 21056a4 have been removed. These tests verified correct merging of constrained function calls that differed only in their exception handling:

%div = call double @llvm.experimental.constrained.fdiv.f64(double %a, double %b, metadata !"round.dynamic", metadata !"fpexcept.strict") #0
%div2 = call double @llvm.experimental.constrained.fdiv.f64(double %a, double %b, metadata !"round.dynamic", metadata !"fpexcept.ignore") #0

Previously one of such calls was eliminated by CSE during DAG construction. It was possible because the calls had the same income chain, as they were in the same group made by TokenFactor. Now these calls are no longer merged by CSE, and the test cases become unnecessary. Removal of one of such calls is still possible and profitable, but should be made elsewhere, probably at IR level.

Selection DAG has a more sophisticated execution order representation
than the simple sequence used in IR, so building the DAG can take into
account specific properties of the nodes to better express possible
parallelism. The existing implementation does this for constrained
function calls, some of them are considered as independent, which can
potentially improve the generated code. However this mechanism
incorrectly implies that the calls with exception behavior 'ebIgnore'
cannot raise floating-point exception. The purpose of this change is to
fix the implementation.

In the current implementation, constrained function calls don't
immediately update the DAG root. Instead, the DAG builder collects their
output chains and flushes them when the root is required. Constrained
function calls cannot be moved across calls of external functions and
intrinsics that access floating-point environment, they work as
barriers. Between the barriers, constrained function calls can be
reordered, they may be considered independent from viewpoint of raising
exceptions. For strictfp functions this is possible only if
floating-point trapping is disabled.

This change introduces a new restriction - the calls with default
exception handling cannot not be moved between strictfp function calls.
Otherwise the exceptions raised by such call can disturb the expected
exception sequence. It means that constrained function calls with strict
exception behavior act as barriers for the calls with non-strict
behavior and vice versa. Effectively it means that the entire sequence
of constrained calls in IR is split into "strict" and "non-strict"
regions, in which restrictions on the order of constrained calls are
relaxed, but move from one region to another is not allowed. It agrees
with the representation of strictfp code in high-level languages. For
example, C/C++ strictfp code correspond to blocks where pragma
`STDC FENV_ACCESS ON` is in effect, this restriction should help
preserving the intended semantics.

When floating-point exception trapping is enabled, constrained
intrinsics with 'ebStrinc' cannot be reordered, their sequence must be
identical to the original source order. The current implementation does
not distinguish between strictfp modes with trapping and without it.
This change make assumption that the trapping is disabled. It is not
correct in the general case, but is compatible with the existing
implementation, which makes this change NFC.
@spavloff spavloff self-assigned this Aug 11, 2025
@llvmbot llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Aug 11, 2025
@llvmbot
Copy link
Member

llvmbot commented Aug 11, 2025

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: Serge Pavlov (spavloff)

Changes

Selection DAG has a more sophisticated execution order representation than the simple sequence used in IR, so building the DAG can take into account specific properties of the nodes to better express possible parallelism. The existing implementation does this for constrained function calls, some of them are considered as independent, which can potentially improve the generated code. However this mechanism incorrectly implies that the calls with exception behavior 'ebIgnore' cannot raise floating-point exception. The purpose of this change is to fix the implementation.

In the current implementation, constrained function calls don't immediately update the DAG root. Instead, the DAG builder collects their output chains and flushes them when the root is required. Constrained function calls cannot be moved across calls of external functions and intrinsics that access floating-point environment, they work as barriers. Between the barriers, constrained function calls can be reordered, they may be considered independent from viewpoint of raising exceptions. For strictfp functions this is possible only if floating-point trapping is disabled.

This change introduces a new restriction - the calls with default exception handling cannot not be moved between strictfp function calls. Otherwise the exceptions raised by such call can disturb the expected exception sequence. It means that constrained function calls with strict exception behavior act as barriers for the calls with non-strict behavior and vice versa. Effectively it means that the entire sequence of constrained calls in IR is split into "strict" and "non-strict" regions, in which restrictions on the order of constrained calls are relaxed, but move from one region to another is not allowed. It agrees with the representation of strictfp code in high-level languages. For example, C/C++ strictfp code correspond to blocks where pragma STDC FENV_ACCESS ON is in effect, this restriction should help preserving the intended semantics.

When floating-point exception trapping is enabled, constrained intrinsics with 'ebStrinc' cannot be reordered, their sequence must be identical to the original source order. The current implementation does not distinguish between strictfp modes with trapping and without it. This change make assumption that the trapping is disabled. It is not correct in the general case, but is compatible with the existing implementation, which makes this change NFC.


Full diff: https://github.com/llvm/llvm-project/pull/153029.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+50-28)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h (+5)
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 7aa1fadd10dfc..c2c5d3caad30a 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8280,6 +8280,54 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   }
 }
 
+void SelectionDAGBuilder::pushFPOpOutChain(SDValue Result,
+                                           fp::ExceptionBehavior EB) {
+  assert(Result.getNode()->getNumValues() == 2);
+  SDValue OutChain = Result.getValue(1);
+  assert(OutChain.getValueType() == MVT::Other);
+
+  // Instead of updating the root immediately, push the produced chain to the
+  // appropriate list, deferring the update until the root is requested. In this
+  // case, the nodes from the lists are chained using TokenFactor, indicating
+  // that the operations are independent.
+  //
+  // In particular, the root is updated before any call that might access the
+  // floating-point environment, except for constrained intrinsics.
+  switch (EB) {
+  case fp::ExceptionBehavior::ebMayTrap:
+  case fp::ExceptionBehavior::ebIgnore:
+    // Floating-point exceptions produced by such operations are not intended
+    // to be observed, so the sequence of these operations does not need to be
+    // preserved.
+    //
+    // They however must not be mixed with the instructions that have strict
+    // exception behavior. Placing an operation with 'ebIgnore' behavior between
+    // 'ebStrict' operations could distort the observed exception behavior.
+    if (!PendingConstrainedFPStrict.empty()) {
+      assert(PendingConstrainedFP.empty());
+      updateRoot(PendingConstrainedFPStrict);
+    }
+    PendingConstrainedFP.push_back(OutChain);
+    break;
+  case fp::ExceptionBehavior::ebStrict:
+    // Floating-point exception produced by these operations may be observed, so
+    // they must be correctly chained. If trapping on FP exceptions is
+    // disabled, the exceptions can be observed only by functions that read
+    // exception flags, like 'llvm.get_fpenv' or 'fetestexcept'. It means that
+    // the order of operations is not significant between barriers.
+    //
+    // If trapping is enabled, each operation becomes an implicit observation
+    // point, so the operations must be sequenced according their original
+    // source order.
+    if (!PendingConstrainedFP.empty()) {
+      assert(PendingConstrainedFPStrict.empty());
+      updateRoot(PendingConstrainedFP);
+    }
+    PendingConstrainedFPStrict.push_back(OutChain);
+    // TODO: Add support for trapping-enabled scenarios.
+  }
+}
+
 void SelectionDAGBuilder::visitConstrainedFPIntrinsic(
     const ConstrainedFPIntrinsic &FPI) {
   SDLoc sdl = getCurSDLoc();
@@ -8293,32 +8341,6 @@ void SelectionDAGBuilder::visitConstrainedFPIntrinsic(
   for (unsigned I = 0, E = FPI.getNonMetadataArgCount(); I != E; ++I)
     Opers.push_back(getValue(FPI.getArgOperand(I)));
 
-  auto pushOutChain = [this](SDValue Result, fp::ExceptionBehavior EB) {
-    assert(Result.getNode()->getNumValues() == 2);
-
-    // Push node to the appropriate list so that future instructions can be
-    // chained up correctly.
-    SDValue OutChain = Result.getValue(1);
-    switch (EB) {
-    case fp::ExceptionBehavior::ebIgnore:
-      // The only reason why ebIgnore nodes still need to be chained is that
-      // they might depend on the current rounding mode, and therefore must
-      // not be moved across instruction that may change that mode.
-      [[fallthrough]];
-    case fp::ExceptionBehavior::ebMayTrap:
-      // These must not be moved across calls or instructions that may change
-      // floating-point exception masks.
-      PendingConstrainedFP.push_back(OutChain);
-      break;
-    case fp::ExceptionBehavior::ebStrict:
-      // These must not be moved across calls or instructions that may change
-      // floating-point exception masks or read floating-point exception flags.
-      // In addition, they cannot be optimized out even if unused.
-      PendingConstrainedFPStrict.push_back(OutChain);
-      break;
-    }
-  };
-
   const TargetLowering &TLI = DAG.getTargetLoweringInfo();
   EVT VT = TLI.getValueType(DAG.getDataLayout(), FPI.getType());
   SDVTList VTs = DAG.getVTList(VT, MVT::Other);
@@ -8346,7 +8368,7 @@ void SelectionDAGBuilder::visitConstrainedFPIntrinsic(
         !TLI.isFMAFasterThanFMulAndFAdd(DAG.getMachineFunction(), VT)) {
       Opers.pop_back();
       SDValue Mul = DAG.getNode(ISD::STRICT_FMUL, sdl, VTs, Opers, Flags);
-      pushOutChain(Mul, EB);
+      pushFPOpOutChain(Mul, EB);
       Opcode = ISD::STRICT_FADD;
       Opers.clear();
       Opers.push_back(Mul.getValue(1));
@@ -8377,7 +8399,7 @@ void SelectionDAGBuilder::visitConstrainedFPIntrinsic(
   }
 
   SDValue Result = DAG.getNode(Opcode, sdl, VTs, Opers, Flags);
-  pushOutChain(Result, EB);
+  pushFPOpOutChain(Result, EB);
 
   SDValue FPResult = Result.getValue(0);
   setValue(&FPI, FPResult);
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
index e0835e6310357..e1d7804ce12cd 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
@@ -195,6 +195,11 @@ class SelectionDAGBuilder {
   /// Update root to include all chains from the Pending list.
   SDValue updateRoot(SmallVectorImpl<SDValue> &Pending);
 
+  /// Given a node representing a floating-point operation and its specified
+  /// exception behavior, this either updates the root or stores the node in
+  /// a list to be added to chains latter.
+  void pushFPOpOutChain(SDValue Result, fp::ExceptionBehavior EB);
+
   /// A unique monotonically increasing number used to order the SDNodes we
   /// create.
   unsigned SDNodeOrder;

@kpneal
Copy link
Member

kpneal commented Aug 12, 2025

No tests?

@kpneal
Copy link
Member

kpneal commented Aug 12, 2025

How are we handling rounding mode? When we're ignoring exceptions we might still be changing the rounding mode so we probably shouldn't be reordering constrained intrinsics without checking it as well.

@spavloff
Copy link
Collaborator Author

Trying to prepare a test I realized that pending calls must be flushed prior to node creation, the PR is updated accordingly.

However I could not create a test. I wanted to use code like:

declare float@join(float, float, float, float, float, float);
define float @dag(float %a, float %b) strictfp {
entry:
  %v0 = call float @llvm.experimental.constrained.fadd.f32(float %a, float %b, metadata !"round.dynamic", metadata !"fpexcept.ignore")
  %v1 = call float @llvm.experimental.constrained.fadd.f32(float %b, float %b, metadata !"round.dynamic", metadata !"fpexcept.ignore")
  %v2 = call float @llvm.experimental.constrained.fmul.f32(float %a, float %b, metadata !"round.dynamic", metadata !"fpexcept.strict")
  %v3 = call float @llvm.experimental.constrained.fmul.f32(float %a, float %a, metadata !"round.dynamic", metadata !"fpexcept.strict")
  %v4 = call float @llvm.experimental.constrained.fsub.f32(float %a, float %b, metadata !"round.dynamic", metadata !"fpexcept.strict")
  %v5 = call float @llvm.experimental.constrained.fsub.f32(float %b, float %a, metadata !"round.dynamic", metadata !"fpexcept.strict")

  %ret = call float @join(float %v0, float %v1, float %v2, float %v3, float %v4, float %v5)
  ret float %ret
}

This code should producee three TokenFactor groups, which would differ from how the DAG is constructed now. The DAG indeed has structure as expected but the produced code is not. Machine Instruction Scheduler reorders the code breaking the expected order. In the machine representation the code for "strict" calls looks like:

  %4:fr32 = MULSSrr %0:fr32(tied-def 0), %1:fr32, implicit $mxcsr
  %5:fr32 = MULSSrr %0:fr32(tied-def 0), %0:fr32, implicit $mxcsr

The instructions have implicit use of MXCSR, which is sufficient for dynamic rounding, but correct exception handling require also implicit def of MXCSR. I checked X86, AArch64, SystemZ, PPC and RISCV, - none have implicit def of a status register. Correct code generation for exception-aware code is not possible now.

How are we handling rounding mode? When we're ignoring exceptions we might still be changing the rounding mode so we probably shouldn't be reordering constrained intrinsics without checking it as well.

The is no problem with rounding mode. It can be changed only by some functions, which are already barriers for constrained function. Exception handling is special because the exceptions can be raised by any FP instructions.

store double %result2, ptr %y
ret void
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with this test? I'd hate for us to lose test coverage because a bug was discovered. Wouldn't a FIXME and a good comment be better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put back the tests and slightly modified them to check changes introduced by this patch. Previously the two operation in the test were merged because the nodes were identical, due to the same value of chain argument. If the instructions are sequenced, which this patch implements, the chain arguments become different and the nodes are not merged.

It is not an error that the nodes were merged previously, but it was made using CSE mechanism. With sequenced node it does not work. Merging such nodes require another mechanism, probably at IR level.

assert(PendingConstrainedFPStrict.empty());
updateRoot(PendingConstrainedFP);
}
// TODO: Add support for trapping-enabled scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% certain that this change won't create cases where instructions expecting traps to be off will be moved into a chunk of instructions that run with FP traps on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some part of a function runs with traps enabled and the other - with traps disabled, somewhere between them must be a function call, that changes FP control register. Such function is either an external function or an intrinsic with flag IntrInaccessibleMemOnly. Both are barriers for FP instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some part of a function runs with traps enabled and the other - with traps disabled, somewhere between them must be a function call, that changes FP control register. Such function is either an external function or an intrinsic with flag IntrInaccessibleMemOnly. Both are barriers for FP instructions.

Is this specific to SelectionDAG? I thought we didn't have barriers for FP instructions. And since we can have multiple function calls in a basic block, it's possible for traps to be enabled and disabled in different parts of the same basic block. Thus my question about certainty around movement of instructions.

@kpneal
Copy link
Member

kpneal commented Aug 18, 2025

Rounding mode can be changed by an arbitrary function called from inside a basic block. Thus, we can't model it perfectly and need to take it into account when reordering.

And I still think tests are worthwhile even if all they do is demonstrate a bug. It seems like a good comment on the issue including a FIXME and perhaps a link to a bug report would be helpful. Do we have a stated policy on this?

@spavloff
Copy link
Collaborator Author

Rounding mode can be changed by an arbitrary function called from inside a basic block. Thus, we can't model it perfectly and need to take it into account when reordering.

What's wrong with the current way FP operations are modeled? Each operation that may depend on dynamic rounding has some memory effect, which indicates that this operation reads FP control register. Functions that changes the control register must have memory effect, that this function writes FP control register. Any instruction movement must honor this relations, just as for real memory access.

@kpneal
Copy link
Member

kpneal commented Aug 29, 2025

Rounding mode can be changed by an arbitrary function called from inside a basic block. Thus, we can't model it perfectly and need to take it into account when reordering.

What's wrong with the current way FP operations are modeled? Each operation that may depend on dynamic rounding has some memory effect, which indicates that this operation reads FP control register. Functions that changes the control register must have memory effect, that this function writes FP control register. Any instruction movement must honor this relations, just as for real memory access.

How about a function where rounding mode matters and is different in different parts of the function calling a function that calls a function that changes the rounding mode? I don't think we can model that, especially if the functions are in different compilations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants