Skip to content

Conversation

AlexMaclean
Copy link
Member

No description provided.

@AlexMaclean AlexMaclean self-assigned this Aug 19, 2025
@llvmbot llvmbot added backend:NVPTX llvm:SelectionDAG SelectionDAGISel as well labels Aug 19, 2025
@llvmbot
Copy link
Member

llvmbot commented Aug 19, 2025

@llvm/pr-subscribers-backend-mips

@llvm/pr-subscribers-llvm-selectiondag

Author: Alex MacLean (AlexMaclean)

Changes

Patch is 130.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154251.diff

9 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+1-1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+64-100)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (-2)
  • (modified) llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll (+5-5)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll (+180-198)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg-sm70.ll (+180-198)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg-sm90.ll (+180-198)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg.ll (+100-110)
  • (modified) llvm/test/CodeGen/NVPTX/mulwide.ll (+4-6)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index c16ccaf926bc7..e63568bc3fb63 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -15122,7 +15122,7 @@ SDValue DAGCombiner::visitANY_EXTEND(SDNode *N) {
       return foldedExt;
   } else if (ISD::isNON_EXTLoad(N0.getNode()) &&
              ISD::isUNINDEXEDLoad(N0.getNode()) &&
-             TLI.isLoadExtLegal(ISD::EXTLOAD, VT, N0.getValueType())) {
+             TLI.isLoadExtLegalOrCustom(ISD::EXTLOAD, VT, N0.getValueType())) {
     bool DoXform = true;
     SmallVector<SDNode *, 4> SetCCs;
     if (!N0.hasOneUse())
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 74e6c139c610d..6933d21c5d8a9 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -746,57 +746,56 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   // intrinsics.
   setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
 
-  // Turn FP extload into load/fpextend
-  setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f32, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f32, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f32, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8bf16, Expand);
-  // Turn FP truncstore into trunc + store.
-  // FIXME: vector types should also be expanded
-  setTruncStoreAction(MVT::f32, MVT::f16, Expand);
-  setTruncStoreAction(MVT::f64, MVT::f16, Expand);
-  setTruncStoreAction(MVT::f32, MVT::bf16, Expand);
-  setTruncStoreAction(MVT::f64, MVT::bf16, Expand);
-  setTruncStoreAction(MVT::f64, MVT::f32, Expand);
-  setTruncStoreAction(MVT::v2f32, MVT::v2f16, Expand);
-  setTruncStoreAction(MVT::v2f32, MVT::v2bf16, Expand);
+  // FP extload/truncstore is not legal in PTX. We need to expand all these.
+  for (auto FloatVTs :
+       {MVT::fp_valuetypes(), MVT::fp_fixedlen_vector_valuetypes()}) {
+    for (MVT ValVT : FloatVTs) {
+      for (MVT MemVT : FloatVTs) {
+        setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Expand);
+        setTruncStoreAction(ValVT, MemVT, Expand);
+      }
+    }
+  }
 
-  // PTX does not support load / store predicate registers
-  setOperationAction(ISD::LOAD, MVT::i1, Custom);
-  setOperationAction(ISD::STORE, MVT::i1, Custom);
+  // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+  // how they'll be lowered in ISel anyway, and by doing this a little earlier
+  // we allow for more DAG combine opportunities.
+  for (auto IntVTs :
+       {MVT::integer_valuetypes(), MVT::integer_fixedlen_vector_valuetypes()})
+    for (MVT ValVT : IntVTs)
+      for (MVT MemVT : IntVTs)
+        if (isTypeLegal(ValVT))
+          setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Custom);
 
+  // PTX does not support load / store predicate registers
+  setOperationAction({ISD::LOAD, ISD::STORE}, MVT::i1, Custom);
   for (MVT VT : MVT::integer_valuetypes()) {
-    setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);
-    setLoadExtAction(ISD::ZEXTLOAD, VT, MVT::i1, Promote);
-    setLoadExtAction(ISD::EXTLOAD, VT, MVT::i1, Promote);
+    setLoadExtAction({ISD::SEXTLOAD, ISD::ZEXTLOAD, ISD::EXTLOAD}, VT, MVT::i1,
+                     Promote);
     setTruncStoreAction(VT, MVT::i1, Expand);
   }
 
+  // Register custom handling for illegal type loads/stores. We'll try to custom
+  // lower almost all illegal types and logic in the lowering will discard cases
+  // we can't handle.
+  setOperationAction({ISD::LOAD, ISD::STORE}, {MVT::i128, MVT::f128}, Custom);
+  for (MVT VT : MVT::fixedlen_vector_valuetypes())
+    if (!isTypeLegal(VT) && VT.getStoreSizeInBits() <= 256)
+      setOperationAction({ISD::STORE, ISD::LOAD}, VT, Custom);
+
+  // Custom legalization for LDU intrinsics.
+  // TODO: The logic to lower these is not very robust and we should rewrite it.
+  //       Perhaps LDU should not be represented as an intrinsic at all.
+  setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
+  for (MVT VT : MVT::fixedlen_vector_valuetypes())
+    if (IsPTXVectorType(VT))
+      setOperationAction(ISD::INTRINSIC_W_CHAIN, VT, Custom);
+
   setCondCodeAction({ISD::SETNE, ISD::SETEQ, ISD::SETUGE, ISD::SETULE,
                      ISD::SETUGT, ISD::SETULT, ISD::SETGT, ISD::SETLT,
                      ISD::SETGE, ISD::SETLE},
                     MVT::i1, Expand);
 
-  // expand extload of vector of integers.
-  setLoadExtAction({ISD::EXTLOAD, ISD::SEXTLOAD, ISD::ZEXTLOAD}, MVT::v2i16,
-                   MVT::v2i8, Expand);
-  setTruncStoreAction(MVT::v2i16, MVT::v2i8, Expand);
-
   // This is legal in NVPTX
   setOperationAction(ISD::ConstantFP, MVT::f64, Legal);
   setOperationAction(ISD::ConstantFP, MVT::f32, Legal);
@@ -811,24 +810,12 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   // DEBUGTRAP can be lowered to PTX brkpt
   setOperationAction(ISD::DEBUGTRAP, MVT::Other, Legal);
 
-  // Register custom handling for vector loads/stores
-  for (MVT VT : MVT::fixedlen_vector_valuetypes())
-    if (IsPTXVectorType(VT))
-      setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN}, VT,
-                         Custom);
-
-  setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN},
-                     {MVT::i128, MVT::f128}, Custom);
-
   // Support varargs.
   setOperationAction(ISD::VASTART, MVT::Other, Custom);
   setOperationAction(ISD::VAARG, MVT::Other, Custom);
   setOperationAction(ISD::VACOPY, MVT::Other, Expand);
   setOperationAction(ISD::VAEND, MVT::Other, Expand);
 
-  // Custom handling for i8 intrinsics
-  setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
-
   setOperationAction({ISD::ABS, ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX},
                      {MVT::i16, MVT::i32, MVT::i64}, Legal);
 
@@ -3135,39 +3122,14 @@ static void replaceLoadVector(SDNode *N, SelectionDAG &DAG,
                               SmallVectorImpl<SDValue> &Results,
                               const NVPTXSubtarget &STI);
 
-SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
-  if (Op.getValueType() == MVT::i1)
-    return LowerLOADi1(Op, DAG);
-
-  EVT VT = Op.getValueType();
-
-  if (NVPTX::isPackedVectorTy(VT)) {
-    // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
-    // handle unaligned loads and have to handle it here.
-    LoadSDNode *Load = cast<LoadSDNode>(Op);
-    EVT MemVT = Load->getMemoryVT();
-    if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
-                                        MemVT, *Load->getMemOperand())) {
-      SDValue Ops[2];
-      std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(Load, DAG);
-      return DAG.getMergeValues(Ops, SDLoc(Op));
-    }
-  }
-
-  return SDValue();
-}
-
 // v = ld i1* addr
 //   =>
 // v1 = ld i8* addr (-> i16)
 // v = trunc i16 to i1
-SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
-  SDNode *Node = Op.getNode();
-  LoadSDNode *LD = cast<LoadSDNode>(Node);
-  SDLoc dl(Node);
+static SDValue lowerLOADi1(LoadSDNode *LD, SelectionDAG &DAG) {
+  SDLoc dl(LD);
   assert(LD->getExtensionType() == ISD::NON_EXTLOAD);
-  assert(Node->getValueType(0) == MVT::i1 &&
-         "Custom lowering for i1 load only");
+  assert(LD->getValueType(0) == MVT::i1 && "Custom lowering for i1 load only");
   SDValue newLD = DAG.getExtLoad(ISD::ZEXTLOAD, dl, MVT::i16, LD->getChain(),
                                  LD->getBasePtr(), LD->getPointerInfo(),
                                  MVT::i8, LD->getAlign(),
@@ -3176,8 +3138,27 @@ SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
   // The legalizer (the caller) is expecting two values from the legalized
   // load, so we build a MergeValues node for it. See ExpandUnalignedLoad()
   // in LegalizeDAG.cpp which also uses MergeValues.
-  SDValue Ops[] = { result, LD->getChain() };
-  return DAG.getMergeValues(Ops, dl);
+  return DAG.getMergeValues({result, LD->getChain()}, dl);
+}
+
+SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
+  LoadSDNode *LD = cast<LoadSDNode>(Op);
+
+  if (Op.getValueType() == MVT::i1)
+    return lowerLOADi1(LD, DAG);
+
+  // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+  // how they'll be lowered in ISel anyway, and by doing this a little earlier
+  // we allow for more DAG combine opportunities.
+  if (LD->getExtensionType() == ISD::EXTLOAD) {
+    assert(LD->getValueType(0).isInteger() && LD->getMemoryVT().isInteger() &&
+           "Unexpected fpext-load");
+    return DAG.getExtLoad(ISD::ZEXTLOAD, SDLoc(Op), Op.getValueType(),
+                          LD->getChain(), LD->getBasePtr(), LD->getMemoryVT(),
+                          LD->getMemOperand());
+  }
+
+  llvm_unreachable("Unexpected custom lowering for load");
 }
 
 SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
@@ -3187,17 +3168,6 @@ SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
   if (VT == MVT::i1)
     return LowerSTOREi1(Op, DAG);
 
-  // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
-  // handle unaligned stores and have to handle it here.
-  if (NVPTX::isPackedVectorTy(VT) &&
-      !allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
-                                      VT, *Store->getMemOperand()))
-    return expandUnalignedStore(Store, DAG);
-
-  // v2f16/v2bf16/v2i16 don't need special handling.
-  if (NVPTX::isPackedVectorTy(VT) && VT.is32BitVector())
-    return SDValue();
-
   // Lower store of any other vector type, including v2f32 as we want to break
   // it apart since this is not a widely-supported type.
   return LowerSTOREVector(Op, DAG);
@@ -4051,14 +4021,8 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
   case Intrinsic::nvvm_ldu_global_i:
   case Intrinsic::nvvm_ldu_global_f:
   case Intrinsic::nvvm_ldu_global_p: {
-    auto &DL = I.getDataLayout();
     Info.opc = ISD::INTRINSIC_W_CHAIN;
-    if (Intrinsic == Intrinsic::nvvm_ldu_global_i)
-      Info.memVT = getValueType(DL, I.getType());
-    else if(Intrinsic == Intrinsic::nvvm_ldu_global_p)
-      Info.memVT = getPointerTy(DL);
-    else
-      Info.memVT = getValueType(DL, I.getType());
+    Info.memVT = getValueType(I.getDataLayout(), I.getType());
     Info.ptrVal = I.getArgOperand(0);
     Info.offset = 0;
     Info.flags = MachineMemOperand::MOLoad;
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 27f099e220976..e7f1a4b4c98c4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -309,8 +309,6 @@ class NVPTXTargetLowering : public TargetLowering {
   SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
-  SDValue LowerLOADi1(SDValue Op, SelectionDAG &DAG) const;
-
   SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSTOREi1(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSTOREVector(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index b4641d01eb927..4fb294761e347 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -711,11 +711,11 @@ define <2 x bfloat> @test_copysign(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; CHECK-NEXT:    .reg .b32 %r<6>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.b32 %r1, [test_copysign_param_0];
-; CHECK-NEXT:    ld.param.b32 %r2, [test_copysign_param_1];
-; CHECK-NEXT:    and.b32 %r3, %r2, -2147450880;
-; CHECK-NEXT:    and.b32 %r4, %r1, 2147450879;
-; CHECK-NEXT:    or.b32 %r5, %r4, %r3;
+; CHECK-NEXT:    ld.param.b32 %r1, [test_copysign_param_1];
+; CHECK-NEXT:    and.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT:    ld.param.b32 %r3, [test_copysign_param_0];
+; CHECK-NEXT:    and.b32 %r4, %r3, 2147450879;
+; CHECK-NEXT:    or.b32 %r5, %r4, %r2;
 ; CHECK-NEXT:    st.param.b32 [func_retval0], %r5;
 ; CHECK-NEXT:    ret;
   %r = call <2 x bfloat> @llvm.copysign.f16(<2 x bfloat> %a, <2 x bfloat> %b)
diff --git a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
index 63c389c36e87e..5e7d1a1cd8af1 100644
--- a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
+++ b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
@@ -7,7 +7,7 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -22,23 +22,22 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB0_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17, %r19, %r4;
+; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r17;
 ; SM60-NEXT:    @%p1 bra $L__BB0_3;
 ; SM60-NEXT:  // %bb.2: // %partword.cmpxchg.failure
 ; SM60-NEXT:    // in Loop: Header=BB0_1 Depth=1
 ; SM60-NEXT:    and.b32 %r8, %r7, %r2;
-; SM60-NEXT:    setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT:    mov.b32 %r20, %r8;
+; SM60-NEXT:    setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT:    mov.b32 %r19, %r8;
 ; SM60-NEXT:    @%p2 bra $L__BB0_1;
 ; SM60-NEXT:  $L__BB0_3: // %partword.cmpxchg.end
 ; SM60-NEXT:    st.param.b32 [func_retval0], %r14;
@@ -52,7 +51,7 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -67,23 +66,22 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB1_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17, %r19, %r4;
+; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r17;
 ; SM60-NEXT:    @%p1 bra $L__BB1_3;
 ; SM60-NEXT:  // %bb.2: // %partword.cmpxchg.failure
 ; SM60-NEXT:    // in Loop: Header=BB1_1 Depth=1
 ; SM60-NEXT:    and.b32 %r8, %r7, %r2;
-; SM60-NEXT:    setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT:    mov.b32 %r20, %r8;
+; SM60-NEXT:    setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT:    mov.b32 %r19, %r8;
 ; SM60-NEXT:    @%p2 bra $L__BB1_1;
 ; SM60-NEXT:  $L__BB1_3: // %partword.cmpxchg.end
 ; SM60-NEXT:    membar.cta;
@@ -98,7 +96,7 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -114,23 +112,22 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB2_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17, %r19, %r4;
+; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r17;
 ; SM60-NEXT:    @%p1 bra $L__BB2_3;
 ; SM60-NEXT:  // %bb.2: // %partword.cmpxchg.failure
 ; SM60-NEXT:    // in Loop: Header=BB2_1 Depth=1
 ; SM60-NEXT:    and.b32 %r8, %r7, %r2;
-; SM60-NEXT:    setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT:    mov.b32 %r20, %r8;
+; SM60-NEXT:    setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT:    mov.b32 %r19, %r8;
 ; SM60-NEXT:    @%p2 bra $L__BB2_1;
 ; SM60-NEXT:  $L__BB2_3: // %partword.cmpxchg.end
 ; SM60-NEXT:    membar.cta;
@@ -145,7 +142,7 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -160,23 +157,22 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB3_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17,...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 19, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Patch is 130.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154251.diff

9 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+1-1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+64-100)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (-2)
  • (modified) llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll (+5-5)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll (+180-198)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg-sm70.ll (+180-198)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg-sm90.ll (+180-198)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg.ll (+100-110)
  • (modified) llvm/test/CodeGen/NVPTX/mulwide.ll (+4-6)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index c16ccaf926bc7..e63568bc3fb63 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -15122,7 +15122,7 @@ SDValue DAGCombiner::visitANY_EXTEND(SDNode *N) {
       return foldedExt;
   } else if (ISD::isNON_EXTLoad(N0.getNode()) &&
              ISD::isUNINDEXEDLoad(N0.getNode()) &&
-             TLI.isLoadExtLegal(ISD::EXTLOAD, VT, N0.getValueType())) {
+             TLI.isLoadExtLegalOrCustom(ISD::EXTLOAD, VT, N0.getValueType())) {
     bool DoXform = true;
     SmallVector<SDNode *, 4> SetCCs;
     if (!N0.hasOneUse())
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 74e6c139c610d..6933d21c5d8a9 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -746,57 +746,56 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   // intrinsics.
   setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
 
-  // Turn FP extload into load/fpextend
-  setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f32, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f32, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f32, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8bf16, Expand);
-  setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8bf16, Expand);
-  // Turn FP truncstore into trunc + store.
-  // FIXME: vector types should also be expanded
-  setTruncStoreAction(MVT::f32, MVT::f16, Expand);
-  setTruncStoreAction(MVT::f64, MVT::f16, Expand);
-  setTruncStoreAction(MVT::f32, MVT::bf16, Expand);
-  setTruncStoreAction(MVT::f64, MVT::bf16, Expand);
-  setTruncStoreAction(MVT::f64, MVT::f32, Expand);
-  setTruncStoreAction(MVT::v2f32, MVT::v2f16, Expand);
-  setTruncStoreAction(MVT::v2f32, MVT::v2bf16, Expand);
+  // FP extload/truncstore is not legal in PTX. We need to expand all these.
+  for (auto FloatVTs :
+       {MVT::fp_valuetypes(), MVT::fp_fixedlen_vector_valuetypes()}) {
+    for (MVT ValVT : FloatVTs) {
+      for (MVT MemVT : FloatVTs) {
+        setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Expand);
+        setTruncStoreAction(ValVT, MemVT, Expand);
+      }
+    }
+  }
 
-  // PTX does not support load / store predicate registers
-  setOperationAction(ISD::LOAD, MVT::i1, Custom);
-  setOperationAction(ISD::STORE, MVT::i1, Custom);
+  // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+  // how they'll be lowered in ISel anyway, and by doing this a little earlier
+  // we allow for more DAG combine opportunities.
+  for (auto IntVTs :
+       {MVT::integer_valuetypes(), MVT::integer_fixedlen_vector_valuetypes()})
+    for (MVT ValVT : IntVTs)
+      for (MVT MemVT : IntVTs)
+        if (isTypeLegal(ValVT))
+          setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Custom);
 
+  // PTX does not support load / store predicate registers
+  setOperationAction({ISD::LOAD, ISD::STORE}, MVT::i1, Custom);
   for (MVT VT : MVT::integer_valuetypes()) {
-    setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);
-    setLoadExtAction(ISD::ZEXTLOAD, VT, MVT::i1, Promote);
-    setLoadExtAction(ISD::EXTLOAD, VT, MVT::i1, Promote);
+    setLoadExtAction({ISD::SEXTLOAD, ISD::ZEXTLOAD, ISD::EXTLOAD}, VT, MVT::i1,
+                     Promote);
     setTruncStoreAction(VT, MVT::i1, Expand);
   }
 
+  // Register custom handling for illegal type loads/stores. We'll try to custom
+  // lower almost all illegal types and logic in the lowering will discard cases
+  // we can't handle.
+  setOperationAction({ISD::LOAD, ISD::STORE}, {MVT::i128, MVT::f128}, Custom);
+  for (MVT VT : MVT::fixedlen_vector_valuetypes())
+    if (!isTypeLegal(VT) && VT.getStoreSizeInBits() <= 256)
+      setOperationAction({ISD::STORE, ISD::LOAD}, VT, Custom);
+
+  // Custom legalization for LDU intrinsics.
+  // TODO: The logic to lower these is not very robust and we should rewrite it.
+  //       Perhaps LDU should not be represented as an intrinsic at all.
+  setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
+  for (MVT VT : MVT::fixedlen_vector_valuetypes())
+    if (IsPTXVectorType(VT))
+      setOperationAction(ISD::INTRINSIC_W_CHAIN, VT, Custom);
+
   setCondCodeAction({ISD::SETNE, ISD::SETEQ, ISD::SETUGE, ISD::SETULE,
                      ISD::SETUGT, ISD::SETULT, ISD::SETGT, ISD::SETLT,
                      ISD::SETGE, ISD::SETLE},
                     MVT::i1, Expand);
 
-  // expand extload of vector of integers.
-  setLoadExtAction({ISD::EXTLOAD, ISD::SEXTLOAD, ISD::ZEXTLOAD}, MVT::v2i16,
-                   MVT::v2i8, Expand);
-  setTruncStoreAction(MVT::v2i16, MVT::v2i8, Expand);
-
   // This is legal in NVPTX
   setOperationAction(ISD::ConstantFP, MVT::f64, Legal);
   setOperationAction(ISD::ConstantFP, MVT::f32, Legal);
@@ -811,24 +810,12 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   // DEBUGTRAP can be lowered to PTX brkpt
   setOperationAction(ISD::DEBUGTRAP, MVT::Other, Legal);
 
-  // Register custom handling for vector loads/stores
-  for (MVT VT : MVT::fixedlen_vector_valuetypes())
-    if (IsPTXVectorType(VT))
-      setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN}, VT,
-                         Custom);
-
-  setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN},
-                     {MVT::i128, MVT::f128}, Custom);
-
   // Support varargs.
   setOperationAction(ISD::VASTART, MVT::Other, Custom);
   setOperationAction(ISD::VAARG, MVT::Other, Custom);
   setOperationAction(ISD::VACOPY, MVT::Other, Expand);
   setOperationAction(ISD::VAEND, MVT::Other, Expand);
 
-  // Custom handling for i8 intrinsics
-  setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
-
   setOperationAction({ISD::ABS, ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX},
                      {MVT::i16, MVT::i32, MVT::i64}, Legal);
 
@@ -3135,39 +3122,14 @@ static void replaceLoadVector(SDNode *N, SelectionDAG &DAG,
                               SmallVectorImpl<SDValue> &Results,
                               const NVPTXSubtarget &STI);
 
-SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
-  if (Op.getValueType() == MVT::i1)
-    return LowerLOADi1(Op, DAG);
-
-  EVT VT = Op.getValueType();
-
-  if (NVPTX::isPackedVectorTy(VT)) {
-    // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
-    // handle unaligned loads and have to handle it here.
-    LoadSDNode *Load = cast<LoadSDNode>(Op);
-    EVT MemVT = Load->getMemoryVT();
-    if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
-                                        MemVT, *Load->getMemOperand())) {
-      SDValue Ops[2];
-      std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(Load, DAG);
-      return DAG.getMergeValues(Ops, SDLoc(Op));
-    }
-  }
-
-  return SDValue();
-}
-
 // v = ld i1* addr
 //   =>
 // v1 = ld i8* addr (-> i16)
 // v = trunc i16 to i1
-SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
-  SDNode *Node = Op.getNode();
-  LoadSDNode *LD = cast<LoadSDNode>(Node);
-  SDLoc dl(Node);
+static SDValue lowerLOADi1(LoadSDNode *LD, SelectionDAG &DAG) {
+  SDLoc dl(LD);
   assert(LD->getExtensionType() == ISD::NON_EXTLOAD);
-  assert(Node->getValueType(0) == MVT::i1 &&
-         "Custom lowering for i1 load only");
+  assert(LD->getValueType(0) == MVT::i1 && "Custom lowering for i1 load only");
   SDValue newLD = DAG.getExtLoad(ISD::ZEXTLOAD, dl, MVT::i16, LD->getChain(),
                                  LD->getBasePtr(), LD->getPointerInfo(),
                                  MVT::i8, LD->getAlign(),
@@ -3176,8 +3138,27 @@ SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
   // The legalizer (the caller) is expecting two values from the legalized
   // load, so we build a MergeValues node for it. See ExpandUnalignedLoad()
   // in LegalizeDAG.cpp which also uses MergeValues.
-  SDValue Ops[] = { result, LD->getChain() };
-  return DAG.getMergeValues(Ops, dl);
+  return DAG.getMergeValues({result, LD->getChain()}, dl);
+}
+
+SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
+  LoadSDNode *LD = cast<LoadSDNode>(Op);
+
+  if (Op.getValueType() == MVT::i1)
+    return lowerLOADi1(LD, DAG);
+
+  // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+  // how they'll be lowered in ISel anyway, and by doing this a little earlier
+  // we allow for more DAG combine opportunities.
+  if (LD->getExtensionType() == ISD::EXTLOAD) {
+    assert(LD->getValueType(0).isInteger() && LD->getMemoryVT().isInteger() &&
+           "Unexpected fpext-load");
+    return DAG.getExtLoad(ISD::ZEXTLOAD, SDLoc(Op), Op.getValueType(),
+                          LD->getChain(), LD->getBasePtr(), LD->getMemoryVT(),
+                          LD->getMemOperand());
+  }
+
+  llvm_unreachable("Unexpected custom lowering for load");
 }
 
 SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
@@ -3187,17 +3168,6 @@ SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
   if (VT == MVT::i1)
     return LowerSTOREi1(Op, DAG);
 
-  // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
-  // handle unaligned stores and have to handle it here.
-  if (NVPTX::isPackedVectorTy(VT) &&
-      !allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
-                                      VT, *Store->getMemOperand()))
-    return expandUnalignedStore(Store, DAG);
-
-  // v2f16/v2bf16/v2i16 don't need special handling.
-  if (NVPTX::isPackedVectorTy(VT) && VT.is32BitVector())
-    return SDValue();
-
   // Lower store of any other vector type, including v2f32 as we want to break
   // it apart since this is not a widely-supported type.
   return LowerSTOREVector(Op, DAG);
@@ -4051,14 +4021,8 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
   case Intrinsic::nvvm_ldu_global_i:
   case Intrinsic::nvvm_ldu_global_f:
   case Intrinsic::nvvm_ldu_global_p: {
-    auto &DL = I.getDataLayout();
     Info.opc = ISD::INTRINSIC_W_CHAIN;
-    if (Intrinsic == Intrinsic::nvvm_ldu_global_i)
-      Info.memVT = getValueType(DL, I.getType());
-    else if(Intrinsic == Intrinsic::nvvm_ldu_global_p)
-      Info.memVT = getPointerTy(DL);
-    else
-      Info.memVT = getValueType(DL, I.getType());
+    Info.memVT = getValueType(I.getDataLayout(), I.getType());
     Info.ptrVal = I.getArgOperand(0);
     Info.offset = 0;
     Info.flags = MachineMemOperand::MOLoad;
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 27f099e220976..e7f1a4b4c98c4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -309,8 +309,6 @@ class NVPTXTargetLowering : public TargetLowering {
   SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
-  SDValue LowerLOADi1(SDValue Op, SelectionDAG &DAG) const;
-
   SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSTOREi1(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSTOREVector(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index b4641d01eb927..4fb294761e347 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -711,11 +711,11 @@ define <2 x bfloat> @test_copysign(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; CHECK-NEXT:    .reg .b32 %r<6>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.b32 %r1, [test_copysign_param_0];
-; CHECK-NEXT:    ld.param.b32 %r2, [test_copysign_param_1];
-; CHECK-NEXT:    and.b32 %r3, %r2, -2147450880;
-; CHECK-NEXT:    and.b32 %r4, %r1, 2147450879;
-; CHECK-NEXT:    or.b32 %r5, %r4, %r3;
+; CHECK-NEXT:    ld.param.b32 %r1, [test_copysign_param_1];
+; CHECK-NEXT:    and.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT:    ld.param.b32 %r3, [test_copysign_param_0];
+; CHECK-NEXT:    and.b32 %r4, %r3, 2147450879;
+; CHECK-NEXT:    or.b32 %r5, %r4, %r2;
 ; CHECK-NEXT:    st.param.b32 [func_retval0], %r5;
 ; CHECK-NEXT:    ret;
   %r = call <2 x bfloat> @llvm.copysign.f16(<2 x bfloat> %a, <2 x bfloat> %b)
diff --git a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
index 63c389c36e87e..5e7d1a1cd8af1 100644
--- a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
+++ b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
@@ -7,7 +7,7 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -22,23 +22,22 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB0_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17, %r19, %r4;
+; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r17;
 ; SM60-NEXT:    @%p1 bra $L__BB0_3;
 ; SM60-NEXT:  // %bb.2: // %partword.cmpxchg.failure
 ; SM60-NEXT:    // in Loop: Header=BB0_1 Depth=1
 ; SM60-NEXT:    and.b32 %r8, %r7, %r2;
-; SM60-NEXT:    setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT:    mov.b32 %r20, %r8;
+; SM60-NEXT:    setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT:    mov.b32 %r19, %r8;
 ; SM60-NEXT:    @%p2 bra $L__BB0_1;
 ; SM60-NEXT:  $L__BB0_3: // %partword.cmpxchg.end
 ; SM60-NEXT:    st.param.b32 [func_retval0], %r14;
@@ -52,7 +51,7 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -67,23 +66,22 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB1_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17, %r19, %r4;
+; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r17;
 ; SM60-NEXT:    @%p1 bra $L__BB1_3;
 ; SM60-NEXT:  // %bb.2: // %partword.cmpxchg.failure
 ; SM60-NEXT:    // in Loop: Header=BB1_1 Depth=1
 ; SM60-NEXT:    and.b32 %r8, %r7, %r2;
-; SM60-NEXT:    setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT:    mov.b32 %r20, %r8;
+; SM60-NEXT:    setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT:    mov.b32 %r19, %r8;
 ; SM60-NEXT:    @%p2 bra $L__BB1_1;
 ; SM60-NEXT:  $L__BB1_3: // %partword.cmpxchg.end
 ; SM60-NEXT:    membar.cta;
@@ -98,7 +96,7 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -114,23 +112,22 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB2_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17, %r19, %r4;
+; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r17;
 ; SM60-NEXT:    @%p1 bra $L__BB2_3;
 ; SM60-NEXT:  // %bb.2: // %partword.cmpxchg.failure
 ; SM60-NEXT:    // in Loop: Header=BB2_1 Depth=1
 ; SM60-NEXT:    and.b32 %r8, %r7, %r2;
-; SM60-NEXT:    setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT:    mov.b32 %r20, %r8;
+; SM60-NEXT:    setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT:    mov.b32 %r19, %r8;
 ; SM60-NEXT:    @%p2 bra $L__BB2_1;
 ; SM60-NEXT:  $L__BB2_3: // %partword.cmpxchg.end
 ; SM60-NEXT:    membar.cta;
@@ -145,7 +142,7 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60:       {
 ; SM60-NEXT:    .reg .pred %p<3>;
 ; SM60-NEXT:    .reg .b16 %rs<2>;
-; SM60-NEXT:    .reg .b32 %r<21>;
+; SM60-NEXT:    .reg .b32 %r<20>;
 ; SM60-NEXT:    .reg .b64 %rd<3>;
 ; SM60-EMPTY:
 ; SM60-NEXT:  // %bb.0:
@@ -160,23 +157,22 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
 ; SM60-NEXT:    shl.b32 %r13, %r12, %r1;
 ; SM60-NEXT:    not.b32 %r2, %r13;
 ; SM60-NEXT:    cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT:    and.b32 %r15, %r14, 255;
-; SM60-NEXT:    shl.b32 %r3, %r15, %r1;
+; SM60-NEXT:    shl.b32 %r3, %r14, %r1;
 ; SM60-NEXT:    shl.b32 %r4, %r9, %r1;
-; SM60-NEXT:    ld.global.b32 %r16, [%rd1];
-; SM60-NEXT:    and.b32 %r20, %r16, %r2;
+; SM60-NEXT:    ld.global.b32 %r15, [%rd1];
+; SM60-NEXT:    and.b32 %r19, %r15, %r2;
 ; SM60-NEXT:  $L__BB3_1: // %partword.cmpxchg.loop
 ; SM60-NEXT:    // =>This Inner Loop Header: Depth=1
-; SM60-NEXT:    or.b32 %r17, %r20, %r3;
-; SM60-NEXT:    or.b32 %r18, %r20, %r4;
-; SM60-NEXT:    atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT:    setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT:    or.b32 %r16, %r19, %r3;
+; SM60-NEXT:    or.b32 %r17,...
[truncated]

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

So, in practice, the only observable effect on the tests is on loading i8 scalars that we store in i16 registers.

@AlexMaclean AlexMaclean requested a review from topperc August 20, 2025 16:38
@AlexMaclean
Copy link
Member Author

@topperc Do these changes to Mips tests look alright to you?

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/ld-work branch from 6e5e4d4 to c359e0d Compare August 20, 2025 16:52
Copy link
Member

Artem-B commented Aug 20, 2025

My hazy memory of MIPS is that signed/unsigned load only matters for byte/short variants lb/lh, so lw/lwu changes should be benign.

However, the changes that mention # implicit-def: $a0_64 may need a closer look from someone familiar with MIPS. MIPS assembly and ABIs are funky, and those extra register moves may have been there for a reason. It's probably OK, too. AFAICT, the code just loads function arguments into a0/a1/a2 (AKA $4/$5/$6) before the function call. If anything the old code looks somewhat questionable, as $1 is typically reserved for the assembler use on MIPS.

@topperc
Copy link
Collaborator

topperc commented Aug 20, 2025

My hazy memory of MIPS is that signed/unsigned load only matters for byte/short variants lb/lh, so lw/lwu changes should be benign.

However, the changes that mention # implicit-def: $a0_64 may need a closer look from someone familiar with MIPS. MIPS assembly and ABIs are funky, and those extra register moves may have been there for a reason. It's probably OK, too. AFAICT, the code just loads function arguments into a0/a1/a2 (AKA $4/$5/$6) before the function call. If anything the old code looks somewhat questionable, as $1 is typically reserved for the assembler use on MIPS.

I think the Mips test with the extra moves was compiled with -O0 so some DAGCombine optimizations might have been disabled. This change may have caused some optimization on a different path.

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/ld-work branch from 322a43f to 245886f Compare August 21, 2025 21:41
@AlexMaclean AlexMaclean merged commit a3ed96b into llvm:main Aug 21, 2025
9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Aug 21, 2025

LLVM Buildbot has detected a new failure on builder clangd-ubuntu-tsan running on clangd-ubuntu-clang while building llvm at step 6 "test-build-clangd-clangd-index-server-clangd-in...".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/134/builds/24898

Here is the relevant piece of the build log for the reference
Step 6 (test-build-clangd-clangd-index-server-clangd-in...) failure: test (failure)
******************** TEST 'Clangd :: target_info.test' FAILED ********************
Exit Code: 66

Command Output (stderr):
--
rm -rf /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir && mkdir -p /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir # RUN: at line 5
+ rm -rf /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir
+ mkdir -p /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir
echo '[{"directory": "/vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir", "command": "/vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir/armv7-clang -x c++ the-file.cpp -v", "file": "the-file.cpp"}]' > /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir/compile_commands.json # RUN: at line 7
+ echo '[{"directory": "/vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir", "command": "/vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir/armv7-clang -x c++ the-file.cpp -v", "file": "the-file.cpp"}]'
sed -e "s|INPUT_DIR|/vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir|g" /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/llvm-project/clang-tools-extra/clangd/test/target_info.test > /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test.1 # RUN: at line 9
+ sed -e 's|INPUT_DIR|/vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.dir|g' /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/llvm-project/clang-tools-extra/clangd/test/target_info.test
sed -E -e 's|"file://([A-Z]):/|"file:///\1:/|g' /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test.1 > /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test # RUN: at line 12
+ sed -E -e 's|"file://([A-Z]):/|"file:///\1:/|g' /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test.1
clangd -lit-test < /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test 2>&1 | /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/bin/FileCheck -strict-whitespace /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test # RUN: at line 14
+ clangd -lit-test
+ /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/bin/FileCheck -strict-whitespace /vol/worker/clangd-ubuntu-clang/clangd-ubuntu-tsan/build/tools/clang/tools/extra/clangd/test/Output/target_info.test.tmp.test

--

********************


@jhuber6
Copy link
Contributor

jhuber6 commented Aug 22, 2025

I bisected some new libc failures to this PR. After reverting it and rebuilding the tests all pass https://lab.llvm.org/buildbot/#/builders/69/builds/26327/steps/12/logs/stdio. Should we go ahead and revert this for now?

@AlexMaclean
Copy link
Member Author

I bisected some new libc failures to this PR. After reverting it and rebuilding the tests all pass https://lab.llvm.org/buildbot/#/builders/69/builds/26327/steps/12/logs/stdio. Should we go ahead and revert this for now?

@jhuber6 would you be able to attach any more information such as LLVM IR or PTX before/after? or reproduction instructions? If this is possible to reproduce I'm happy to take a look right now. If you think investigation will take a long time I'd be happy to revert as well.

@jhuber6
Copy link
Contributor

jhuber6 commented Aug 22, 2025

I bisected some new libc failures to this PR. After reverting it and rebuilding the tests all pass https://lab.llvm.org/buildbot/#/builders/69/builds/26327/steps/12/logs/stdio. Should we go ahead and revert this for now?

@jhuber6 would you be able to attach any more information such as LLVM IR or PTX before/after? or reproduction instructions? If this is possible to reproduce I'm happy to take a look right now. If you think investigation will take a long time I'd be happy to revert as well.

It caused a handful of tests to fail, you should be able to build and run libc tests locally if you have a working CUDA installation + GPU. I spent over forty minutes trying and failing to get a reproducer because the NVPTX toolchain is so weird and I can't spend all day doing this.

The documentation for building is at https://libc.llvm.org/gpu/building.html#cmake-options and it would probably help to run these tests locally when you make changes since it covers a lot of uncommon use-cases.

@jhuber6
Copy link
Contributor

jhuber6 commented Aug 22, 2025

Here https://godbolt.org/z/Eqqjoon7b, it's a ton of LLVM-IR but the difference I see in the assembly is only a small change the one on the left with three local stores works, the other one causes the error.

+ +--506 lines: // Generated by LLVM NVPTX Back-End··································································│+ +--506 lines: // Generated by LLVM NVPTX Back-End·································································
  // %bb.0:                               // %entry                                                                  │  // %bb.0:                               // %entry      
    mov.b64   %SPL, __local_depot9;                                                                                  │    mov.b64   %SPL, __local_depot9;      
    ld.param.b64  %rd158, [_ZN33LlvmLibcIntegerToStringTest_INT323RunEv_param_0];                                    │    ld.param.b64  %rd158, [_ZN33LlvmLibcIntegerToStringTest_INT323RunEv_param_0];      
    add.u64   %rd1, %SPL, 0;                                                                                         │    add.u64   %rd1, %SPL, 0;      
    add.u64   %rd159, %SPL, 24;                                                                                      │    add.u64   %rd159, %SPL, 24;      
    add.s64   %rd2, %rd159, 9;                                                                                       │    add.s64   %rd2, %rd159, 9;      
    st.local.b8   [%rd159+10], 49;                                                                                   │    st.local.b8   [%rd159+12], 0;      
  -------------------------------------------------------------------------------------------------------------------│    st.local.b8   [%rd159+11], 49;      
  -------------------------------------------------------------------------------------------------------------------│    st.local.b8   [%rd159+10], 0;      
    st.local.b8   [%rd159+9], 45;                                                                                    │    st.local.b8   [%rd159+9], 45;      
    st.local.b64  [%rd159+16], 2;                                                                                    │    st.local.b64  [%rd159+16], 2;      
    ld.b64  %rd3, [%rd158+16];                                                                                       │    ld.b64  %rd3, [%rd158+16];      
    mov.b64   %rd978, 0;                                                                                             │    mov.b64   %rd978, 0;      
    mov.b64   %rd161, _$_str;                                                                                        │    mov.b64   %rd161, _$_str;      
  $L__BB9_1:                              // %for.body.i.i.i.i                                                       │  $L__BB9_1:                              // %for.body.i.i.i.i      
+ +--17287 lines: =>This Inner Loop Header: Depth=1··································································│+ +--17287 lines: =>This Inner Loop Header: Depth=1·································································

@rupprecht
Copy link
Collaborator

We also see some JAX test failures that bisect here, specifically testSort and testSortAgainstNumpy: https://github.com/jax-ml/jax/blob/3d52f5f87095189c3c619fb9daff3259a519aaad/tests/lax_test.py#L2528. I don't have any repro instructions or public buildbots to point to however.

jhuber6 added a commit that referenced this pull request Aug 22, 2025
@jhuber6
Copy link
Contributor

jhuber6 commented Aug 22, 2025

Went ahead and reverted it since the bot has been red for over a day.

AlexMaclean added a commit to AlexMaclean/llvm-project that referenced this pull request Aug 23, 2025
@AlexMaclean
Copy link
Member Author

Looks like this change accidentally removed b6e19b3. I've added this bit back with tests to make sure this doesn't happen again. I'd appreciate both of your review on #155063 when you have a minute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants