-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[NVPTX] Legalize aext-load to zext-load to expose more DAG combines #154251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-mips @llvm/pr-subscribers-llvm-selectiondag Author: Alex MacLean (AlexMaclean) ChangesPatch is 130.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154251.diff 9 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index c16ccaf926bc7..e63568bc3fb63 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -15122,7 +15122,7 @@ SDValue DAGCombiner::visitANY_EXTEND(SDNode *N) {
return foldedExt;
} else if (ISD::isNON_EXTLoad(N0.getNode()) &&
ISD::isUNINDEXEDLoad(N0.getNode()) &&
- TLI.isLoadExtLegal(ISD::EXTLOAD, VT, N0.getValueType())) {
+ TLI.isLoadExtLegalOrCustom(ISD::EXTLOAD, VT, N0.getValueType())) {
bool DoXform = true;
SmallVector<SDNode *, 4> SetCCs;
if (!N0.hasOneUse())
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 74e6c139c610d..6933d21c5d8a9 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -746,57 +746,56 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
// intrinsics.
setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
- // Turn FP extload into load/fpextend
- setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f32, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f32, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f32, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8bf16, Expand);
- // Turn FP truncstore into trunc + store.
- // FIXME: vector types should also be expanded
- setTruncStoreAction(MVT::f32, MVT::f16, Expand);
- setTruncStoreAction(MVT::f64, MVT::f16, Expand);
- setTruncStoreAction(MVT::f32, MVT::bf16, Expand);
- setTruncStoreAction(MVT::f64, MVT::bf16, Expand);
- setTruncStoreAction(MVT::f64, MVT::f32, Expand);
- setTruncStoreAction(MVT::v2f32, MVT::v2f16, Expand);
- setTruncStoreAction(MVT::v2f32, MVT::v2bf16, Expand);
+ // FP extload/truncstore is not legal in PTX. We need to expand all these.
+ for (auto FloatVTs :
+ {MVT::fp_valuetypes(), MVT::fp_fixedlen_vector_valuetypes()}) {
+ for (MVT ValVT : FloatVTs) {
+ for (MVT MemVT : FloatVTs) {
+ setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Expand);
+ setTruncStoreAction(ValVT, MemVT, Expand);
+ }
+ }
+ }
- // PTX does not support load / store predicate registers
- setOperationAction(ISD::LOAD, MVT::i1, Custom);
- setOperationAction(ISD::STORE, MVT::i1, Custom);
+ // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+ // how they'll be lowered in ISel anyway, and by doing this a little earlier
+ // we allow for more DAG combine opportunities.
+ for (auto IntVTs :
+ {MVT::integer_valuetypes(), MVT::integer_fixedlen_vector_valuetypes()})
+ for (MVT ValVT : IntVTs)
+ for (MVT MemVT : IntVTs)
+ if (isTypeLegal(ValVT))
+ setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Custom);
+ // PTX does not support load / store predicate registers
+ setOperationAction({ISD::LOAD, ISD::STORE}, MVT::i1, Custom);
for (MVT VT : MVT::integer_valuetypes()) {
- setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);
- setLoadExtAction(ISD::ZEXTLOAD, VT, MVT::i1, Promote);
- setLoadExtAction(ISD::EXTLOAD, VT, MVT::i1, Promote);
+ setLoadExtAction({ISD::SEXTLOAD, ISD::ZEXTLOAD, ISD::EXTLOAD}, VT, MVT::i1,
+ Promote);
setTruncStoreAction(VT, MVT::i1, Expand);
}
+ // Register custom handling for illegal type loads/stores. We'll try to custom
+ // lower almost all illegal types and logic in the lowering will discard cases
+ // we can't handle.
+ setOperationAction({ISD::LOAD, ISD::STORE}, {MVT::i128, MVT::f128}, Custom);
+ for (MVT VT : MVT::fixedlen_vector_valuetypes())
+ if (!isTypeLegal(VT) && VT.getStoreSizeInBits() <= 256)
+ setOperationAction({ISD::STORE, ISD::LOAD}, VT, Custom);
+
+ // Custom legalization for LDU intrinsics.
+ // TODO: The logic to lower these is not very robust and we should rewrite it.
+ // Perhaps LDU should not be represented as an intrinsic at all.
+ setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
+ for (MVT VT : MVT::fixedlen_vector_valuetypes())
+ if (IsPTXVectorType(VT))
+ setOperationAction(ISD::INTRINSIC_W_CHAIN, VT, Custom);
+
setCondCodeAction({ISD::SETNE, ISD::SETEQ, ISD::SETUGE, ISD::SETULE,
ISD::SETUGT, ISD::SETULT, ISD::SETGT, ISD::SETLT,
ISD::SETGE, ISD::SETLE},
MVT::i1, Expand);
- // expand extload of vector of integers.
- setLoadExtAction({ISD::EXTLOAD, ISD::SEXTLOAD, ISD::ZEXTLOAD}, MVT::v2i16,
- MVT::v2i8, Expand);
- setTruncStoreAction(MVT::v2i16, MVT::v2i8, Expand);
-
// This is legal in NVPTX
setOperationAction(ISD::ConstantFP, MVT::f64, Legal);
setOperationAction(ISD::ConstantFP, MVT::f32, Legal);
@@ -811,24 +810,12 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
// DEBUGTRAP can be lowered to PTX brkpt
setOperationAction(ISD::DEBUGTRAP, MVT::Other, Legal);
- // Register custom handling for vector loads/stores
- for (MVT VT : MVT::fixedlen_vector_valuetypes())
- if (IsPTXVectorType(VT))
- setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN}, VT,
- Custom);
-
- setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN},
- {MVT::i128, MVT::f128}, Custom);
-
// Support varargs.
setOperationAction(ISD::VASTART, MVT::Other, Custom);
setOperationAction(ISD::VAARG, MVT::Other, Custom);
setOperationAction(ISD::VACOPY, MVT::Other, Expand);
setOperationAction(ISD::VAEND, MVT::Other, Expand);
- // Custom handling for i8 intrinsics
- setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
-
setOperationAction({ISD::ABS, ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX},
{MVT::i16, MVT::i32, MVT::i64}, Legal);
@@ -3135,39 +3122,14 @@ static void replaceLoadVector(SDNode *N, SelectionDAG &DAG,
SmallVectorImpl<SDValue> &Results,
const NVPTXSubtarget &STI);
-SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
- if (Op.getValueType() == MVT::i1)
- return LowerLOADi1(Op, DAG);
-
- EVT VT = Op.getValueType();
-
- if (NVPTX::isPackedVectorTy(VT)) {
- // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
- // handle unaligned loads and have to handle it here.
- LoadSDNode *Load = cast<LoadSDNode>(Op);
- EVT MemVT = Load->getMemoryVT();
- if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
- MemVT, *Load->getMemOperand())) {
- SDValue Ops[2];
- std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(Load, DAG);
- return DAG.getMergeValues(Ops, SDLoc(Op));
- }
- }
-
- return SDValue();
-}
-
// v = ld i1* addr
// =>
// v1 = ld i8* addr (-> i16)
// v = trunc i16 to i1
-SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
- SDNode *Node = Op.getNode();
- LoadSDNode *LD = cast<LoadSDNode>(Node);
- SDLoc dl(Node);
+static SDValue lowerLOADi1(LoadSDNode *LD, SelectionDAG &DAG) {
+ SDLoc dl(LD);
assert(LD->getExtensionType() == ISD::NON_EXTLOAD);
- assert(Node->getValueType(0) == MVT::i1 &&
- "Custom lowering for i1 load only");
+ assert(LD->getValueType(0) == MVT::i1 && "Custom lowering for i1 load only");
SDValue newLD = DAG.getExtLoad(ISD::ZEXTLOAD, dl, MVT::i16, LD->getChain(),
LD->getBasePtr(), LD->getPointerInfo(),
MVT::i8, LD->getAlign(),
@@ -3176,8 +3138,27 @@ SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
// The legalizer (the caller) is expecting two values from the legalized
// load, so we build a MergeValues node for it. See ExpandUnalignedLoad()
// in LegalizeDAG.cpp which also uses MergeValues.
- SDValue Ops[] = { result, LD->getChain() };
- return DAG.getMergeValues(Ops, dl);
+ return DAG.getMergeValues({result, LD->getChain()}, dl);
+}
+
+SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
+ LoadSDNode *LD = cast<LoadSDNode>(Op);
+
+ if (Op.getValueType() == MVT::i1)
+ return lowerLOADi1(LD, DAG);
+
+ // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+ // how they'll be lowered in ISel anyway, and by doing this a little earlier
+ // we allow for more DAG combine opportunities.
+ if (LD->getExtensionType() == ISD::EXTLOAD) {
+ assert(LD->getValueType(0).isInteger() && LD->getMemoryVT().isInteger() &&
+ "Unexpected fpext-load");
+ return DAG.getExtLoad(ISD::ZEXTLOAD, SDLoc(Op), Op.getValueType(),
+ LD->getChain(), LD->getBasePtr(), LD->getMemoryVT(),
+ LD->getMemOperand());
+ }
+
+ llvm_unreachable("Unexpected custom lowering for load");
}
SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
@@ -3187,17 +3168,6 @@ SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
if (VT == MVT::i1)
return LowerSTOREi1(Op, DAG);
- // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
- // handle unaligned stores and have to handle it here.
- if (NVPTX::isPackedVectorTy(VT) &&
- !allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
- VT, *Store->getMemOperand()))
- return expandUnalignedStore(Store, DAG);
-
- // v2f16/v2bf16/v2i16 don't need special handling.
- if (NVPTX::isPackedVectorTy(VT) && VT.is32BitVector())
- return SDValue();
-
// Lower store of any other vector type, including v2f32 as we want to break
// it apart since this is not a widely-supported type.
return LowerSTOREVector(Op, DAG);
@@ -4051,14 +4021,8 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
case Intrinsic::nvvm_ldu_global_i:
case Intrinsic::nvvm_ldu_global_f:
case Intrinsic::nvvm_ldu_global_p: {
- auto &DL = I.getDataLayout();
Info.opc = ISD::INTRINSIC_W_CHAIN;
- if (Intrinsic == Intrinsic::nvvm_ldu_global_i)
- Info.memVT = getValueType(DL, I.getType());
- else if(Intrinsic == Intrinsic::nvvm_ldu_global_p)
- Info.memVT = getPointerTy(DL);
- else
- Info.memVT = getValueType(DL, I.getType());
+ Info.memVT = getValueType(I.getDataLayout(), I.getType());
Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 27f099e220976..e7f1a4b4c98c4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -309,8 +309,6 @@ class NVPTXTargetLowering : public TargetLowering {
SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
- SDValue LowerLOADi1(SDValue Op, SelectionDAG &DAG) const;
-
SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSTOREi1(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSTOREVector(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index b4641d01eb927..4fb294761e347 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -711,11 +711,11 @@ define <2 x bfloat> @test_copysign(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
; CHECK-NEXT: .reg .b32 %r<6>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
-; CHECK-NEXT: ld.param.b32 %r1, [test_copysign_param_0];
-; CHECK-NEXT: ld.param.b32 %r2, [test_copysign_param_1];
-; CHECK-NEXT: and.b32 %r3, %r2, -2147450880;
-; CHECK-NEXT: and.b32 %r4, %r1, 2147450879;
-; CHECK-NEXT: or.b32 %r5, %r4, %r3;
+; CHECK-NEXT: ld.param.b32 %r1, [test_copysign_param_1];
+; CHECK-NEXT: and.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT: ld.param.b32 %r3, [test_copysign_param_0];
+; CHECK-NEXT: and.b32 %r4, %r3, 2147450879;
+; CHECK-NEXT: or.b32 %r5, %r4, %r2;
; CHECK-NEXT: st.param.b32 [func_retval0], %r5;
; CHECK-NEXT: ret;
%r = call <2 x bfloat> @llvm.copysign.f16(<2 x bfloat> %a, <2 x bfloat> %b)
diff --git a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
index 63c389c36e87e..5e7d1a1cd8af1 100644
--- a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
+++ b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
@@ -7,7 +7,7 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -22,23 +22,22 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB0_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17, %r19, %r4;
+; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT: setp.eq.b32 %p1, %r7, %r17;
; SM60-NEXT: @%p1 bra $L__BB0_3;
; SM60-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM60-NEXT: // in Loop: Header=BB0_1 Depth=1
; SM60-NEXT: and.b32 %r8, %r7, %r2;
-; SM60-NEXT: setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT: mov.b32 %r20, %r8;
+; SM60-NEXT: setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT: mov.b32 %r19, %r8;
; SM60-NEXT: @%p2 bra $L__BB0_1;
; SM60-NEXT: $L__BB0_3: // %partword.cmpxchg.end
; SM60-NEXT: st.param.b32 [func_retval0], %r14;
@@ -52,7 +51,7 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -67,23 +66,22 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB1_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17, %r19, %r4;
+; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT: setp.eq.b32 %p1, %r7, %r17;
; SM60-NEXT: @%p1 bra $L__BB1_3;
; SM60-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM60-NEXT: // in Loop: Header=BB1_1 Depth=1
; SM60-NEXT: and.b32 %r8, %r7, %r2;
-; SM60-NEXT: setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT: mov.b32 %r20, %r8;
+; SM60-NEXT: setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT: mov.b32 %r19, %r8;
; SM60-NEXT: @%p2 bra $L__BB1_1;
; SM60-NEXT: $L__BB1_3: // %partword.cmpxchg.end
; SM60-NEXT: membar.cta;
@@ -98,7 +96,7 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -114,23 +112,22 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB2_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17, %r19, %r4;
+; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT: setp.eq.b32 %p1, %r7, %r17;
; SM60-NEXT: @%p1 bra $L__BB2_3;
; SM60-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM60-NEXT: // in Loop: Header=BB2_1 Depth=1
; SM60-NEXT: and.b32 %r8, %r7, %r2;
-; SM60-NEXT: setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT: mov.b32 %r20, %r8;
+; SM60-NEXT: setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT: mov.b32 %r19, %r8;
; SM60-NEXT: @%p2 bra $L__BB2_1;
; SM60-NEXT: $L__BB2_3: // %partword.cmpxchg.end
; SM60-NEXT: membar.cta;
@@ -145,7 +142,7 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -160,23 +157,22 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB3_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17,...
[truncated]
|
@llvm/pr-subscribers-backend-nvptx Author: Alex MacLean (AlexMaclean) ChangesPatch is 130.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154251.diff 9 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index c16ccaf926bc7..e63568bc3fb63 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -15122,7 +15122,7 @@ SDValue DAGCombiner::visitANY_EXTEND(SDNode *N) {
return foldedExt;
} else if (ISD::isNON_EXTLoad(N0.getNode()) &&
ISD::isUNINDEXEDLoad(N0.getNode()) &&
- TLI.isLoadExtLegal(ISD::EXTLOAD, VT, N0.getValueType())) {
+ TLI.isLoadExtLegalOrCustom(ISD::EXTLOAD, VT, N0.getValueType())) {
bool DoXform = true;
SmallVector<SDNode *, 4> SetCCs;
if (!N0.hasOneUse())
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 74e6c139c610d..6933d21c5d8a9 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -746,57 +746,56 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
// intrinsics.
setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
- // Turn FP extload into load/fpextend
- setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f32, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f32, MVT::v2bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v2f64, MVT::v2f32, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f32, MVT::v4bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v4f64, MVT::v4f32, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8f16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f32, MVT::v8bf16, Expand);
- setLoadExtAction(ISD::EXTLOAD, MVT::v8f64, MVT::v8bf16, Expand);
- // Turn FP truncstore into trunc + store.
- // FIXME: vector types should also be expanded
- setTruncStoreAction(MVT::f32, MVT::f16, Expand);
- setTruncStoreAction(MVT::f64, MVT::f16, Expand);
- setTruncStoreAction(MVT::f32, MVT::bf16, Expand);
- setTruncStoreAction(MVT::f64, MVT::bf16, Expand);
- setTruncStoreAction(MVT::f64, MVT::f32, Expand);
- setTruncStoreAction(MVT::v2f32, MVT::v2f16, Expand);
- setTruncStoreAction(MVT::v2f32, MVT::v2bf16, Expand);
+ // FP extload/truncstore is not legal in PTX. We need to expand all these.
+ for (auto FloatVTs :
+ {MVT::fp_valuetypes(), MVT::fp_fixedlen_vector_valuetypes()}) {
+ for (MVT ValVT : FloatVTs) {
+ for (MVT MemVT : FloatVTs) {
+ setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Expand);
+ setTruncStoreAction(ValVT, MemVT, Expand);
+ }
+ }
+ }
- // PTX does not support load / store predicate registers
- setOperationAction(ISD::LOAD, MVT::i1, Custom);
- setOperationAction(ISD::STORE, MVT::i1, Custom);
+ // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+ // how they'll be lowered in ISel anyway, and by doing this a little earlier
+ // we allow for more DAG combine opportunities.
+ for (auto IntVTs :
+ {MVT::integer_valuetypes(), MVT::integer_fixedlen_vector_valuetypes()})
+ for (MVT ValVT : IntVTs)
+ for (MVT MemVT : IntVTs)
+ if (isTypeLegal(ValVT))
+ setLoadExtAction(ISD::EXTLOAD, ValVT, MemVT, Custom);
+ // PTX does not support load / store predicate registers
+ setOperationAction({ISD::LOAD, ISD::STORE}, MVT::i1, Custom);
for (MVT VT : MVT::integer_valuetypes()) {
- setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);
- setLoadExtAction(ISD::ZEXTLOAD, VT, MVT::i1, Promote);
- setLoadExtAction(ISD::EXTLOAD, VT, MVT::i1, Promote);
+ setLoadExtAction({ISD::SEXTLOAD, ISD::ZEXTLOAD, ISD::EXTLOAD}, VT, MVT::i1,
+ Promote);
setTruncStoreAction(VT, MVT::i1, Expand);
}
+ // Register custom handling for illegal type loads/stores. We'll try to custom
+ // lower almost all illegal types and logic in the lowering will discard cases
+ // we can't handle.
+ setOperationAction({ISD::LOAD, ISD::STORE}, {MVT::i128, MVT::f128}, Custom);
+ for (MVT VT : MVT::fixedlen_vector_valuetypes())
+ if (!isTypeLegal(VT) && VT.getStoreSizeInBits() <= 256)
+ setOperationAction({ISD::STORE, ISD::LOAD}, VT, Custom);
+
+ // Custom legalization for LDU intrinsics.
+ // TODO: The logic to lower these is not very robust and we should rewrite it.
+ // Perhaps LDU should not be represented as an intrinsic at all.
+ setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
+ for (MVT VT : MVT::fixedlen_vector_valuetypes())
+ if (IsPTXVectorType(VT))
+ setOperationAction(ISD::INTRINSIC_W_CHAIN, VT, Custom);
+
setCondCodeAction({ISD::SETNE, ISD::SETEQ, ISD::SETUGE, ISD::SETULE,
ISD::SETUGT, ISD::SETULT, ISD::SETGT, ISD::SETLT,
ISD::SETGE, ISD::SETLE},
MVT::i1, Expand);
- // expand extload of vector of integers.
- setLoadExtAction({ISD::EXTLOAD, ISD::SEXTLOAD, ISD::ZEXTLOAD}, MVT::v2i16,
- MVT::v2i8, Expand);
- setTruncStoreAction(MVT::v2i16, MVT::v2i8, Expand);
-
// This is legal in NVPTX
setOperationAction(ISD::ConstantFP, MVT::f64, Legal);
setOperationAction(ISD::ConstantFP, MVT::f32, Legal);
@@ -811,24 +810,12 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
// DEBUGTRAP can be lowered to PTX brkpt
setOperationAction(ISD::DEBUGTRAP, MVT::Other, Legal);
- // Register custom handling for vector loads/stores
- for (MVT VT : MVT::fixedlen_vector_valuetypes())
- if (IsPTXVectorType(VT))
- setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN}, VT,
- Custom);
-
- setOperationAction({ISD::LOAD, ISD::STORE, ISD::INTRINSIC_W_CHAIN},
- {MVT::i128, MVT::f128}, Custom);
-
// Support varargs.
setOperationAction(ISD::VASTART, MVT::Other, Custom);
setOperationAction(ISD::VAARG, MVT::Other, Custom);
setOperationAction(ISD::VACOPY, MVT::Other, Expand);
setOperationAction(ISD::VAEND, MVT::Other, Expand);
- // Custom handling for i8 intrinsics
- setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);
-
setOperationAction({ISD::ABS, ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX},
{MVT::i16, MVT::i32, MVT::i64}, Legal);
@@ -3135,39 +3122,14 @@ static void replaceLoadVector(SDNode *N, SelectionDAG &DAG,
SmallVectorImpl<SDValue> &Results,
const NVPTXSubtarget &STI);
-SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
- if (Op.getValueType() == MVT::i1)
- return LowerLOADi1(Op, DAG);
-
- EVT VT = Op.getValueType();
-
- if (NVPTX::isPackedVectorTy(VT)) {
- // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
- // handle unaligned loads and have to handle it here.
- LoadSDNode *Load = cast<LoadSDNode>(Op);
- EVT MemVT = Load->getMemoryVT();
- if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
- MemVT, *Load->getMemOperand())) {
- SDValue Ops[2];
- std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(Load, DAG);
- return DAG.getMergeValues(Ops, SDLoc(Op));
- }
- }
-
- return SDValue();
-}
-
// v = ld i1* addr
// =>
// v1 = ld i8* addr (-> i16)
// v = trunc i16 to i1
-SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
- SDNode *Node = Op.getNode();
- LoadSDNode *LD = cast<LoadSDNode>(Node);
- SDLoc dl(Node);
+static SDValue lowerLOADi1(LoadSDNode *LD, SelectionDAG &DAG) {
+ SDLoc dl(LD);
assert(LD->getExtensionType() == ISD::NON_EXTLOAD);
- assert(Node->getValueType(0) == MVT::i1 &&
- "Custom lowering for i1 load only");
+ assert(LD->getValueType(0) == MVT::i1 && "Custom lowering for i1 load only");
SDValue newLD = DAG.getExtLoad(ISD::ZEXTLOAD, dl, MVT::i16, LD->getChain(),
LD->getBasePtr(), LD->getPointerInfo(),
MVT::i8, LD->getAlign(),
@@ -3176,8 +3138,27 @@ SDValue NVPTXTargetLowering::LowerLOADi1(SDValue Op, SelectionDAG &DAG) const {
// The legalizer (the caller) is expecting two values from the legalized
// load, so we build a MergeValues node for it. See ExpandUnalignedLoad()
// in LegalizeDAG.cpp which also uses MergeValues.
- SDValue Ops[] = { result, LD->getChain() };
- return DAG.getMergeValues(Ops, dl);
+ return DAG.getMergeValues({result, LD->getChain()}, dl);
+}
+
+SDValue NVPTXTargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
+ LoadSDNode *LD = cast<LoadSDNode>(Op);
+
+ if (Op.getValueType() == MVT::i1)
+ return lowerLOADi1(LD, DAG);
+
+ // To improve CodeGen we'll legalize any-extend loads to zext loads. This is
+ // how they'll be lowered in ISel anyway, and by doing this a little earlier
+ // we allow for more DAG combine opportunities.
+ if (LD->getExtensionType() == ISD::EXTLOAD) {
+ assert(LD->getValueType(0).isInteger() && LD->getMemoryVT().isInteger() &&
+ "Unexpected fpext-load");
+ return DAG.getExtLoad(ISD::ZEXTLOAD, SDLoc(Op), Op.getValueType(),
+ LD->getChain(), LD->getBasePtr(), LD->getMemoryVT(),
+ LD->getMemOperand());
+ }
+
+ llvm_unreachable("Unexpected custom lowering for load");
}
SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
@@ -3187,17 +3168,6 @@ SDValue NVPTXTargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
if (VT == MVT::i1)
return LowerSTOREi1(Op, DAG);
- // v2f32/v2f16/v2bf16/v2i16/v4i8 are legal, so we can't rely on legalizer to
- // handle unaligned stores and have to handle it here.
- if (NVPTX::isPackedVectorTy(VT) &&
- !allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
- VT, *Store->getMemOperand()))
- return expandUnalignedStore(Store, DAG);
-
- // v2f16/v2bf16/v2i16 don't need special handling.
- if (NVPTX::isPackedVectorTy(VT) && VT.is32BitVector())
- return SDValue();
-
// Lower store of any other vector type, including v2f32 as we want to break
// it apart since this is not a widely-supported type.
return LowerSTOREVector(Op, DAG);
@@ -4051,14 +4021,8 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
case Intrinsic::nvvm_ldu_global_i:
case Intrinsic::nvvm_ldu_global_f:
case Intrinsic::nvvm_ldu_global_p: {
- auto &DL = I.getDataLayout();
Info.opc = ISD::INTRINSIC_W_CHAIN;
- if (Intrinsic == Intrinsic::nvvm_ldu_global_i)
- Info.memVT = getValueType(DL, I.getType());
- else if(Intrinsic == Intrinsic::nvvm_ldu_global_p)
- Info.memVT = getPointerTy(DL);
- else
- Info.memVT = getValueType(DL, I.getType());
+ Info.memVT = getValueType(I.getDataLayout(), I.getType());
Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 27f099e220976..e7f1a4b4c98c4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -309,8 +309,6 @@ class NVPTXTargetLowering : public TargetLowering {
SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
- SDValue LowerLOADi1(SDValue Op, SelectionDAG &DAG) const;
-
SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSTOREi1(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSTOREVector(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index b4641d01eb927..4fb294761e347 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -711,11 +711,11 @@ define <2 x bfloat> @test_copysign(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
; CHECK-NEXT: .reg .b32 %r<6>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
-; CHECK-NEXT: ld.param.b32 %r1, [test_copysign_param_0];
-; CHECK-NEXT: ld.param.b32 %r2, [test_copysign_param_1];
-; CHECK-NEXT: and.b32 %r3, %r2, -2147450880;
-; CHECK-NEXT: and.b32 %r4, %r1, 2147450879;
-; CHECK-NEXT: or.b32 %r5, %r4, %r3;
+; CHECK-NEXT: ld.param.b32 %r1, [test_copysign_param_1];
+; CHECK-NEXT: and.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT: ld.param.b32 %r3, [test_copysign_param_0];
+; CHECK-NEXT: and.b32 %r4, %r3, 2147450879;
+; CHECK-NEXT: or.b32 %r5, %r4, %r2;
; CHECK-NEXT: st.param.b32 [func_retval0], %r5;
; CHECK-NEXT: ret;
%r = call <2 x bfloat> @llvm.copysign.f16(<2 x bfloat> %a, <2 x bfloat> %b)
diff --git a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
index 63c389c36e87e..5e7d1a1cd8af1 100644
--- a/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
+++ b/llvm/test/CodeGen/NVPTX/cmpxchg-sm60.ll
@@ -7,7 +7,7 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -22,23 +22,22 @@ define i8 @monotonic_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB0_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17, %r19, %r4;
+; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT: setp.eq.b32 %p1, %r7, %r17;
; SM60-NEXT: @%p1 bra $L__BB0_3;
; SM60-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM60-NEXT: // in Loop: Header=BB0_1 Depth=1
; SM60-NEXT: and.b32 %r8, %r7, %r2;
-; SM60-NEXT: setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT: mov.b32 %r20, %r8;
+; SM60-NEXT: setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT: mov.b32 %r19, %r8;
; SM60-NEXT: @%p2 bra $L__BB0_1;
; SM60-NEXT: $L__BB0_3: // %partword.cmpxchg.end
; SM60-NEXT: st.param.b32 [func_retval0], %r14;
@@ -52,7 +51,7 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -67,23 +66,22 @@ define i8 @monotonic_acquire_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB1_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17, %r19, %r4;
+; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT: setp.eq.b32 %p1, %r7, %r17;
; SM60-NEXT: @%p1 bra $L__BB1_3;
; SM60-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM60-NEXT: // in Loop: Header=BB1_1 Depth=1
; SM60-NEXT: and.b32 %r8, %r7, %r2;
-; SM60-NEXT: setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT: mov.b32 %r20, %r8;
+; SM60-NEXT: setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT: mov.b32 %r19, %r8;
; SM60-NEXT: @%p2 bra $L__BB1_1;
; SM60-NEXT: $L__BB1_3: // %partword.cmpxchg.end
; SM60-NEXT: membar.cta;
@@ -98,7 +96,7 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -114,23 +112,22 @@ define i8 @monotonic_seq_cst_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB2_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17, %r19, %r4;
+; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r17, %r16;
+; SM60-NEXT: setp.eq.b32 %p1, %r7, %r17;
; SM60-NEXT: @%p1 bra $L__BB2_3;
; SM60-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM60-NEXT: // in Loop: Header=BB2_1 Depth=1
; SM60-NEXT: and.b32 %r8, %r7, %r2;
-; SM60-NEXT: setp.ne.b32 %p2, %r20, %r8;
-; SM60-NEXT: mov.b32 %r20, %r8;
+; SM60-NEXT: setp.ne.b32 %p2, %r19, %r8;
+; SM60-NEXT: mov.b32 %r19, %r8;
; SM60-NEXT: @%p2 bra $L__BB2_1;
; SM60-NEXT: $L__BB2_3: // %partword.cmpxchg.end
; SM60-NEXT: membar.cta;
@@ -145,7 +142,7 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60: {
; SM60-NEXT: .reg .pred %p<3>;
; SM60-NEXT: .reg .b16 %rs<2>;
-; SM60-NEXT: .reg .b32 %r<21>;
+; SM60-NEXT: .reg .b32 %r<20>;
; SM60-NEXT: .reg .b64 %rd<3>;
; SM60-EMPTY:
; SM60-NEXT: // %bb.0:
@@ -160,23 +157,22 @@ define i8 @acquire_monotonic_i8_global_cta(ptr addrspace(1) %addr, i8 %cmp, i8 %
; SM60-NEXT: shl.b32 %r13, %r12, %r1;
; SM60-NEXT: not.b32 %r2, %r13;
; SM60-NEXT: cvt.u32.u16 %r14, %rs1;
-; SM60-NEXT: and.b32 %r15, %r14, 255;
-; SM60-NEXT: shl.b32 %r3, %r15, %r1;
+; SM60-NEXT: shl.b32 %r3, %r14, %r1;
; SM60-NEXT: shl.b32 %r4, %r9, %r1;
-; SM60-NEXT: ld.global.b32 %r16, [%rd1];
-; SM60-NEXT: and.b32 %r20, %r16, %r2;
+; SM60-NEXT: ld.global.b32 %r15, [%rd1];
+; SM60-NEXT: and.b32 %r19, %r15, %r2;
; SM60-NEXT: $L__BB3_1: // %partword.cmpxchg.loop
; SM60-NEXT: // =>This Inner Loop Header: Depth=1
-; SM60-NEXT: or.b32 %r17, %r20, %r3;
-; SM60-NEXT: or.b32 %r18, %r20, %r4;
-; SM60-NEXT: atom.cta.global.cas.b32 %r7, [%rd1], %r18, %r17;
-; SM60-NEXT: setp.eq.b32 %p1, %r7, %r18;
+; SM60-NEXT: or.b32 %r16, %r19, %r3;
+; SM60-NEXT: or.b32 %r17,...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
So, in practice, the only observable effect on the tests is on loading i8 scalars that we store in i16 registers.
@topperc Do these changes to Mips tests look alright to you? |
6e5e4d4
to
c359e0d
Compare
My hazy memory of MIPS is that signed/unsigned load only matters for byte/short variants However, the changes that mention |
I think the Mips test with the extra moves was compiled with -O0 so some DAGCombine optimizations might have been disabled. This change may have caused some optimization on a different path. |
322a43f
to
245886f
Compare
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/134/builds/24898 Here is the relevant piece of the build log for the reference
|
I bisected some new |
@jhuber6 would you be able to attach any more information such as LLVM IR or PTX before/after? or reproduction instructions? If this is possible to reproduce I'm happy to take a look right now. If you think investigation will take a long time I'd be happy to revert as well. |
It caused a handful of tests to fail, you should be able to build and run The documentation for building is at https://libc.llvm.org/gpu/building.html#cmake-options and it would probably help to run these tests locally when you make changes since it covers a lot of uncommon use-cases. |
Here https://godbolt.org/z/Eqqjoon7b, it's a ton of LLVM-IR but the difference I see in the assembly is only a small change the one on the left with three local stores works, the other one causes the error.
|
We also see some JAX test failures that bisect here, specifically testSort and testSortAgainstNumpy: https://github.com/jax-ml/jax/blob/3d52f5f87095189c3c619fb9daff3259a519aaad/tests/lax_test.py#L2528. I don't have any repro instructions or public buildbots to point to however. |
…mbines (#154251)" Causes failures in the LLVM libc test suite https://lab.llvm.org/buildbot/#/builders/69/builds/26327/steps/12/logs/stdio. This reverts commit a3ed96b.
Went ahead and reverted it since the bot has been red for over a day. |
No description provided.