-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[VPlan] Model cost of safe-divisor in VPlan #154468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In llvm#154103 we encountered a legacy cost-model mismatch with EVL tail folding, because the select used for a safe-divisor that was previously free was optimized to a vp.merge, which is now costed. The VPWidenRecipe for the div uses the legacy cost model cost which included the select cost, so now we ended up with the cost of both a select + vp.merge in the VPlan cost model. This fixes it by modelling the cost of safe divisors in VPlan, so that the select is costed in the VPInstruction select and the division in the VPWidenRecipe. As far as I can tell we only need to replicate the SafeDivisorCost part of getDivRemSpeculationCost, because if it was scalarized due to isScalarWithPredication then we wouldn't have ended up with a VPWidenRecipe anyway. I've also restricted the cost of VPInstruction w/ Instruction::Select to those with an underlying value so we don't change the cost of anything else. I tested this with AArch64 + RISC-V on the test-suite and SPEC CPU 2017 and didn't hit any more assertions. Fixes llvm#154103.
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesIn #154103 we encountered a legacy cost-model mismatch with EVL tail folding, because the select used for a safe-divisor that was previously free was optimized to a vp.merge, which is now costed. The VPWidenRecipe for the div uses the legacy cost model cost which included the select cost, so now we ended up with the cost of both a select + vp.merge in the VPlan cost model. This fixes it by modelling the cost of safe divisors in VPlan, so that the select is costed in the VPInstruction select and the division in the VPWidenRecipe. As far as I can tell we only need to replicate the SafeDivisorCost part of getDivRemSpeculationCost, because if it was scalarized due to isScalarWithPredication then we wouldn't have ended up with a VPWidenRecipe anyway. I've also restricted the cost of VPInstruction w/ Instruction::Select to those with an underlying value so we don't change the cost of anything else. I tested this with AArch64 + RISC-V on the test-suite and SPEC CPU 2017 and didn't hit any more assertions. Fixes #154103. Full diff: https://github.com/llvm/llvm-project/pull/154468.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 675a230bd2c94..af5273bd210d3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7949,6 +7949,7 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
VPValue *One =
Plan.getOrAddLiveIn(ConstantInt::get(I->getType(), 1u, false));
auto *SafeRHS = Builder.createSelect(Mask, Ops[1], One, I->getDebugLoc());
+ SafeRHS->setUnderlyingValue(I);
Ops[1] = SafeRHS;
return new VPWidenRecipe(*I, Ops);
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index fa62547d374cd..da11f922b5362 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -972,6 +972,19 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
return Ctx.TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy,
Ctx.CostKind);
}
+ case Instruction::Select: {
+ // TODO: Compute cost for VPInstructions without underlying values once
+ // the legacy cost model has been retired.
+ if (!getUnderlyingValue())
+ return 0;
+ Type *ResTy = Ctx.Types.inferScalarType(this);
+ if (!vputils::onlyFirstLaneUsed(this))
+ ResTy = toVectorTy(ResTy, VF);
+ return Ctx.TTI.getCmpSelInstrCost(
+ Instruction::Select, ResTy,
+ ResTy->getWithNewType(Type::getInt1Ty(ResTy->getContext())),
+ CmpInst::BAD_ICMP_PREDICATE, Ctx.CostKind);
+ }
case VPInstruction::AnyOf: {
auto *VecTy = toVectorTy(Ctx.Types.inferScalarType(this), VF);
return Ctx.TTI.getArithmeticReductionCost(
@@ -2037,8 +2050,6 @@ InstructionCost VPWidenRecipe::computeCost(ElementCount VF,
case Instruction::SDiv:
case Instruction::SRem:
case Instruction::URem:
- // More complex computation, let the legacy cost-model handle this for now.
- return Ctx.getLegacyCost(cast<Instruction>(getUnderlyingValue()), VF);
case Instruction::Add:
case Instruction::FAdd:
case Instruction::Sub:
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll b/llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll
new file mode 100644
index 0000000000000..d4b3cfffa4501
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll
@@ -0,0 +1,108 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -p loop-vectorize -mtriple riscv64 -mattr=+v < %s -S | FileCheck %s
+
+define void @pr154103(ptr noalias %a, ptr noalias %b, ptr noalias %c, ptr noalias %d) {
+; CHECK-LABEL: define void @pr154103(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]], ptr noalias [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x ptr> poison, ptr [[B]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x ptr> [[BROADCAST_SPLATINSERT]], <vscale x 4 x ptr> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x ptr> poison, ptr [[C]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x ptr> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x ptr> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP0:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT: [[TMP1:%.*]] = mul <vscale x 4 x i64> [[TMP0]], splat (i64 7)
+; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> splat (i64 1), [[TMP1]]
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[AVL:%.*]] = phi i64 [ -7905747460161236406, %[[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP2]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT5]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
+; CHECK-NEXT: [[TMP4:%.*]] = mul i64 7, [[TMP3]]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i32> @llvm.stepvector.nxv4i32()
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ult <vscale x 4 x i32> [[TMP5]], [[BROADCAST_SPLAT6]]
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[A]], <vscale x 4 x i64> [[VEC_IND]]
+; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i8> @llvm.vp.gather.nxv4i8.nxv4p0(<vscale x 4 x ptr> align 1 [[TMP7]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP2]])
+; CHECK-NEXT: [[TMP8:%.*]] = zext <vscale x 4 x i8> [[WIDE_MASKED_GATHER]] to <vscale x 4 x i64>
+; CHECK-NEXT: [[TMP9:%.*]] = call <vscale x 4 x i64> @llvm.vp.merge.nxv4i64(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i64> [[TMP8]], <vscale x 4 x i64> splat (i64 1), i32 [[TMP2]])
+; CHECK-NEXT: [[TMP10:%.*]] = sdiv <vscale x 4 x i64> zeroinitializer, [[TMP9]]
+; CHECK-NEXT: [[TMP11:%.*]] = icmp sgt <vscale x 4 x i64> [[TMP10]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = select <vscale x 4 x i1> [[TMP6]], <vscale x 4 x i1> [[TMP11]], <vscale x 4 x i1> zeroinitializer
+; CHECK-NEXT: [[WIDE_MASKED_GATHER7:%.*]] = call <vscale x 4 x i8> @llvm.vp.gather.nxv4i8.nxv4p0(<vscale x 4 x ptr> align 1 [[BROADCAST_SPLAT]], <vscale x 4 x i1> [[TMP11]], i32 [[TMP2]])
+; CHECK-NEXT: [[TMP13:%.*]] = zext <vscale x 4 x i8> [[WIDE_MASKED_GATHER7]] to <vscale x 4 x i64>
+; CHECK-NEXT: [[TMP14:%.*]] = xor <vscale x 4 x i64> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP12]], <vscale x 4 x i64> [[TMP14]], <vscale x 4 x i64> zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = trunc <vscale x 4 x i64> [[PREDPHI]] to <vscale x 4 x i16>
+; CHECK-NEXT: call void @llvm.vp.scatter.nxv4i16.nxv4p0(<vscale x 4 x i16> [[TMP15]], <vscale x 4 x ptr> align 2 [[BROADCAST_SPLAT2]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP2]])
+; CHECK-NEXT: store i32 0, ptr [[D]], align 4
+; CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP2]] to i64
+; CHECK-NEXT: [[INDEX_EVL_NEXT]] = add i64 [[TMP16]], [[EVL_BASED_IV]]
+; CHECK-NEXT: [[AVL_NEXT]] = sub nuw i64 [[AVL]], [[TMP16]]
+; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT4]]
+; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_EVL_NEXT]], -7905747460161236406
+; CHECK-NEXT: br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 1, %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT: [[GEP:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT: [[X:%.*]] = load i8, ptr [[GEP]], align 1
+; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[X]] to i64
+; CHECK-NEXT: [[DIV:%.*]] = sdiv i64 0, [[CONV]]
+; CHECK-NEXT: [[CMP:%.*]] = icmp sgt i64 [[DIV]], 0
+; CHECK-NEXT: br i1 [[CMP]], label %[[THEN:.*]], label %[[LATCH]]
+; CHECK: [[THEN]]:
+; CHECK-NEXT: [[Y:%.*]] = load i8, ptr [[B]], align 1
+; CHECK-NEXT: [[ZEXT:%.*]] = zext i8 [[Y]] to i64
+; CHECK-NEXT: [[NOT:%.*]] = xor i64 [[ZEXT]], 0
+; CHECK-NEXT: br label %[[LATCH]]
+; CHECK: [[LATCH]]:
+; CHECK-NEXT: [[COND:%.*]] = phi i64 [ [[NOT]], %[[THEN]] ], [ 0, %[[LOOP]] ]
+; CHECK-NEXT: [[TRUNC:%.*]] = trunc i64 [[COND]] to i16
+; CHECK-NEXT: store i16 [[TRUNC]], ptr [[C]], align 2
+; CHECK-NEXT: store i32 0, ptr [[D]], align 4
+; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV]], 7
+; CHECK-NEXT: [[DONE:%.*]] = icmp eq i64 [[IV]], 0
+; CHECK-NEXT: br i1 [[DONE]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 1, %entry ], [ %iv.next, %latch ]
+ %gep = getelementptr i8, ptr %a, i64 %iv
+ %x = load i8, ptr %gep, align 1
+ %conv = zext i8 %x to i64
+ %div = sdiv i64 0, %conv
+ %cmp = icmp sgt i64 %div, 0
+ br i1 %cmp, label %then, label %latch
+
+then:
+ %y = load i8, ptr %b
+ %zext = zext i8 %y to i64
+ %not = xor i64 %zext, 0
+ br label %latch
+
+latch:
+ %cond = phi i64 [ %not, %then ], [ 0, %loop ]
+ %trunc = trunc i64 %cond to i16
+ store i16 %trunc, ptr %c
+ store i32 0, ptr %d
+ %iv.next = add i64 %iv, 7
+ %done = icmp eq i64 %iv, 0
+ br i1 %done, label %exit, label %loop
+
+exit:
+ ret void
+}
|
Related work #152707? |
Oh woops yes, I didn't see that PR. I will close this PR then and leave some comments on the other one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth still pre-commiting this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It causes a crash without the cost model changes so I don't think we can precommit it. But feel free to add it into #152707
@@ -7949,6 +7949,7 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I, | |||
VPValue *One = | |||
Plan.getOrAddLiveIn(ConstantInt::get(I->getType(), 1u, false)); | |||
auto *SafeRHS = Builder.createSelect(Mask, Ops[1], One, I->getDebugLoc()); | |||
SafeRHS->setUnderlyingValue(I); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this confuse VPSlotTracker::assignName
, since there will be two VPValues with the same underlying value? I guess it also means that the select would be ignored along with the div in both calculateRegisterUsageForPlan
and VPRecipeBase::cost
, which perhaps isn't a bad thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your approach of adding the cost of the select to the legacy cost model in #152707 actually makes more sense! Since its one less legacy-bailout we need to have in the VPlan cost model.
In #154103 we encountered a legacy cost-model mismatch with EVL tail folding, because the select used for a safe-divisor that was previously free was optimized to a vp.merge, which is now costed.
The VPWidenRecipe for the div uses the legacy cost model cost which included the select cost, so now we ended up with the cost of both a select + vp.merge in the VPlan cost model.
This fixes it by modelling the cost of safe divisors in VPlan, so that the select is costed in the VPInstruction select and the division in the VPWidenRecipe.
As far as I can tell we only need to replicate the SafeDivisorCost part of getDivRemSpeculationCost, because if it was scalarized due to isScalarWithPredication then we wouldn't have ended up with a VPWidenRecipe anyway.
I've also restricted the cost of VPInstruction w/ Instruction::Select to those with an underlying value so we don't change the cost of anything else.
I tested this with AArch64 + RISC-V on the test-suite and SPEC CPU 2017 and didn't hit any more assertions.
Fixes #154103.