Skip to content

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Aug 20, 2025

In #154103 we encountered a legacy cost-model mismatch with EVL tail folding, because the select used for a safe-divisor that was previously free was optimized to a vp.merge, which is now costed.

The VPWidenRecipe for the div uses the legacy cost model cost which included the select cost, so now we ended up with the cost of both a select + vp.merge in the VPlan cost model.

This fixes it by modelling the cost of safe divisors in VPlan, so that the select is costed in the VPInstruction select and the division in the VPWidenRecipe.

As far as I can tell we only need to replicate the SafeDivisorCost part of getDivRemSpeculationCost, because if it was scalarized due to isScalarWithPredication then we wouldn't have ended up with a VPWidenRecipe anyway.

I've also restricted the cost of VPInstruction w/ Instruction::Select to those with an underlying value so we don't change the cost of anything else.

I tested this with AArch64 + RISC-V on the test-suite and SPEC CPU 2017 and didn't hit any more assertions.

Fixes #154103.

In llvm#154103 we encountered a legacy cost-model mismatch with EVL tail folding, because the select used for a safe-divisor that was previously free was optimized to a vp.merge, which is now costed.

The VPWidenRecipe for the div uses the legacy cost model cost which included the select cost, so now we ended up with the cost of both a select + vp.merge in the VPlan cost model.

This fixes it by modelling the cost of safe divisors in VPlan, so that the select is costed in the VPInstruction select and the division in the VPWidenRecipe.

As far as I can tell we only need to replicate the SafeDivisorCost part of getDivRemSpeculationCost, because if it was scalarized due to isScalarWithPredication then we wouldn't have ended up with a VPWidenRecipe anyway.

I've also restricted the cost of VPInstruction w/ Instruction::Select to those with an underlying value so we don't change the cost of anything else.

I tested this with AArch64 + RISC-V on the test-suite and SPEC CPU 2017 and didn't hit any more assertions.

Fixes llvm#154103.
@llvmbot
Copy link
Member

llvmbot commented Aug 20, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

In #154103 we encountered a legacy cost-model mismatch with EVL tail folding, because the select used for a safe-divisor that was previously free was optimized to a vp.merge, which is now costed.

The VPWidenRecipe for the div uses the legacy cost model cost which included the select cost, so now we ended up with the cost of both a select + vp.merge in the VPlan cost model.

This fixes it by modelling the cost of safe divisors in VPlan, so that the select is costed in the VPInstruction select and the division in the VPWidenRecipe.

As far as I can tell we only need to replicate the SafeDivisorCost part of getDivRemSpeculationCost, because if it was scalarized due to isScalarWithPredication then we wouldn't have ended up with a VPWidenRecipe anyway.

I've also restricted the cost of VPInstruction w/ Instruction::Select to those with an underlying value so we don't change the cost of anything else.

I tested this with AArch64 + RISC-V on the test-suite and SPEC CPU 2017 and didn't hit any more assertions.

Fixes #154103.


Full diff: https://github.com/llvm/llvm-project/pull/154468.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+13-2)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll (+108)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 675a230bd2c94..af5273bd210d3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7949,6 +7949,7 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
       VPValue *One =
           Plan.getOrAddLiveIn(ConstantInt::get(I->getType(), 1u, false));
       auto *SafeRHS = Builder.createSelect(Mask, Ops[1], One, I->getDebugLoc());
+      SafeRHS->setUnderlyingValue(I);
       Ops[1] = SafeRHS;
       return new VPWidenRecipe(*I, Ops);
     }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index fa62547d374cd..da11f922b5362 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -972,6 +972,19 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
     return Ctx.TTI.getVectorInstrCost(Instruction::ExtractElement, VecTy,
                                       Ctx.CostKind);
   }
+  case Instruction::Select: {
+    // TODO: Compute cost for VPInstructions without underlying values once
+    // the legacy cost model has been retired.
+    if (!getUnderlyingValue())
+      return 0;
+    Type *ResTy = Ctx.Types.inferScalarType(this);
+    if (!vputils::onlyFirstLaneUsed(this))
+      ResTy = toVectorTy(ResTy, VF);
+    return Ctx.TTI.getCmpSelInstrCost(
+        Instruction::Select, ResTy,
+        ResTy->getWithNewType(Type::getInt1Ty(ResTy->getContext())),
+        CmpInst::BAD_ICMP_PREDICATE, Ctx.CostKind);
+  }
   case VPInstruction::AnyOf: {
     auto *VecTy = toVectorTy(Ctx.Types.inferScalarType(this), VF);
     return Ctx.TTI.getArithmeticReductionCost(
@@ -2037,8 +2050,6 @@ InstructionCost VPWidenRecipe::computeCost(ElementCount VF,
   case Instruction::SDiv:
   case Instruction::SRem:
   case Instruction::URem:
-    // More complex computation, let the legacy cost-model handle this for now.
-    return Ctx.getLegacyCost(cast<Instruction>(getUnderlyingValue()), VF);
   case Instruction::Add:
   case Instruction::FAdd:
   case Instruction::Sub:
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll b/llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll
new file mode 100644
index 0000000000000..d4b3cfffa4501
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/pr154103.ll
@@ -0,0 +1,108 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -p loop-vectorize -mtriple riscv64 -mattr=+v < %s -S | FileCheck %s
+
+define void @pr154103(ptr noalias %a, ptr noalias %b, ptr noalias %c, ptr noalias %d) {
+; CHECK-LABEL: define void @pr154103(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]], ptr noalias [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x ptr> poison, ptr [[B]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x ptr> [[BROADCAST_SPLATINSERT]], <vscale x 4 x ptr> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x ptr> poison, ptr [[C]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x ptr> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x ptr> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP0:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = mul <vscale x 4 x i64> [[TMP0]], splat (i64 7)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 4 x i64> splat (i64 1), [[TMP1]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[AVL:%.*]] = phi i64 [ -7905747460161236406, %[[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP2:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP2]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT6:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT5]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
+; CHECK-NEXT:    [[TMP4:%.*]] = mul i64 7, [[TMP3]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i32> @llvm.stepvector.nxv4i32()
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp ult <vscale x 4 x i32> [[TMP5]], [[BROADCAST_SPLAT6]]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[A]], <vscale x 4 x i64> [[VEC_IND]]
+; CHECK-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i8> @llvm.vp.gather.nxv4i8.nxv4p0(<vscale x 4 x ptr> align 1 [[TMP7]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP2]])
+; CHECK-NEXT:    [[TMP8:%.*]] = zext <vscale x 4 x i8> [[WIDE_MASKED_GATHER]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP9:%.*]] = call <vscale x 4 x i64> @llvm.vp.merge.nxv4i64(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i64> [[TMP8]], <vscale x 4 x i64> splat (i64 1), i32 [[TMP2]])
+; CHECK-NEXT:    [[TMP10:%.*]] = sdiv <vscale x 4 x i64> zeroinitializer, [[TMP9]]
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp sgt <vscale x 4 x i64> [[TMP10]], zeroinitializer
+; CHECK-NEXT:    [[TMP12:%.*]] = select <vscale x 4 x i1> [[TMP6]], <vscale x 4 x i1> [[TMP11]], <vscale x 4 x i1> zeroinitializer
+; CHECK-NEXT:    [[WIDE_MASKED_GATHER7:%.*]] = call <vscale x 4 x i8> @llvm.vp.gather.nxv4i8.nxv4p0(<vscale x 4 x ptr> align 1 [[BROADCAST_SPLAT]], <vscale x 4 x i1> [[TMP11]], i32 [[TMP2]])
+; CHECK-NEXT:    [[TMP13:%.*]] = zext <vscale x 4 x i8> [[WIDE_MASKED_GATHER7]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP14:%.*]] = xor <vscale x 4 x i64> [[TMP13]], zeroinitializer
+; CHECK-NEXT:    [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP12]], <vscale x 4 x i64> [[TMP14]], <vscale x 4 x i64> zeroinitializer
+; CHECK-NEXT:    [[TMP15:%.*]] = trunc <vscale x 4 x i64> [[PREDPHI]] to <vscale x 4 x i16>
+; CHECK-NEXT:    call void @llvm.vp.scatter.nxv4i16.nxv4p0(<vscale x 4 x i16> [[TMP15]], <vscale x 4 x ptr> align 2 [[BROADCAST_SPLAT2]], <vscale x 4 x i1> splat (i1 true), i32 [[TMP2]])
+; CHECK-NEXT:    store i32 0, ptr [[D]], align 4
+; CHECK-NEXT:    [[TMP16:%.*]] = zext i32 [[TMP2]] to i64
+; CHECK-NEXT:    [[INDEX_EVL_NEXT]] = add i64 [[TMP16]], [[EVL_BASED_IV]]
+; CHECK-NEXT:    [[AVL_NEXT]] = sub nuw i64 [[AVL]], [[TMP16]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT4]]
+; CHECK-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_EVL_NEXT]], -7905747460161236406
+; CHECK-NEXT:    br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 1, %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[X:%.*]] = load i8, ptr [[GEP]], align 1
+; CHECK-NEXT:    [[CONV:%.*]] = zext i8 [[X]] to i64
+; CHECK-NEXT:    [[DIV:%.*]] = sdiv i64 0, [[CONV]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i64 [[DIV]], 0
+; CHECK-NEXT:    br i1 [[CMP]], label %[[THEN:.*]], label %[[LATCH]]
+; CHECK:       [[THEN]]:
+; CHECK-NEXT:    [[Y:%.*]] = load i8, ptr [[B]], align 1
+; CHECK-NEXT:    [[ZEXT:%.*]] = zext i8 [[Y]] to i64
+; CHECK-NEXT:    [[NOT:%.*]] = xor i64 [[ZEXT]], 0
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[COND:%.*]] = phi i64 [ [[NOT]], %[[THEN]] ], [ 0, %[[LOOP]] ]
+; CHECK-NEXT:    [[TRUNC:%.*]] = trunc i64 [[COND]] to i16
+; CHECK-NEXT:    store i16 [[TRUNC]], ptr [[C]], align 2
+; CHECK-NEXT:    store i32 0, ptr [[D]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 7
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i64 [[IV]], 0
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 1, %entry ], [ %iv.next, %latch ]
+  %gep = getelementptr i8, ptr %a, i64 %iv
+  %x = load i8, ptr %gep, align 1
+  %conv = zext i8 %x to i64
+  %div = sdiv i64 0, %conv
+  %cmp = icmp sgt i64 %div, 0
+  br i1 %cmp, label %then, label %latch
+
+then:
+  %y = load i8, ptr %b
+  %zext = zext i8 %y to i64
+  %not = xor i64 %zext, 0
+  br label %latch
+
+latch:
+  %cond = phi i64 [ %not, %then ], [ 0, %loop ]
+  %trunc = trunc i64 %cond to i16
+  store i16 %trunc, ptr %c
+  store i32 0, ptr %d
+  %iv.next = add i64 %iv, 7
+  %done = icmp eq i64 %iv, 0
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}

@Mel-Chen
Copy link
Contributor

Related work #152707?

@lukel97
Copy link
Contributor Author

lukel97 commented Aug 20, 2025

Related work #152707?

Oh woops yes, I didn't see that PR. I will close this PR then and leave some comments on the other one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth still pre-commiting this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It causes a crash without the cost model changes so I don't think we can precommit it. But feel free to add it into #152707

@@ -7949,6 +7949,7 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
VPValue *One =
Plan.getOrAddLiveIn(ConstantInt::get(I->getType(), 1u, false));
auto *SafeRHS = Builder.createSelect(Mask, Ops[1], One, I->getDebugLoc());
SafeRHS->setUnderlyingValue(I);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this confuse VPSlotTracker::assignName, since there will be two VPValues with the same underlying value? I guess it also means that the select would be ignored along with the div in both calculateRegisterUsageForPlan and VPRecipeBase::cost, which perhaps isn't a bad thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I'm happy to pull this into #152707 if we're happy with the potential impact of having the same underlying value for multiple recipes. It would reduce the scope of my PR to strictly divides. Any thoughts @fhahn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your approach of adding the cost of the select to the legacy cost model in #152707 actually makes more sense! Since its one less legacy-bailout we need to have in the VPlan cost model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[riscv64] [LoopVectorize] Assertion Failure in computeBestVF with VPlan Cost Model
4 participants