[SLP] Check for extracts, being replaced by original scalars, for user nodes #149572

alexey-bataev · 2025-07-18T19:35:02Z

Better to check, if the extracts for external uses of the user nodes are
expensive, and they better to be represented as original scalars. In
this case, better to keep their operands scalar too, i.e. for
buildvector node. Otherwise, expensive extracts will be creating,
pessimizing the cost of the tree.

Created using spr 1.3.5

llvmbot · 2025-07-18T19:35:32Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Alexey Bataev (alexey-bataev)

Changes

Better to check, if the extracts for external uses of the user nodes are
expensive, and they better to be represented as original scalars. In
this case, better to keep their operands scalar too, i.e. for
buildvector node. Otherwise, expensive extracts will be creating,
pessimizing the cost of the tree.

Patch is 23.38 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149572.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+142-4)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/reordered-buildvector-scalars.ll (+55-47)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 6ad5c60105a28..14d4e45d1ab6f 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1906,6 +1906,7 @@ class BoUpSLP {
   void deleteTree() {
     VectorizableTree.clear();
     ScalarToTreeEntries.clear();
+    PostponedNodesWithNonVecUsers.clear();
     OperandsToTreeEntry.clear();
     ScalarsInSplitNodes.clear();
     MustGather.clear();
@@ -3896,6 +3897,9 @@ class BoUpSLP {
 
     bool hasState() const { return S.valid(); }
 
+    /// Returns the state of the operations.
+    const InstructionsState &getOperations() const { return S; }
+
     /// When ReuseReorderShuffleIndices is empty it just returns position of \p
     /// V within vector of Scalars. Otherwise, try to remap on its reuse index.
     unsigned findLaneForValue(Value *V) const {
@@ -4290,6 +4294,13 @@ class BoUpSLP {
                                OrdersType &CurrentOrder,
                                SmallVectorImpl<Value *> &PointerOps);
 
+  /// Checks if it is profitable to vectorize the specified list of the
+  /// instructions if not all users are vectorized.
+  bool isProfitableToVectorizeWithNonVecUsers(const InstructionsState &S,
+                                              const EdgeInfo &UserTreeIdx,
+                                              ArrayRef<Value *> Scalars,
+                                              ArrayRef<int> ScalarsMask);
+
   /// Maps a specific scalar to its tree entry(ies).
   SmallDenseMap<Value *, SmallVector<TreeEntry *>> ScalarToTreeEntries;
 
@@ -4300,6 +4311,9 @@ class BoUpSLP {
   /// Scalars, used in split vectorize nodes.
   SmallDenseMap<Value *, SmallVector<TreeEntry *>> ScalarsInSplitNodes;
 
+  /// List of tree nodes indices, which have non-vectorized users.
+  SmallSet<unsigned, 4> PostponedNodesWithNonVecUsers;
+
   /// Maps a value to the proposed vectorizable size.
   SmallDenseMap<Value *, unsigned> InstrElementSize;
 
@@ -8993,6 +9007,81 @@ getVectorCallCosts(CallInst *CI, FixedVectorType *VecTy,
   return {IntrinsicCost, LibCost};
 }
 
+bool BoUpSLP::isProfitableToVectorizeWithNonVecUsers(
+    const InstructionsState &S, const EdgeInfo &UserTreeIdx,
+    ArrayRef<Value *> Scalars, ArrayRef<int> ScalarsMask) {
+  assert(S && "Expected valid instructions state.");
+  // Loads, extracts and geps are immediately scalarizable, so no need to check.
+  if (S.getOpcode() == Instruction::Load ||
+      S.getOpcode() == Instruction::ExtractElement ||
+      S.getOpcode() == Instruction::GetElementPtr)
+    return true;
+  // Check only vectorized users, others scalarized (potentially, at least)
+  // already.
+  if (!UserTreeIdx.UserTE || UserTreeIdx.UserTE->isGather() ||
+      UserTreeIdx.UserTE->State == TreeEntry::SplitVectorize)
+    return true;
+  // PHI nodes may have cyclic deps, so cannot check here.
+  if (UserTreeIdx.UserTE->getOpcode() == Instruction::PHI)
+    return true;
+  // Do not check root reduction nodes, they do not have non-vectorized users.
+  if (UserIgnoreList && UserTreeIdx.UserTE->Idx == 0)
+    return true;
+  constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  ArrayRef<Value *> VL = UserTreeIdx.UserTE->Scalars;
+  Type *UserScalarTy = getValueType(VL.front());
+  if (!isValidElementType(UserScalarTy))
+    return true;
+  Type *ScalarTy = getValueType(Scalars.front());
+  if (!isValidElementType(ScalarTy))
+    return true;
+  // Ignore subvectors extracts.
+  if (UserScalarTy->isVectorTy())
+    return true;
+  auto *UserVecTy =
+      getWidenedType(UserScalarTy, UserTreeIdx.UserTE->getVectorFactor());
+  APInt DemandedElts = APInt::getZero(UserTreeIdx.UserTE->getVectorFactor());
+  // Check the external uses and check, if vector node + extracts is not
+  // profitable for the vectorization.
+  InstructionCost UserScalarsCost = 0;
+  for (Value *V : VL) {
+    auto *I = dyn_cast<Instruction>(V);
+    if (!I)
+      continue;
+    if (areAllUsersVectorized(I, UserIgnoreList))
+      continue;
+    DemandedElts.setBit(UserTreeIdx.UserTE->findLaneForValue(V));
+    UserScalarsCost += TTI->getInstructionCost(I, CostKind);
+  }
+  // No non-vectorized users - success.
+  if (DemandedElts.isZero())
+    return true;
+  // If extracts are cheaper than the original scalars - success.
+  InstructionCost ExtractCost =
+      ::getScalarizationOverhead(*TTI, UserScalarTy, UserVecTy, DemandedElts,
+                                 /*Insert=*/false, /*Extract=*/true, CostKind);
+  if (ExtractCost <= UserScalarsCost)
+    return true;
+  SmallPtrSet<Value *, 4> CheckedExtracts;
+  InstructionCost NodeCost =
+      UserTreeIdx.UserTE->State == TreeEntry::CombinedVectorize
+          ? InstructionCost(0)
+          : getEntryCost(UserTreeIdx.UserTE, {}, CheckedExtracts);
+  // The node is profitable for vectorization - success.
+  if (ExtractCost + NodeCost <= -SLPCostThreshold)
+    return true;
+  auto *VecTy = getWidenedType(ScalarTy, Scalars.size());
+  InstructionCost ScalarsCost = ::getScalarizationOverhead(
+      *TTI, ScalarTy, VecTy, APInt::getAllOnes(Scalars.size()),
+      /*Insert=*/true, /*Extract=*/false, CostKind);
+  if (!ScalarsMask.empty())
+    ScalarsCost += getShuffleCost(*TTI, TTI::SK_PermuteSingleSrc, VecTy,
+                                  ScalarsMask, CostKind);
+
+  // User extracts are cheaper than user scalars + immediate scalars - success.
+  return ExtractCost - (UserScalarsCost + ScalarsCost) < -SLPCostThreshold;
+}
+
 BoUpSLP::TreeEntry::EntryState BoUpSLP::getScalarsVectorizationState(
     const InstructionsState &S, ArrayRef<Value *> VL,
     bool IsScatterVectorizeUserTE, OrdersType &CurrentOrder,
@@ -10283,6 +10372,17 @@ void BoUpSLP::buildTreeRec(ArrayRef<Value *> VLRef, unsigned Depth,
     return;
   }
 
+  // Postpone vectorization, if the node is not profitable because of the
+  // external uses.
+  if (!isProfitableToVectorizeWithNonVecUsers(S, UserTreeIdx, VL,
+                                              ReuseShuffleIndices)) {
+    PostponedNodesWithNonVecUsers.insert(VectorizableTree.size());
+    auto Invalid = ScheduleBundle::invalid();
+    newTreeEntry(VL, Invalid /*not vectorized*/, S, UserTreeIdx,
+                 ReuseShuffleIndices);
+    return;
+  }
+
   Instruction *VL0 = S.getMainOp();
   BasicBlock *BB = VL0->getParent();
   auto &BSRef = BlocksSchedules[BB];
@@ -11668,6 +11768,27 @@ void BoUpSLP::transformNodes() {
       ArrayRef<Value *> VL = E.Scalars;
       const unsigned Sz = getVectorElementSize(VL.front());
       unsigned MinVF = getMinVF(2 * Sz);
+      const EdgeInfo &EI = E.UserTreeIndex;
+      // Try to vectorized postponed scalars, if external uses are vectorized.
+      if (PostponedNodesWithNonVecUsers.contains(E.Idx) &&
+          isProfitableToVectorizeWithNonVecUsers(
+              E.getOperations(), EI, E.Scalars, E.ReuseShuffleIndices)) {
+        assert(E.hasState() && "Expected to have state");
+        unsigned PrevSize = VectorizableTree.size();
+        [[maybe_unused]] unsigned PrevEntriesSize =
+            LoadEntriesToVectorize.size();
+        buildTreeRec(VL, 0, EdgeInfo(&E, UINT_MAX));
+        if (PrevSize + 1 == VectorizableTree.size() &&
+            VectorizableTree[PrevSize]->isGather()) {
+          VectorizableTree.pop_back();
+          assert(PrevEntriesSize == LoadEntriesToVectorize.size() &&
+                 "LoadEntriesToVectorize expected to remain the same");
+        } else {
+          E.CombinedEntriesWithIndices.emplace_back(PrevSize, 0);
+          continue;
+        }
+      }
+
       // Do not try partial vectorization for small nodes (<= 2), nodes with the
       // same opcode and same parent block or all constants.
       if (VL.size() <= 2 || LoadEntriesToVectorize.contains(Idx) ||
@@ -12828,7 +12949,9 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
       });
       InVectors.front() = V;
     }
-    if (!SubVectors.empty()) {
+    if (!SubVectors.empty() &&
+        (SubVectors.size() > 1 || SubVectors.front().second != 0 ||
+         SubVectors.front().first->getVectorFactor() != CommonMask.size())) {
       const PointerUnion<Value *, const TreeEntry *> &Vec = InVectors.front();
       if (InVectors.size() == 2)
         Cost += createShuffle(Vec, InVectors.back(), CommonMask);
@@ -13348,7 +13471,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
   case Instruction::Trunc:
   case Instruction::FPTrunc:
   case Instruction::BitCast: {
-    auto SrcIt = MinBWs.find(getOperandEntry(E, 0));
+    auto SrcIt =
+        MinBWs.empty() ? MinBWs.end() : MinBWs.find(getOperandEntry(E, 0));
     Type *SrcScalarTy = VL0->getOperand(0)->getType();
     auto *SrcVecTy = getWidenedType(SrcScalarTy, VL.size());
     unsigned Opcode = ShuffleOrOp;
@@ -13795,7 +13919,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
         Type *SrcSclTy = E->getMainOp()->getOperand(0)->getType();
         auto *SrcTy = getWidenedType(SrcSclTy, VL.size());
         if (SrcSclTy->isIntegerTy() && ScalarTy->isIntegerTy()) {
-          auto SrcIt = MinBWs.find(getOperandEntry(E, 0));
+          auto SrcIt = MinBWs.empty() ? MinBWs.end()
+                                      : MinBWs.find(getOperandEntry(E, 0));
           unsigned BWSz = DL->getTypeSizeInBits(ScalarTy);
           unsigned SrcBWSz =
               DL->getTypeSizeInBits(E->getMainOp()->getOperand(0)->getType());
@@ -14793,6 +14918,8 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals,
       auto *Inst = cast<Instruction>(EU.Scalar);
       InstructionCost ScalarCost = TTI->getInstructionCost(Inst, CostKind);
       auto OperandIsScalar = [&](Value *V) {
+        if (!isa<Instruction>(V))
+          return true;
         if (!isVectorized(V)) {
           // Some extractelements might be not vectorized, but
           // transformed into shuffle and removed from the function,
@@ -16873,7 +17000,18 @@ class BoUpSLP::ShuffleInstructionBuilder final : public BaseShuffleAnalysis {
       });
       InVectors.front() = Vec;
     }
-    if (!SubVectors.empty()) {
+    if (SubVectors.size() == 1 && SubVectors.front().second == 0 &&
+        SubVectors.front().first->getVectorFactor() == CommonMask.size()) {
+      Value *Vec = SubVectors.front().first->VectorizedValue;
+      if (Vec->getType()->isIntOrIntVectorTy())
+        Vec = castToScalarTyElem(
+            Vec, any_of(SubVectors.front().first->Scalars, [&](Value *V) {
+              if (isa<PoisonValue>(V))
+                return false;
+              return !isKnownNonNegative(V, SimplifyQuery(*R.DL));
+            }));
+      transformMaskAfterShuffle(CommonMask, CommonMask);
+    } else if (!SubVectors.empty()) {
       Value *Vec = InVectors.front();
       if (InVectors.size() == 2) {
         Vec = createShuffle(Vec, InVectors.back(), CommonMask);
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/reordered-buildvector-scalars.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/reordered-buildvector-scalars.ll
index d4e323819402c..117c892b79f9a 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/reordered-buildvector-scalars.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/reordered-buildvector-scalars.ll
@@ -102,80 +102,88 @@ define fastcc i32 @test(i32 %0, i32 %add111.i.i, <4 x i32> %PredPel.i.sroa.86.72
 ; THRESH-NEXT:    [[SHR143_5_I_I9:%.*]] = ashr i32 [[TMP0]], 1
 ; THRESH-NEXT:    [[ADD1392_I:%.*]] = add i32 [[TMP0]], 1
 ; THRESH-NEXT:    [[MUL1445_I:%.*]] = shl i32 [[TMP0]], 1
-; THRESH-NEXT:    [[ADD2136_I:%.*]] = or i32 [[LOOPARRAY_SROA_24_0_I_I3]], [[TMP0]]
-; THRESH-NEXT:    [[SHR2137_I:%.*]] = lshr i32 [[ADD2136_I]], 1
-; THRESH-NEXT:    [[CONV2138_I:%.*]] = trunc i32 [[SHR2137_I]] to i16
 ; THRESH-NEXT:    [[ADD2174_I:%.*]] = add i32 [[MUL1445_I]], 2
 ; THRESH-NEXT:    [[SHR2175_I:%.*]] = lshr i32 [[ADD2174_I]], 2
-; THRESH-NEXT:    [[CONV2176_I:%.*]] = trunc i32 [[SHR2175_I]] to i16
 ; THRESH-NEXT:    [[ADD2190_I:%.*]] = or i32 [[ADD1392_I]], 1
 ; THRESH-NEXT:    [[ADD2191_I:%.*]] = add i32 [[ADD2190_I]], [[TMP0]]
-; THRESH-NEXT:    [[CONV2193_I:%.*]] = trunc i32 [[ADD2191_I]] to i16
-; THRESH-NEXT:    [[ADD2203_I:%.*]] = or i32 [[TMP0]], 1
-; THRESH-NEXT:    [[ADD2204_I:%.*]] = add i32 [[ADD2203_I]], [[TMP0]]
-; THRESH-NEXT:    [[CONV2206_I:%.*]] = trunc i32 [[ADD2204_I]] to i16
+; THRESH-NEXT:    [[TMP2:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; THRESH-NEXT:    [[TMP3:%.*]] = insertelement <2 x i32> [[TMP2]], i32 [[LOOPARRAY_SROA_24_0_I_I3]], i32 0
+; THRESH-NEXT:    [[TMP4:%.*]] = add <2 x i32> [[TMP3]], splat (i32 1)
+; THRESH-NEXT:    [[TMP5:%.*]] = lshr <2 x i32> [[TMP4]], splat (i32 1)
 ; THRESH-NEXT:    [[ADD2235_I16:%.*]] = or i32 [[TMP0]], 1
 ; THRESH-NEXT:    [[ADD2236_I:%.*]] = add i32 [[ADD2235_I16]], 1
 ; THRESH-NEXT:    [[SHR2237_I:%.*]] = lshr i32 [[ADD2236_I]], 1
-; THRESH-NEXT:    [[CONV2238_I:%.*]] = trunc i32 [[SHR2237_I]] to i16
-; THRESH-NEXT:    store i16 [[CONV2238_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8196), align 4
-; THRESH-NEXT:    store i16 [[CONV2238_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8176), align 8
-; THRESH-NEXT:    [[ADD2258_I:%.*]] = or i32 [[ADD111_I_I]], [[TMP0]]
-; THRESH-NEXT:    [[SHR2259_I:%.*]] = lshr i32 [[ADD2258_I]], 1
-; THRESH-NEXT:    [[CONV2260_I:%.*]] = trunc i32 [[SHR2259_I]] to i16
-; THRESH-NEXT:    store i16 [[CONV2260_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8212), align 4
-; THRESH-NEXT:    store i16 [[CONV2260_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8192), align 8
-; THRESH-NEXT:    store i16 [[CONV2260_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8172), align 4
+; THRESH-NEXT:    [[TMP6:%.*]] = insertelement <2 x i32> poison, i32 [[ADD111_I_I]], i32 0
+; THRESH-NEXT:    [[TMP19:%.*]] = insertelement <2 x i32> [[TMP6]], i32 [[LOOPARRAY_SROA_24_0_I_I3]], i32 1
+; THRESH-NEXT:    [[TMP20:%.*]] = insertelement <2 x i32> poison, i32 [[TMP0]], i32 0
+; THRESH-NEXT:    [[TMP21:%.*]] = shufflevector <2 x i32> [[TMP20]], <2 x i32> poison, <2 x i32> zeroinitializer
+; THRESH-NEXT:    [[TMP22:%.*]] = or <2 x i32> [[TMP19]], [[TMP21]]
 ; THRESH-NEXT:    [[ADD2302_I:%.*]] = add i32 [[TMP0]], 1
 ; THRESH-NEXT:    [[SHR2303_I:%.*]] = lshr i32 [[ADD2302_I]], 1
-; THRESH-NEXT:    [[CONV2304_I:%.*]] = trunc i32 [[SHR2303_I]] to i16
-; THRESH-NEXT:    store i16 [[CONV2304_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8224), align 8
-; THRESH-NEXT:    store i16 [[CONV2304_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8204), align 4
-; THRESH-NEXT:    store i16 [[CONV2304_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8184), align 8
 ; THRESH-NEXT:    [[ADD2323_I:%.*]] = add i32 [[TMP0]], 1
 ; THRESH-NEXT:    [[ADD2324_I:%.*]] = or i32 [[ADD2323_I]], [[TMP0]]
 ; THRESH-NEXT:    [[SHR2325_I:%.*]] = lshr i32 [[ADD2324_I]], 1
-; THRESH-NEXT:    [[CONV2326_I:%.*]] = trunc i32 [[SHR2325_I]] to i16
-; THRESH-NEXT:    store i16 [[CONV2326_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8220), align 4
-; THRESH-NEXT:    store i16 [[CONV2326_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8200), align 8
-; THRESH-NEXT:    [[ADD2342_I:%.*]] = add i32 [[SHR143_5_I_I9]], 1
-; THRESH-NEXT:    [[SHR2343_I:%.*]] = lshr i32 [[ADD2342_I]], 1
-; THRESH-NEXT:    [[CONV2344_I:%.*]] = trunc i32 [[SHR2343_I]] to i16
-; THRESH-NEXT:    store i16 [[CONV2344_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8216), align 8
+; THRESH-NEXT:    [[TMP25:%.*]] = insertelement <2 x i32> poison, i32 [[TMP0]], i32 1
+; THRESH-NEXT:    [[TMP12:%.*]] = insertelement <2 x i32> [[TMP25]], i32 [[SHR143_5_I_I9]], i32 0
+; THRESH-NEXT:    [[TMP13:%.*]] = add <2 x i32> [[TMP12]], splat (i32 1)
+; THRESH-NEXT:    [[TMP14:%.*]] = or <2 x i32> [[TMP12]], splat (i32 1)
+; THRESH-NEXT:    [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
 ; THRESH-NEXT:    [[ADD2355_I:%.*]] = or i32 [[SHR143_5_I_I9]], 1
 ; THRESH-NEXT:    [[ADD2356_I:%.*]] = add i32 [[ADD2355_I]], [[TMP0]]
 ; THRESH-NEXT:    [[CONV2358_I:%.*]] = trunc i32 [[ADD2356_I]] to i16
 ; THRESH-NEXT:    store i16 [[CONV2358_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8232), align 8
-; THRESH-NEXT:    [[TMP2:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <2 x i32> <i32 poison, i32 0>
-; THRESH-NEXT:    [[TMP3:%.*]] = insertelement <2 x i32> [[TMP2]], i32 [[LOOPARRAY_SROA_24_0_I_I3]], i32 0
-; THRESH-NEXT:    [[TMP4:%.*]] = add <2 x i32> [[TMP3]], splat (i32 1)
-; THRESH-NEXT:    [[TMP5:%.*]] = lshr <2 x i32> [[TMP4]], splat (i32 1)
-; THRESH-NEXT:    [[TMP6:%.*]] = trunc <2 x i32> [[TMP5]] to <2 x i16>
-; THRESH-NEXT:    store <2 x i16> [[TMP6]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8180), align 4
 ; THRESH-NEXT:    [[ADD2393_I:%.*]] = or i32 [[LOOPARRAY_SROA_24_0_I_I3]], 1
 ; THRESH-NEXT:    [[ADD2394_I:%.*]] = add i32 [[ADD2393_I]], [[TMP0]]
-; THRESH-NEXT:    [[CONV2396_I:%.*]] = trunc i32 [[ADD2394_I]] to i16
-; THRESH-NEXT:    store i16 [[CONV2396_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8198), align 2
-; THRESH-NEXT:    store i16 [[CONV2396_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8178), align 2
-; THRESH-NEXT:    store i16 [[CONV2138_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8214), align 2
-; THRESH-NEXT:    store i16 [[CONV2138_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8194), align 2
-; THRESH-NEXT:    store i16 [[CONV2138_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8174), align 2
+; THRESH-NEXT:    [[TMP16:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; THRESH-NEXT:    [[TMP17:%.*]] = insertelement <8 x i32> [[TMP16]], i32 [[SHR2303_I]], i32 2
+; THRESH-NEXT:    [[TMP18:%.*]] = insertelement <8 x i32> [[TMP17]], i32 [[SHR2175_I]], i32 3
+; THRESH-NEXT:    [[CONV2176_I:%.*]] = trunc i32 [[SHR2175_I]] to i16
+; THRESH-NEXT:    [[CONV2238_I:%.*]] = trunc i32 [[SHR2237_I]] to i16
+; THRESH-NEXT:    store i16 [[CONV2238_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8196), align 4
 ; THRESH-NEXT:    [[TMP7:%.*]] = shufflevector <4 x i32> [[PREDPEL_I_SROA_86_72_VEC_EXTRACT]], <4 x i32> poison, <2 x i32> <i32 poison, i32 0>
 ; THRESH-NEXT:    [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[ADD111_I_I]], i32 0
 ; THRESH-NEXT:    [[TMP9:%.*]] = add <2 x i32> [[TMP8]], splat (i32 1)
 ; THRESH-NEXT:    [[TMP10:%.*]] = lshr <2 x i32> [[TMP9]], splat (i32 1)
+; THRESH-NEXT:    [[TMP23:%.*]] = shufflevector <2 x i32> [[TMP10]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; THRESH-NEXT:    [[TMP24:%.*]] = shufflevector <8 x i32> [[TMP18]], <8 x i32> [[TMP23]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 poison, i32 poison>
 ; THRESH-NEXT:    [[TMP11:%.*]] = trunc <2 x i32> [[TMP10]] to <2 x i16>
-; THRESH-NEXT:    [[TMP12:%.*]] = extractelement <2 x i16> [[TMP11]], i32 1
-; THRESH-NEXT:    store <2 x i16> [[TMP11]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8228), align 4
+; THRESH-NEXT:    [[TMP26:%.*]] = shufflevector <2 x i32> [[TMP10]], <2 x i32> poison, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
+; THRESH-NEXT:    [[TMP27:%.*]] = trunc <2 x i32> [[TMP10]] to <2 x i16>
+; THRESH-NEXT:    store <2 x i16> [[TMP27]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8228), align 4
 ; THRESH-NEXT:    store <2 x i16> [[TMP11]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8208), align 8
-; THRESH-NEXT:    store <2 x i16> [[TMP11]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8188), align 4
-; THRESH-NEXT:    store i16 [[TMP12]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8170), align 2
 ; THRESH-NEXT:    store i16 [[CONV2176_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8226), align 2
 ; THRESH-NEXT:    store i16 [[CONV2176_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8206), align 2
-; THRESH-NEXT:    store i16 [[CONV2176_I]], ptr getelementptr inbounds nuw (i8, ptr @images, i64 8186), align 2
+; THRESH-NEXT:    [[CONV...
[truncated]

alexey-bataev · 2025-07-21T11:41:51Z

Ping!

Created using spr 1.3.5

gbossu

Drive-by comments. I think I understand what this is attempting to do, but I don't understand how the implementation works. I left a couple of comments, I would be happy to get your thoughts on them. Then I'll probably do another round of review.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Created using spr 1.3.5

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Created using spr 1.3.5

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Created using spr 1.3.5

alexey-bataev · 2025-08-04T19:48:33Z

Ping!

gbossu

Sorry for getting back to this only now. I left a couple more comments, but I don't feel comfortable approving that change, as I think I'm still missing something.

AFAIU, we are trying to find out if it is worth vectorising instructions when some of them might need to be kept as scalars as well due to scalar users. That is why we first check if the user TreeEntry will have some of its instructions kept as scalars:

If there are none, then we can go ahead and vectorise VL
But if there are some, then we also need the operands of those scalar users to be scalarised. On top of that, we would need to gather all the instructions into a vector so that they can be used by the user TreeEntry as well. I think this is what you're checking with ExtractCost < UserScalarsCost + ScalarsCost

I still think the cost checks are too "local" to really answer the question "Is this worth vectorising if some lanes need to be kept as scalars?". It's not clear to me if looking at a single user TreeEntry is a good enough heuristic. To me this should be done when the whole tree is built. At this point, we can accurately compute the cost of keeping some lanes of a vectorised instruction as scalars because we also know which operands need to be scalarised as well as vectorised. If a TreeEntry turns out to be too costly because it is both partially scalarised and vectorised, then we can replace it with a Gather node, and essentially prune the tree.

I'll let the usual reviewers do a more thorough review, I don't understand enough about the patch (and the SLPVectorizer in general) to give an approval.

gbossu · 2025-08-05T09:01:23Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+    return true;
+  // The node is profitable for vectorization - success.
+  if (ExtractCost <= UserEntryCost)
+    return true;


I don't understand why we are comparing the extraction cost from the the user TreeEntry to the cost of that same user TreeEntry. Could that early exit be removed?

I think that function here is only meant to compare the cost of keeping some lanes from a user TreeEntry as scalars to the cost of extracting them.

If the scalarization of the user node is not required or too small (less than the difference of the vector node and original scalars), we can live with that, and we can safely vectorize the operands. This is what it checks

I understand the first ExtractCost <= UserScalarsCost check, because if the lanes are already extracted from the user TreeEntry, then there's no need to extract or scalarize operands.

But the check below is still a bit of a mystery to me:

// The node is profitable for vectorization - success. if (ExtractCost <= UserEntryCost) return true;

Which node are we talking about? The user node has already been vectorized, and this is checking nothing about the to-be-created node, so I feel this check should be removed.

It just checks, if the extracts from users node will be actual extracts, not original scalars. In this case, we can vectorize the current (operand) node, because we don't need to keep it scalar. As I said, this estimation mimics similar one in the actual analysis for scalars, which should remain in the code instead of generating the extractelement instructions.

gbossu · 2025-08-05T09:04:31Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+  if (!Mask.empty())
+    ScalarsCost +=
+        getShuffleCost(TTI, TTI::SK_PermuteSingleSrc, VecTy, Mask, CostKind);
+  return ExtractCost < UserScalarsCost + ScalarsCost;


I think that function would be more readable if it had a single return statement. E.g. something like:

return UserExtractCost < UserScalarsCost + OperandScalarizationCost;

The original logic is easier to understand:

If extract cheaper than scalarization - we can live with extracts.

If extracts are cheaper than the vectorization benefits - we can live with extracts.

These 2 checks are extracted from the extract-to-scalars analysis

Last shot. If extracts cheaper than scalarization of the user + its operands - still try to live with extracts, it will be more profitable

gbossu · 2025-08-05T09:07:49Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+  if (S.getOpcode() == Instruction::Load ||
+      S.getOpcode() == Instruction::ExtractElement ||
+      S.getOpcode() == Instruction::GetElementPtr)
+    return true;


But the cost of keeping a scalar load might be pretty high compared to just extracting it. What would happen if you remove this early exit and run through the cost checks below just like for other opcodes?

This analysis is not a part of this logic, it is performed later. Here we just skip loads, as we may either extract them, to keep scalar without checking the operands. The decision on what to do with them is performed in getTreeCost.

Where would that analysis be done? Do you mean the analysis done at the very end to estimate the cost of external scalar uses?

It is part of the gettTreeCost, part of the loop, running over ExternalUses

gbossu · 2025-08-05T09:14:33Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

@@ -10700,6 +10801,15 @@ void BoUpSLP::buildTreeRec(ArrayRef<Value *> VLRef, unsigned Depth,
    return;
  }

+  // Postpone vectorization, if the node is not profitable because of the
+  // external uses.


Nit: "external uses" is a bit confusing, as we are looking at instructions in the tree, not outside of it. Maybe rephrase as because of scalar uses?

We're estimating the cost of the keeping the scalar штыкгсешщт scalar because of the external uses, not because of the tree internal uses. Internal uses are always vectorized (well, almost, except for the vector pointers, being the addresses of the vectorized loads), so we don't need to check internal uses

Fair enough, eventually we will look at uses outside the tree because we check if UserTree has non-vectorized uses. I would still update the comment as:

// Postpone vectorization if the node is not profitable because of // external uses of UserTreeIdx.UserTE

Otherwise it feels like we are checking external uses of VL, which is not what this function is doing.

Ok, will try to make it more clear

alexey-bataev · 2025-08-05T15:02:46Z

Sorry for getting back to this only now. I left a couple more comments, but I don't feel comfortable approving that change, as I think I'm still missing something.

AFAIU, we are trying to find out if it is worth vectorising instructions when some of them might need to be kept as scalars as well due to scalar users. That is why we first check if the user TreeEntry will have some of its instructions kept as scalars:

If there are none, then we can go ahead and vectorise VL

But if there are some, then we also need the operands of those scalar users to be scalarised. On top of that, we would need to gather all the instructions into a vector so that they can be used by the user TreeEntry as well. I think this is what you're checking with ExtractCost < UserScalarsCost + ScalarsCost

I still think the cost checks are too "local" to really answer the question "Is this worth vectorising if some lanes need to be kept as scalars?". It's not clear to me if looking at a single user TreeEntry is a good enough heuristic.

TreeEntries always has single user (by design)

To me this should be done when the whole tree is built. At this point, we can accurately compute the cost of keeping some lanes of a vectorised instruction as scalars because we also know which operands need to be scalarised as well as vectorised. If a TreeEntry turns out to be too costly because it is both partially scalarised and vectorised, then we can replace it with a Gather node, and essentially prune the tree.

It is too late already for making the decision.

I'll let the usual reviewers do a more thorough review, I don't understand enough about the patch (and the SLPVectorizer in general) to give an approval.

gbossu

Thanks for taking the time to answer my questions.

I still think the cost checks are too "local" to really answer the question "Is this worth vectorising if some lanes need to be kept as scalars?". It's not clear to me if looking at a single user TreeEntry is a good enough heuristic.

TreeEntries always has single user (by design)

Sorry it wasn't entirely clear, I understand that TreeEntries have a single user because of the way the SLPVectorizer works. What I meant is: if we want to estimate the scalarization cost of operands because some lanes down the tree need to be extracted, then we need to look at more that just one user node. We also need to look at the user of that user node, and so on, until we find the node that has decided to scalarize some lanes. Essentially, we would need to find the whole dependence tree that needs to be scalarized. Now, I understand this is expensive, that's why I suggested to do that after the whole tree is built. At that point we would be able to properly estimate the scalarization cost, and we can prune the tree if needed.

If a TreeEntry turns out to be too costly because it is both partially scalarised and vectorised, then we can replace it with a Gather node, and essentially prune the tree.

It is too late already for making the decision.

Why is that? We do plenty of optimisations once the tree is built.

gbossu · 2025-08-05T16:23:14Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+    return true;
+  // The node is profitable for vectorization - success.
+  if (ExtractCost <= UserEntryCost)
+    return true;


I understand the first ExtractCost <= UserScalarsCost check, because if the lanes are already extracted from the user TreeEntry, then there's no need to extract or scalarize operands.

But the check below is still a bit of a mystery to me:

// The node is profitable for vectorization - success. if (ExtractCost <= UserEntryCost) return true;

Which node are we talking about? The user node has already been vectorized, and this is checking nothing about the to-be-created node, so I feel this check should be removed.

gbossu · 2025-08-05T16:24:34Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+  if (S.getOpcode() == Instruction::Load ||
+      S.getOpcode() == Instruction::ExtractElement ||
+      S.getOpcode() == Instruction::GetElementPtr)
+    return true;


Where would that analysis be done? Do you mean the analysis done at the very end to estimate the cost of external scalar uses?

gbossu · 2025-08-05T16:30:02Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

@@ -10700,6 +10801,15 @@ void BoUpSLP::buildTreeRec(ArrayRef<Value *> VLRef, unsigned Depth,
    return;
  }

+  // Postpone vectorization, if the node is not profitable because of the
+  // external uses.


Fair enough, eventually we will look at uses outside the tree because we check if UserTree has non-vectorized uses. I would still update the comment as:

// Postpone vectorization if the node is not profitable because of // external uses of UserTreeIdx.UserTE

Otherwise it feels like we are checking external uses of VL, which is not what this function is doing.

alexey-bataev · 2025-08-05T16:50:24Z

Thanks for taking the time to answer my questions.

I still think the cost checks are too "local" to really answer the question "Is this worth vectorising if some lanes need to be kept as scalars?". It's not clear to me if looking at a single user TreeEntry is a good enough heuristic.

TreeEntries always has single user (by design)

Sorry it wasn't entirely clear, I understand that TreeEntries have a single user because of the way the SLPVectorizer works. What I meant is: if we want to estimate the scalarization cost of operands because some lanes down the tree need to be extracted, then we need to look at more that just one user node. We also need to look at the user of that user node, and so on, until we find the node that has decided to scalarize some lanes.

No, we don't need it. This patch looks at the node and its user and checks the user node, if the sacalars should be extracted. Previous users were checked already on the previous iteration.

Essentially, we would need to find the whole dependence tree that needs to be scalarized. Now, I understand this is expensive, that's why I suggested to do that after the whole tree is built. At that point we would be able to properly estimate the scalarization cost, and we can prune the tree if needed.

In general, yes. But not necessary. It will explode compile time + won't give significant benefits. SLP vectorizer uses greedy approach, so all the analysis tries to implement same approach to avoid compile time regressions.

If a TreeEntry turns out to be too costly because it is both partially scalarised and vectorised, then we can replace it with a Gather node, and essentially prune the tree.

This is another approach, I have a plan for. It is called SLP throttling. There is a ticket to support this, but it will require lots of time to implement.
The approach, implemented in this patch, is much simpler and faster, it just postpones potentially expensive nodes and then vectorizes them in transformNodes() pass.

It is too late already for making the decision.

Why is that? We do plenty of optimisations once the tree is built.

Yes, but as I said, it is completely different and more expensive approach

Created using spr 1.3.5

alexey-bataev · 2025-08-13T18:13:55Z

Ping!

alexey-bataev · 2025-09-01T19:06:17Z

Ping!

gbossu · 2025-09-02T12:16:48Z

Just in case you were expecting a review from me, as I wrote in this comment, I don't feel comfortable approving the change because I don't completely understand what it is doing. I'll let the usual experts do the actual thorough review. Sorry if that was not clear :/

[𝘀𝗽𝗿] initial version

eb2d7aa

Created using spr 1.3.5

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Jul 18, 2025

alexey-bataev requested review from hiraditya, RKSimon and HanKuanChen July 18, 2025 19:35

RKSimon changed the title ~~[SLP]Check for extracts, being replaced by original scalares, for user nodes~~ [SLP] Check for extracts, being replaced by original scalars, for user nodes Jul 21, 2025

Rebase

b1dd9a3

Created using spr 1.3.5

gbossu reviewed Jul 25, 2025

View reviewed changes

Rebase, address comments

cb2b89e

Created using spr 1.3.5

gbossu reviewed Jul 28, 2025

View reviewed changes

Address comments

7083641

Created using spr 1.3.5

gbossu reviewed Jul 29, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Show resolved Hide resolved

Precalculate the node cost

ee156a1

Created using spr 1.3.5

gbossu reviewed Aug 5, 2025

View reviewed changes

Address comments, rebase

88e0e10

Created using spr 1.3.5

[SLP] Check for extracts, being replaced by original scalars, for user nodes #149572

Are you sure you want to change the base?

[SLP] Check for extracts, being replaced by original scalars, for user nodes #149572

Conversation

alexey-bataev commented Jul 18, 2025

Uh oh!

llvmbot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-bataev commented Jul 21, 2025

Uh oh!

gbossu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-bataev commented Aug 4, 2025

Uh oh!

gbossu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-bataev commented Aug 5, 2025

Uh oh!

gbossu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-bataev commented Aug 5, 2025

Uh oh!

alexey-bataev commented Aug 13, 2025

Uh oh!

alexey-bataev commented Sep 1, 2025

llvmbot commented Jul 18, 2025 •

edited

Loading