[SROA] Use tree-structure merge to remove alloca #152793

Chengjunp · 2025-08-08T20:51:52Z

This patch introduces a new optimization in SROA that handles the pattern where multiple non-overlapping vector stores completely fill an alloca.

The current approach to handle this pattern introduces many .vecexpand and .vecblend instructions, which can dramatically slow down compilation when dealing with large allocas built from many small vector stores. For example, consider an alloca of type <128 x float> filled by 64 stores of <2 x float> each. The current implementation requires:

64 shufflevectors( .vecexpand)
64 selects ( .vecblend )
All operations use masks of size 128
These operations form a long dependency chain

This kind of IR is both difficult to optimize and slow to compile, particularly impacting the InstCombine pass.

This patch introduces a tree-structured merge approach that significantly reduces the number of operations and improves compilation performance.

Key features:

Detects when vector stores completely fill an alloca without gaps
Ensures no loads occur in the middle of the store sequence
Uses a tree-based approach with shufflevectors to merge stored values
Reduces the number of intermediate operations compared to linear merging
Eliminates the long dependency chains that hurt optimization

Example transformation:

// Before: (stores do not have to be in order) 
%alloca = alloca <8 x float>
store <2 x float> %val0, ptr %alloca       ; offset 0-1
store <2 x float> %val2, ptr %alloca+16    ; offset 4-5  
store <2 x float> %val1, ptr %alloca+8     ; offset 2-3
store <2 x float> %val3, ptr %alloca+24    ; offset 6-7
%result = load <8 x float>, ptr %alloca

// After (tree-structured merge):
%shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>  
%result = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

Benefits:

Logarithmic depth (O(log n)) instead of linear dependency chains
Fewer total operations for large vectors
Better optimization opportunities for subsequent passes
Significant compilation time improvements for large vector patterns

For some large cases, the compile time can be reduced from about 60s to less than 3s.

llvmbot · 2025-08-08T20:52:13Z

@llvm/pr-subscribers-llvm-transforms

Author: Chengjun (Chengjunp)

Changes

This patch introduces a new optimization in SROA that handles the pattern where multiple non-overlapping vector stores completely fill an alloca.

The current approach to handle this pattern introduces many .vecexpand and .vecblend instructions, which can dramatically slow down compilation when dealing with large allocas built from many small vector stores. For example, consider an alloca of type <128 x float> filled by 64 stores of <2 x float> each. The current implementation requires:

64 shufflevectors( .vecexpand)
64 selects ( .vecblend )
All operations use masks of size 128
These operations form a long dependency chain

This kind of IR is both difficult to optimize and slow to compile, particularly impacting the InstCombine pass.

This patch introduces a tree-structured merge approach that significantly reduces the number of operations and improves compilation performance.

Key features:

Detects when vector stores completely fill an alloca without gaps
Ensures no loads occur in the middle of the store sequence
Uses a tree-based approach with shufflevectors to merge stored values
Reduces the number of intermediate operations compared to linear merging
Eliminates the long dependency chains that hurt optimization

Example transformation:

// Before: (stores do not have to be in order) 
%alloca = alloca &lt;8 x float&gt;
store &lt;2 x float&gt; %val0, ptr %alloca       ; offset 0-1
store &lt;2 x float&gt; %val2, ptr %alloca+16    ; offset 4-5  
store &lt;2 x float&gt; %val1, ptr %alloca+8     ; offset 2-3
store &lt;2 x float&gt; %val3, ptr %alloca+24    ; offset 6-7
%result = load &lt;8 x float&gt;, ptr %alloca

// After (tree-structured merge):
%shuffle0 = shufflevector %val0, %val1, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
%shuffle1 = shufflevector %val2, %val3, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;  
%result = shufflevector %shuffle0, %shuffle1, &lt;8 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7&gt;

Benefits:

Logarithmic depth (O(log n)) instead of linear dependency chains
Fewer total operations for large vectors
Better optimization opportunities for subsequent passes
Significant compilation time improvements for large vector patterns

For some large cases, the compile time can be reduced from about 60s to less than 3s.

Patch is 50.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152793.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Scalar/SROA.cpp (+288-7)
(added) llvm/test/Transforms/SROA/vector-promotion-cannot-tree-structure-merge.ll (+214)
(added) llvm/test/Transforms/SROA/vector-promotion-via-tree-structure-merge.ll (+408)

diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index d6e27aa20730b..2bbaf7813c3c0 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -91,6 +91,7 @@
 #include <cstdint>
 #include <cstring>
 #include <iterator>
+#include <queue>
 #include <string>
 #include <tuple>
 #include <utility>
@@ -2678,6 +2679,53 @@ static Value *insertVector(IRBuilderTy &IRB, Value *Old, Value *V,
   return V;
 }
 
+static Value *mergeTwoVectors(Value *V0, Value *V1, IRBuilder<> &Builder) {
+  assert(V0->getType()->isVectorTy() && V1->getType()->isVectorTy() &&
+         "Can not merge two non-vector values");
+
+  // V0 and V1 are vectors
+  // Create a new vector type with combined elements
+  // Use ShuffleVector to concatenate the vectors
+  auto *VecType0 = cast<FixedVectorType>(V0->getType());
+  auto *VecType1 = cast<FixedVectorType>(V1->getType());
+
+  assert(VecType0->getElementType() == VecType1->getElementType() &&
+         "Can not merge two vectors with different element types");
+  unsigned NumElts0 = VecType0->getNumElements();
+  unsigned NumElts1 = VecType1->getNumElements();
+
+  SmallVector<int, 16> ShuffleMask;
+
+  if (NumElts0 == NumElts1) {
+    for (unsigned i = 0; i < NumElts0 + NumElts1; ++i)
+      ShuffleMask.push_back(i);
+  } else {
+    // If two vectors have different sizes, we need to extend
+    // the smaller vector to the size of the larger vector.
+    unsigned SmallSize = std::min(NumElts0, NumElts1);
+    unsigned LargeSize = std::max(NumElts0, NumElts1);
+    bool IsV0Smaller = NumElts0 < NumElts1;
+    Value *SmallVec = IsV0Smaller ? V0 : V1;
+
+    SmallVector<int, 16> ExtendMask;
+    for (unsigned i = 0; i < SmallSize; ++i)
+      ExtendMask.push_back(i);
+    for (unsigned i = SmallSize; i < LargeSize; ++i)
+      ExtendMask.push_back(PoisonMaskElem);
+    Value *ExtendedVec = Builder.CreateShuffleVector(
+        SmallVec, PoisonValue::get(SmallVec->getType()), ExtendMask);
+    LLVM_DEBUG(dbgs() << "    shufflevector: " << *ExtendedVec << "\n");
+    V0 = IsV0Smaller ? ExtendedVec : V0;
+    V1 = IsV0Smaller ? V1 : ExtendedVec;
+    for (unsigned i = 0; i < NumElts0; ++i)
+      ShuffleMask.push_back(i);
+    for (unsigned i = 0; i < NumElts1; ++i)
+      ShuffleMask.push_back(LargeSize + i);
+  }
+
+  return Builder.CreateShuffleVector(V0, V1, ShuffleMask);
+}
+
 namespace {
 
 /// Visitor to rewrite instructions using p particular slice of an alloca
@@ -2822,6 +2870,230 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
     return CanSROA;
   }
 
+  /// Attempts to rewrite a partition using tree-structured merge optimization.
+  ///
+  /// This function analyzes a partition to determine if it can be optimized
+  /// using a tree-structured merge pattern, where multiple non-overlapping
+  /// stores completely fill an alloca. And there is no load from the alloca in
+  /// the middle of the stores. Such patterns can be optimized by eliminating
+  /// the intermediate stores and directly constructing the final vector by
+  /// using shufflevectors.
+  ///
+  /// Example transformation:
+  /// Before: (stores do not have to be in order)
+  ///   %alloca = alloca <8 x float>
+  ///   store <2 x float> %val0, ptr %alloca             ; offset 0-1
+  ///   store <2 x float> %val2, ptr %alloca+16          ; offset 4-5
+  ///   store <2 x float> %val1, ptr %alloca+8           ; offset 2-3
+  ///   store <2 x float> %val3, ptr %alloca+24          ; offset 6-7
+  ///
+  /// After:
+  ///   %alloca = alloca <8 x float>
+  ///   %shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2,
+  ///                 i32 3>
+  ///   %shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2,
+  ///                 i32 3>
+  ///   %shuffle2 = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1,
+  ///                 i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ///   store %shuffle2, ptr %alloca
+  ///
+  /// The optimization looks for partitions that:
+  /// 1. Have no overlapping split slice tails
+  /// 2. Contain non-overlapping stores that cover the entire alloca
+  /// 3. Have exactly one load that reads the complete alloca structure and not
+  ///    in the middle of the stores (TODO: maybe we can relax the constraint
+  ///    about reading the entire alloca structure)
+  ///
+  /// \param P The partition to analyze and potentially rewrite
+  /// \return An optional vector of values that were deleted during the rewrite
+  ///         process, or std::nullopt if the partition cannot be optimized
+  ///         using tree-structured merge
+  std::optional<SmallVector<Value *, 4>>
+  rewriteTreeStructuredMerge(Partition &P) {
+    // No tail slices that overlap with the partition
+    if (P.splitSliceTails().size() > 0)
+      return std::nullopt;
+
+    SmallVector<Value *, 4> DeletedValues;
+    LoadInst *TheLoad = nullptr;
+
+    // Structure to hold store information
+    struct StoreInfo {
+      StoreInst *Store;
+      uint64_t BeginOffset;
+      uint64_t EndOffset;
+      Value *StoredValue;
+      TypeSize StoredTypeSize = TypeSize::getZero();
+
+      StoreInfo(StoreInst *SI, uint64_t Begin, uint64_t End, Value *Val,
+                TypeSize StoredTypeSize)
+          : Store(SI), BeginOffset(Begin), EndOffset(End), StoredValue(Val),
+            StoredTypeSize(StoredTypeSize) {}
+    };
+
+    SmallVector<StoreInfo, 4> StoreInfos;
+
+    // The alloca must be a fixed vector type
+    auto *AllocatedTy = NewAI.getAllocatedType();
+    if (!isa<FixedVectorType>(AllocatedTy))
+      return std::nullopt;
+
+    Slice *LoadSlice = nullptr;
+    Type *LoadElementType = nullptr;
+    Type *StoreElementType = nullptr;
+    for (Slice &S : P) {
+      auto *User = cast<Instruction>(S.getUse()->getUser());
+      if (auto *LI = dyn_cast<LoadInst>(User)) {
+        // Do not handle the case where there is more than one load
+        // TODO: maybe we can handle this case
+        if (TheLoad)
+          return std::nullopt;
+        // If load is not a fixed vector type, we do not handle it
+        // If the number of loaded bits is not the same as the new alloca type
+        // size, we do not handle it
+        auto *FixedVecTy = dyn_cast<FixedVectorType>(LI->getType());
+        if (!FixedVecTy)
+          return std::nullopt;
+        if (DL.getTypeSizeInBits(FixedVecTy) !=
+            DL.getTypeSizeInBits(NewAI.getAllocatedType()))
+          return std::nullopt;
+        LoadElementType = FixedVecTy->getElementType();
+        TheLoad = LI;
+        LoadSlice = &S;
+      } else if (auto *SI = dyn_cast<StoreInst>(User)) {
+        // The store needs to be a fixed vector type
+        // All the stores should have the same element type
+        Type *StoredValueType = SI->getValueOperand()->getType();
+        Type *CurrentElementType = nullptr;
+        TypeSize StoredTypeSize = TypeSize::getZero();
+        if (auto *FixedVecTy = dyn_cast<FixedVectorType>(StoredValueType)) {
+          // Fixed vector type - use its element type
+          CurrentElementType = FixedVecTy->getElementType();
+          StoredTypeSize = DL.getTypeSizeInBits(FixedVecTy);
+        } else
+          return std::nullopt;
+        // Check element type consistency across all stores
+        if (StoreElementType && StoreElementType != CurrentElementType)
+          return std::nullopt;
+        StoreElementType = CurrentElementType;
+        StoreInfos.emplace_back(SI, S.beginOffset(), S.endOffset(),
+                                SI->getValueOperand(), StoredTypeSize);
+      } else {
+        // If we have instructions other than load and store, we cannot do the
+        // tree structured merge
+        return std::nullopt;
+      }
+    }
+    // If we do not have any load, we cannot do the tree structured merge
+    if (!TheLoad)
+      return std::nullopt;
+
+    // If we do not have any stores, we cannot do the tree structured merge
+    if (StoreInfos.empty())
+      return std::nullopt;
+
+    // The load and store element types should be the same
+    if (LoadElementType != StoreElementType)
+      return std::nullopt;
+
+    // The load should cover the whole alloca
+    // TODO: maybe we can relax this constraint
+    if (!LoadSlice || LoadSlice->beginOffset() != NewAllocaBeginOffset ||
+        LoadSlice->endOffset() != NewAllocaEndOffset)
+      return std::nullopt;
+
+    // Stores should not overlap and should cover the whole alloca
+    // Sort by begin offset
+    llvm::sort(StoreInfos, [](const StoreInfo &A, const StoreInfo &B) {
+      return A.BeginOffset < B.BeginOffset;
+    });
+
+    // Check for overlaps and coverage
+    uint64_t ExpectedStart = NewAllocaBeginOffset;
+    TypeSize TotalStoreBits = TypeSize::getZero();
+    Instruction *PrevStore = nullptr;
+    for (auto &StoreInfo : StoreInfos) {
+      uint64_t BeginOff = StoreInfo.BeginOffset;
+      uint64_t EndOff = StoreInfo.EndOffset;
+
+      // Check for gap or overlap
+      if (BeginOff != ExpectedStart)
+        return std::nullopt;
+
+      ExpectedStart = EndOff;
+      TotalStoreBits += StoreInfo.StoredTypeSize;
+      PrevStore = StoreInfo.Store;
+    }
+    // Check that stores cover the entire alloca
+    // We need check both the end offset and the total store bits
+    if (ExpectedStart != NewAllocaEndOffset ||
+        TotalStoreBits != DL.getTypeSizeInBits(NewAI.getAllocatedType()))
+      return std::nullopt;
+
+    // Stores should be in the same basic block
+    // The load should not be in the middle of the stores
+    BasicBlock *LoadBB = TheLoad->getParent();
+    BasicBlock *StoreBB = StoreInfos[0].Store->getParent();
+
+    for (auto &StoreInfo : StoreInfos) {
+      if (StoreInfo.Store->getParent() != StoreBB)
+        return std::nullopt;
+      if (LoadBB == StoreBB && !StoreInfo.Store->comesBefore(TheLoad))
+        return std::nullopt;
+    }
+
+    // If we reach here, the partition can be merged with a tree structured
+    // merge
+    LLVM_DEBUG({
+      dbgs() << "Tree structured merge rewrite:\n  Load: " << *TheLoad
+             << "\n Ordered stores:\n";
+      for (auto [i, Info] : enumerate(StoreInfos))
+        dbgs() << "    [" << i << "] Range[" << Info.BeginOffset << ", "
+               << Info.EndOffset << ") \tStore: " << *Info.Store
+               << "\tValue: " << *Info.StoredValue << "\n";
+    });
+
+    // Instead of having these stores, we merge all the stored values into a
+    // vector and store the merged value into the alloca
+    std::queue<Value *> VecElements;
+    IRBuilder<> Builder(StoreInfos.back().Store);
+    for (const auto &Info : StoreInfos) {
+      DeletedValues.push_back(Info.Store);
+      VecElements.push(Info.StoredValue);
+    }
+
+    LLVM_DEBUG(dbgs() << "  Rewrite stores into shufflevectors:\n");
+    while (VecElements.size() > 1) {
+      uint64_t NumElts = VecElements.size();
+      for (uint64_t i = 0; i < NumElts / 2; i++) {
+        Value *V0 = VecElements.front();
+        VecElements.pop();
+        Value *V1 = VecElements.front();
+        VecElements.pop();
+        Value *Merged = mergeTwoVectors(V0, V1, Builder);
+        LLVM_DEBUG(dbgs() << "    shufflevector: " << *Merged << "\n");
+        VecElements.push(Merged);
+      }
+      if (NumElts % 2 == 1) {
+        Value *V = VecElements.front();
+        VecElements.pop();
+        VecElements.push(V);
+      }
+    }
+
+    // Store the merged value into the alloca
+    Value *MergedValue = VecElements.front();
+    Builder.CreateAlignedStore(MergedValue, &NewAI, getSliceAlign());
+
+    IRBuilder<> LoadBuilder(TheLoad);
+    TheLoad->replaceAllUsesWith(LoadBuilder.CreateAlignedLoad(
+        TheLoad->getType(), &NewAI, getSliceAlign(), TheLoad->isVolatile(),
+        TheLoad->getName() + ".sroa.new.load"));
+    DeletedValues.push_back(TheLoad);
+
+    return DeletedValues;
+  }
+
 private:
   // Make sure the other visit overloads are visible.
   using Base::visit;
@@ -4996,13 +5268,22 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
                                P.endOffset(), IsIntegerPromotable, VecTy,
                                PHIUsers, SelectUsers);
   bool Promotable = true;
-  for (Slice *S : P.splitSliceTails()) {
-    Promotable &= Rewriter.visit(S);
-    ++NumUses;
-  }
-  for (Slice &S : P) {
-    Promotable &= Rewriter.visit(&S);
-    ++NumUses;
+  // Check whether we can have tree-structured merge.
+  std::optional<SmallVector<Value *, 4>> DeletedValues =
+      Rewriter.rewriteTreeStructuredMerge(P);
+  if (DeletedValues) {
+    NumUses += DeletedValues->size() + 1;
+    for (Value *V : *DeletedValues)
+      DeadInsts.push_back(V);
+  } else {
+    for (Slice *S : P.splitSliceTails()) {
+      Promotable &= Rewriter.visit(S);
+      ++NumUses;
+    }
+    for (Slice &S : P) {
+      Promotable &= Rewriter.visit(&S);
+      ++NumUses;
+    }
   }
 
   NumAllocaPartitionUses += NumUses;
diff --git a/llvm/test/Transforms/SROA/vector-promotion-cannot-tree-structure-merge.ll b/llvm/test/Transforms/SROA/vector-promotion-cannot-tree-structure-merge.ll
new file mode 100644
index 0000000000000..61d77478e0b59
--- /dev/null
+++ b/llvm/test/Transforms/SROA/vector-promotion-cannot-tree-structure-merge.ll
@@ -0,0 +1,214 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes='sroa<preserve-cfg>' -disable-output -debug-only=sroa 2>&1 | FileCheck %s
+; RUN: opt < %s -passes='sroa<modify-cfg>' -disable-output -debug-only=sroa 2>&1 | FileCheck %s
+
+target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n8:16:32:64"
+
+; CHECK-NOT: Tree structured merge rewrite
+define i32 @test_alloca_not_fixed_vector() {
+entry:
+  %alloca = alloca [4 x float]
+
+  %ptr0 = getelementptr inbounds [4 x float], ptr %alloca, i32 0, i32 0
+  store float 1.0, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds [4 x float], ptr %alloca, i32 0, i32 1
+  store float 2.0, ptr %ptr1
+
+  %result = load i32, ptr %alloca
+  ret i32 %result
+}
+
+define <4 x float> @test_more_than_one_load(<2 x float> %a, <2 x float> %b) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %b, ptr %ptr1
+
+  %result1 = load <4 x float>, ptr %alloca
+  %result2 = load <4 x float>, ptr %alloca
+
+  %final = fadd <4 x float> %result1, %result2
+  ret <4 x float> %final
+}
+
+define void @test_no_load(<4 x float> %a) {
+entry:
+  %alloca = alloca <4 x float>
+  store <4 x float> %a, ptr %alloca
+  ret void
+}
+
+define i32 @test_load_not_fixed_vector(<2 x float> %a, <2 x float> %b) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %b, ptr %ptr1
+
+  %result = load i32, ptr %alloca
+  ret i32 %result
+}
+
+define <3 x float> @test_load_not_covering_alloca(<2 x float> %a, <2 x float> %b) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %b, ptr %ptr1
+
+  %result = load <3 x float>, ptr %ptr0
+  ret <3 x float> %result
+}
+
+define <4 x float> @test_store_not_fixed_vector(<vscale x 2 x float> %a) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  %fixed = extractelement <vscale x 2 x float> %a, i32 0
+  store float %fixed, ptr %ptr0
+
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+define <4 x float> @test_store_not_same_element_type() {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  %float_vec = insertelement <2 x float> undef, float 1.0, i32 0
+  %float_vec2 = insertelement <2 x float> %float_vec, float 2.0, i32 1
+  store <2 x float> %float_vec2, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  %int_vec = insertelement <2 x i32> undef, i32 3, i32 0
+  %int_vec2 = insertelement <2 x i32> %int_vec, i32 4, i32 1
+  store <2 x i32> %int_vec2, ptr %ptr1
+
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+define <4 x i32> @test_load_store_different_element_type() {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  %float_vec = insertelement <2 x float> undef, float 1.0, i32 0
+  %float_vec2 = insertelement <2 x float> %float_vec, float 2.0, i32 1
+  store <2 x float> %float_vec2, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  %float_vec3 = insertelement <2 x float> undef, float 3.0, i32 0
+  %float_vec4 = insertelement <2 x float> %float_vec3, float 4.0, i32 1
+  store <2 x float> %float_vec4, ptr %ptr1
+
+  %result = load <4 x i32>, ptr %alloca
+  ret <4 x i32> %result
+}
+
+define <4 x float> @test_no_stores() {
+entry:
+  %alloca = alloca <4 x float>
+
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+define <4 x float> @test_stores_overlapping(<2 x float> %a, <2 x float> %b, <2 x float> %c) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 1
+  store <2 x float> %b, ptr %ptr1
+
+  %ptr2 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %c, ptr %ptr2
+
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+define <4 x float> @test_stores_not_covering_alloca(<2 x float> %a) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+define <4 x float> @test_stores_not_same_basic_block(<2 x float> %a, <2 x float> %b, i1 %cond) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  br i1 %cond, label %then, label %else
+
+then:
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %b, ptr %ptr1
+  br label %merge
+
+else:
+  br label %merge
+
+merge:
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+define <4 x float> @test_load_before_stores(<2 x float> %a, <2 x float> %b) {
+entry:
+  %alloca = alloca <4 x float>
+
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+
+  %intermediate = load <4 x float>, ptr %alloca
+
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %b, ptr %ptr1
+
+  ret <4 x float> %intermediate
+}
+
+define <4 x float> @test_other_instructions(<2 x float> %a, <2 x float> %b) {
+entry:
+  %alloca = alloca <4 x float>
+  
+  ; Store first vector
+  %ptr0 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 0
+  store <2 x float> %a, ptr %ptr0
+  
+  ; Other instruction (memset) that's not a simple load/store
+  call void @llvm.memset.p0.i64(ptr %alloca, i8 0, i64 8, i1 false)
+  
+  ; Store second vector
+  %ptr1 = getelementptr inbounds <4 x float>, ptr %alloca, i32 0, i32 2
+  store <2 x float> %b, ptr %ptr1
+  
+  %result = load <4 x float>, ptr %alloca
+  ret <4 x float> %result
+}
+
+declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)
diff --git a/llvm/test/Transforms/SROA/vector-promotion-via-tree-structure-merge.ll b/llvm/test/Transforms/SROA/vector-promotion-via-tree-structure-merge.ll
new file mode 100644
index 0000000000000..c74b0b932ddef
--- /dev/null
+++ b/llvm/test/Transforms/SROA/vector-promotion-via-tree-structure-merge.ll
@@ -0,0 +1,408 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes='sroa<preserve-cfg>' -S | FileCheck %s --check-prefixes=CHEC...
[truncated]

github-actions · 2025-08-08T20:54:42Z

✅ With the latest revision this PR passed the undef deprecator.

…com/Chengjunp/llvm-project into chengjunp/SROA_tree_merge_of_stores

Chengjunp · 2025-08-08T22:41:39Z

I have a question in the function bool SROA::deleteDeadInstructions. I see that

We only remove the debug info attached to Allocas
We only remove DbgDeclare, but not DbgValues

    // If the instruction is an alloca, find the possible dbg.declare connected
    // to it, and remove it too. We must do this before calling RAUW or we will
    // not be able to find it.
    if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {
      DeletedAllocas.insert(AI);
      for (DbgVariableRecord *OldDII : findDVRDeclares(AI))
        OldDII->eraseFromParent();
    }

As a result, the tests run with -passes=debugify,sroa will keep many #dbg_values for the deleted undef values.
Is this expected? Or we should make some changes here.

Chengjunp · 2025-08-13T21:27:33Z

@nikic @dtcxzyw Could you help to have a review? Thanks!

… chengjunp/SROA_tree_merge_of_stores

github-actions · 2025-08-15T19:36:06Z

✅ With the latest revision this PR passed the C/C++ code formatter.

dtcxzyw · 2025-08-16T12:06:03Z

******************** TEST 'LLVM :: Transforms/SROA/vector-conversion.ll' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/opt < /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/Transforms/SROA/vector-conversion.ll -passes='sroa<preserve-cfg>' -S | /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/Transforms/SROA/vector-conversion.ll --check-prefixes=CHECK,CHECK-PRESERVE-CFG # RUN: at line 2
+ /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/opt '-passes=sroa<preserve-cfg>' -S
+ /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/Transforms/SROA/vector-conversion.ll --check-prefixes=CHECK,CHECK-PRESERVE-CFG
opt: /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/IR/Instructions.cpp:3041: static CastInst *llvm::CastInst::Create(Instruction::CastOps, Value *, Type *, const Twine &, InsertPosition): Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/opt -passes=sroa<preserve-cfg> -S
1.	Running pass "function(sroa<preserve-cfg>)" on module "<stdin>"
2.	Running pass "sroa<preserve-cfg>" on function "vector_ptrtoint"
 #0 0x000058dd565d5cc8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Support/Unix/Signals.inc:834:13
 #1 0x000058dd565d32d5 llvm::sys::RunSignalHandlers() /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Support/Signals.cpp:105:18
 #2 0x000058dd565d6ad1 SignalHandler(int, siginfo_t*, void*) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Support/Unix/Signals.inc:426:38
 #3 0x00007bd547646330 (/lib/x86_64-linux-gnu/libc.so.6+0x45330)
 #4 0x00007bd54769fb2c pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x9eb2c)
 #5 0x00007bd54764627e raise (/lib/x86_64-linux-gnu/libc.so.6+0x4527e)
 #6 0x00007bd5476298ff abort (/lib/x86_64-linux-gnu/libc.so.6+0x288ff)
 #7 0x00007bd54762981b (/lib/x86_64-linux-gnu/libc.so.6+0x2881b)
 #8 0x00007bd54763c517 (/lib/x86_64-linux-gnu/libc.so.6+0x3b517)
 #9 0x000058dd56714883 llvm::CastInst::Create(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&, llvm::InsertPosition) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/IR/Instructions.cpp:3061:5
#10 0x000058dd5680a1f5 doit /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/Support/Casting.h:109:5
#11 0x000058dd5680a1f5 doit /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/Support/Casting.h:137:12
#12 0x000058dd5680a1f5 doit /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/Support/Casting.h:127:12
#13 0x000058dd5680a1f5 isPossible /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/Support/Casting.h:255:12
#14 0x000058dd5680a1f5 isPossible /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/Support/Casting.h:509:12
#15 0x000058dd5680a1f5 isa<llvm::FPMathOperator, llvm::Instruction *> /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/Support/Casting.h:549:10
#16 0x000058dd5680a1f5 llvm::IRBuilderBase::CreateCast(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&, llvm::MDNode*, llvm::FMFSource) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/IRBuilder.h:2246:9
#17 0x000058dd570dca97 mergeTwoVectors(llvm::Value*, llvm::Value*, llvm::DataLayout const&, llvm::Type*, llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>&)::$_0::operator()(llvm::Value*&, llvm::FixedVectorType*&, char const*) const /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:2719:9
#18 0x000058dd570d62a3 mergeTwoVectors /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:2726:3
#19 0x000058dd570d62a3 rewriteTreeStructuredMerge /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:3084:25
#20 0x000058dd570d62a3 (anonymous namespace)::SROA::rewritePartition(llvm::AllocaInst&, (anonymous namespace)::AllocaSlices&, (anonymous namespace)::Partition&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:5280:16
#21 0x000058dd570c614b (anonymous namespace)::SROA::splitAlloca(llvm::AllocaInst&, (anonymous namespace)::AllocaSlices&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:5604:21
#22 0x000058dd570c291d (anonymous namespace)::SROA::runOnAlloca(llvm::AllocaInst&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:0:14
#23 0x000058dd570bd61d (anonymous namespace)::SROA::runSROA(llvm::Function&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:0:11
#24 0x000058dd570bd163 llvm::SROAPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp:6048:53
#25 0x000058dd57a80c1d llvm::detail::PassModel<llvm::Function, llvm::SROAPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:5
#26 0x000058dd5681333a llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/PassManagerImpl.h:80:8
#27 0x000058dd57a83f4d llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:5
#28 0x000058dd56817c37 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/lib/IR/PassManager.cpp:132:23
#29 0x000058dd57a15e0d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:5
#30 0x000058dd5681209a llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/PassManagerImpl.h:80:8
#31 0x000058dd57a0ee54 ~SmallPtrSetImplBase /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:89:9
#32 0x000058dd57a0ee54 ~PreservedAnalyses /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/include/llvm/IR/Analysis.h:112:7
#33 0x000058dd57a0ee54 llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool, bool) /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/tools/opt/NewPMDriver.cpp:561:3
#34 0x000058dd565ae206 optMain /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/tools/opt/optdriver.cpp:753:12
#35 0x00007bd54762b1ca (/lib/x86_64-linux-gnu/libc.so.6+0x2a1ca)
#36 0x00007bd54762b28b __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28b)
#37 0x000058dd565a73e5 _start (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/opt+0x49193e5)
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/Transforms/SROA/vector-conversion.ll --check-prefixes=CHECK,CHECK-PRESERVE-CFG

… chengjunp/SROA_tree_merge_of_stores

…com/Chengjunp/llvm-project into chengjunp/SROA_tree_merge_of_stores

Chengjunp · 2025-08-19T04:32:07Z

Ping @davemgreen @RKSimon @nikic

Chengjunp · 2025-08-21T21:54:09Z

Ping @davemgreen @RKSimon @nikic

Prince781

Looks good, with some nits.

Prince781 · 2025-08-22T00:33:31Z

llvm/lib/Transforms/Scalar/SROA.cpp

+  assert(V0->getType()->isVectorTy() && V1->getType()->isVectorTy() &&
+         "Can not merge two non-vector values");
+
+  // V0 and V1 are vectors
+  // Create a new vector type with combined elements
+  // Use ShuffleVector to concatenate the vectors
+  auto *VecType0 = cast<FixedVectorType>(V0->getType());
+  auto *VecType1 = cast<FixedVectorType>(V1->getType());


You could have:

Suggested change

assert(V0->getType()->isVectorTy() && V1->getType()->isVectorTy() &&

"Can not merge two non-vector values");

// V0 and V1 are vectors

// Create a new vector type with combined elements

// Use ShuffleVector to concatenate the vectors

auto *VecType0 = cast<FixedVectorType>(V0->getType());

auto *VecType1 = cast<FixedVectorType>(V1->getType());

// V0 and V1 are vectors

// Create a new vector type with combined elements

// Use ShuffleVector to concatenate the vectors

auto *VecType0 = dyn_cast<FixedVectorType>(V0->getType());

auto *VecType1 = dyn_cast<FixedVectorType>(V1->getType());

assert(VecType0 && VecType1 && "Cannot merge two non-vector values");

Or drop the assertion if this is always called with FixedVectorTypes.

You can just drop the assert, as cast<> is an assert by itself.

Prince781 · 2025-08-22T00:39:05Z

llvm/lib/Transforms/Scalar/SROA.cpp

+    Value *SmallVec = IsV0Smaller ? V0 : V1;
+
+    SmallVector<int, 16> ExtendMask;
+    for (unsigned i = 0; i < SmallSize; ++i)
+      ExtendMask.push_back(i);
+    for (unsigned i = SmallSize; i < LargeSize; ++i)
+      ExtendMask.push_back(PoisonMaskElem);
+    Value *ExtendedVec = Builder.CreateShuffleVector(
+        SmallVec, PoisonValue::get(SmallVec->getType()), ExtendMask);
+    LLVM_DEBUG(dbgs() << "    shufflevector: " << *ExtendedVec << "\n");
+    V0 = IsV0Smaller ? ExtendedVec : V0;
+    V1 = IsV0Smaller ? V1 : ExtendedVec;


Suggested change

Value *SmallVec = IsV0Smaller ? V0 : V1;

SmallVector<int, 16> ExtendMask;

for (unsigned i = 0; i < SmallSize; ++i)

ExtendMask.push_back(i);

for (unsigned i = SmallSize; i < LargeSize; ++i)

ExtendMask.push_back(PoisonMaskElem);

Value *ExtendedVec = Builder.CreateShuffleVector(

SmallVec, PoisonValue::get(SmallVec->getType()), ExtendMask);

LLVM_DEBUG(dbgs() << " shufflevector: " << *ExtendedVec << "\n");

V0 = IsV0Smaller ? ExtendedVec : V0;

V1 = IsV0Smaller ? V1 : ExtendedVec;

Value *&ExtendedVec = IsV0Smaller ? V0 : V1;

SmallVector<int, 16> ExtendMask;

for (unsigned i = 0; i < SmallSize; ++i)

ExtendMask.push_back(i);

for (unsigned i = SmallSize; i < LargeSize; ++i)

ExtendMask.push_back(PoisonMaskElem);

ExtendedVec = Builder.CreateShuffleVector(

ExtendedVec, PoisonValue::get(SmallVec->getType()), ExtendMask);

LLVM_DEBUG(dbgs() << " shufflevector: " << *ExtendedVec << "\n");

Prince781 · 2025-08-22T00:57:53Z

llvm/lib/Transforms/Scalar/SROA.cpp

+    // The load should not be in the middle of the stores
+    BasicBlock *LoadBB = TheLoad->getParent();
+    BasicBlock *StoreBB = StoreInfos[0].Store->getParent();
+


Maybe add a comment explaining these checks? It looks like you're checking that the stores dominate the load, but you're only replacing stores and leaving the job of store->load forwarding to another part of SROA, however if they're in the same block then it's possible the load may come between the stores, in which case you can't replace them.

Prince781 · 2025-08-22T01:00:21Z

llvm/lib/Transforms/Scalar/SROA.cpp

+    LLVM_DEBUG(dbgs() << "  Rewrite stores into shufflevectors:\n");
+    while (VecElements.size() > 1) {
+      uint64_t NumElts = VecElements.size();
+      for (uint64_t i = 0; i < NumElts / 2; i++) {


nit: i is unused

Suggested change

for (uint64_t i = 0; i < NumElts / 2; i++) {

for (const auto _ : llvm::seq(NumElts/2)) {

Prince781 · 2025-08-22T01:00:40Z

llvm/lib/Transforms/Scalar/SROA.cpp

+
+    LLVM_DEBUG(dbgs() << "  Rewrite stores into shufflevectors:\n");
+    while (VecElements.size() > 1) {
+      uint64_t NumElts = VecElements.size();


nit: add const?

Suggested change

uint64_t NumElts = VecElements.size();

const auto NumElts = VecElements.size();

Prince781 · 2025-08-22T01:18:07Z

llvm/lib/Transforms/Scalar/SROA.cpp

+  std::optional<SmallVector<Value *, 4>> DeletedValues =
+      Rewriter.rewriteTreeStructuredMerge(P);
+  if (DeletedValues) {


nit?:

Suggested change

std::optional<SmallVector<Value *, 4>> DeletedValues =

Rewriter.rewriteTreeStructuredMerge(P);

if (DeletedValues) {

if (auto DeletedValues = Rewriter.rewriteTreeStructuredMerge(P)) {

nikic

Some initial notes.

nikic · 2025-08-22T20:34:23Z

llvm/test/Transforms/SROA/vector-promotion-via-tree-structure-merge.ll

+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes='sroa<preserve-cfg>' -S | FileCheck %s --check-prefixes=CHECK,CHECK-PRESERVE-CFG
+; RUN: opt < %s -passes='sroa<modify-cfg>' -S | FileCheck %s --check-prefixes=CHECK,CHECK-MODIFY-CFG
+; RUN: opt < %s -passes=debugify,sroa -S | FileCheck %s --check-prefix=DEBUG


If you want to test debug info, use a separate test. debugify with generated check lines is too noisy.

nikic · 2025-08-22T20:39:22Z

llvm/lib/Transforms/Scalar/SROA.cpp

+
+    // The alloca must be a fixed vector type
+    Type *AllocatedEltTy = nullptr;
+    if (auto *FixedVecTy = dyn_cast<FixedVectorType>(NewAI.getAllocatedType()))


We should not have heuristics based on the alloca type. This should be based on the vector types of loads/stores, not of the alloca.

nikic · 2025-08-22T20:39:57Z

llvm/lib/Transforms/Scalar/SROA.cpp

+      return std::nullopt;
+    // If the allocated element type is a pointer, we do not handle it
+    // TODO: handle this case by using inttoptr/ptrtoint
+    if (AllocatedEltTy->isPtrOrPtrVectorTy())


Suggested change

if (AllocatedEltTy->isPtrOrPtrVectorTy())

if (AllocatedEltTy->isPointerTy())

It's already scalar. Same for checks below.

nikic · 2025-08-22T20:47:43Z

llvm/lib/Transforms/Scalar/SROA.cpp

+    for (auto &StoreInfo : StoreInfos) {
+      if (StoreInfo.Store->getParent() != StoreBB)
+        return std::nullopt;
+      if (LoadBB == StoreBB && !StoreInfo.Store->comesBefore(TheLoad))


Is there test coverage for the load being in a different BB?

I have added a test for that. In that case, we can still do the tree structure merge, and we will introduce phi nodes when we promote the allocas.

nikic · 2025-08-22T20:48:57Z

llvm/lib/Transforms/Scalar/SROA.cpp

+        if (DL.getTypeSizeInBits(StoredValueType) %
+                DL.getTypeSizeInBits(AllocatedEltTy) !=
+            0)
+          return std::nullopt;


Please add test coverage for the element type being non-byte-sizes. Something like <2 x i5>. I don't think you handle it correctly.

Thank you for pointing this out. Now I add a check to only handle byte aligned types.

nikic · 2025-08-22T20:57:09Z

llvm/lib/Transforms/Scalar/SROA.cpp

+    IRBuilder<> LoadBuilder(TheLoad);
+    TheLoad->replaceAllUsesWith(LoadBuilder.CreateAlignedLoad(
+        TheLoad->getType(), &NewAI, getSliceAlign(), TheLoad->isVolatile(),
+        TheLoad->getName() + ".sroa.new.load"));


Why is it necessary to rewrite the load? Doesn't it already have the correct form?

Because TheLoad is a load from the old alloca instead of the new one. NewAI may be only a partition of the original alloca.
For example, considering the following case

entry: %alloca = alloca <8 x float> %ptr0 = getelementptr inbounds <8 x float>, ptr %alloca, i32 0, i32 0 store <2 x float> %a, ptr %ptr0 %ptr1 = getelementptr inbounds <8 x float>, ptr %alloca, i32 0, i32 2 store <2 x float> %b, ptr %ptr1 %ptr2 = getelementptr inbounds <8 x float>, ptr %alloca, i32 0, i32 4 store <2 x float> %c, ptr %ptr2 %ptr3 = getelementptr inbounds <8 x float>, ptr %alloca, i32 0, i32 6 store <2 x float> %d, ptr %ptr3 %result1 = load <4 x float>, ptr %alloca %ptr_offset4 = getelementptr inbounds <8 x float>, ptr %alloca, i32 0, i32 4 %result2 = load <4 x float>, ptr %ptr_offset4 store <4 x float> %result1, ptr %e store <4 x float> %result2, ptr %f ret void

In SROA, %alloca will be divided into two partitions with offsets [0,16) and [16,32). Here %result1 will only cover the first partition, which will be the new alloca. So we need to create a new load from this new alloca and replace the original one. And the new load will be eventually removed when SROA promote the alloca later.

nikic · 2025-08-22T20:57:51Z

llvm/test/Transforms/SROA/vector-promotion-cannot-tree-structure-merge.ll

Can you please also add negative tests for volatile store?

I have added tests for that. And in this case, we will not touch the stores.

nikic · 2025-08-22T20:59:58Z

llvm/lib/Transforms/Scalar/SROA.cpp

+  assert(V0->getType()->isVectorTy() && V1->getType()->isVectorTy() &&
+         "Can not merge two non-vector values");
+
+  // V0 and V1 are vectors
+  // Create a new vector type with combined elements
+  // Use ShuffleVector to concatenate the vectors
+  auto *VecType0 = cast<FixedVectorType>(V0->getType());
+  auto *VecType1 = cast<FixedVectorType>(V1->getType());


You can just drop the assert, as cast<> is an assert by itself.

llvm/lib/Transforms/Scalar/SROA.cpp

Chengjunp · 2025-08-29T21:21:20Z

Ping @nikic @davemgreen @RKSimon

Initial impl of tree structure merge in SROA

a30ca09

Chengjunp requested review from nikic and dtcxzyw August 8, 2025 20:51

Chengjunp self-assigned this Aug 8, 2025

Chengjunp added the llvm:transforms label Aug 8, 2025

Merge branch 'main' into chengjunp/SROA_tree_merge_of_stores

ae5f6eb

chengjunp added 2 commits August 8, 2025 22:03

Not do tree merge when only having one store

46039ce

Merge branch 'chengjunp/SROA_tree_merge_of_stores' of https://github.…

be0aa77

…com/Chengjunp/llvm-project into chengjunp/SROA_tree_merge_of_stores

Prince781 self-requested a review August 8, 2025 23:11

chengjunp added 2 commits August 15, 2025 19:28

Handle the cases where ld/st has different elt types

68eea1e

Merge branch 'main' of https://github.com/Chengjunp/llvm-project into…

8f69c55

… chengjunp/SROA_tree_merge_of_stores

dtcxzyw requested a review from davemgreen August 16, 2025 12:04

dtcxzyw requested a review from RKSimon August 16, 2025 12:16

chengjunp and others added 5 commits August 19, 2025 04:20

Do not handle ptr cases

1d1e3d3

format

a884946

Merge branch 'main' of https://github.com/Chengjunp/llvm-project into…

b6f5364

… chengjunp/SROA_tree_merge_of_stores

Merge branch 'llvm:main' into chengjunp/SROA_tree_merge_of_stores

fe64ae8

Merge branch 'chengjunp/SROA_tree_merge_of_stores' of https://github.…

659dfd7

…com/Chengjunp/llvm-project into chengjunp/SROA_tree_merge_of_stores

Prince781 approved these changes Aug 22, 2025

View reviewed changes

nikic reviewed Aug 22, 2025

View reviewed changes

chengjunp added 2 commits August 27, 2025 20:07

Fix bugs and update tests

3146a3b

Fix nits

a3c0c06

Chengjunp requested a review from nikic August 27, 2025 20:28

Format

8afd484

	for (uint64_t i = 0; i < NumElts / 2; i++) {
	for (const auto _ : llvm::seq(NumElts/2)) {

	uint64_t NumElts = VecElements.size();
	const auto NumElts = VecElements.size();

	if (AllocatedEltTy->isPtrOrPtrVectorTy())
	if (AllocatedEltTy->isPointerTy())

[SROA] Use tree-structure merge to remove alloca #152793

Are you sure you want to change the base?

[SROA] Use tree-structure merge to remove alloca #152793

Conversation

Chengjunp commented Aug 8, 2025

Uh oh!

llvmbot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chengjunp commented Aug 8, 2025

Uh oh!

Chengjunp commented Aug 13, 2025

Uh oh!

github-actions bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtcxzyw commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chengjunp commented Aug 19, 2025

Uh oh!

Chengjunp commented Aug 21, 2025

Uh oh!

Prince781 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Chengjunp commented Aug 29, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2025 •

edited

Loading

github-actions bot commented Aug 15, 2025 •

edited

Loading

dtcxzyw commented Aug 16, 2025 •

edited

Loading