[AMDGPU] Identify vector idiom to unlock SROA #156791

yxsamliu · 2025-09-04T02:54:33Z

HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs.

For example:

%p = select i1 %cond, ptr %A, ptr %B

call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)

When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations.

The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped.

The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads.

Fixes: SWDEV-550134

llvmbot · 2025-09-04T02:55:05Z

@llvm/pr-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs.

For example:

%p = select i1 %cond, ptr %A, ptr %B

call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)

When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations.

The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped.

The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads.

Fixes: SWDEV-550134

Patch is 21.86 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156791.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+7)
(added) llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp (+317)
(added) llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h (+41)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll (+111)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 48448833721bf..7fb808b64a1c1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -67,6 +67,7 @@ FUNCTION_PASS("amdgpu-simplifylib", AMDGPUSimplifyLibCallsPass())
 FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
               AMDGPUUnifyDivergentExitNodesPass())
 FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
+FUNCTION_PASS("amdgpu-vector-idiom", AMDGPUVectorIdiomCombinePass())
 FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
 #undef FUNCTION_PASS
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 4a2f0a13b1325..2b3168a805cf0 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -33,6 +33,7 @@
 #include "AMDGPUTargetObjectFile.h"
 #include "AMDGPUTargetTransformInfo.h"
 #include "AMDGPUUnifyDivergentExitNodes.h"
+#include "AMDGPUVectorIdiom.h"
 #include "AMDGPUWaitSGPRHazards.h"
 #include "GCNDPPCombine.h"
 #include "GCNIterativeScheduler.h"
@@ -922,6 +923,10 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
             EnablePromoteKernelArguments)
           FPM.addPass(AMDGPUPromoteKernelArgumentsPass());
 
+        // Run vector-idiom canonicalization early (after inlining) and before
+        // infer-AS / SROA to maximize scalarization opportunities.
+        FPM.addPass(AMDGPUVectorIdiomCombinePass());
+
         // Add infer address spaces pass to the opt pipeline after inlining
         // but before SROA to increase SROA opportunities.
         FPM.addPass(InferAddressSpacesPass());
@@ -973,6 +978,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
           // We only want to run this with O2 or higher since inliner and SROA
           // don't run in O1.
           if (Level != OptimizationLevel::O1) {
+            PM.addPass(
+                createModuleToFunctionPassAdaptor(AMDGPUVectorIdiomCombinePass()));
             PM.addPass(
                 createModuleToFunctionPassAdaptor(InferAddressSpacesPass()));
           }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp
new file mode 100644
index 0000000000000..05f47c648493e
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp
@@ -0,0 +1,317 @@
+//===- AMDGPUVectorIdiom.cpp ------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// AMDGPU-specific vector idiom canonicalizations to unblock SROA and
+// subsequent scalarization/vectorization.
+//
+// Motivation:
+// - HIP vector types are often modeled as structs and copied with memcpy.
+//   Address-level selects on such copies block SROA. Converting to value-level
+//   operations or splitting the CFG enables SROA to break aggregates, which
+//   unlocks scalarization/vectorization on AMDGPU.
+//
+// Example pattern:
+//   %src = select i1 %c, ptr %A, ptr %B
+//   call void @llvm.memcpy(ptr %dst, ptr %src, i32 16, i1 false)
+//
+// Objectives:
+// - Canonicalize small memcpy patterns where source or destination is a select
+// of pointers.
+// - Prefer value-level selects (on loaded values) over address-level selects
+// when safe.
+// - When speculation is unsafe, split the CFG to isolate each arm.
+//
+// Assumptions:
+// - Only handles non-volatile memcpy with constant length N where 0 < N <=
+// MaxBytes (default 32).
+// - Source and destination must be in the same address space.
+// - Speculative loads are allowed only if a conservative alignment check
+// passes.
+// - No speculative stores are introduced.
+//
+// Transformations:
+// - Source-select memcpy: attempt speculative loads -> value select -> single
+// store.
+//   Fallback is CFG split with two memcpy calls.
+// - Destination-select memcpy: always CFG split to avoid speculative stores.
+//
+// Run this pass early, before SROA.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUVectorIdiom.h"
+#include "AMDGPU.h"
+
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/Analysis/AssumptionCache.h"
+#include "llvm/Analysis/TargetLibraryInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstIterator.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+
+using namespace llvm;
+using namespace PatternMatch;
+
+#define DEBUG_TYPE "amdgpu-vector-idiom"
+
+namespace {
+
+// Default to 32 bytes since the largest HIP vector types are double4 or long4.
+static cl::opt<unsigned> AMDGPUVectorIdiomMaxBytes(
+    "amdgpu-vector-idiom-max-bytes",
+    cl::desc("Max memcpy size (in bytes) to transform in AMDGPUVectorIdiom "
+             "(default 32)"),
+    cl::init(32));
+
+// Selects an integer or integer-vector element type matching NBytes, using the
+// minimum proven alignment to decide the widest safe element width.
+// Assumptions:
+// - Pointee types are opaque; the element choice is based solely on size and
+// alignment.
+// - Falls back to <N x i8> if wider lanes are not safe/aligned.
+static Type *getIntOrVecTypeForSize(uint64_t NBytes, LLVMContext &Ctx,
+                                    Align MinProvenAlign = Align(1)) {
+  auto CanUseI64 = [&]() { return MinProvenAlign >= Align(8); };
+  auto CanUseI32 = [&]() { return MinProvenAlign >= Align(4); };
+  auto CanUseI16 = [&]() { return MinProvenAlign >= Align(2); };
+
+  if (NBytes == 32 && CanUseI64())
+    return FixedVectorType::get(Type::getInt64Ty(Ctx), 4);
+
+  if ((NBytes % 4) == 0 && CanUseI32())
+    return FixedVectorType::get(Type::getInt32Ty(Ctx), NBytes / 4);
+
+  if ((NBytes % 2) == 0 && CanUseI16())
+    return FixedVectorType::get(Type::getInt16Ty(Ctx), NBytes / 2);
+
+  return FixedVectorType::get(Type::getInt8Ty(Ctx), NBytes);
+}
+
+static Align minAlign(Align A, Align B) { return A < B ? A : B; }
+
+// Checks if both pointer operands can be speculatively loaded for N bytes and
+// computes the minimum alignment to use.
+// Notes:
+// - Intentionally conservative: relies on getOrEnforceKnownAlignment.
+// - AA/TLI are not used for deeper reasoning here.
+static bool bothArmsSafeToSpeculateLoads(Value *A, Value *B, Align &OutAlign,
+                                         const DataLayout &DL,
+                                         AssumptionCache *AC,
+                                         const DominatorTree *DT) {
+  Align AlignA =
+      llvm::getOrEnforceKnownAlignment(A, Align(1), DL, nullptr, AC, DT);
+  Align AlignB =
+      llvm::getOrEnforceKnownAlignment(B, Align(1), DL, nullptr, AC, DT);
+
+  if (AlignA.value() < 1 || AlignB.value() < 1)
+    return false;
+
+  OutAlign = minAlign(AlignA, AlignB);
+  return true;
+}
+
+// With opaque pointers, ensure address spaces match and otherwise return Ptr.
+// Assumes the address space is the only property to validate for this cast.
+static Value *castPtrTo(Value *Ptr, unsigned ExpectedAS) {
+  auto *FromPTy = cast<PointerType>(Ptr->getType());
+  unsigned AS = FromPTy->getAddressSpace();
+  (void)ExpectedAS;
+  assert(AS == ExpectedAS && "Address space mismatch for castPtrTo");
+  return Ptr;
+}
+
+struct AMDGPUVectorIdiomImpl {
+  unsigned MaxBytes;
+
+  AMDGPUVectorIdiomImpl(unsigned MaxBytes) : MaxBytes(MaxBytes) {}
+
+  // Rewrites memcpy when the source is a select of pointers. Prefers a
+  // value-level select (two loads + select + one store) if speculative loads
+  // are safe. Otherwise, falls back to a guarded CFG split with two memcpy
+  // calls. Assumptions:
+  // - Non-volatile, constant length, within MaxBytes.
+  // - Source and destination in the same address space.
+  bool transformSelectMemcpySource(MemTransferInst &MT, SelectInst &Sel,
+                                   const DataLayout &DL,
+                                   const DominatorTree *DT,
+                                   AssumptionCache *AC) {
+    IRBuilder<> B(&MT);
+    Value *Dst = MT.getRawDest();
+    Value *A = Sel.getTrueValue();
+    Value *Bv = Sel.getFalseValue();
+    if (!A->getType()->isPointerTy() || !Bv->getType()->isPointerTy())
+      return false;
+
+    ConstantInt *LenCI = cast<ConstantInt>(MT.getLength());
+    uint64_t N = LenCI->getLimitedValue();
+
+    Align DstAlign = MaybeAlign(MT.getDestAlign()).valueOrOne();
+    Align AlignAB;
+    bool CanSpeculate =
+        bothArmsSafeToSpeculateLoads(A, Bv, AlignAB, DL, AC, DT);
+
+    unsigned AS = cast<PointerType>(A->getType())->getAddressSpace();
+    assert(AS == cast<PointerType>(Bv->getType())->getAddressSpace() &&
+           "Expected same AS");
+
+    if (CanSpeculate) {
+      Align MinAlign = std::min(AlignAB, DstAlign);
+      Type *Ty = getIntOrVecTypeForSize(N, B.getContext(), MinAlign);
+
+      Value *PA = castPtrTo(A, AS);
+      Value *PB = castPtrTo(Bv, AS);
+      LoadInst *LA = B.CreateAlignedLoad(Ty, PA, MinAlign);
+      LoadInst *LB = B.CreateAlignedLoad(Ty, PB, MinAlign);
+      Value *V = B.CreateSelect(Sel.getCondition(), LA, LB);
+
+      Value *PDst =
+          castPtrTo(Dst, cast<PointerType>(Dst->getType())->getAddressSpace());
+      (void)B.CreateAlignedStore(V, PDst, DstAlign);
+
+      LLVM_DEBUG(dbgs() << "[AMDGPUVectorIdiom] Rewrote memcpy(select-src) to "
+                           "value-select loads/stores: "
+                        << MT << "\n");
+      MT.eraseFromParent();
+      return true;
+    }
+
+    splitCFGForMemcpy(MT, Sel.getCondition(), A, Bv, true);
+    LLVM_DEBUG(
+        dbgs()
+        << "[AMDGPUVectorIdiom] Rewrote memcpy(select-src) by CFG split\n");
+    return true;
+  }
+
+  // Rewrites memcpy when the destination is a select of pointers. To avoid
+  // speculative stores, always splits the CFG and emits a memcpy per branch.
+  // Assumptions mirror the source case.
+  bool transformSelectMemcpyDest(MemTransferInst &MT, SelectInst &Sel) {
+    Value *DA = Sel.getTrueValue();
+    Value *DB = Sel.getFalseValue();
+    if (!DA->getType()->isPointerTy() || !DB->getType()->isPointerTy())
+      return false;
+
+    splitCFGForMemcpy(MT, Sel.getCondition(), DA, DB, false);
+    LLVM_DEBUG(
+        dbgs()
+        << "[AMDGPUVectorIdiom] Rewrote memcpy(select-dst) by CFG split\n");
+    return true;
+  }
+
+  // Splits the CFG around a memcpy whose source or destination depends on a
+  // condition. Clones memcpy in then/else using TruePtr/FalsePtr and rejoins.
+  // Assumptions:
+  // - MT has constant length and is non-volatile.
+  // - TruePtr/FalsePtr are correct replacements for the selected operand.
+  void splitCFGForMemcpy(MemTransferInst &MT, Value *Cond, Value *TruePtr,
+                         Value *FalsePtr, bool IsSource) {
+    Function *F = MT.getFunction();
+    BasicBlock *Cur = MT.getParent();
+    BasicBlock *ThenBB = BasicBlock::Create(F->getContext(), "memcpy.then", F);
+    BasicBlock *ElseBB = BasicBlock::Create(F->getContext(), "memcpy.else", F);
+    BasicBlock *JoinBB =
+        Cur->splitBasicBlock(BasicBlock::iterator(&MT), "memcpy.join");
+
+    Cur->getTerminator()->eraseFromParent();
+    IRBuilder<> B(Cur);
+    B.CreateCondBr(Cond, ThenBB, ElseBB);
+
+    ConstantInt *LenCI = cast<ConstantInt>(MT.getLength());
+
+    IRBuilder<> BT(ThenBB);
+    if (IsSource) {
+      (void)BT.CreateMemCpy(MT.getRawDest(), MT.getDestAlign(), TruePtr,
+                            MT.getSourceAlign(), LenCI, MT.isVolatile());
+    } else {
+      (void)BT.CreateMemCpy(TruePtr, MT.getDestAlign(), MT.getRawSource(),
+                            MT.getSourceAlign(), LenCI, MT.isVolatile());
+    }
+    BT.CreateBr(JoinBB);
+
+    IRBuilder<> BE(ElseBB);
+    if (IsSource) {
+      (void)BE.CreateMemCpy(MT.getRawDest(), MT.getDestAlign(), FalsePtr,
+                            MT.getSourceAlign(), LenCI, MT.isVolatile());
+    } else {
+      (void)BE.CreateMemCpy(FalsePtr, MT.getDestAlign(), MT.getRawSource(),
+                            MT.getSourceAlign(), LenCI, MT.isVolatile());
+    }
+    BE.CreateBr(JoinBB);
+
+    MT.eraseFromParent();
+  }
+};
+
+} // end anonymous namespace
+
+AMDGPUVectorIdiomCombinePass::AMDGPUVectorIdiomCombinePass()
+    : MaxBytes(AMDGPUVectorIdiomMaxBytes) {}
+
+// Pass driver that locates small, constant-size, non-volatile memcpy calls
+// where source or destination is a select in the same address space. Applies
+// the source/destination transforms described above. Intended to run early to
+// maximize SROA and subsequent optimizations.
+PreservedAnalyses
+AMDGPUVectorIdiomCombinePass::run(Function &F, FunctionAnalysisManager &FAM) {
+  const DataLayout &DL = F.getParent()->getDataLayout();
+  auto &DT = FAM.getResult<DominatorTreeAnalysis>(F);
+  auto &AC = FAM.getResult<AssumptionAnalysis>(F);
+
+  SmallVector<CallInst *, 8> Worklist;
+  for (Instruction &I : instructions(F)) {
+    if (auto *CI = dyn_cast<CallInst>(&I)) {
+      if (isa<MemTransferInst>(CI))
+        Worklist.push_back(CI);
+    }
+  }
+
+  bool Changed = false;
+  AMDGPUVectorIdiomImpl Impl(MaxBytes);
+
+  for (CallInst *CI : Worklist) {
+    auto *MT = cast<MemTransferInst>(CI);
+    if (MT->isVolatile())
+      continue;
+
+    ConstantInt *LenCI = dyn_cast<ConstantInt>(MT->getLength());
+    if (!LenCI)
+      continue;
+
+    uint64_t N = LenCI->getLimitedValue();
+    if (N == 0 || N > MaxBytes)
+      continue;
+
+    Value *Dst = MT->getRawDest();
+    Value *Src = MT->getRawSource();
+
+    unsigned DstAS = cast<PointerType>(Dst->getType())->getAddressSpace();
+    unsigned SrcAS = cast<PointerType>(Src->getType())->getAddressSpace();
+    if (DstAS != SrcAS)
+      continue;
+
+    if (auto *Sel = dyn_cast<SelectInst>(Src)) {
+      Changed |= Impl.transformSelectMemcpySource(*MT, *Sel, DL, &DT, &AC);
+      continue;
+    }
+    if (auto *Sel = dyn_cast<SelectInst>(Dst)) {
+      Changed |= Impl.transformSelectMemcpyDest(*MT, *Sel);
+      continue;
+    }
+  }
+
+  return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h
new file mode 100644
index 0000000000000..6aebafe3e4e93
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h
@@ -0,0 +1,41 @@
+//===- AMDGPUVectorIdiom.h --------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// AMDGPU-specific vector idiom canonicalizations to unblock SROA and
+// subsequent scalarization/vectorization.
+//
+// This pass rewrites memcpy with select-fed operands into either:
+//  - a value-level select (two loads + select + store), when safe to
+//    speculatively load both arms, or
+//  - a conservative CFG split around the condition to isolate each arm.
+//
+// Run this pass early, before SROA.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPVECTORIDIOM_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPVECTORIDIOM_H
+
+#include "AMDGPU.h"
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class AMDGPUVectorIdiomCombinePass
+    : public PassInfoMixin<AMDGPUVectorIdiomCombinePass> {
+  unsigned MaxBytes;
+
+public:
+  AMDGPUVectorIdiomCombinePass();
+
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
+};
+
+} // end namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPVECTORIDIOM_H
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 05295ae73be23..abf52f438fbe0 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -116,6 +116,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUTargetTransformInfo.cpp
   AMDGPUWaitSGPRHazards.cpp
   AMDGPUUnifyDivergentExitNodes.cpp
+  AMDGPUVectorIdiom.cpp
   R600MachineCFGStructurizer.cpp
   GCNCreateVOPD.cpp
   GCNDPPCombine.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll
new file mode 100644
index 0000000000000..ed5e6700477a1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll
@@ -0,0 +1,111 @@
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-vector-idiom -S %s | FileCheck %s
+
+; This test verifies the AMDGPUVectorIdiomCombinePass transforms:
+; 1) memcpy with select-fed source into a value-level select between two loads,
+;    followed by one store (when it's safe to speculate both loads).
+; 2) memcpy with select-fed destination into a control-flow split with two memcpys.
+
+@G0 = addrspace(1) global [4 x i32] zeroinitializer, align 16
+@G1 = addrspace(1) global [4 x i32] zeroinitializer, align 16
+
+declare void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) nocapture writeonly, ptr addrspace(1) nocapture readonly, i64, i1 immarg)
+
+; -----------------------------------------------------------------------------
+; Source is a select. Expect value-level select of two <4 x i32> loads
+; and a single store, with no remaining memcpy.
+;
+; CHECK-LABEL: @value_select_src(
+; CHECK-NOT: call void @llvm.memcpy
+; CHECK:      [[LA:%.+]] = load <4 x i32>, ptr addrspace(1) [[A:%.+]], align 16
+; CHECK:      [[LB:%.+]] = load <4 x i32>, ptr addrspace(1) [[B:%.+]], align 16
+; CHECK:      [[SEL:%.+]] = select i1 [[COND:%.+]], <4 x i32> [[LA]], <4 x i32> [[LB]]
+; CHECK:      store <4 x i32> [[SEL]], ptr addrspace(1) [[DST:%.+]], align 16
+define amdgpu_kernel void @value_select_src(ptr addrspace(1) %dst, i1 %cond) {
+entry:
+  ; Pointers to two 16-byte aligned buffers in the same addrspace(1).
+  %pa = getelementptr inbounds [4 x i32], ptr addrspace(1) @G0, i64 0, i64 0
+  %pb = getelementptr inbounds [4 x i32], ptr addrspace(1) @G1, i64 0, i64 0
+  %src = select i1 %cond, ptr addrspace(1) %pa, ptr addrspace(1) %pb
+
+  ; Provide explicit operand alignments so the pass can emit an aligned store.
+  call void @llvm.memcpy.p1.p1.i64(
+    ptr addrspace(1) align 16 %dst,
+    ptr addrspace(1) align 16 %src,
+    i64 16, i1 false)
+
+  ret void
+}
+
+; -----------------------------------------------------------------------------
+; Destination is a select. Expect CFG split with two memcpys guarded
+; by a branch (we do not speculate stores in this pass).
+;
+; CHECK-LABEL: @dest_select_cfg_split(
+; CHECK:       br i1 %cond, label %memcpy.then, label %memcpy.else
+; CHECK:     memcpy.join:
+; CHECK:       ret void
+; CHECK:     memcpy.then:
+; CHECK:       call void @llvm.memcpy.p1.p1.i64(
+; CHECK:       br label %memcpy.join
+; CHECK:     memcpy.else:
+; CHECK:       call void @llvm.memcpy.p1.p1.i64(
+; CHECK:       br label %memcpy.join
+define amdgpu_kernel void @dest_select_cfg_split(ptr addrspace(1) %da, ptr addrspace(1) %db,
+                                                 ptr addrspace(1) %src, i1 %cond) {
+entry:
+  %dst = select i1 %cond, ptr addrspace(1) %da, ptr addrspace(1) %db
+  call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 16, i1 false)
+  ret void
+}
+
+; -----------------------------------------------------------------------------
+; Source is a select, 4 x double (32 bytes).
+; Expect value-level select of two <4 x i64> loads and a single store, no memcpy.
+;
+; CHECK-LABEL: @value_select_src_4xd(
+; CHECK-NOT: call void @llvm.memcpy
+; CHECK:      [[LA4D:%.+]] = load <4 x i64>, ptr addrspace(1) {{%.+}}, align 32
+; CHECK:      [[LB4D:%.+]] = load <4 x i64>, ptr addrspace(1) {{%.+}}, align 32
+; CHECK:      [[SEL4D:%.+]] = select i1 {{%.+}}, ...
[truncated]

github-actions · 2025-09-04T02:56:49Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2025-09-04T02:59:35Z

HIP vector types often lower to aggregates and get copied with memcpy

I think the HIP vector type structure definitions are broken, and should be defined with a union of ext_vector_type such that the original access is emitted as an under-aligned load of IR vector.

Why does this need to be a new pass? instcombine already does replacement of small memcpy and this is a small extension on top of that?

HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs. For example: %p = select i1 %cond, ptr %A, ptr %B call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false) When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations. The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped. The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads. Fixes: SWDEV-550134

yxsamliu · 2025-09-04T13:26:07Z

HIP vector types often lower to aggregates and get copied with memcpy

I think the HIP vector type structure definitions are broken, and should be defined with a union of ext_vector_type such that the original access is emitted as an under-aligned load of IR vector.

Using a union with ext_vector_type would help codegen, and that was used to implement HIP vector type. But we have to change that to a plain struct type because users rely on two C++ features that break with union-based implementation.

What breaks with a union-based HIP vector type:

Structured bindings fail.

Reason: The type has an anonymous union. C++ forbids decomposing such types.

Example that fails in HIP union-based vector type today (works in CUDA and HIP struct-based vector type): https://godbolt.org/z/rT7nhjszE

#include <hip/hip_runtime.h>

__host__ __device__ void func() {
  int3 v{};
  auto&& [x, y, z] = v; // error: cannot decompose type with anonymous union
}

constexpr member access fails.

Reason: Only one union member is “active.” Reading another member in a constant expression is not allowed.

Example that fails in HIP union-based vector type today (works in CUDA and HIP struct-based type): https://godbolt.org/z/T1eM9snM4

#include <hip/hip_runtime.h>

__host__ __device__ constexpr bool check() {
  int3 v{};
  int exp{};
  return v.x == exp; // error in constexpr: reading inactive union member
}

__global__ void k() { static_assert(check()); } // fails
int main() { static_assert(check()); } // fails

Why does this need to be a new pass? instcombine already does replacement of small memcpy and this is a small extension on top of that?

The reasons to use a target-specific pass instead of instcombine:

No CFG Modifications: This is the main reason. InstCombine is designed to be a "local" pass that doesn't alter the basic block structure. The AMDGPUVectorIdiomCombinePass needs to be able to split basic blocks as a fallback, which is outside the scope of InstCombine.
Target Specificity: This is a very target-specific optimization for AMDGPU. While InstCombine can have target-specific logic, it's generally cleaner to keep significant target-specific optimizations in their own passes. This improves modularity and makes the code easier to maintain.
Pass Ordering: This pass needs to run at a very specific point in the pipeline (after inlining, but before SROA) to be effective. A dedicated pass provides explicit control over its placement in the pass pipeline. While InstCombine runs at various points, a separate pass guarantees it runs exactly where needed.
Complexity: The logic for this transformation is non-trivial. It involves safety checks for speculative loads and has two different transformation strategies. Adding this level of complexity to InstCombine, which is already a very large and complex pass, would be undesirable.

yxsamliu requested review from arsenm, shiltian, rampitec and bcahoon September 4, 2025 02:54

llvmbot added the backend:AMDGPU label Sep 4, 2025

yxsamliu force-pushed the vector-idiom branch from 941fe85 to 269fad2 Compare September 4, 2025 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Identify vector idiom to unlock SROA #156791

[AMDGPU] Identify vector idiom to unlock SROA #156791

yxsamliu commented Sep 4, 2025

Uh oh!

llvmbot commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

arsenm commented Sep 4, 2025

Uh oh!

yxsamliu commented Sep 4, 2025

Uh oh!

Uh oh!

[AMDGPU] Identify vector idiom to unlock SROA #156791

Are you sure you want to change the base?

[AMDGPU] Identify vector idiom to unlock SROA #156791

Conversation

yxsamliu commented Sep 4, 2025

Uh oh!

llvmbot commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Sep 4, 2025

Uh oh!

yxsamliu commented Sep 4, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2025 •

edited

Loading