-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[AMDGPU] Identify vector idiom to unlock SROA #156791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Yaxun (Sam) Liu (yxsamliu) ChangesHIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs. For example: %p = select i1 %cond, ptr %A, ptr %B call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false) When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations. The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped. The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads. Fixes: SWDEV-550134 Patch is 21.86 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156791.diff 6 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 48448833721bf..7fb808b64a1c1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -67,6 +67,7 @@ FUNCTION_PASS("amdgpu-simplifylib", AMDGPUSimplifyLibCallsPass())
FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
AMDGPUUnifyDivergentExitNodesPass())
FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
+FUNCTION_PASS("amdgpu-vector-idiom", AMDGPUVectorIdiomCombinePass())
FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
#undef FUNCTION_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 4a2f0a13b1325..2b3168a805cf0 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -33,6 +33,7 @@
#include "AMDGPUTargetObjectFile.h"
#include "AMDGPUTargetTransformInfo.h"
#include "AMDGPUUnifyDivergentExitNodes.h"
+#include "AMDGPUVectorIdiom.h"
#include "AMDGPUWaitSGPRHazards.h"
#include "GCNDPPCombine.h"
#include "GCNIterativeScheduler.h"
@@ -922,6 +923,10 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
EnablePromoteKernelArguments)
FPM.addPass(AMDGPUPromoteKernelArgumentsPass());
+ // Run vector-idiom canonicalization early (after inlining) and before
+ // infer-AS / SROA to maximize scalarization opportunities.
+ FPM.addPass(AMDGPUVectorIdiomCombinePass());
+
// Add infer address spaces pass to the opt pipeline after inlining
// but before SROA to increase SROA opportunities.
FPM.addPass(InferAddressSpacesPass());
@@ -973,6 +978,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
// We only want to run this with O2 or higher since inliner and SROA
// don't run in O1.
if (Level != OptimizationLevel::O1) {
+ PM.addPass(
+ createModuleToFunctionPassAdaptor(AMDGPUVectorIdiomCombinePass()));
PM.addPass(
createModuleToFunctionPassAdaptor(InferAddressSpacesPass()));
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp
new file mode 100644
index 0000000000000..05f47c648493e
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.cpp
@@ -0,0 +1,317 @@
+//===- AMDGPUVectorIdiom.cpp ------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// AMDGPU-specific vector idiom canonicalizations to unblock SROA and
+// subsequent scalarization/vectorization.
+//
+// Motivation:
+// - HIP vector types are often modeled as structs and copied with memcpy.
+// Address-level selects on such copies block SROA. Converting to value-level
+// operations or splitting the CFG enables SROA to break aggregates, which
+// unlocks scalarization/vectorization on AMDGPU.
+//
+// Example pattern:
+// %src = select i1 %c, ptr %A, ptr %B
+// call void @llvm.memcpy(ptr %dst, ptr %src, i32 16, i1 false)
+//
+// Objectives:
+// - Canonicalize small memcpy patterns where source or destination is a select
+// of pointers.
+// - Prefer value-level selects (on loaded values) over address-level selects
+// when safe.
+// - When speculation is unsafe, split the CFG to isolate each arm.
+//
+// Assumptions:
+// - Only handles non-volatile memcpy with constant length N where 0 < N <=
+// MaxBytes (default 32).
+// - Source and destination must be in the same address space.
+// - Speculative loads are allowed only if a conservative alignment check
+// passes.
+// - No speculative stores are introduced.
+//
+// Transformations:
+// - Source-select memcpy: attempt speculative loads -> value select -> single
+// store.
+// Fallback is CFG split with two memcpy calls.
+// - Destination-select memcpy: always CFG split to avoid speculative stores.
+//
+// Run this pass early, before SROA.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUVectorIdiom.h"
+#include "AMDGPU.h"
+
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/Analysis/AssumptionCache.h"
+#include "llvm/Analysis/TargetLibraryInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstIterator.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+
+using namespace llvm;
+using namespace PatternMatch;
+
+#define DEBUG_TYPE "amdgpu-vector-idiom"
+
+namespace {
+
+// Default to 32 bytes since the largest HIP vector types are double4 or long4.
+static cl::opt<unsigned> AMDGPUVectorIdiomMaxBytes(
+ "amdgpu-vector-idiom-max-bytes",
+ cl::desc("Max memcpy size (in bytes) to transform in AMDGPUVectorIdiom "
+ "(default 32)"),
+ cl::init(32));
+
+// Selects an integer or integer-vector element type matching NBytes, using the
+// minimum proven alignment to decide the widest safe element width.
+// Assumptions:
+// - Pointee types are opaque; the element choice is based solely on size and
+// alignment.
+// - Falls back to <N x i8> if wider lanes are not safe/aligned.
+static Type *getIntOrVecTypeForSize(uint64_t NBytes, LLVMContext &Ctx,
+ Align MinProvenAlign = Align(1)) {
+ auto CanUseI64 = [&]() { return MinProvenAlign >= Align(8); };
+ auto CanUseI32 = [&]() { return MinProvenAlign >= Align(4); };
+ auto CanUseI16 = [&]() { return MinProvenAlign >= Align(2); };
+
+ if (NBytes == 32 && CanUseI64())
+ return FixedVectorType::get(Type::getInt64Ty(Ctx), 4);
+
+ if ((NBytes % 4) == 0 && CanUseI32())
+ return FixedVectorType::get(Type::getInt32Ty(Ctx), NBytes / 4);
+
+ if ((NBytes % 2) == 0 && CanUseI16())
+ return FixedVectorType::get(Type::getInt16Ty(Ctx), NBytes / 2);
+
+ return FixedVectorType::get(Type::getInt8Ty(Ctx), NBytes);
+}
+
+static Align minAlign(Align A, Align B) { return A < B ? A : B; }
+
+// Checks if both pointer operands can be speculatively loaded for N bytes and
+// computes the minimum alignment to use.
+// Notes:
+// - Intentionally conservative: relies on getOrEnforceKnownAlignment.
+// - AA/TLI are not used for deeper reasoning here.
+static bool bothArmsSafeToSpeculateLoads(Value *A, Value *B, Align &OutAlign,
+ const DataLayout &DL,
+ AssumptionCache *AC,
+ const DominatorTree *DT) {
+ Align AlignA =
+ llvm::getOrEnforceKnownAlignment(A, Align(1), DL, nullptr, AC, DT);
+ Align AlignB =
+ llvm::getOrEnforceKnownAlignment(B, Align(1), DL, nullptr, AC, DT);
+
+ if (AlignA.value() < 1 || AlignB.value() < 1)
+ return false;
+
+ OutAlign = minAlign(AlignA, AlignB);
+ return true;
+}
+
+// With opaque pointers, ensure address spaces match and otherwise return Ptr.
+// Assumes the address space is the only property to validate for this cast.
+static Value *castPtrTo(Value *Ptr, unsigned ExpectedAS) {
+ auto *FromPTy = cast<PointerType>(Ptr->getType());
+ unsigned AS = FromPTy->getAddressSpace();
+ (void)ExpectedAS;
+ assert(AS == ExpectedAS && "Address space mismatch for castPtrTo");
+ return Ptr;
+}
+
+struct AMDGPUVectorIdiomImpl {
+ unsigned MaxBytes;
+
+ AMDGPUVectorIdiomImpl(unsigned MaxBytes) : MaxBytes(MaxBytes) {}
+
+ // Rewrites memcpy when the source is a select of pointers. Prefers a
+ // value-level select (two loads + select + one store) if speculative loads
+ // are safe. Otherwise, falls back to a guarded CFG split with two memcpy
+ // calls. Assumptions:
+ // - Non-volatile, constant length, within MaxBytes.
+ // - Source and destination in the same address space.
+ bool transformSelectMemcpySource(MemTransferInst &MT, SelectInst &Sel,
+ const DataLayout &DL,
+ const DominatorTree *DT,
+ AssumptionCache *AC) {
+ IRBuilder<> B(&MT);
+ Value *Dst = MT.getRawDest();
+ Value *A = Sel.getTrueValue();
+ Value *Bv = Sel.getFalseValue();
+ if (!A->getType()->isPointerTy() || !Bv->getType()->isPointerTy())
+ return false;
+
+ ConstantInt *LenCI = cast<ConstantInt>(MT.getLength());
+ uint64_t N = LenCI->getLimitedValue();
+
+ Align DstAlign = MaybeAlign(MT.getDestAlign()).valueOrOne();
+ Align AlignAB;
+ bool CanSpeculate =
+ bothArmsSafeToSpeculateLoads(A, Bv, AlignAB, DL, AC, DT);
+
+ unsigned AS = cast<PointerType>(A->getType())->getAddressSpace();
+ assert(AS == cast<PointerType>(Bv->getType())->getAddressSpace() &&
+ "Expected same AS");
+
+ if (CanSpeculate) {
+ Align MinAlign = std::min(AlignAB, DstAlign);
+ Type *Ty = getIntOrVecTypeForSize(N, B.getContext(), MinAlign);
+
+ Value *PA = castPtrTo(A, AS);
+ Value *PB = castPtrTo(Bv, AS);
+ LoadInst *LA = B.CreateAlignedLoad(Ty, PA, MinAlign);
+ LoadInst *LB = B.CreateAlignedLoad(Ty, PB, MinAlign);
+ Value *V = B.CreateSelect(Sel.getCondition(), LA, LB);
+
+ Value *PDst =
+ castPtrTo(Dst, cast<PointerType>(Dst->getType())->getAddressSpace());
+ (void)B.CreateAlignedStore(V, PDst, DstAlign);
+
+ LLVM_DEBUG(dbgs() << "[AMDGPUVectorIdiom] Rewrote memcpy(select-src) to "
+ "value-select loads/stores: "
+ << MT << "\n");
+ MT.eraseFromParent();
+ return true;
+ }
+
+ splitCFGForMemcpy(MT, Sel.getCondition(), A, Bv, true);
+ LLVM_DEBUG(
+ dbgs()
+ << "[AMDGPUVectorIdiom] Rewrote memcpy(select-src) by CFG split\n");
+ return true;
+ }
+
+ // Rewrites memcpy when the destination is a select of pointers. To avoid
+ // speculative stores, always splits the CFG and emits a memcpy per branch.
+ // Assumptions mirror the source case.
+ bool transformSelectMemcpyDest(MemTransferInst &MT, SelectInst &Sel) {
+ Value *DA = Sel.getTrueValue();
+ Value *DB = Sel.getFalseValue();
+ if (!DA->getType()->isPointerTy() || !DB->getType()->isPointerTy())
+ return false;
+
+ splitCFGForMemcpy(MT, Sel.getCondition(), DA, DB, false);
+ LLVM_DEBUG(
+ dbgs()
+ << "[AMDGPUVectorIdiom] Rewrote memcpy(select-dst) by CFG split\n");
+ return true;
+ }
+
+ // Splits the CFG around a memcpy whose source or destination depends on a
+ // condition. Clones memcpy in then/else using TruePtr/FalsePtr and rejoins.
+ // Assumptions:
+ // - MT has constant length and is non-volatile.
+ // - TruePtr/FalsePtr are correct replacements for the selected operand.
+ void splitCFGForMemcpy(MemTransferInst &MT, Value *Cond, Value *TruePtr,
+ Value *FalsePtr, bool IsSource) {
+ Function *F = MT.getFunction();
+ BasicBlock *Cur = MT.getParent();
+ BasicBlock *ThenBB = BasicBlock::Create(F->getContext(), "memcpy.then", F);
+ BasicBlock *ElseBB = BasicBlock::Create(F->getContext(), "memcpy.else", F);
+ BasicBlock *JoinBB =
+ Cur->splitBasicBlock(BasicBlock::iterator(&MT), "memcpy.join");
+
+ Cur->getTerminator()->eraseFromParent();
+ IRBuilder<> B(Cur);
+ B.CreateCondBr(Cond, ThenBB, ElseBB);
+
+ ConstantInt *LenCI = cast<ConstantInt>(MT.getLength());
+
+ IRBuilder<> BT(ThenBB);
+ if (IsSource) {
+ (void)BT.CreateMemCpy(MT.getRawDest(), MT.getDestAlign(), TruePtr,
+ MT.getSourceAlign(), LenCI, MT.isVolatile());
+ } else {
+ (void)BT.CreateMemCpy(TruePtr, MT.getDestAlign(), MT.getRawSource(),
+ MT.getSourceAlign(), LenCI, MT.isVolatile());
+ }
+ BT.CreateBr(JoinBB);
+
+ IRBuilder<> BE(ElseBB);
+ if (IsSource) {
+ (void)BE.CreateMemCpy(MT.getRawDest(), MT.getDestAlign(), FalsePtr,
+ MT.getSourceAlign(), LenCI, MT.isVolatile());
+ } else {
+ (void)BE.CreateMemCpy(FalsePtr, MT.getDestAlign(), MT.getRawSource(),
+ MT.getSourceAlign(), LenCI, MT.isVolatile());
+ }
+ BE.CreateBr(JoinBB);
+
+ MT.eraseFromParent();
+ }
+};
+
+} // end anonymous namespace
+
+AMDGPUVectorIdiomCombinePass::AMDGPUVectorIdiomCombinePass()
+ : MaxBytes(AMDGPUVectorIdiomMaxBytes) {}
+
+// Pass driver that locates small, constant-size, non-volatile memcpy calls
+// where source or destination is a select in the same address space. Applies
+// the source/destination transforms described above. Intended to run early to
+// maximize SROA and subsequent optimizations.
+PreservedAnalyses
+AMDGPUVectorIdiomCombinePass::run(Function &F, FunctionAnalysisManager &FAM) {
+ const DataLayout &DL = F.getParent()->getDataLayout();
+ auto &DT = FAM.getResult<DominatorTreeAnalysis>(F);
+ auto &AC = FAM.getResult<AssumptionAnalysis>(F);
+
+ SmallVector<CallInst *, 8> Worklist;
+ for (Instruction &I : instructions(F)) {
+ if (auto *CI = dyn_cast<CallInst>(&I)) {
+ if (isa<MemTransferInst>(CI))
+ Worklist.push_back(CI);
+ }
+ }
+
+ bool Changed = false;
+ AMDGPUVectorIdiomImpl Impl(MaxBytes);
+
+ for (CallInst *CI : Worklist) {
+ auto *MT = cast<MemTransferInst>(CI);
+ if (MT->isVolatile())
+ continue;
+
+ ConstantInt *LenCI = dyn_cast<ConstantInt>(MT->getLength());
+ if (!LenCI)
+ continue;
+
+ uint64_t N = LenCI->getLimitedValue();
+ if (N == 0 || N > MaxBytes)
+ continue;
+
+ Value *Dst = MT->getRawDest();
+ Value *Src = MT->getRawSource();
+
+ unsigned DstAS = cast<PointerType>(Dst->getType())->getAddressSpace();
+ unsigned SrcAS = cast<PointerType>(Src->getType())->getAddressSpace();
+ if (DstAS != SrcAS)
+ continue;
+
+ if (auto *Sel = dyn_cast<SelectInst>(Src)) {
+ Changed |= Impl.transformSelectMemcpySource(*MT, *Sel, DL, &DT, &AC);
+ continue;
+ }
+ if (auto *Sel = dyn_cast<SelectInst>(Dst)) {
+ Changed |= Impl.transformSelectMemcpyDest(*MT, *Sel);
+ continue;
+ }
+ }
+
+ return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h
new file mode 100644
index 0000000000000..6aebafe3e4e93
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUVectorIdiom.h
@@ -0,0 +1,41 @@
+//===- AMDGPUVectorIdiom.h --------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// AMDGPU-specific vector idiom canonicalizations to unblock SROA and
+// subsequent scalarization/vectorization.
+//
+// This pass rewrites memcpy with select-fed operands into either:
+// - a value-level select (two loads + select + store), when safe to
+// speculatively load both arms, or
+// - a conservative CFG split around the condition to isolate each arm.
+//
+// Run this pass early, before SROA.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPVECTORIDIOM_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPVECTORIDIOM_H
+
+#include "AMDGPU.h"
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class AMDGPUVectorIdiomCombinePass
+ : public PassInfoMixin<AMDGPUVectorIdiomCombinePass> {
+ unsigned MaxBytes;
+
+public:
+ AMDGPUVectorIdiomCombinePass();
+
+ PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
+};
+
+} // end namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPVECTORIDIOM_H
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 05295ae73be23..abf52f438fbe0 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -116,6 +116,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUTargetTransformInfo.cpp
AMDGPUWaitSGPRHazards.cpp
AMDGPUUnifyDivergentExitNodes.cpp
+ AMDGPUVectorIdiom.cpp
R600MachineCFGStructurizer.cpp
GCNCreateVOPD.cpp
GCNDPPCombine.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll
new file mode 100644
index 0000000000000..ed5e6700477a1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-vector-idiom-memcpy-select.ll
@@ -0,0 +1,111 @@
+; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-vector-idiom -S %s | FileCheck %s
+
+; This test verifies the AMDGPUVectorIdiomCombinePass transforms:
+; 1) memcpy with select-fed source into a value-level select between two loads,
+; followed by one store (when it's safe to speculate both loads).
+; 2) memcpy with select-fed destination into a control-flow split with two memcpys.
+
+@G0 = addrspace(1) global [4 x i32] zeroinitializer, align 16
+@G1 = addrspace(1) global [4 x i32] zeroinitializer, align 16
+
+declare void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) nocapture writeonly, ptr addrspace(1) nocapture readonly, i64, i1 immarg)
+
+; -----------------------------------------------------------------------------
+; Source is a select. Expect value-level select of two <4 x i32> loads
+; and a single store, with no remaining memcpy.
+;
+; CHECK-LABEL: @value_select_src(
+; CHECK-NOT: call void @llvm.memcpy
+; CHECK: [[LA:%.+]] = load <4 x i32>, ptr addrspace(1) [[A:%.+]], align 16
+; CHECK: [[LB:%.+]] = load <4 x i32>, ptr addrspace(1) [[B:%.+]], align 16
+; CHECK: [[SEL:%.+]] = select i1 [[COND:%.+]], <4 x i32> [[LA]], <4 x i32> [[LB]]
+; CHECK: store <4 x i32> [[SEL]], ptr addrspace(1) [[DST:%.+]], align 16
+define amdgpu_kernel void @value_select_src(ptr addrspace(1) %dst, i1 %cond) {
+entry:
+ ; Pointers to two 16-byte aligned buffers in the same addrspace(1).
+ %pa = getelementptr inbounds [4 x i32], ptr addrspace(1) @G0, i64 0, i64 0
+ %pb = getelementptr inbounds [4 x i32], ptr addrspace(1) @G1, i64 0, i64 0
+ %src = select i1 %cond, ptr addrspace(1) %pa, ptr addrspace(1) %pb
+
+ ; Provide explicit operand alignments so the pass can emit an aligned store.
+ call void @llvm.memcpy.p1.p1.i64(
+ ptr addrspace(1) align 16 %dst,
+ ptr addrspace(1) align 16 %src,
+ i64 16, i1 false)
+
+ ret void
+}
+
+; -----------------------------------------------------------------------------
+; Destination is a select. Expect CFG split with two memcpys guarded
+; by a branch (we do not speculate stores in this pass).
+;
+; CHECK-LABEL: @dest_select_cfg_split(
+; CHECK: br i1 %cond, label %memcpy.then, label %memcpy.else
+; CHECK: memcpy.join:
+; CHECK: ret void
+; CHECK: memcpy.then:
+; CHECK: call void @llvm.memcpy.p1.p1.i64(
+; CHECK: br label %memcpy.join
+; CHECK: memcpy.else:
+; CHECK: call void @llvm.memcpy.p1.p1.i64(
+; CHECK: br label %memcpy.join
+define amdgpu_kernel void @dest_select_cfg_split(ptr addrspace(1) %da, ptr addrspace(1) %db,
+ ptr addrspace(1) %src, i1 %cond) {
+entry:
+ %dst = select i1 %cond, ptr addrspace(1) %da, ptr addrspace(1) %db
+ call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 16, i1 false)
+ ret void
+}
+
+; -----------------------------------------------------------------------------
+; Source is a select, 4 x double (32 bytes).
+; Expect value-level select of two <4 x i64> loads and a single store, no memcpy.
+;
+; CHECK-LABEL: @value_select_src_4xd(
+; CHECK-NOT: call void @llvm.memcpy
+; CHECK: [[LA4D:%.+]] = load <4 x i64>, ptr addrspace(1) {{%.+}}, align 32
+; CHECK: [[LB4D:%.+]] = load <4 x i64>, ptr addrspace(1) {{%.+}}, align 32
+; CHECK: [[SEL4D:%.+]] = select i1 {{%.+}}, ...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
I think the HIP vector type structure definitions are broken, and should be defined with a union of ext_vector_type such that the original access is emitted as an under-aligned load of IR vector. Why does this need to be a new pass? instcombine already does replacement of small memcpy and this is a small extension on top of that? |
HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs. For example: %p = select i1 %cond, ptr %A, ptr %B call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false) When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations. The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped. The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads. Fixes: SWDEV-550134
941fe85
to
269fad2
Compare
Using a union with ext_vector_type would help codegen, and that was used to implement HIP vector type. But we have to change that to a plain struct type because users rely on two C++ features that break with union-based implementation. What breaks with a union-based HIP vector type:
Reason: The type has an anonymous union. C++ forbids decomposing such types. Example that fails in HIP union-based vector type today (works in CUDA and HIP struct-based vector type): https://godbolt.org/z/rT7nhjszE
Reason: Only one union member is “active.” Reading another member in a constant expression is not allowed. Example that fails in HIP union-based vector type today (works in CUDA and HIP struct-based type): https://godbolt.org/z/T1eM9snM4
The reasons to use a target-specific pass instead of instcombine:
|
HIP vector types often lower to aggregates and get copied with memcpy. When the source or destination is chosen via a pointer select, SROA cannot split the aggregate. This keeps data in stack slots and increases scratch traffic. By rewriting these memcpy idioms, we enable SROA to promote values, reducing stack usage and improving occupancy and bandwidth on AMD GPUs.
For example:
%p = select i1 %cond, ptr %A, ptr %B
call void @llvm.memcpy.p0.p0.i32(ptr %dst, ptr %p, i32 16, i1 false)
When the source is a pointer select and conditions allow, the pass replaces the memcpy with two aligned loads, a value-level select of the loaded vector, and one aligned store. If it is not safe to speculate both loads, it splits control flow and emits a memcpy in each arm. When the destination is a select, it always splits control flow to avoid speculative stores. Vector element types are chosen based on size and minimum proven alignment to minimize the number of operations.
The pass handles non-volatile, constant-length memcpy up to a small size cap. Source and destination must be in the same address space. It runs early, after inlining and before InferAddressSpaces and SROA. Volatile and cross-address-space memcpys are skipped.
The size cap is controlled by -amdgpu-vector-idiom-max-bytes (default 32), allowing tuning for different workloads.
Fixes: SWDEV-550134