Skip to content

Conversation

Pierre-vh
Copy link
Contributor

@Pierre-vh Pierre-vh commented Aug 26, 2025

Re-enable it by adding a wait on vm_vsrc before every barrier "start" instruction in GFX10/11/12 CU mode.

This is a less strong wait than what we do without BackOffBarrier, thus this shouldn't introduce
any new guarantees that can be abused, instead it relaxes the guarantees we have now to the bare
minimum needed to support the behavior users want (fence release + barrier works).

There is an exact memory model in the works which will be documented separately.

Re-enable it by adding a wait on vm_vsrc before every barrier "start" instruction.
This is a less strong wait than what we do without BackOffBarrier, thus this shouldn't introduce
any new guarantees that can be abused, instead it relaxes the guarantees we have now to the bare
minimum needed to support the behavior users want (fence release + barrier works).

There is an exact memory model in the works which will be documented separately.
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Pierre-vh Pierre-vh marked this pull request as ready for review August 26, 2025 08:16
@llvmbot
Copy link
Member

llvmbot commented Aug 26, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

Re-enable it by adding a wait on vm_vsrc before every barrier "start" instruction.
This is a less strong wait than what we do without BackOffBarrier, thus this shouldn't introduce
any new guarantees that can be abused, instead it relaxes the guarantees we have now to the bare
minimum needed to support the behavior users want (fence release + barrier works).

There is an exact memory model in the works which will be documented separately.


Full diff: https://github.com/llvm/llvm-project/pull/155370.diff

8 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+9-3)
  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+30-1)
  • (modified) llvm/test/CodeGen/AMDGPU/back-off-barrier-subtarget-feature.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+183)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir (+2-2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index edd3ce72d7df3..1038797374de3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1902,6 +1902,7 @@ def FeatureISAVersion10_3_Generic: FeatureSet<
 
 def FeatureISAVersion11_Common : FeatureSet<
   [FeatureGFX11,
+   FeatureBackOffBarrier,
    FeatureLDSBankCount32,
    FeatureDLInsts,
    FeatureDot5Insts,
@@ -1985,6 +1986,7 @@ def FeatureISAVersion11_5_3 : FeatureSet<
 
 def FeatureISAVersion12 : FeatureSet<
   [FeatureGFX12,
+   FeatureBackOffBarrier,
    FeatureAddressableLocalMemorySize65536,
    FeatureLDSBankCount32,
    FeatureDLInsts,
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index fdbd9ce4a66bf..f7c7bb509c9ef 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -983,13 +983,19 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
     return MI.getDesc().TSFlags & SIInstrFlags::IsNeverUniform;
   }
 
-  bool isBarrier(unsigned Opcode) const {
+  // Check to see if opcode is for a barrier start. Pre gfx12 this is just the
+  // S_BARRIER, but after support for S_BARRIER_SIGNAL* / S_BARRIER_WAIT we want
+  // to check for the barrier start (S_BARRIER_SIGNAL*)
+  bool isBarrierStart(unsigned Opcode) const {
     return Opcode == AMDGPU::S_BARRIER ||
            Opcode == AMDGPU::S_BARRIER_SIGNAL_M0 ||
            Opcode == AMDGPU::S_BARRIER_SIGNAL_ISFIRST_M0 ||
            Opcode == AMDGPU::S_BARRIER_SIGNAL_IMM ||
-           Opcode == AMDGPU::S_BARRIER_SIGNAL_ISFIRST_IMM ||
-           Opcode == AMDGPU::S_BARRIER_WAIT ||
+           Opcode == AMDGPU::S_BARRIER_SIGNAL_ISFIRST_IMM;
+  }
+
+  bool isBarrier(unsigned Opcode) const {
+    return isBarrierStart(Opcode) || Opcode == AMDGPU::S_BARRIER_WAIT ||
            Opcode == AMDGPU::S_BARRIER_INIT_M0 ||
            Opcode == AMDGPU::S_BARRIER_INIT_IMM ||
            Opcode == AMDGPU::S_BARRIER_JOIN_IMM ||
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 53f554eccb1fb..3006eaf468bdd 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -359,6 +359,12 @@ class SICacheControl {
                              bool IsCrossAddrSpaceOrdering,
                              Position Pos) const = 0;
 
+  /// Inserts any necessary instructions before the barrier start instruction
+  /// \p MI in order to support pairing of barriers and fences.
+  virtual bool insertBarrierStart(MachineBasicBlock::iterator &MI) const {
+    return false;
+  };
+
   /// Virtual destructor to allow derivations to be deleted.
   virtual ~SICacheControl() = default;
 };
@@ -547,6 +553,8 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
                      SIAtomicScope Scope,
                      SIAtomicAddrSpace AddrSpace,
                      Position Pos) const override;
+
+  bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
 };
 
 class SIGfx11CacheControl : public SIGfx10CacheControl {
@@ -2169,6 +2177,21 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
   return Changed;
 }
 
+bool SIGfx10CacheControl::insertBarrierStart(
+    MachineBasicBlock::iterator &MI) const {
+  // We need to wait on vm_vsrc so barriers can pair with fences in GFX10+ CU
+  // mode. This is because a CU mode release fence does not emit any wait, which
+  // is fine when only dealing with vmem, but isn't sufficient in the presence
+  // of barriers which do not go through vmem.
+  if (!ST.isCuModeEnabled())
+    return false;
+
+  BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+          TII->get(AMDGPU::S_WAITCNT_DEPCTR))
+      .addImm(AMDGPU::DepCtr::encodeFieldVmVsrc(0));
+  return true;
+}
+
 bool SIGfx11CacheControl::enableLoadCacheBypass(
     const MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
     SIAtomicAddrSpace AddrSpace) const {
@@ -2840,7 +2863,8 @@ bool SIMemoryLegalizer::run(MachineFunction &MF) {
   bool Changed = false;
 
   SIMemOpAccess MOA(MMI.getObjFileInfo<AMDGPUMachineModuleInfo>());
-  CC = SICacheControl::create(MF.getSubtarget<GCNSubtarget>());
+  const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+  CC = SICacheControl::create(ST);
 
   for (auto &MBB : MF) {
     for (auto MI = MBB.begin(); MI != MBB.end(); ++MI) {
@@ -2860,6 +2884,11 @@ bool SIMemoryLegalizer::run(MachineFunction &MF) {
         MI = II->getIterator();
       }
 
+      if (ST.getInstrInfo()->isBarrierStart(MI->getOpcode())) {
+        Changed |= CC->insertBarrierStart(MI);
+        continue;
+      }
+
       if (!(MI->getDesc().TSFlags & SIInstrFlags::maybeAtomic))
         continue;
 
diff --git a/llvm/test/CodeGen/AMDGPU/back-off-barrier-subtarget-feature.ll b/llvm/test/CodeGen/AMDGPU/back-off-barrier-subtarget-feature.ll
index b584f6df647ce..122e683f79664 100644
--- a/llvm/test/CodeGen/AMDGPU/back-off-barrier-subtarget-feature.ll
+++ b/llvm/test/CodeGen/AMDGPU/back-off-barrier-subtarget-feature.ll
@@ -54,9 +54,8 @@ define void @back_off_barrier_no_fence(ptr %in, ptr %out) #0 {
 ; GFX11-BACKOFF:       ; %bb.0:
 ; GFX11-BACKOFF-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-BACKOFF-NEXT:    flat_load_b32 v0, v[0:1]
-; GFX11-BACKOFF-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX11-BACKOFF-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-BACKOFF-NEXT:    s_barrier
+; GFX11-BACKOFF-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX11-BACKOFF-NEXT:    flat_store_b32 v[2:3], v0
 ; GFX11-BACKOFF-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-BACKOFF-NEXT:    s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
index d23509b5aa812..b91963f08681c 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -150,6 +150,7 @@ define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
 ; GFX10CU-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
 ; GFX10CU-NEXT:    v_mov_b32_e32 v0, s13
 ; GFX10CU-NEXT:    s_waitcnt vmcnt(0)
+; GFX10CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX10CU-NEXT:    s_barrier
 ; GFX10CU-NEXT:    ds_read_b32 v0, v0
 ; GFX10CU-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll
index 7d44d9178f941..4b1d1560ebe0e 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll
@@ -214,9 +214,9 @@ define weak_odr amdgpu_kernel void @dpp_test1(ptr %arg) local_unnamed_addr {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11-NEXT:    v_and_b32_e32 v0, 0xffc, v0
 ; GFX11-NEXT:    ds_load_b32 v1, v0
-; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_barrier
 ; GFX11-NEXT:    buffer_gl0_inv
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    v_add_co_u32 v0, s0, s0, v0
 ; GFX11-NEXT:    v_add_nc_u32_e32 v1, v1, v1
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
new file mode 100644
index 0000000000000..e921f581c00a7
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
@@ -0,0 +1,183 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1010 < %s | FileCheck --check-prefixes=GFX10-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1010 -mattr=+cumode < %s | FileCheck --check-prefixes=GFX10-CU %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1100 < %s | FileCheck --check-prefixes=GFX11-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1100 -mattr=+cumode < %s | FileCheck --check-prefixes=GFX11-CU %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1200 < %s | FileCheck --check-prefixes=GFX12-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1200 -mattr=+cumode < %s | FileCheck --check-prefixes=GFX12-CU %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -O0 -mcpu=gfx1250 < %s | FileCheck --check-prefixes=GFX1250 %s
+
+define amdgpu_kernel void @test_s_barrier() {
+; GFX10-WGP-LABEL: test_s_barrier:
+; GFX10-WGP:       ; %bb.0: ; %entry
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    s_endpgm
+;
+; GFX10-CU-LABEL: test_s_barrier:
+; GFX10-CU:       ; %bb.0: ; %entry
+; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_barrier
+; GFX10-CU-NEXT:    s_endpgm
+;
+; GFX11-WGP-LABEL: test_s_barrier:
+; GFX11-WGP:       ; %bb.0: ; %entry
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    s_endpgm
+;
+; GFX11-CU-LABEL: test_s_barrier:
+; GFX11-CU:       ; %bb.0: ; %entry
+; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_barrier
+; GFX11-CU-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: test_s_barrier:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: test_s_barrier:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_barrier_signal -1
+; GFX12-CU-NEXT:    s_barrier_wait -1
+; GFX12-CU-NEXT:    s_endpgm
+;
+; GFX1250-LABEL: test_s_barrier:
+; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_alu 0xffe3
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    s_endpgm
+entry:
+  call void @llvm.amdgcn.s.barrier()
+  ret void
+}
+
+define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
+; GFX10-WGP-LABEL: test_s_barrier_workgroup_fence:
+; GFX10-WGP:       ; %bb.0: ; %entry
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX10-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    s_endpgm
+;
+; GFX10-CU-LABEL: test_s_barrier_workgroup_fence:
+; GFX10-CU:       ; %bb.0: ; %entry
+; GFX10-CU-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_barrier
+; GFX10-CU-NEXT:    s_endpgm
+;
+; GFX11-WGP-LABEL: test_s_barrier_workgroup_fence:
+; GFX11-WGP:       ; %bb.0: ; %entry
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX11-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    s_endpgm
+;
+; GFX11-CU-LABEL: test_s_barrier_workgroup_fence:
+; GFX11-CU:       ; %bb.0: ; %entry
+; GFX11-CU-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_barrier
+; GFX11-CU-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: test_s_barrier_workgroup_fence:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: test_s_barrier_workgroup_fence:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
+; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_barrier_signal -1
+; GFX12-CU-NEXT:    s_barrier_wait -1
+; GFX12-CU-NEXT:    s_endpgm
+;
+; GFX1250-LABEL: test_s_barrier_workgroup_fence:
+; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
+; GFX1250-NEXT:    s_wait_alu 0xffe3
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    s_endpgm
+entry:
+  fence syncscope("workgroup") release
+  call void @llvm.amdgcn.s.barrier()
+  ret void
+}
+
+define amdgpu_kernel void @test_s_barrier_agent_fence() {
+; GFX10-WGP-LABEL: test_s_barrier_agent_fence:
+; GFX10-WGP:       ; %bb.0: ; %entry
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX10-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    s_endpgm
+;
+; GFX10-CU-LABEL: test_s_barrier_agent_fence:
+; GFX10-CU:       ; %bb.0: ; %entry
+; GFX10-CU-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_barrier
+; GFX10-CU-NEXT:    s_endpgm
+;
+; GFX11-WGP-LABEL: test_s_barrier_agent_fence:
+; GFX11-WGP:       ; %bb.0: ; %entry
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX11-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    s_endpgm
+;
+; GFX11-CU-LABEL: test_s_barrier_agent_fence:
+; GFX11-CU:       ; %bb.0: ; %entry
+; GFX11-CU-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_barrier
+; GFX11-CU-NEXT:    s_endpgm
+;
+; GFX12-WGP-LABEL: test_s_barrier_agent_fence:
+; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    s_endpgm
+;
+; GFX12-CU-LABEL: test_s_barrier_agent_fence:
+; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
+; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_barrier_signal -1
+; GFX12-CU-NEXT:    s_barrier_wait -1
+; GFX12-CU-NEXT:    s_endpgm
+;
+; GFX1250-LABEL: test_s_barrier_agent_fence:
+; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_bvhcnt 0x0
+; GFX1250-NEXT:    s_wait_samplecnt 0x0
+; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT:    s_wait_alu 0xffe3
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    s_endpgm
+entry:
+  fence syncscope("agent") release
+  call void @llvm.amdgcn.s.barrier()
+  ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir b/llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir
index 292091ad6e1e7..bb47392e7b7d7 100644
--- a/llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir
+++ b/llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir
@@ -62,7 +62,7 @@ body:             |
     ; GFX11-NEXT: {{  $}}
     ; GFX11-NEXT: S_WAITCNT 0
     ; GFX11-NEXT: GLOBAL_STORE_DWORD $vgpr0_vgpr1, $vgpr2, 0, 0, implicit $exec
-    ; GFX11-NEXT: S_WAITCNT_VSCNT undef $sgpr_null, 0
+    ; GFX11-NEXT: S_WAITCNT_VSCNT undef $sgpr_null, 1
     ; GFX11-NEXT: S_BARRIER
     ; GFX11-NEXT: $vgpr0 = FLAT_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec, implicit $flat_scr
     ; GFX11-NEXT: S_WAITCNT 7
@@ -176,7 +176,7 @@ body:             |
     ; GFX11-NEXT: {{  $}}
     ; GFX11-NEXT: S_WAITCNT 0
     ; GFX11-NEXT: GLOBAL_STORE_DWORD $vgpr0_vgpr1, $vgpr2, 0, 0, implicit $exec
-    ; GFX11-NEXT: S_WAITCNT_VSCNT undef $sgpr_null, 0
+    ; GFX11-NEXT: S_WAITCNT_VSCNT undef $sgpr_null, 1
     ; GFX11-NEXT: S_BARRIER
     ; GFX11-NEXT: $vgpr0 = FLAT_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec, implicit $flat_scr
     ; GFX11-NEXT: S_WAITCNT 7

// Check to see if opcode is for a barrier start. Pre gfx12 this is just the
// S_BARRIER, but after support for S_BARRIER_SIGNAL* / S_BARRIER_WAIT we want
// to check for the barrier start (S_BARRIER_SIGNAL*)
bool isBarrierStart(unsigned Opcode) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there a way to encode whether an instruction is a barrier-start or not directly in the TD file where it is declared? In general, this would be true for encoding many different properties of an instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible, we just generally don't do it for things that can be a small switch case. e.g. waitcnts also don't have that
But other stuff like memory instructions have it because there's just too many variants

@@ -54,9 +54,8 @@ define void @back_off_barrier_no_fence(ptr %in, ptr %out) #0 {
; GFX11-BACKOFF: ; %bb.0:
; GFX11-BACKOFF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-BACKOFF-NEXT: flat_load_b32 v0, v[0:1]
; GFX11-BACKOFF-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-BACKOFF-NEXT: s_waitcnt_vscnt null, 0x0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What causes the vscnt to disappear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because restoring BackOffBarrier removes the waitcnts before barriers, the s_waitcnt vmcnt(0) lgkmcnt(0) is moved down because InsertWaitCnt still needs one for the flat load to complete before using its result
Before it didn't insert one because of the forced wait before the barrier

@Pierre-vh Pierre-vh requested a review from ssahasra August 28, 2025 09:15
Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jayfoad
Copy link
Contributor

jayfoad commented Aug 28, 2025

Re-enable it by adding a wait on vm_vsrc before every barrier "start" instruction.
This is a less strong wait than what we do without BackOffBarrier,

Description should mention that we only add the wait in CU mode? Is that why you describe it as "less strong" than without BackOffBarrier?

@Pierre-vh
Copy link
Contributor Author

Re-enable it by adding a wait on vm_vsrc before every barrier "start" instruction.
This is a less strong wait than what we do without BackOffBarrier,

Description should mention that we only add the wait in CU mode? Is that why you describe it as "less strong" than without BackOffBarrier?

Yes, and yes. I updated the description.

The wait is less strong because it only waits on the "coherency point" for a workgroup in CU mode, instead of waiting on the operation to complete entirely

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Pierre-vh Pierre-vh merged commit d6edc1a into main Sep 2, 2025
13 checks passed
@Pierre-vh Pierre-vh deleted the users/pierre-vh/barrier-pair-fence branch September 2, 2025 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants