Skip to content

Conversation

nhaehnle
Copy link
Collaborator

The barrier intrinsic itself should not have memory semantics. Frontends should use appropriate fence instructions for memory effects, and some frontends want to rely on that for performance (e.g. wait only for LDS before a barrier).

See the code comment for more detail.

@llvmbot
Copy link
Member

llvmbot commented Aug 22, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Nicolai Hähnle (nhaehnle)

Changes

The barrier intrinsic itself should not have memory semantics. Frontends should use appropriate fence instructions for memory effects, and some frontends want to rely on that for performance (e.g. wait only for LDS before a barrier).

See the code comment for more detail.


Full diff: https://github.com/llvm/llvm-project/pull/154970.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+13-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.signal.isfirst.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/s-barrier.ll (+14-18)
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 343e4550e1d93..a65bd1e5e5c42 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -2010,11 +2010,19 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
     }
   }
 
-  // The subtarget may have an implicit S_WAITCNT 0 before barriers. If it does
-  // not, we need to ensure the subtarget is capable of backing off barrier
-  // instructions in case there are any outstanding memory operations that may
-  // cause an exception. Otherwise, insert an explicit S_WAITCNT 0 here.
-  if (TII->isBarrierStart(MI.getOpcode()) &&
+  // Ensure safety against exceptions from outstanding memory operations while
+  // waiting for a barrier:
+  //
+  //  * Some subtargets safely handle backing off the barrier in hardware
+  //    when an exception occurs.
+  //  * Some subtargets have an implicit S_WAITCNT 0 before barriers, so that
+  //    there can be no outstanding memory operations during the wait.
+  //  * Subtargets with split barriers don't need to back off the barrier; it
+  //    is up to the trap handler to preserve the user barrier state correctly.
+  //
+  // In all other cases, ensure safety by ensuring that there are no outstanding
+  // memory operations.
+  if (MI.getOpcode() == AMDGPU::S_BARRIER &&
       !ST->hasAutoWaitcntBeforeBarrier() && !ST->supportsBackOffBarrier()) {
     Wait = Wait.combined(WCG->getAllZeroWaitcnt(/*IncludeVSCnt=*/true));
   }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll
index 90e150c89955b..9003251253740 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll
@@ -98,7 +98,6 @@ define amdgpu_kernel void @test_barrier(ptr addrspace(1) %out, i32 %size) #0 {
 ; VARIANT4-NEXT:    s_wait_kmcnt 0x0
 ; VARIANT4-NEXT:    v_xad_u32 v0, v2, -1, s2
 ; VARIANT4-NEXT:    global_store_b32 v3, v2, s[0:1]
-; VARIANT4-NEXT:    s_wait_storecnt 0x0
 ; VARIANT4-NEXT:    s_barrier_signal -1
 ; VARIANT4-NEXT:    s_barrier_wait -1
 ; VARIANT4-NEXT:    v_ashrrev_i32_e32 v1, 31, v0
@@ -145,7 +144,6 @@ define amdgpu_kernel void @test_barrier(ptr addrspace(1) %out, i32 %size) #0 {
 ; VARIANT6-NEXT:    v_sub_nc_u32_e32 v0, s2, v4
 ; VARIANT6-NEXT:    global_store_b32 v5, v4, s[0:1]
 ; VARIANT6-NEXT:    v_ashrrev_i32_e32 v1, 31, v0
-; VARIANT6-NEXT:    s_wait_storecnt 0x0
 ; VARIANT6-NEXT:    s_barrier_signal -1
 ; VARIANT6-NEXT:    s_barrier_wait -1
 ; VARIANT6-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.signal.isfirst.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.signal.isfirst.ll
index 651d204f65b6c..248e0c716b975 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.signal.isfirst.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.signal.isfirst.ll
@@ -11,7 +11,6 @@ define i1 @func1() {
 ; GFX12-SDAG-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-SDAG-NEXT:    s_cmp_eq_u32 0, 0
-; GFX12-SDAG-NEXT:    s_wait_storecnt 0x0
 ; GFX12-SDAG-NEXT:    s_barrier_signal_isfirst -1
 ; GFX12-SDAG-NEXT:    s_cselect_b32 s0, -1, 0
 ; GFX12-SDAG-NEXT:    s_wait_alu 0xfffe
@@ -27,7 +26,6 @@ define i1 @func1() {
 ; GFX12-GISEL-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-GISEL-NEXT:    s_cmp_eq_u32 0, 0
-; GFX12-GISEL-NEXT:    s_wait_storecnt 0x0
 ; GFX12-GISEL-NEXT:    s_barrier_signal_isfirst -1
 ; GFX12-GISEL-NEXT:    s_cselect_b32 s0, 1, 0
 ; GFX12-GISEL-NEXT:    s_wait_alu 0xfffe
diff --git a/llvm/test/CodeGen/AMDGPU/s-barrier.ll b/llvm/test/CodeGen/AMDGPU/s-barrier.ll
index 1821bd45dc1cc..a4fa8e4b3c8e2 100644
--- a/llvm/test/CodeGen/AMDGPU/s-barrier.ll
+++ b/llvm/test/CodeGen/AMDGPU/s-barrier.ll
@@ -14,11 +14,10 @@ define void @func1() {
 ; GFX12-SDAG-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-SDAG-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
-; GFX12-SDAG-NEXT:    s_mov_b32 m0, 0x70003
-; GFX12-SDAG-NEXT:    s_wait_storecnt 0x0
-; GFX12-SDAG-NEXT:    s_barrier_signal m0
 ; GFX12-SDAG-NEXT:    s_mov_b32 m0, 3
 ; GFX12-SDAG-NEXT:    s_barrier_join m0
+; GFX12-SDAG-NEXT:    s_mov_b32 m0, 0x70003
+; GFX12-SDAG-NEXT:    s_barrier_signal m0
 ; GFX12-SDAG-NEXT:    s_barrier_wait 1
 ; GFX12-SDAG-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -30,13 +29,12 @@ define void @func1() {
 ; GFX12-GISEL-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-GISEL-NEXT:    s_mov_b32 m0, 0x70003
-; GFX12-GISEL-NEXT:    s_wait_storecnt 0x0
-; GFX12-GISEL-NEXT:    s_barrier_signal m0
 ; GFX12-GISEL-NEXT:    s_barrier_join 3
+; GFX12-GISEL-NEXT:    s_barrier_signal m0
 ; GFX12-GISEL-NEXT:    s_barrier_wait 1
 ; GFX12-GISEL-NEXT:    s_setpc_b64 s[30:31]
-    call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) @bar3, i32 7)
     call void @llvm.amdgcn.s.barrier.join(ptr addrspace(3) @bar3)
+    call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) @bar3, i32 7)
     call void @llvm.amdgcn.s.barrier.wait(i16 1)
     ret void
 }
@@ -49,11 +47,10 @@ define void @func2() {
 ; GFX12-SDAG-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-SDAG-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
-; GFX12-SDAG-NEXT:    s_mov_b32 m0, 0x70001
-; GFX12-SDAG-NEXT:    s_wait_storecnt 0x0
-; GFX12-SDAG-NEXT:    s_barrier_signal m0
 ; GFX12-SDAG-NEXT:    s_mov_b32 m0, 1
 ; GFX12-SDAG-NEXT:    s_barrier_join m0
+; GFX12-SDAG-NEXT:    s_mov_b32 m0, 0x70001
+; GFX12-SDAG-NEXT:    s_barrier_signal m0
 ; GFX12-SDAG-NEXT:    s_barrier_wait 1
 ; GFX12-SDAG-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -65,13 +62,12 @@ define void @func2() {
 ; GFX12-GISEL-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-GISEL-NEXT:    s_mov_b32 m0, 0x70001
-; GFX12-GISEL-NEXT:    s_wait_storecnt 0x0
-; GFX12-GISEL-NEXT:    s_barrier_signal m0
 ; GFX12-GISEL-NEXT:    s_barrier_join 1
+; GFX12-GISEL-NEXT:    s_barrier_signal m0
 ; GFX12-GISEL-NEXT:    s_barrier_wait 1
 ; GFX12-GISEL-NEXT:    s_setpc_b64 s[30:31]
-    call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) @bar2, i32 7)
     call void @llvm.amdgcn.s.barrier.join(ptr addrspace(3) @bar2)
+    call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) @bar2, i32 7)
     call void @llvm.amdgcn.s.barrier.wait(i16 1)
     ret void
 }
@@ -102,9 +98,9 @@ define amdgpu_kernel void @kernel1(ptr addrspace(1) %out, ptr addrspace(3) %in)
 ; GFX12-SDAG-NEXT:    s_barrier_signal m0
 ; GFX12-SDAG-NEXT:    s_mov_b32 m0, s2
 ; GFX12-SDAG-NEXT:    s_barrier_signal -1
-; GFX12-SDAG-NEXT:    s_barrier_signal_isfirst -1
 ; GFX12-SDAG-NEXT:    s_barrier_join m0
 ; GFX12-SDAG-NEXT:    s_mov_b32 m0, 2
+; GFX12-SDAG-NEXT:    s_barrier_signal_isfirst -1
 ; GFX12-SDAG-NEXT:    s_barrier_wait 1
 ; GFX12-SDAG-NEXT:    s_barrier_leave
 ; GFX12-SDAG-NEXT:    s_get_barrier_state s3, m0
@@ -155,11 +151,11 @@ define amdgpu_kernel void @kernel1(ptr addrspace(1) %out, ptr addrspace(3) %in)
 ; GFX12-GISEL-NEXT:    s_barrier_signal m0
 ; GFX12-GISEL-NEXT:    s_mov_b32 m0, s1
 ; GFX12-GISEL-NEXT:    s_barrier_signal m0
+; GFX12-GISEL-NEXT:    s_mov_b32 m0, s0
 ; GFX12-GISEL-NEXT:    s_barrier_signal -1
+; GFX12-GISEL-NEXT:    s_barrier_join m0
 ; GFX12-GISEL-NEXT:    s_barrier_signal_isfirst -1
-; GFX12-GISEL-NEXT:    s_mov_b32 m0, s0
 ; GFX12-GISEL-NEXT:    s_add_co_u32 s8, s12, 48
-; GFX12-GISEL-NEXT:    s_barrier_join m0
 ; GFX12-GISEL-NEXT:    s_barrier_wait 1
 ; GFX12-GISEL-NEXT:    s_barrier_leave
 ; GFX12-GISEL-NEXT:    s_get_barrier_state s0, 2
@@ -194,8 +190,8 @@ define amdgpu_kernel void @kernel1(ptr addrspace(1) %out, ptr addrspace(3) %in)
     call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) @bar, i32 12)
     call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) %in, i32 9)
     call void @llvm.amdgcn.s.barrier.signal(i32 -1)
-    %isfirst = call i1 @llvm.amdgcn.s.barrier.signal.isfirst(i32 -1)
     call void @llvm.amdgcn.s.barrier.join(ptr addrspace(3) %in)
+    %isfirst = call i1 @llvm.amdgcn.s.barrier.signal.isfirst(i32 -1)
     call void @llvm.amdgcn.s.barrier.wait(i16 1)
     call void @llvm.amdgcn.s.barrier.leave(i16 1)
     %state = call i32 @llvm.amdgcn.s.get.named.barrier.state(ptr addrspace(3) @bar)
@@ -219,7 +215,6 @@ define amdgpu_kernel void @kernel2(ptr addrspace(1) %out, ptr addrspace(3) %in)
 ; GFX12-SDAG-NEXT:    s_load_b64 s[12:13], s[6:7], 0x0
 ; GFX12-SDAG-NEXT:    s_mov_b32 m0, 0x70002
 ; GFX12-SDAG-NEXT:    s_add_nc_u64 s[8:9], s[4:5], 48
-; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-SDAG-NEXT:    s_barrier_signal m0
 ; GFX12-SDAG-NEXT:    s_mov_b32 m0, 2
 ; GFX12-SDAG-NEXT:    s_mov_b64 s[4:5], s[0:1]
@@ -227,6 +222,7 @@ define amdgpu_kernel void @kernel2(ptr addrspace(1) %out, ptr addrspace(3) %in)
 ; GFX12-SDAG-NEXT:    s_mov_b32 s32, 0
 ; GFX12-SDAG-NEXT:    s_barrier_join m0
 ; GFX12-SDAG-NEXT:    s_barrier_wait 1
+; GFX12-SDAG-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-SDAG-NEXT:    s_swappc_b64 s[30:31], s[12:13]
 ; GFX12-SDAG-NEXT:    s_endpgm
 ;
@@ -245,10 +241,10 @@ define amdgpu_kernel void @kernel2(ptr addrspace(1) %out, ptr addrspace(3) %in)
 ; GFX12-GISEL-NEXT:    s_mov_b64 s[4:5], s[0:1]
 ; GFX12-GISEL-NEXT:    s_mov_b64 s[6:7], s[2:3]
 ; GFX12-GISEL-NEXT:    s_mov_b32 s32, 0
-; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-GISEL-NEXT:    s_barrier_signal m0
 ; GFX12-GISEL-NEXT:    s_barrier_join 2
 ; GFX12-GISEL-NEXT:    s_barrier_wait 1
+; GFX12-GISEL-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-GISEL-NEXT:    s_swappc_b64 s[30:31], s[12:13]
 ; GFX12-GISEL-NEXT:    s_endpgm
     call void @llvm.amdgcn.s.barrier.signal.var(ptr addrspace(3) @bar, i32 7)

@jayfoad
Copy link
Contributor

jayfoad commented Aug 22, 2025

We could get the same effect by saying GFX12 has FeatureBackOffBarrier, right? But perhaps your way is more honest.

I have a slight concern that this could trigger the same kind of problems we have seen on GFX10 and GFX11, which we worked around by removing FeatureBackOffBarrier from them. But if not then this seems fine to me.

The barrier intrinsic itself should not have memory semantics. Frontends
should use appropriate fence instructions for memory effects, and some
frontends want to rely on that for performance (e.g. wait only for LDS
before a barrier).

See the code comment for more detail.
@nhaehnle nhaehnle force-pushed the pub-gfx12-barrier-wait branch from 8c3b7d1 to 3fe4723 Compare August 22, 2025 16:18
@nhaehnle
Copy link
Collaborator Author

We could get the same effect by saying GFX12 has FeatureBackOffBarrier, right? But perhaps your way is more honest.

Yeah, that was pretty much my thinking. The split barriers in GFX12 eliminate the need to "back off" from the barrier.

I have a slight concern that this could trigger the same kind of problems we have seen on GFX10 and GFX11, which we worked around by removing FeatureBackOffBarrier from them. But if not then this seems fine to me.

Yes, there is a risk with this change that there is code out there which doesn't have the correct fences ("memory barriers") around these (control) barriers.

We've had people asking about this though because it is a performance problem for correctly written code.

@nhaehnle nhaehnle merged commit 4676242 into llvm:main Aug 23, 2025
9 checks passed
@nhaehnle nhaehnle deleted the pub-gfx12-barrier-wait branch August 23, 2025 07:08
@Pierre-vh
Copy link
Contributor

Pierre-vh commented Aug 25, 2025

I have a slight concern that this could trigger the same kind of problems we have seen on GFX10 and GFX11, which we worked around by removing FeatureBackOffBarrier from them. But if not then this seems fine to me.

It likely can cause the exact same issue on gfx12 (I'd be very surprised if it didn't), seeing as we had to also disable the BackOffBarrier on downstream for that target. I am working on a proper fix for GFX10/11/12 which will allow us to reenable BackOffBarrier.

@Pierre-vh
Copy link
Contributor

Yeah, that was pretty much my thinking. The split barriers in GFX12 eliminate the need to "back off" from the barrier.

Can you elaborate why backoff barrier is not necessary with split barriers ?
We still have a barrier_wait, that needs to be able to "back off" to process signals.

Or are you just talking about the BackOffBarrier compiler feature, not the actual HW behavior?

@nhaehnle
Copy link
Collaborator Author

Pre-GFX12 hardware has a single barrier object per workgroup, and the trap handler needs to be able to use barriers at least for CWSR. So on pre-GFX12, when a wave is at S_BARRIER and a trap arrives, handling the trap safely requires the HW to reliably back off the barrier in the sense of undoing the barrier signal since the signal and wait parts of the barrier are part of the same instruction, and so the instruction has already affected state if it gets interrupted by a trap.

With split barriers, the signal and wait parts are separate and there are separate user and trap barrier objects. So, once the use barrier is signaled, that signal never has to be undone. The trap handler uses the trap barrier object to do its thing, and in CWSR the state of the user barrier can simply be saved and then later restored by the trap handler. S_BARRIER_WAIT doesn't need to back off the barrier because it doesn't actually do anything (doesn't change any state) until the barrier is signaled as complete. So, there literally is no back-off.

@nhaehnle
Copy link
Collaborator Author

I have a slight concern that this could trigger the same kind of problems we have seen on GFX10 and GFX11, which we worked around by removing FeatureBackOffBarrier from them. But if not then this seems fine to me.

It likely can cause the exact same issue on gfx12 (I'd be very surprised if it didn't), seeing as we had to also disable the BackOffBarrier on downstream for that target. I am working on a proper fix for GFX10/11/12 which will allow us to reenable BackOffBarrier.

Can you elaborate on this? I would expect any issues to ultimately come from incorrect source programs. And yeah, we likely have to add some workarounds if we run into that, but pessimizing everything is just not the right answer here.

@nhaehnle
Copy link
Collaborator Author

Ah, I remember now -- is this the issue of VMEM instructions getting re-ordered across barriers because a workgroup release fence doesn't wait for anything? Do you have relevant changes already in upstream review somewhere?

@Pierre-vh
Copy link
Contributor

The trap handler uses the trap barrier object to do its thing, and in CWSR the state of the user barrier can simply be saved and then later restored by the trap handler. S_BARRIER_WAIT doesn't need to back off the barrier because it doesn't actually do anything (doesn't change any state) until the barrier is signaled as complete. So, there literally is no back-off.

Ah I understand now, thanks. That makes sense.

Ah, I remember now -- is this the issue of VMEM instructions getting re-ordered across barriers because a workgroup release fence doesn't wait for anything?

Yes

Do you have relevant changes already in upstream review somewhere?

No, but I hope to post something very soon. We just need to wait on vm_vsrc before barriers in CU mode and that's it.
The actual docs/memory model will need a bit longer before it's available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants