Skip to content

Conversation

arun-thmn
Copy link
Contributor

@arun-thmn arun-thmn commented Aug 26, 2025

Adds MemWrite side effect to amx.tile_zero and amx.tile_load ops.

Memory write models hardware populating AMX tiles with specified values through tile zero and load ops.
Making the side effect explicit allows to use multiple op instances as a compilation hint to use different AMX tile registers. This can prevent less efficient lowering through tile store-load copies compared to directly populating tiles with values.

To illustrate the trade off:
Without explicit side effects, CSE optimizes two amx.tile_zero into a single op which lowers to a copy for the second tile:

  tilezero        %tmm0
  tilestored      %tmm0, -2032(%rbp,%rbx) # 1024-byte Folded Spill
  tileloadd       -2032(%rbp,%rbx), %tmm1 # 1024-byte Folded Reload

By keeping the two amx.tile_zero ops and, thus, lowering to two separate intrinsic invocations, the two tile registers are zeroed out directly without the additional round trip through memory:

  tilezero        %tmm0
  tilezero        %tmm1

The same principle applies to amx.tile_load ops.

@arun-thmn arun-thmn marked this pull request as ready for review August 26, 2025 12:45
@llvmbot
Copy link
Member

llvmbot commented Aug 26, 2025

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-amx

Author: Arun Thangamani (arun-thmn)

Changes

Updating amx.tile_zero with memory effect (MemWrite).

CSE optimizes multiple amx.tile_zero into one amx.tile_zero. It is okay for reads but for writes to multiple amx.tile_zero the CSE optimized code leads to a spill.

Below is the example IR, where four amx.tile_zero get CSE optimized into one amx.tile_zero but it lowers to assembly with three spills. Adding, memory effect (MemWrite) will avoid CSE if there is a write.

    tilezero        %tmm0
    tilestored      %tmm0, -2032(%rbp,%rbx) # 1024-byte Folded Spill
    tileloadd       -2032(%rbp,%rbx), %tmm3 # 1024-byte Folded Reload
    tilestored      %tmm0, -3056(%rbp,%rbx) # 1024-byte Folded Spill
    tileloadd       -3056(%rbp,%rbx), %tmm2 # 1024-byte Folded Reload
    tilestored      %tmm0, -4080(%rbp,%rbx) # 1024-byte Folded Spill
    tileloadd       -4080(%rbp,%rbx), %tmm1 # 1024-byte Folded Reload

Full diff: https://github.com/llvm/llvm-project/pull/155403.diff

1 Files Affected:

  • (modified) mlir/include/mlir/Dialect/AMX/AMX.td (+3-2)
diff --git a/mlir/include/mlir/Dialect/AMX/AMX.td b/mlir/include/mlir/Dialect/AMX/AMX.td
index 6bbde43e2d011..91fe1bfb5cd35 100644
--- a/mlir/include/mlir/Dialect/AMX/AMX.td
+++ b/mlir/include/mlir/Dialect/AMX/AMX.td
@@ -142,8 +142,9 @@ class AMX_Op<string mnemonic, list<Trait> traits = []> :
 // Tile reset.
 //
 
-def TileZeroOp : AMX_Op<"tile_zero", [Pure,
-    AMXIntrinsicOpInterface
+def TileZeroOp : AMX_Op<"tile_zero", [
+    AMXIntrinsicOpInterface,
+    MemoryEffects<[MemWrite]>
   ]> {
   let summary = "tile zero operation";
   let description = [{

@arun-thmn
Copy link
Contributor Author

@adam-smnk @rengolin Please review this PR to avoid CSE on amx.tile_zero if there is a write.

@rengolin
Copy link
Member

rengolin commented Aug 26, 2025

Can you add a test? Runs CSE and make sure the tile_zero doesn't get removed.

What happens if there isn't a write?

@adam-smnk
Copy link
Contributor

Can you add a test? Runs CSE and make sure the tile_zero doesn't get removed.

Not sure if that in particular needs testing. It's a generic feature of memory effect traits.
To really test the desired behavior, we'd need to pipe it all the way to llc IMO.

@arun-thmn
Copy link
Contributor Author

arun-thmn commented Aug 26, 2025

What happens if there isn't a write?

For no writes the CSE should optimize the amx.tile_zero as we added memory effect only for MemWrite

@rengolin
Copy link
Member

Can you add a test? Runs CSE and make sure the tile_zero doesn't get removed.

Not sure if that in particular needs testing. It's a generic feature of memory effect traits. To really test the desired behavior, we'd need to pipe it all the way to llc IMO.

Just that IR on the description through cse and show that only three are left?

@adam-smnk
Copy link
Contributor

Can you add a test? Runs CSE and make sure the tile_zero doesn't get removed.

Not sure if that in particular needs testing. It's a generic feature of memory effect traits. To really test the desired behavior, we'd need to pipe it all the way to llc IMO.

Just that IR on the description through cse and show that only three are left?

Fair enough, doesn't hurt to have extra coverage.

@arun-thmn Could you add a new test file under test/Dialect/AMX for this side effect testing?
For consistency, tile_load should also have MemWrite to model side effect of writing into an AMX tile.
And the op descriptions should mention the updated semantics.

@rengolin
Copy link
Member

This is a variation of https://discourse.llvm.org/t/bug-in-operationequivalence-breaks-cse-on-linalg-index/85773/43.

Here, each tile_zero op is meant to reserve a new register that will be used in parallel on the main FMA loop. When CSE runs without side-effects, 4 calls get collapsed into 1 and all FMAs use the same "tile".

There are two main lowering strategies here, both bad:

  1. (current) one tile_zero + 3 register copies. This is bad because AMX has no move instruction, so you need store+load for the other three.
  2. (possible) one tile_zero and all 4 FMAs run on the same register, spilling every time. This is equivalent to the one above because of the memory semantics of AMX.

There's a potential "fix" in LLVM's back-end for AMX, to simplify zero+store+load into two zeros. (@phoebewang), but I'm not sure what else this could break.

Currently, MemWrite side-effect is "wrong" for tile_zero, but there's no other way to represent register side-effects. Ultimately, we want tile_zero to "reserve a new concurrent SSA value", so InitValue side-effect of some sort would apply, which currently doesn't exist.

My advice here is to use MemWrite for now on both tile_zero and tile_load as @adam-smnk mentions, and discuss the other side-effect in the forums, then update the ops with the result of that discussion.

@phoebewang
Copy link
Contributor

There's a potential "fix" in LLVM's back-end for AMX, to simplify zero+store+load into two zeros. (@phoebewang), but I'm not sure what else this could break.

We have improved similar case before: 6753eb0

@arun-thmn
Copy link
Contributor Author

With MemoryEffects<[MemWrite] the CSE is restricted even for reads - I guess. For the below example, %0 and %1 doesn't get CSE to one amx.tile_zero. Now, I'm not very sure that MemoryEffects<[MemWrite] can be a valid fix.

Example:
%alloca = memref.alloca() : memref<16x32xf32>
%0 = amx.tile_zero : !amx.tile<16x16xf32>
%1 = amx.tile_zero : !amx.tile<16x16xf32>
amx.tile_store %alloca[%c0, %c0], %0 : memref<16x32xf32>, !amx.tile<16x16xf32>
amx.tile_store %alloca[%c0, %c16], %1 : memref<16x32xf32>, !amx.tile<16x16xf32>

@adam-smnk
Copy link
Contributor

With MemoryEffects<[MemWrite] the CSE is restricted even for reads - I guess.

Good 👍

For the below example, %0 and %1 doesn't get CSE to one amx.tile_zero. Now, I'm not very sure that MemoryEffects<[MemWrite] can be a valid fix.

Example: %alloca = memref.alloca() : memref<16x32xf32> %0 = amx.tile_zero : !amx.tile<16x16xf32> %1 = amx.tile_zero : !amx.tile<16x16xf32> amx.tile_store %alloca[%c0, %c0], %0 : memref<16x32xf32>, !amx.tile<16x16xf32> amx.tile_store %alloca[%c0, %c16], %1 : memref<16x32xf32>, !amx.tile<16x16xf32>

I think it is in line with the updated semantics we want to achieve. Each tile load/zero serves as a hint to use a separate tile register. User will have to be a bit more deliberate when creating AMX ops and not rely as much on an automatic cleanup.

I'd suggest to move forward with MemWrite to model side effects of tile zero and load.
Please also update op descriptions to explain the new intended behavior/usage.

@arun-thmn
Copy link
Contributor Author

Updated MemEffects for tileload + added a test-case ensuring CSE doesn't fuse multiple tilezero and tileload

Copy link
Contributor

@adam-smnk adam-smnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rengolin Any more comments?

@rengolin
Copy link
Member

@rengolin Any more comments?

Nope, all good, thanks!

@adam-smnk adam-smnk changed the title [mlir][AMX] Memory effects to amx.tile_zero [mlir][amx] Add write side effect to AMX tile creation ops Aug 29, 2025
@adam-smnk adam-smnk merged commit 448811d into llvm:main Aug 29, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants