AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks #154501

arsenm · 2025-08-20T09:58:00Z

This strengthens the check to ensure the new mov's source class
is compatible with the source register. This avoids using the register
sized based checks in getMovOpcode, which don't quite understand
AV superclasses correctly. As a side effect it also enables more folds
into true16 movs.

getMovOpcode should probably be deleted, or at least replaced
with class check based logic. In this particular case other
legality checks need to be mixed in with attempted IR changes,
so I didn't try to push all of that into the opcode selection.

arsenm · 2025-08-20T09:58:19Z

AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks #154501 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-08-20T09:58:38Z

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

This strengthens the check to ensure the new mov's source class
is compatible with the source register. This avoids using the register
sized based checks in getMovOpcode, which don't quite understand
AV superclasses correctly. As a side effect it also enables more folds
into true16 movs.

getMovOpcode should probably be deleted, or at least replaced
with class check based logic. In this particular case other
legality checks need to be mixed in with attempted IR changes,
so I didn't try to push all of that into the opcode selection.

Patch is 62.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154501.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+51-34)
(modified) llvm/test/CodeGen/AMDGPU/br_cc.f16.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+63-30)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+165-81)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+11-5)
(modified) llvm/test/CodeGen/AMDGPU/imm16.ll (+268-127)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.interp.inreg.ll (+14-12)

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 962c276bc2123..d72af06ac566e 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1248,6 +1248,7 @@ void SIFoldOperandsImpl::foldOperand(
   if (FoldingImmLike && UseMI->isCopy()) {
     Register DestReg = UseMI->getOperand(0).getReg();
     Register SrcReg = UseMI->getOperand(1).getReg();
+    unsigned UseSubReg = UseMI->getOperand(1).getSubReg();
     assert(SrcReg.isVirtual());
 
     const TargetRegisterClass *SrcRC = MRI->getRegClass(SrcReg);
@@ -1278,44 +1279,60 @@ void SIFoldOperandsImpl::foldOperand(
       DestRC = &AMDGPU::SGPR_32RegClass;
     }
 
-    // In order to fold immediates into copies, we need to change the
-    // copy to a MOV.
+    // In order to fold immediates into copies, we need to change the copy to a
+    // MOV. Find a compatible mov instruction with the value.
+    for (unsigned MovOp :
+         {AMDGPU::S_MOV_B32, AMDGPU::V_MOV_B32_e32, AMDGPU::S_MOV_B64,
+          AMDGPU::V_MOV_B64_PSEUDO, AMDGPU::V_MOV_B16_t16_e64}) {
+      const MCInstrDesc &MovDesc = TII->get(MovOp);
+      assert(MovDesc.getNumDefs() > 0 && MovDesc.operands()[0].RegClass != -1);
+
+      const TargetRegisterClass *MovDstRC =
+          TRI->getRegClass(MovDesc.operands()[0].RegClass);
+
+      // Fold if the destination register class of the MOV instruction (ResRC)
+      // is a superclass of (or equal to) the destination register class of the
+      // COPY (DestRC). If this condition fails, folding would be illegal.
+      if (!DestRC->hasSuperClassEq(MovDstRC))
+        continue;
 
-    unsigned MovOp = TII->getMovOpcode(DestRC);
-    if (MovOp == AMDGPU::COPY)
-      return;
+      const int SrcIdx = MovOp == AMDGPU::V_MOV_B16_t16_e64 ? 2 : 1;
+      const TargetRegisterClass *MovSrcRC =
+          TRI->getRegClass(MovDesc.operands()[SrcIdx].RegClass);
 
-    // Fold if the destination register class of the MOV instruction (ResRC)
-    // is a superclass of (or equal to) the destination register class of the
-    // COPY (DestRC). If this condition fails, folding would be illegal.
-    const MCInstrDesc &MovDesc = TII->get(MovOp);
-    assert(MovDesc.getNumDefs() > 0 && MovDesc.operands()[0].RegClass != -1);
-    const TargetRegisterClass *ResRC =
-        TRI->getRegClass(MovDesc.operands()[0].RegClass);
-    if (!DestRC->hasSuperClassEq(ResRC))
-      return;
+      if (UseSubReg)
+        MovSrcRC = TRI->getMatchingSuperRegClass(SrcRC, MovSrcRC, UseSubReg);
+      if (!MRI->constrainRegClass(SrcReg, MovSrcRC))
+        break;
 
-    MachineInstr::mop_iterator ImpOpI = UseMI->implicit_operands().begin();
-    MachineInstr::mop_iterator ImpOpE = UseMI->implicit_operands().end();
-    while (ImpOpI != ImpOpE) {
-      MachineInstr::mop_iterator Tmp = ImpOpI;
-      ImpOpI++;
-      UseMI->removeOperand(UseMI->getOperandNo(Tmp));
-    }
-    UseMI->setDesc(TII->get(MovOp));
-
-    if (MovOp == AMDGPU::V_MOV_B16_t16_e64) {
-      const auto &SrcOp = UseMI->getOperand(UseOpIdx);
-      MachineOperand NewSrcOp(SrcOp);
-      MachineFunction *MF = UseMI->getParent()->getParent();
-      UseMI->removeOperand(1);
-      UseMI->addOperand(*MF, MachineOperand::CreateImm(0)); // src0_modifiers
-      UseMI->addOperand(NewSrcOp);                          // src0
-      UseMI->addOperand(*MF, MachineOperand::CreateImm(0)); // op_sel
-      UseOpIdx = 2;
-      UseOp = &UseMI->getOperand(UseOpIdx);
+      MachineInstr::mop_iterator ImpOpI = UseMI->implicit_operands().begin();
+      MachineInstr::mop_iterator ImpOpE = UseMI->implicit_operands().end();
+      while (ImpOpI != ImpOpE) {
+        MachineInstr::mop_iterator Tmp = ImpOpI;
+        ImpOpI++;
+        UseMI->removeOperand(UseMI->getOperandNo(Tmp));
+      }
+      UseMI->setDesc(MovDesc);
+
+      if (MovOp == AMDGPU::V_MOV_B16_t16_e64) {
+        const auto &SrcOp = UseMI->getOperand(UseOpIdx);
+        MachineOperand NewSrcOp(SrcOp);
+        MachineFunction *MF = UseMI->getParent()->getParent();
+        UseMI->removeOperand(1);
+        UseMI->addOperand(*MF, MachineOperand::CreateImm(0)); // src0_modifiers
+        UseMI->addOperand(NewSrcOp);                          // src0
+        UseMI->addOperand(*MF, MachineOperand::CreateImm(0)); // op_sel
+        UseOpIdx = SrcIdx;
+        UseOp = &UseMI->getOperand(UseOpIdx);
+      }
+      CopiesToReplace.push_back(UseMI);
+      break;
     }
-    CopiesToReplace.push_back(UseMI);
+
+    // We failed to replace the copy, so give up.
+    if (UseMI->getOpcode() == AMDGPU::COPY)
+      return;
+
   } else {
     if (UseMI->isCopy() && OpToFold.isReg() &&
         UseMI->getOperand(0).getReg().isVirtual() &&
diff --git a/llvm/test/CodeGen/AMDGPU/br_cc.f16.ll b/llvm/test/CodeGen/AMDGPU/br_cc.f16.ll
index 2761cba5ea71b..bfef88cdba9ed 100644
--- a/llvm/test/CodeGen/AMDGPU/br_cc.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/br_cc.f16.ll
@@ -197,7 +197,7 @@ define amdgpu_kernel void @br_cc_f16_imm_a(
 ; GFX11-TRUE16-NEXT:    v_cmp_nlt_f16_e32 vcc_lo, 0.5, v1.l
 ; GFX11-TRUE16-NEXT:    s_cbranch_vccnz .LBB1_2
 ; GFX11-TRUE16-NEXT:  ; %bb.1: ; %one
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v0, 0x3800
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x3800
 ; GFX11-TRUE16-NEXT:  .LBB1_2: ; %two
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s2, s6
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s3, s7
@@ -303,7 +303,7 @@ define amdgpu_kernel void @br_cc_f16_imm_b(
 ; GFX11-TRUE16-NEXT:    v_cmp_ngt_f16_e32 vcc_lo, 0.5, v1.l
 ; GFX11-TRUE16-NEXT:    s_cbranch_vccz .LBB2_2
 ; GFX11-TRUE16-NEXT:  ; %bb.1: ; %two
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v0, 0x3800
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x3800
 ; GFX11-TRUE16-NEXT:  .LBB2_2: ; %one
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s2, s6
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s3, s7
diff --git a/llvm/test/CodeGen/AMDGPU/call-argument-types.ll b/llvm/test/CodeGen/AMDGPU/call-argument-types.ll
index 2a1be99dff5d2..b8dd377377dab 100644
--- a/llvm/test/CodeGen/AMDGPU/call-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/call-argument-types.ll
@@ -426,16 +426,27 @@ define amdgpu_kernel void @test_call_external_void_func_i8_imm(i32) #0 {
 ; GFX9-NEXT:    s_swappc_b64 s[30:31], s[4:5]
 ; GFX9-NEXT:    s_endpgm
 ;
-; GFX11-LABEL: test_call_external_void_func_i8_imm:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0x7b
-; GFX11-NEXT:    s_getpc_b64 s[2:3]
-; GFX11-NEXT:    s_add_u32 s2, s2, external_void_func_i8@rel32@lo+4
-; GFX11-NEXT:    s_addc_u32 s3, s3, external_void_func_i8@rel32@hi+12
-; GFX11-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GFX11-NEXT:    s_mov_b32 s32, 0
-; GFX11-NEXT:    s_swappc_b64 s[30:31], s[2:3]
-; GFX11-NEXT:    s_endpgm
+; GFX11-TRUE16-LABEL: test_call_external_void_func_i8_imm:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x7b
+; GFX11-TRUE16-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-TRUE16-NEXT:    s_add_u32 s2, s2, external_void_func_i8@rel32@lo+4
+; GFX11-TRUE16-NEXT:    s_addc_u32 s3, s3, external_void_func_i8@rel32@hi+12
+; GFX11-TRUE16-NEXT:    s_mov_b64 s[6:7], s[0:1]
+; GFX11-TRUE16-NEXT:    s_mov_b32 s32, 0
+; GFX11-TRUE16-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; GFX11-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-FAKE16-LABEL: test_call_external_void_func_i8_imm:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, 0x7b
+; GFX11-FAKE16-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-FAKE16-NEXT:    s_add_u32 s2, s2, external_void_func_i8@rel32@lo+4
+; GFX11-FAKE16-NEXT:    s_addc_u32 s3, s3, external_void_func_i8@rel32@hi+12
+; GFX11-FAKE16-NEXT:    s_mov_b64 s[6:7], s[0:1]
+; GFX11-FAKE16-NEXT:    s_mov_b32 s32, 0
+; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; GFX11-FAKE16-NEXT:    s_endpgm
 ;
 ; HSA-LABEL: test_call_external_void_func_i8_imm:
 ; HSA:       ; %bb.0:
@@ -723,16 +734,27 @@ define amdgpu_kernel void @test_call_external_void_func_i16_imm() #0 {
 ; GFX9-NEXT:    s_swappc_b64 s[30:31], s[4:5]
 ; GFX9-NEXT:    s_endpgm
 ;
-; GFX11-LABEL: test_call_external_void_func_i16_imm:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0x7b
-; GFX11-NEXT:    s_getpc_b64 s[2:3]
-; GFX11-NEXT:    s_add_u32 s2, s2, external_void_func_i16@rel32@lo+4
-; GFX11-NEXT:    s_addc_u32 s3, s3, external_void_func_i16@rel32@hi+12
-; GFX11-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GFX11-NEXT:    s_mov_b32 s32, 0
-; GFX11-NEXT:    s_swappc_b64 s[30:31], s[2:3]
-; GFX11-NEXT:    s_endpgm
+; GFX11-TRUE16-LABEL: test_call_external_void_func_i16_imm:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x7b
+; GFX11-TRUE16-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-TRUE16-NEXT:    s_add_u32 s2, s2, external_void_func_i16@rel32@lo+4
+; GFX11-TRUE16-NEXT:    s_addc_u32 s3, s3, external_void_func_i16@rel32@hi+12
+; GFX11-TRUE16-NEXT:    s_mov_b64 s[6:7], s[0:1]
+; GFX11-TRUE16-NEXT:    s_mov_b32 s32, 0
+; GFX11-TRUE16-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; GFX11-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-FAKE16-LABEL: test_call_external_void_func_i16_imm:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, 0x7b
+; GFX11-FAKE16-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-FAKE16-NEXT:    s_add_u32 s2, s2, external_void_func_i16@rel32@lo+4
+; GFX11-FAKE16-NEXT:    s_addc_u32 s3, s3, external_void_func_i16@rel32@hi+12
+; GFX11-FAKE16-NEXT:    s_mov_b64 s[6:7], s[0:1]
+; GFX11-FAKE16-NEXT:    s_mov_b32 s32, 0
+; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; GFX11-FAKE16-NEXT:    s_endpgm
 ;
 ; HSA-LABEL: test_call_external_void_func_i16_imm:
 ; HSA:       ; %bb.0:
@@ -1642,16 +1664,27 @@ define amdgpu_kernel void @test_call_external_void_func_f16_imm() #0 {
 ; GFX9-NEXT:    s_swappc_b64 s[30:31], s[4:5]
 ; GFX9-NEXT:    s_endpgm
 ;
-; GFX11-LABEL: test_call_external_void_func_f16_imm:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0x4400
-; GFX11-NEXT:    s_getpc_b64 s[2:3]
-; GFX11-NEXT:    s_add_u32 s2, s2, external_void_func_f16@rel32@lo+4
-; GFX11-NEXT:    s_addc_u32 s3, s3, external_void_func_f16@rel32@hi+12
-; GFX11-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GFX11-NEXT:    s_mov_b32 s32, 0
-; GFX11-NEXT:    s_swappc_b64 s[30:31], s[2:3]
-; GFX11-NEXT:    s_endpgm
+; GFX11-TRUE16-LABEL: test_call_external_void_func_f16_imm:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x4400
+; GFX11-TRUE16-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-TRUE16-NEXT:    s_add_u32 s2, s2, external_void_func_f16@rel32@lo+4
+; GFX11-TRUE16-NEXT:    s_addc_u32 s3, s3, external_void_func_f16@rel32@hi+12
+; GFX11-TRUE16-NEXT:    s_mov_b64 s[6:7], s[0:1]
+; GFX11-TRUE16-NEXT:    s_mov_b32 s32, 0
+; GFX11-TRUE16-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; GFX11-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-FAKE16-LABEL: test_call_external_void_func_f16_imm:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, 0x4400
+; GFX11-FAKE16-NEXT:    s_getpc_b64 s[2:3]
+; GFX11-FAKE16-NEXT:    s_add_u32 s2, s2, external_void_func_f16@rel32@lo+4
+; GFX11-FAKE16-NEXT:    s_addc_u32 s3, s3, external_void_func_f16@rel32@hi+12
+; GFX11-FAKE16-NEXT:    s_mov_b64 s[6:7], s[0:1]
+; GFX11-FAKE16-NEXT:    s_mov_b32 s32, 0
+; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; GFX11-FAKE16-NEXT:    s_endpgm
 ;
 ; HSA-LABEL: test_call_external_void_func_f16_imm:
 ; HSA:       ; %bb.0:
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
index 2fdc1a8854863..a844b6ceceadc 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
@@ -559,33 +559,61 @@ define amdgpu_gfx void @test_call_external_void_func_i8_imm(i32) #0 {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-LABEL: test_call_external_void_func_i8_imm:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT:    s_mov_b32 s0, s33
-; GFX11-NEXT:    s_mov_b32 s33, s32
-; GFX11-NEXT:    s_or_saveexec_b32 s1, -1
-; GFX11-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
-; GFX11-NEXT:    s_mov_b32 exec_lo, s1
-; GFX11-NEXT:    v_writelane_b32 v40, s0, 2
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0x7b
-; GFX11-NEXT:    s_mov_b32 s1, external_void_func_i8@abs32@hi
-; GFX11-NEXT:    s_mov_b32 s0, external_void_func_i8@abs32@lo
-; GFX11-NEXT:    s_add_i32 s32, s32, 16
-; GFX11-NEXT:    v_writelane_b32 v40, s30, 0
-; GFX11-NEXT:    v_writelane_b32 v40, s31, 1
-; GFX11-NEXT:    s_swappc_b64 s[30:31], s[0:1]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_readlane_b32 s31, v40, 1
-; GFX11-NEXT:    v_readlane_b32 s30, v40, 0
-; GFX11-NEXT:    s_mov_b32 s32, s33
-; GFX11-NEXT:    v_readlane_b32 s0, v40, 2
-; GFX11-NEXT:    s_or_saveexec_b32 s1, -1
-; GFX11-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
-; GFX11-NEXT:    s_mov_b32 exec_lo, s1
-; GFX11-NEXT:    s_mov_b32 s33, s0
-; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    s_setpc_b64 s[30:31]
+; GFX11-TRUE16-LABEL: test_call_external_void_func_i8_imm:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT:    s_mov_b32 s0, s33
+; GFX11-TRUE16-NEXT:    s_mov_b32 s33, s32
+; GFX11-TRUE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-TRUE16-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-TRUE16-NEXT:    v_writelane_b32 v40, s0, 2
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x7b
+; GFX11-TRUE16-NEXT:    s_mov_b32 s1, external_void_func_i8@abs32@hi
+; GFX11-TRUE16-NEXT:    s_mov_b32 s0, external_void_func_i8@abs32@lo
+; GFX11-TRUE16-NEXT:    s_add_i32 s32, s32, 16
+; GFX11-TRUE16-NEXT:    v_writelane_b32 v40, s30, 0
+; GFX11-TRUE16-NEXT:    v_writelane_b32 v40, s31, 1
+; GFX11-TRUE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT:    v_readlane_b32 s31, v40, 1
+; GFX11-TRUE16-NEXT:    v_readlane_b32 s30, v40, 0
+; GFX11-TRUE16-NEXT:    s_mov_b32 s32, s33
+; GFX11-TRUE16-NEXT:    v_readlane_b32 s0, v40, 2
+; GFX11-TRUE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-TRUE16-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
+; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-TRUE16-NEXT:    s_mov_b32 s33, s0
+; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-FAKE16-LABEL: test_call_external_void_func_i8_imm:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-FAKE16-NEXT:    s_mov_b32 s0, s33
+; GFX11-FAKE16-NEXT:    s_mov_b32 s33, s32
+; GFX11-FAKE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-FAKE16-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; GFX11-FAKE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s0, 2
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, 0x7b
+; GFX11-FAKE16-NEXT:    s_mov_b32 s1, external_void_func_i8@abs32@hi
+; GFX11-FAKE16-NEXT:    s_mov_b32 s0, external_void_func_i8@abs32@lo
+; GFX11-FAKE16-NEXT:    s_add_i32 s32, s32, 16
+; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s30, 0
+; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s31, 1
+; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-FAKE16-NEXT:    v_readlane_b32 s31, v40, 1
+; GFX11-FAKE16-NEXT:    v_readlane_b32 s30, v40, 0
+; GFX11-FAKE16-NEXT:    s_mov_b32 s32, s33
+; GFX11-FAKE16-NEXT:    v_readlane_b32 s0, v40, 2
+; GFX11-FAKE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-FAKE16-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
+; GFX11-FAKE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-FAKE16-NEXT:    s_mov_b32 s33, s0
+; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-SCRATCH-LABEL: test_call_external_void_func_i8_imm:
 ; GFX10-SCRATCH:       ; %bb.0:
@@ -978,33 +1006,61 @@ define amdgpu_gfx void @test_call_external_void_func_i16_imm() #0 {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-LABEL: test_call_external_void_func_i16_imm:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT:    s_mov_b32 s0, s33
-; GFX11-NEXT:    s_mov_b32 s33, s32
-; GFX11-NEXT:    s_or_saveexec_b32 s1, -1
-; GFX11-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
-; GFX11-NEXT:    s_mov_b32 exec_lo, s1
-; GFX11-NEXT:    v_writelane_b32 v40, s0, 2
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0x7b
-; GFX11-NEXT:    s_mov_b32 s1, external_void_func_i16@abs32@hi
-; GFX11-NEXT:    s_mov_b32 s0, external_void_func_i16@abs32@lo
-; GFX11-NEXT:    s_add_i32 s32, s32, 16
-; GFX11-NEXT:    v_writelane_b32 v40, s30, 0
-; GFX11-NEXT:    v_writelane_b32 v40, s31, 1
-; GFX11-NEXT:    s_swappc_b64 s[30:31], s[0:1]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_readlane_b32 s31, v40, 1
-; GFX11-NEXT:    v_readlane_b32 s30, v40, 0
-; GFX11-NEXT:    s_mov_b32 s32, s33
-; GFX11-NEXT:    v_readlane_b32 s0, v40, 2
-; GFX11-NEXT:    s_or_saveexec_b32 s1, -1
-; GFX11-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
-; GFX11-NEXT:    s_mov_b32 exec_lo, s1
-; GFX11-NEXT:    s_mov_b32 s33, s0
-; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    s_setpc_b64 s[30:31]
+; GFX11-TRUE16-LABEL: test_call_external_void_func_i16_imm:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT:    s_mov_b32 s0, s33
+; GFX11-TRUE16-NEXT:    s_mov_b32 s33, s32
+; GFX11-TRUE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-TRUE16-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-TRUE16-NEXT:    v_writelane_b32 v40, s0, 2
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0x7b
+; GFX11-TRUE16-NEXT:    s_mov_b32 s1, external_void_func_i16@abs32@hi
+; GFX11-TRUE16-NEXT:    s_mov_b32 s0, external_void_func_i16@abs32@lo
+; GFX11-TRUE16-NEXT:    s_add_i32 s32, s32, 16
+; GFX11-TRUE16-NEXT:    v_writelane_b32 v40, s30, 0
+; GFX11-TRUE16-NEXT:    v_writelane_b32 v40, s31, 1
+; GFX11-TRUE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT:    v_readlane_b32 s31, v40, 1
+; GFX11-TRUE16-NEXT:    v_readlane_b32 s30, v40, 0
+; GFX11-TRUE16-NEXT:    s_mov_b32 s32, s33
+; GFX11-TRUE16-NEXT:    v_readlane_b32 s0, v40, 2
+; GFX11-TRUE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-TRUE16-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
+; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-TRUE16-NEXT:    s_mov_b32 s33, s0
+; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-FAKE16-LABEL: test_call_external_void_func_i16_imm:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-FAKE16-NEXT:    s_mov_b32 s0, s33
+; GFX11-FAKE16-NEXT:    s_mov_b32 s33, s32
+; GFX11-FAKE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-FAKE16-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; GFX11-FAKE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s0, 2
+; GFX11-FAKE16-NEXT:    v_mov_b32_e32 v0, 0x7b
+; GFX11-FAKE16-NEXT:    s_mov_b32 s1, external_void_func_i16@abs32@hi
+; GFX11-FAKE16-NEXT:    s_mov_b32 s0, external_void_func_i16@abs32@lo
+; GFX11-FAKE16-NEXT:    s_add_i32 s32, s32, 16
+; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s30, 0
+; GFX11-FAKE16-NEXT:    v_writelane_b32 v40, s31, 1
+; GFX11-FAKE16-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GFX11-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-FAKE16-NEXT:    v_readlane_b32 s31, v40, 1
+; GFX11-FAKE16-NEXT:    v_readlane_b32 s30, v40, 0
+; GFX11-FAKE16-NEXT:    s_mov_b32 s32, s33
+; GFX11-FAKE16-NEXT:    v_readlane_b32 s0, v40, 2
+; GFX11-FAKE16-NEXT:    s_or_saveexec_b32 s1, -1
+; GFX11-FAKE16-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
+; GFX11-FAKE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX11-FAKE16-NEXT:    s_mov_b32 s33, s0
+; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-SCRATCH-LABEL: test_call_external_void_func_i16_imm:
 ; GFX10-SCRATCH:       ; %bb.0:
@@ -2161,33 +2217,61 @@ define amdgpu_gfx v...
[truncated]

This strengthens the check to ensure the new mov's source class is compatible with the source register. This avoids using the register sized based checks in getMovOpcode, which don't quite understand AV superclasses correctly. As a side effect it also enables more folds into true16 movs. getMovOpcode should probably be deleted, or at least replaced with class check based logic. In this particular case other legality checks need to be mixed in with attempted IR changes, so I didn't try to push all of that into the opcode selection.

Pierre-vh

small nits but otherwise LGTM

Pierre-vh · 2025-08-26T09:27:59Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.interp.inreg.ll

+; GFX12-TRUE16-NEXT:    v_interp_p10_f16_f32 v2, v0.l, v2, v0.l wait_exp:7
+; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT:    v_interp_p2_f16_f32 v0.l, v0.l, v3, v1 wait_exp:7
+; GFX12-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.h, v2
+; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)


This has one more instruction (s_delay_alu instid0(VALU_DEP_1)), is it expected?

Not the concern of immediate folding

Pierre-vh · 2025-08-26T09:30:02Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

+      MachineInstr::mop_iterator ImpOpI = UseMI->implicit_operands().begin();
+      MachineInstr::mop_iterator ImpOpE = UseMI->implicit_operands().end();
+      while (ImpOpI != ImpOpE) {
+        MachineInstr::mop_iterator Tmp = ImpOpI;
+        ImpOpI++;
+        UseMI->removeOperand(UseMI->getOperandNo(Tmp));
+      }


Don't we have helpers to remove all implicit operands? It feels like we should have one
Maybe put this in a small static helper ?

This is just re-indenting existing code, not touching this here

llvm-ci · 2025-08-26T15:02:26Z

LLVM Buildbot has detected a new failure on builder ml-opt-rel-x86-64 running on ml-opt-rel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/24117

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-rel-x86-64-b1/build/bin/llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -mattr=+real-true16 < /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll | /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -check-prefix=SDAG-TRUE16 /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll # RUN: at line 2
+ /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -check-prefix=SDAG-TRUE16 /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll
+ /b/ml-opt-rel-x86-64-b1/build/bin/llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -mattr=+real-true16
/b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll:38:21: error: SDAG-TRUE16-NEXT: expected string not found in input
; SDAG-TRUE16-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0x3e80
                    ^
<stdin>:103:32: note: scanning from here
 s_load_b64 s[0:1], s[4:5], 0x0
                               ^
<stdin>:104:2: note: possible intended match here
 v_mov_b16_e32 v0.l, 0x3e80
 ^
/b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll:59:21: error: SDAG-TRUE16-NEXT: expected string not found in input
; SDAG-TRUE16-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0x3c24
                    ^
<stdin>:197:32: note: scanning from here
 s_load_b64 s[0:1], s[4:5], 0x0
                               ^
<stdin>:198:2: note: possible intended match here
 v_mov_b16_e32 v0.l, 0x3c24
 ^
/b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll:80:21: error: SDAG-TRUE16-NEXT: expected string not found in input
; SDAG-TRUE16-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0x7fc0
                    ^
<stdin>:291:32: note: scanning from here
 s_load_b64 s[0:1], s[4:5], 0x0
                               ^
<stdin>:292:2: note: possible intended match here
 v_mov_b16_e32 v0.l, 0x7fc0
 ^

Input file: <stdin>
Check file: /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
           .
           .
           .
          98:  .globl rcp_bf16_constant_4 ; -- Begin function rcp_bf16_constant_4 
          99:  .p2align 8 
         100:  .type rcp_bf16_constant_4,@function 
         101: rcp_bf16_constant_4: ; @rcp_bf16_constant_4 
...

jplehr · 2025-08-26T15:27:03Z

There appears to be a similar issue in two of our bots
https://lab.llvm.org/buildbot/#/builders/140/builds/29570
https://lab.llvm.org/buildbot/#/builders/10/builds/12219

boomanaiden154 · 2025-08-26T15:56:50Z

It broke the premerge bot as well https://lab.llvm.org/staging/#/builders/21/builds/1619.

It looks like a mid-air collision given this wasn't caught by the premerge checks. Those LLDB failures in the premerge runs are also old.

boomanaiden154 · 2025-08-26T16:19:09Z

Looks like this was already reverted in 665da0a.

arsenm added the backend:AMDGPU label Aug 20, 2025 — with Graphite App

arsenm requested review from broxigarchen, jayfoad, kosarev, Pierre-vh, shiltian and Sisyph August 20, 2025 09:58

arsenm marked this pull request as ready for review August 20, 2025 09:58

arsenm force-pushed the users/arsenm/replace-getMovOpcode-with-class-compat-checks branch from f594b23 to 75b0db1 Compare August 23, 2025 11:08

Pierre-vh approved these changes Aug 26, 2025

View reviewed changes

arsenm merged commit 4454152 into main Aug 26, 2025
8 of 9 checks passed

arsenm deleted the users/arsenm/replace-getMovOpcode-with-class-compat-checks branch August 26, 2025 14:41

aemerson mentioned this pull request Aug 30, 2025

[NFC] [clangd] [Modules] remove dot in log #156207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks #154501

AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks #154501

Uh oh!

arsenm commented Aug 20, 2025

Uh oh!

arsenm commented Aug 20, 2025

Uh oh!

llvmbot commented Aug 20, 2025

Uh oh!

Pierre-vh left a comment

Uh oh!

Pierre-vh Aug 26, 2025

Uh oh!

arsenm Aug 26, 2025

Uh oh!

Pierre-vh Aug 26, 2025

Uh oh!

arsenm Aug 26, 2025

Uh oh!

Uh oh!

llvm-ci commented Aug 26, 2025

Uh oh!

jplehr commented Aug 26, 2025

Uh oh!

boomanaiden154 commented Aug 26, 2025 •

edited

Loading

Uh oh!

boomanaiden154 commented Aug 26, 2025

Uh oh!

Uh oh!

AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks #154501

AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks #154501

Uh oh!

Conversation

arsenm commented Aug 20, 2025

Uh oh!

arsenm commented Aug 20, 2025

Uh oh!

llvmbot commented Aug 20, 2025

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Pierre-vh Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Pierre-vh Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Aug 26, 2025

Uh oh!

jplehr commented Aug 26, 2025

Uh oh!

boomanaiden154 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boomanaiden154 commented Aug 26, 2025

Uh oh!

Uh oh!

boomanaiden154 commented Aug 26, 2025 •

edited

Loading