Skip to content

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Aug 21, 2025

In some cases this will require an avoidable re-defining of
a register, but it works out better most of the time. Also allow
folding 64-bit immediates into subregister extracts, unless it would
break an inline constant.

We could be more aggressive here, but this set of conditions seems
to do a reasonable job without introducing too many regressions.

Copy link
Contributor Author

arsenm commented Aug 21, 2025

@arsenm arsenm marked this pull request as ready for review August 21, 2025 13:11
@llvmbot
Copy link
Member

llvmbot commented Aug 21, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

In some cases this will require an avoidable re-defining of
a register, but it works out better most of the time. Also allow
folding 64-bit immediates into subregister extracts, unless it would
break an inline constant.

We could be more aggressive here, but this set of conditions seems
to do a reasonable job without introducing too many regressions.


Patch is 453.67 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154757.diff

46 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+24-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.interp.inreg.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll (+9-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+15-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+10-7)
  • (modified) llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll (+3-2)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+80-80)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+178-178)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcomb-extract-vec-elt-different-sizes.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-fmul-sel.ll (+76-36)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+56-56)
  • (modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+555-555)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+15-13)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-f16.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll (+29-20)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+196-194)
  • (modified) llvm/test/CodeGen/AMDGPU/fsqrt.f64.ll (+32-48)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+5-6)
  • (modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.AFLCustomIRMutator.opt.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.frexp.ll (+62-33)
  • (modified) llvm/test/CodeGen/AMDGPU/mad-combine.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/peephole-fold-imm-multi-use.mir (+94)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+116-116)
  • (modified) llvm/test/CodeGen/AMDGPU/roundeven.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/rsq.f64.ll (+166-186)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/shift-and-i64-ubfe.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/sint_to_fp.f64.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-agpr.ll (+116-116)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/srl.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/subreg-coalescer-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/undef-handling-crash-in-ra.ll (+19-21)
  • (modified) llvm/test/CodeGen/AMDGPU/urem64.ll (+47-49)
  • (modified) llvm/test/CodeGen/AMDGPU/v_cndmask.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1-1)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 75b303086163b..1be8d99834f93 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3559,13 +3559,12 @@ static unsigned getNewFMAMKInst(const GCNSubtarget &ST, unsigned Opc) {
 
 bool SIInstrInfo::foldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
                                 Register Reg, MachineRegisterInfo *MRI) const {
-  if (!MRI->hasOneNonDBGUse(Reg))
-    return false;
-
   int64_t Imm;
   if (!getConstValDefinedInReg(DefMI, Reg, Imm))
     return false;
 
+  const bool HasMultipleUses = !MRI->hasOneNonDBGUse(Reg);
+
   assert(!DefMI.getOperand(0).getSubReg() && "Expected SSA form");
 
   unsigned Opc = UseMI.getOpcode();
@@ -3577,6 +3576,25 @@ bool SIInstrInfo::foldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
 
     const TargetRegisterClass *DstRC = RI.getRegClassForReg(*MRI, DstReg);
 
+    if (HasMultipleUses) {
+      // TODO: This should fold in more cases with multiple use, but we need to
+      // more carefully consider what those uses are.
+      unsigned ImmDefSize = RI.getRegSizeInBits(*MRI->getRegClass(Reg));
+
+      // Avoid breaking up a 64-bit inline immediate into a subregister extract.
+      if (UseSubReg != AMDGPU::NoSubRegister && ImmDefSize == 64)
+        return false;
+
+      // Most of the time folding a 32-bit inline constant is free (though this
+      // might not be true if we can't later fold it into a real user).
+      //
+      // FIXME: This isInlineConstant check is imprecise if
+      // getConstValDefinedInReg handled the tricky non-mov cases.
+      if (ImmDefSize == 32 &&
+          !isInlineConstant(Imm, AMDGPU::OPERAND_REG_IMM_INT32))
+        return false;
+    }
+
     bool Is16Bit = UseSubReg != AMDGPU::NoSubRegister &&
                    RI.getSubRegIdxSize(UseSubReg) == 16;
 
@@ -3664,6 +3682,9 @@ bool SIInstrInfo::foldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
     return true;
   }
 
+  if (HasMultipleUses)
+    return false;
+
   if (Opc == AMDGPU::V_MAD_F32_e64 || Opc == AMDGPU::V_MAC_F32_e64 ||
       Opc == AMDGPU::V_MAD_F16_e64 || Opc == AMDGPU::V_MAC_F16_e64 ||
       Opc == AMDGPU::V_FMA_F32_e64 || Opc == AMDGPU::V_FMAC_F32_e64 ||
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.interp.inreg.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.interp.inreg.ll
index a09703285087c..bd6634f250777 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.interp.inreg.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.interp.inreg.ll
@@ -358,12 +358,12 @@ main_body:
 define amdgpu_ps half @v_interp_f16_imm_params(float inreg %i, float inreg %j) #0 {
 ; GFX11-TRUE16-LABEL: v_interp_f16_imm_params:
 ; GFX11-TRUE16:       ; %bb.0: ; %main_body
-; GFX11-TRUE16-NEXT:    v_dual_mov_b32 v1, s0 :: v_dual_mov_b32 v2, 0
+; GFX11-TRUE16-NEXT:    v_dual_mov_b32 v1, s0 :: v_dual_mov_b32 v2, s1
 ; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v3, s1
+; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_interp_p10_f16_f32 v1, v0.l, v1, v0.l wait_exp:7
-; GFX11-TRUE16-NEXT:    v_interp_p2_f16_f32 v0.l, v0.l, v3, v2 wait_exp:7
+; GFX11-TRUE16-NEXT:    v_interp_p2_f16_f32 v0.l, v0.l, v2, v3 wait_exp:7
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.h, v1
 ; GFX11-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.h, v0.l
@@ -383,12 +383,12 @@ define amdgpu_ps half @v_interp_f16_imm_params(float inreg %i, float inreg %j) #
 ;
 ; GFX12-TRUE16-LABEL: v_interp_f16_imm_params:
 ; GFX12-TRUE16:       ; %bb.0: ; %main_body
-; GFX12-TRUE16-NEXT:    v_dual_mov_b32 v1, s0 :: v_dual_mov_b32 v2, 0
+; GFX12-TRUE16-NEXT:    v_dual_mov_b32 v1, s0 :: v_dual_mov_b32 v2, s1
 ; GFX12-TRUE16-NEXT:    v_mov_b16_e32 v0.l, 0
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v3, s1
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v3, 0
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX12-TRUE16-NEXT:    v_interp_p10_f16_f32 v1, v0.l, v1, v0.l wait_exp:7
-; GFX12-TRUE16-NEXT:    v_interp_p2_f16_f32 v0.l, v0.l, v3, v2 wait_exp:7
+; GFX12-TRUE16-NEXT:    v_interp_p2_f16_f32 v0.l, v0.l, v2, v3 wait_exp:7
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX12-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.h, v1
 ; GFX12-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.h, v0.l
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll
index 07d5ff2036d93..b75eb737534e9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll
@@ -1379,45 +1379,43 @@ define amdgpu_ps float @mubuf_atomicrmw_sgpr_ptr_vgpr_offset(ptr addrspace(1) in
 ; GFX6-LABEL: mubuf_atomicrmw_sgpr_ptr_vgpr_offset:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    v_ashrrev_i32_e32 v1, 31, v0
-; GFX6-NEXT:    v_lshl_b64 v[0:1], v[0:1], 2
+; GFX6-NEXT:    v_lshl_b64 v[1:2], v[0:1], 2
 ; GFX6-NEXT:    s_mov_b32 s0, s2
 ; GFX6-NEXT:    s_mov_b32 s1, s3
-; GFX6-NEXT:    v_mov_b32_e32 v2, 2
+; GFX6-NEXT:    v_mov_b32_e32 v0, 2
 ; GFX6-NEXT:    s_mov_b32 s2, 0
 ; GFX6-NEXT:    s_mov_b32 s3, 0xf000
-; GFX6-NEXT:    buffer_atomic_add v2, v[0:1], s[0:3], 0 addr64 glc
+; GFX6-NEXT:    buffer_atomic_add v0, v[1:2], s[0:3], 0 addr64 glc
 ; GFX6-NEXT:    s_waitcnt vmcnt(0)
 ; GFX6-NEXT:    buffer_wbinvl1
-; GFX6-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX6-NEXT:    s_waitcnt expcnt(0)
 ; GFX6-NEXT:    ; return to shader part epilog
 ;
 ; GFX7-LABEL: mubuf_atomicrmw_sgpr_ptr_vgpr_offset:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    v_ashrrev_i32_e32 v1, 31, v0
-; GFX7-NEXT:    v_lshl_b64 v[0:1], v[0:1], 2
+; GFX7-NEXT:    v_lshl_b64 v[1:2], v[0:1], 2
 ; GFX7-NEXT:    s_mov_b32 s0, s2
 ; GFX7-NEXT:    s_mov_b32 s1, s3
-; GFX7-NEXT:    v_mov_b32_e32 v2, 2
+; GFX7-NEXT:    v_mov_b32_e32 v0, 2
 ; GFX7-NEXT:    s_mov_b32 s2, 0
 ; GFX7-NEXT:    s_mov_b32 s3, 0xf000
-; GFX7-NEXT:    buffer_atomic_add v2, v[0:1], s[0:3], 0 addr64 glc
+; GFX7-NEXT:    buffer_atomic_add v0, v[1:2], s[0:3], 0 addr64 glc
 ; GFX7-NEXT:    s_waitcnt vmcnt(0)
 ; GFX7-NEXT:    buffer_wbinvl1
-; GFX7-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX7-NEXT:    ; return to shader part epilog
 ;
 ; GFX12-LABEL: mubuf_atomicrmw_sgpr_ptr_vgpr_offset:
 ; GFX12:       ; %bb.0:
 ; GFX12-NEXT:    v_ashrrev_i32_e32 v1, 31, v0
 ; GFX12-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
-; GFX12-NEXT:    v_mov_b32_e32 v4, 2
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX12-NEXT:    v_lshlrev_b64_e32 v[0:1], 2, v[0:1]
 ; GFX12-NEXT:    v_add_co_u32 v0, vcc_lo, v2, v0
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX12-NEXT:    v_add_co_ci_u32_e64 v1, null, v3, v1, vcc_lo
-; GFX12-NEXT:    global_atomic_add_u32 v0, v[0:1], v4, off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
+; GFX12-NEXT:    v_mov_b32_e32 v2, 2
+; GFX12-NEXT:    global_atomic_add_u32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-NEXT:    s_wait_loadcnt 0x0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
index 832f066adaa84..2f956d7a0a534 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
@@ -229,21 +229,23 @@ define i16 @v_saddsat_v2i8(i16 %lhs.arg, i16 %rhs.arg) {
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v2, 8, v0
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GFX6-NEXT:    v_min_i32_e32 v5, 0, v0
+; GFX6-NEXT:    v_min_i32_e32 v6, 0, v0
+; GFX6-NEXT:    v_bfrev_b32_e32 v7, 1
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
 ; GFX6-NEXT:    v_max_i32_e32 v4, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x80000000, v5
+; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, v7, v6
 ; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x7fffffff, v4
-; GFX6-NEXT:    v_max_i32_e32 v1, v5, v1
+; GFX6-NEXT:    v_max_i32_e32 v1, v6, v1
 ; GFX6-NEXT:    v_min_i32_e32 v1, v1, v4
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v2
 ; GFX6-NEXT:    v_min_i32_e32 v4, 0, v1
+; GFX6-NEXT:    v_bfrev_b32_e32 v5, -2
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
 ; GFX6-NEXT:    v_max_i32_e32 v3, 0, v1
 ; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x80000000, v4
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x7fffffff, v3
+; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, v5, v3
 ; GFX6-NEXT:    v_max_i32_e32 v2, v4, v2
 ; GFX6-NEXT:    v_min_i32_e32 v2, v2, v3
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v2
@@ -2951,20 +2953,22 @@ define amdgpu_ps float @saddsat_v2i16_vs(<2 x i16> %lhs, <2 x i16> inreg %rhs) {
 ; GFX6-LABEL: saddsat_v2i16_vs:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GFX6-NEXT:    v_min_i32_e32 v3, 0, v0
+; GFX6-NEXT:    v_min_i32_e32 v4, 0, v0
+; GFX6-NEXT:    v_bfrev_b32_e32 v5, 1
 ; GFX6-NEXT:    s_lshl_b32 s0, s0, 16
 ; GFX6-NEXT:    v_max_i32_e32 v2, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x80000000, v3
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, v5, v4
 ; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0x7fffffff, v2
-; GFX6-NEXT:    v_max_i32_e32 v3, s0, v3
+; GFX6-NEXT:    v_max_i32_e32 v4, s0, v4
+; GFX6-NEXT:    v_min_i32_e32 v2, v4, v2
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; GFX6-NEXT:    v_min_i32_e32 v2, v3, v2
-; GFX6-NEXT:    v_min_i32_e32 v3, 0, v1
+; GFX6-NEXT:    v_bfrev_b32_e32 v3, -2
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
-; GFX6-NEXT:    s_lshl_b32 s0, s1, 16
 ; GFX6-NEXT:    v_max_i32_e32 v2, 0, v1
+; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, v3, v2
+; GFX6-NEXT:    v_min_i32_e32 v3, 0, v1
+; GFX6-NEXT:    s_lshl_b32 s0, s1, 16
 ; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x80000000, v3
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0x7fffffff, v2
 ; GFX6-NEXT:    v_max_i32_e32 v3, s0, v3
 ; GFX6-NEXT:    v_min_i32_e32 v2, v3, v2
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v2
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
index 8d8eca162257a..19dc20c510041 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
@@ -1067,24 +1067,24 @@ define i64 @v_srem_i64_pow2k_denom(i64 %num) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, v3, v2
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, v7, v2
 ; CHECK-NEXT:    v_mad_u64_u32 v[1:2], s[4:5], v5, v2, v[1:2]
-; CHECK-NEXT:    v_sub_i32_e64 v0, s[4:5], v4, v0
-; CHECK-NEXT:    v_subb_u32_e64 v2, vcc, v9, v1, s[4:5]
-; CHECK-NEXT:    v_sub_i32_e32 v1, vcc, v9, v1
-; CHECK-NEXT:    v_cmp_ge_u32_e32 vcc, v0, v5
-; CHECK-NEXT:    v_cndmask_b32_e64 v3, 0, -1, vcc
-; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v2
-; CHECK-NEXT:    v_cndmask_b32_e32 v3, -1, v3, vcc
+; CHECK-NEXT:    v_sub_i32_e32 v0, vcc, v4, v0
+; CHECK-NEXT:    v_subb_u32_e64 v2, s[4:5], v9, v1, vcc
+; CHECK-NEXT:    v_sub_i32_e64 v1, s[4:5], v9, v1
+; CHECK-NEXT:    v_subbrev_u32_e32 v1, vcc, 0, v1, vcc
 ; CHECK-NEXT:    v_sub_i32_e32 v4, vcc, v0, v5
-; CHECK-NEXT:    v_subbrev_u32_e64 v1, s[4:5], 0, v1, s[4:5]
 ; CHECK-NEXT:    v_subbrev_u32_e32 v1, vcc, 0, v1, vcc
 ; CHECK-NEXT:    v_cmp_ge_u32_e32 vcc, v4, v5
-; CHECK-NEXT:    v_cndmask_b32_e64 v5, 0, -1, vcc
+; CHECK-NEXT:    v_cndmask_b32_e64 v7, 0, -1, vcc
 ; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v1
-; CHECK-NEXT:    v_cndmask_b32_e32 v5, -1, v5, vcc
-; CHECK-NEXT:    v_subrev_i32_e32 v7, vcc, 0x1000, v4
+; CHECK-NEXT:    v_cmp_ge_u32_e64 s[4:5], v0, v5
+; CHECK-NEXT:    v_cndmask_b32_e32 v7, -1, v7, vcc
+; CHECK-NEXT:    v_sub_i32_e32 v5, vcc, v4, v5
+; CHECK-NEXT:    v_cndmask_b32_e64 v3, 0, -1, s[4:5]
+; CHECK-NEXT:    v_cmp_eq_u32_e64 s[4:5], 0, v2
 ; CHECK-NEXT:    v_subbrev_u32_e32 v8, vcc, 0, v1, vcc
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
-; CHECK-NEXT:    v_cndmask_b32_e32 v4, v4, v7, vcc
+; CHECK-NEXT:    v_cndmask_b32_e64 v3, -1, v3, s[4:5]
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v7
+; CHECK-NEXT:    v_cndmask_b32_e32 v4, v4, v5, vcc
 ; CHECK-NEXT:    v_cndmask_b32_e32 v1, v1, v8, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc
@@ -1660,24 +1660,24 @@ define i64 @v_srem_i64_oddk_denom(i64 %num) {
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, v3, v2
 ; CHECK-NEXT:    v_add_i32_e32 v2, vcc, v7, v2
 ; CHECK-NEXT:    v_mad_u64_u32 v[1:2], s[4:5], v5, v2, v[1:2]
-; CHECK-NEXT:    v_sub_i32_e64 v0, s[4:5], v4, v0
-; CHECK-NEXT:    v_subb_u32_e64 v2, vcc, v9, v1, s[4:5]
-; CHECK-NEXT:    v_sub_i32_e32 v1, vcc, v9, v1
-; CHECK-NEXT:    v_cmp_ge_u32_e32 vcc, v0, v5
-; CHECK-NEXT:    v_cndmask_b32_e64 v3, 0, -1, vcc
-; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v2
-; CHECK-NEXT:    v_cndmask_b32_e32 v3, -1, v3, vcc
+; CHECK-NEXT:    v_sub_i32_e32 v0, vcc, v4, v0
+; CHECK-NEXT:    v_subb_u32_e64 v2, s[4:5], v9, v1, vcc
+; CHECK-NEXT:    v_sub_i32_e64 v1, s[4:5], v9, v1
+; CHECK-NEXT:    v_subbrev_u32_e32 v1, vcc, 0, v1, vcc
 ; CHECK-NEXT:    v_sub_i32_e32 v4, vcc, v0, v5
-; CHECK-NEXT:    v_subbrev_u32_e64 v1, s[4:5], 0, v1, s[4:5]
 ; CHECK-NEXT:    v_subbrev_u32_e32 v1, vcc, 0, v1, vcc
 ; CHECK-NEXT:    v_cmp_ge_u32_e32 vcc, v4, v5
-; CHECK-NEXT:    v_cndmask_b32_e64 v5, 0, -1, vcc
+; CHECK-NEXT:    v_cndmask_b32_e64 v7, 0, -1, vcc
 ; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v1
-; CHECK-NEXT:    v_cndmask_b32_e32 v5, -1, v5, vcc
-; CHECK-NEXT:    v_subrev_i32_e32 v7, vcc, 0x12d8fb, v4
+; CHECK-NEXT:    v_cmp_ge_u32_e64 s[4:5], v0, v5
+; CHECK-NEXT:    v_cndmask_b32_e32 v7, -1, v7, vcc
+; CHECK-NEXT:    v_sub_i32_e32 v5, vcc, v4, v5
+; CHECK-NEXT:    v_cndmask_b32_e64 v3, 0, -1, s[4:5]
+; CHECK-NEXT:    v_cmp_eq_u32_e64 s[4:5], 0, v2
 ; CHECK-NEXT:    v_subbrev_u32_e32 v8, vcc, 0, v1, vcc
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v5
-; CHECK-NEXT:    v_cndmask_b32_e32 v4, v4, v7, vcc
+; CHECK-NEXT:    v_cndmask_b32_e64 v3, -1, v3, s[4:5]
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v7
+; CHECK-NEXT:    v_cndmask_b32_e32 v4, v4, v5, vcc
 ; CHECK-NEXT:    v_cndmask_b32_e32 v1, v1, v8, vcc
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v3
 ; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll
index 2673ac4fb5bae..c1b225562b77b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll
@@ -233,16 +233,17 @@ define i16 @v_ssubsat_v2i8(i16 %lhs.arg, i16 %rhs.arg) {
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
 ; GFX6-NEXT:    v_add_i32_e32 v4, vcc, 0x80000001, v4
-; GFX6-NEXT:    v_min_i32_e32 v5, -1, v0
-; GFX6-NEXT:    v_bfrev_b32_e32 v6, 1
-; GFX6-NEXT:    v_add_i32_e32 v5, vcc, v5, v6
+; GFX6-NEXT:    v_min_i32_e32 v6, -1, v0
+; GFX6-NEXT:    v_bfrev_b32_e32 v7, 1
+; GFX6-NEXT:    v_add_i32_e32 v6, vcc, v6, v7
 ; GFX6-NEXT:    v_max_i32_e32 v1, v4, v1
-; GFX6-NEXT:    v_min_i32_e32 v1, v1, v5
+; GFX6-NEXT:    v_min_i32_e32 v1, v1, v6
 ; GFX6-NEXT:    v_sub_i32_e32 v0, vcc, v0, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v2
+; GFX6-NEXT:    v_mov_b32_e32 v5, 0x80000001
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
 ; GFX6-NEXT:    v_max_i32_e32 v3, -1, v1
-; GFX6-NEXT:    v_add_i32_e32 v3, vcc, 0x80000001, v3
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v3, v5
 ; GFX6-NEXT:    v_min_i32_e32 v4, -1, v1
 ; GFX6-NEXT:    v_add_i32_e32 v4, vcc, 0x80000000, v4
 ; GFX6-NEXT:    v_max_i32_e32 v2, v3, v2
@@ -1260,7 +1261,8 @@ define <2 x i32> @v_ssubsat_v2i32(<2 x i32> %lhs, <2 x i32> %rhs) {
 ; GFX6-NEXT:    v_max_i32_e32 v4, -1, v0
 ; GFX6-NEXT:    v_add_i32_e32 v4, vcc, 0x80000001, v4
 ; GFX6-NEXT:    v_min_i32_e32 v5, -1, v0
-; GFX6-NEXT:    v_add_i32_e32 v5, vcc, 0x80000000, v5
+; GFX6-NEXT:    v_bfrev_b32_e32 v6, 1
+; GFX6-NEXT:    v_add_i32_e32 v5, vcc, v5, v6
 ; GFX6-NEXT:    v_max_i32_e32 v2, v4, v2
 ; GFX6-NEXT:    v_min_i32_e32 v2, v2, v5
 ; GFX6-NEXT:    v_sub_i32_e32 v0, vcc, v0, v2
@@ -1279,7 +1281,8 @@ define <2 x i32> @v_ssubsat_v2i32(<2 x i32> %lhs, <2 x i32> %rhs) {
 ; GFX8-NEXT:    v_max_i32_e32 v4, -1, v0
 ; GFX8-NEXT:    v_add_u32_e32 v4, vcc, 0x80000001, v4
 ; GFX8-NEXT:    v_min_i32_e32 v5, -1, v0
-; GFX8-NEXT:    v_add_u32_e32 v5, vcc, 0x80000000, v5
+; GFX8-NEXT:    v_bfrev_b32_e32 v6, 1
+; GFX8-NEXT:    v_add_u32_e32 v5, vcc, v5, v6
 ; GFX8-NEXT:    v_max_i32_e32 v2, v4, v2
 ; GFX8-NEXT:    v_min_i32_e32 v2, v2, v5
 ; GFX8-NEXT:    v_sub_u32_e32 v0, vcc, v0, v2
diff --git a/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll b/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll
index 4b6375cc60800..153898560fc31 100644
--- a/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll
+++ b/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll
@@ -74,12 +74,13 @@ define amdgpu_kernel void @use_private_to_flat_addrspacecast_nonnull(ptr addrspa
 ; GFX1250-GISEL-NEXT:    v_mbcnt_lo_u32_b32 v2, -1, 0
 ; GFX1250-GISEL-NEXT:    v_mov_b64_e32 v[0:1], s[0:1]
 ; GFX1250-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1250-GISEL-NEXT:    v_dual_mov_b32 v3, 0 :: v_dual_lshlrev_b32 v2, 20, v2
+; GFX1250-GISEL-NEXT:    v_lshlrev_b32_e32 v2, 20, v2
 ; GFX1250-GISEL-NEXT:    s_wait_kmcnt 0x0
 ; GFX1250-GISEL-NEXT:    v_add_co_u32 v0, vcc_lo, s2, v0
 ; GFX1250-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX1250-GISEL-NEXT:    v_add_co_ci_u32_e64 v1, null, v2, v1, vcc_lo
-; GFX1250-GISEL-NEXT:    flat_store_b32 v[0:1], v3 scope:SCOPE_SYS
+; GFX1250-GISEL-NEXT:    v_mov_b32_e32 v2, 0
+; GFX1250-GISEL-NEXT:    flat_store_b32 v[0:1], v2 scope:SCOPE_SYS
 ; GFX1250-GISEL-NEXT:    s_wait_storecnt 0x0
 ; GFX1250-GISEL-NEXT:    s_endpgm
   %stof = call ptr @llvm.amdgcn.addrspacecast.nonnull.p0.p5(ptr addrspace(5) %ptr)
diff --git a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
index 3160e38df5e3f..4fd28be3b8425 100644
--- a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
+++ b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
@@ -513,51 +513,51 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
 ; GFX908-LABEL: introduced_copy_to_sgpr:
 ; GFX908:       ; %bb.0: ; %bb
 ; GFX908-NEXT:    global_load_ushort v16, v[0:1], off glc
-; GFX908-NEXT:    s_load_dwordx4 s[4:7], s[8:9], 0x0
-; GFX908-NEXT:    s_load_dwordx2 s[10:11], s[8:9], 0x10
-; GFX908-NEXT:    s_load_dword s0, s[8:9], 0x18
-; GFX908-NEXT:    s_mov_b32 s12, 0
-; GFX908-NEXT:    s_mov_b32 s9, s12
+; GFX908-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
+; GFX908-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x10
+; GFX908-NEXT:    s_load_dword s7, s[8:9], 0x18
+; GFX908-NEXT:    s_mov_b32 s6, 0
+; GFX908-NEXT:    s_mov_b32 s9, s6
 ; GFX908-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX908-NEXT:    v_cvt_f32_u32_e32 v0, s7
-; GFX908-NEXT:    s_sub_i32 s1, 0, s7
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v17, s0
-; GFX908-NEXT:    v_mov_b32_e32 v19, 0
+; GFX908-NEXT:    v_cvt_f32_u32_e32 v0, s3
+; GFX908-NEXT:    s_sub_i32 s8, 0, s3
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v18, s7
+; GFX908-NEXT:    v_mov_b32_e32 v17, 0
 ; GFX908-NEXT:    v_rcp_iflag_f32_e32 v2, v0
 ; GFX908-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX908-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX908-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
 ; GFX908-NEXT:    v_cvt_u32_f32_e32 v2, v2
-; GFX908-NEXT:    v_readfirstlane_b32 s2, v2
-; GFX908-NEXT:    s_mul_i32 s1, s1, s2
-; GFX908-NEXT:    s_mul_hi_u32 s1, s2, s1
-; GFX908-NEXT:    s_add_i32 s2, s2, s1
-; GFX908-NEXT:    s_mul_hi_u32 s1, s6, s2
-; GFX908-NEXT:    s_mul_i32 s2, s1, s7
-; GFX908-NEXT:    s_sub_i32 s2, s6, s2
-; GFX908-NEXT:    s_add_i32 s3, s1, 1
-; GFX908-NEXT:    s_sub_i32 s6, s2, s7
-; GFX908-NEXT:    s_cmp_ge_u32 s2, s7
-; GFX908-NEXT:    s_cselect_b32 s1, s3, s1
-; GFX908-NEXT:    s_cselect_b32 s2, s6, s2
-; GFX908-NEXT:    s_add_i32 s3, s1, 1
-; GFX908-NEXT:    s_cmp_ge_u32 s2, s7
-; GFX908-NEXT:    s_cselect_b32 s8, s3, s1
-; GFX908-NEXT:    s_lshr_b32 s2, s0, 16
-; GFX9...
[truncated]

@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-immediate-allow-multiple-uses branch from 5b5f004 to 3e8998a Compare August 22, 2025 11:31
@arsenm arsenm force-pushed the users/arsenm/amdgpu/simplify-foldImmediate branch from a5d4616 to a32d42e Compare August 22, 2025 11:31
@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-immediate-allow-multiple-uses branch from 3e8998a to e8e8163 Compare August 22, 2025 11:34
@arsenm arsenm force-pushed the users/arsenm/amdgpu/simplify-foldImmediate branch from a32d42e to dfd9b00 Compare August 22, 2025 15:03
@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-immediate-allow-multiple-uses branch from e8e8163 to 8443487 Compare August 22, 2025 15:03
@arsenm arsenm force-pushed the users/arsenm/amdgpu/simplify-foldImmediate branch from dfd9b00 to c9ad541 Compare August 23, 2025 01:10
@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-immediate-allow-multiple-uses branch 2 times, most recently from b00ce3c to 0a92fda Compare August 23, 2025 01:48
@arsenm arsenm force-pushed the users/arsenm/amdgpu/simplify-foldImmediate branch from c9ad541 to 0a06817 Compare August 23, 2025 01:48
Base automatically changed from users/arsenm/amdgpu/simplify-foldImmediate to main August 23, 2025 02:13
In some cases this will require an avoidable re-defining of
a register, but it works out better most of the time. Also allow
folding 64-bit immediates into subregister extracts, unless it would
break an inline constant.

We could be more aggressive here, but this set of conditions seems
to do a reasonable job without introducing too many regressions.
@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-immediate-allow-multiple-uses branch from 0a92fda to 833b7fa Compare August 27, 2025 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants