[PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) #154388

tonykuttai · 2025-08-19T17:08:59Z

This change implements a patfrag based pattern matching ~~dag combiner~~ that combines consecutive VSRO (Vector Shift Right Octet) and VSR (Vector Shift Right) instructions into a single VSRQ (Vector Shift Right Quadword) instruction on Power10+ processors.

Vector right shift operations like vec_srl(vec_sro(input, byte_shift), bit_shift) generate two separate instructions (VSRO + VSR) when they could be optimised into a single VSRQ instruction that performs the equivalent operation.

vsr(vsro (input, vsro_byte_shift), vsr_bit_shift) to vsrq(input, vsrq_bit_shift) 
where vsrq_bit_shift = (vsro_byte_shift * 8) + vsr_bit_shift

Note:

 vsro : Vector Shift Right by Octet VX-form
- vsro VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bytes specified in bits 121:124 of VSR[VRB+32].
	- Bytes shifted out of byte 15 are lost. 
	- Zeros are supplied to the vacated bytes on the left.
- The result is placed into VSR[VRT+32].

vsr : Vector Shift Right VX-form
- vsr VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bits specified in bits 125:127 of VSR[VRB+32]. 3 bits.
	- Bits shifted out of bit 127 are lost.
	- Zeros are supplied to the vacated bits on the left.
- The result is place into VSR[VRT+32], except if, for any byte element in VSR[VRB+32], the low-order 3 bits are not equal to the shift amount, then VSR[VRT+32] is undefined.

vsrq : Vector Shift Right Quadword VX-form
- vsrq VRT,VRA,VRB 
- Let src1 be the contents of VSR[VRA+32]. Let src2 be the contents of VSR[VRB+32]. 
- src1 is shifted right by the number of bits specified in the low-order 7 bits of src2.
	- Bits shifted out the least-significant bit are lost. 
	- Zeros are supplied to the vacated bits on the left. 
	- The result is placed into VSR[VRT+32].

llvmbot · 2025-08-19T17:10:45Z

@llvm/pr-subscribers-backend-powerpc

Author: Tony Varghese (tonykuttai)

Changes

This change implements a dag combiner that combines consecutive VSRO (Vector Shift Right Octet) and VSR (Vector Shift Right) instructions into a single VSRQ (Vector Shift Right Quadword) instruction on Power10+ processors. Vector right shift operations like vec_srl(vec_sro(v, s), s) generate two separate instructions (VSRO + VSR) when they could be optimized into a single VSRQ instruction that performs the equivalent operation.

Note:

 vsro : Vector Shift Right by Octet VX-form
- vsro VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bytes specified in bits 121:124 of VSR[VRB+32].
	- Bytes shifted out of byte 15 are lost. 
	- Zeros are supplied to the vacated bytes on the left.
- The result is placed into VSR[VRT+32].

vsr : Vector Shift Right VX-form
- vsr VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bits specified in bits 125:127 of VSR[VRB+32]. 3 bits.
	- Bits shifted out of bit 127 are lost.
	- Zeros are supplied to the vacated bits on the left.
- The result is place into VSR[VRT+32], except if, for any byte element in VSR[VRB+32], the low-order 3 bits are not equal to the shift amount, then VSR[VRT+32] is undefined.

vsrq : Vector Shift Right Quadword VX-form
- vsrq VRT,VRA,VRB 
- Let src1 be the contents of VSR[VRA+32]. Let src2 be the contents of VSR[VRB+32]. 
- src1 is shifted right by the number of bits specified in the low-order 7 bits of src2.
	- Bits shifted out the least-significant bit are lost. 
	- Zeros are supplied to the vacated bits on the left. 
	- The result is placed into VSR[VRT+32].

Patch is 20.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/154388.diff

5 Files Affected:

(modified) llvm/lib/Target/PowerPC/PPCISelLowering.cpp (+66)
(modified) llvm/lib/Target/PowerPC/PPCISelLowering.h (+4)
(modified) llvm/lib/Target/PowerPC/PPCInstrInfo.td (+6)
(modified) llvm/lib/Target/PowerPC/PPCInstrP10.td (+2-1)
(added) llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll (+337)

diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index 652edd4e04c60..c7c17c2d0f85a 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -1697,6 +1697,7 @@ const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {
   case PPCISD::XXPERM:
     return "PPCISD::XXPERM";
   case PPCISD::VECSHL:          return "PPCISD::VECSHL";
+  case PPCISD::VSRQ:            return "PPCISD::VSRQ";
   case PPCISD::CMPB:            return "PPCISD::CMPB";
   case PPCISD::Hi:              return "PPCISD::Hi";
   case PPCISD::Lo:              return "PPCISD::Lo";
@@ -16680,6 +16681,67 @@ SDValue PPCTargetLowering::combineVReverseMemOP(ShuffleVectorSDNode *SVN,
   llvm_unreachable("Expected a load or store node here");
 }
 
+// Combine VSR(VSRO input, shift), shift) to VSRQ(input, shift)
+//
+// PowerPC Vector Shift Instructions:
+// - vsro (Vector Shift Right by Octet): Shifts vector right by N bytes,
+//   where N is specified in bits 121:124 of the shift vector (4 bits, 0-15
+//   bytes)
+// - vsr (Vector Shift Right): Shifts vector right by N bits,
+//   where N is specified in bits 125:127 of the shift vector (3 bits, 0-7 bits)
+// - vsrq (Vector Shift Right Quadword): Shifts vector right by N bits,
+//   where N is specified in bits 57:63 of the shift vector (7 bits, 0-127 bits)
+//
+// Input DAG pattern: vsr(vsro(input, shift_vector), shift_vector)
+// performs the following shifts:
+//   1. vsro: input >> (bits[121:124] * 8) bits   [byte shifts converted to
+//   bits]
+//   2. vsr:  result >> bits[125:127] bits        [additional bit shifts]
+//   Total shift = (bits[121:124] * 8) + bits[125:127] bits
+//
+// Since bits 121:127 form a 7-bit value representing the total shift amount,
+// and vsrq uses the same 7-bit shift amount (assuming bits 57:63 map to
+// 121:127), we can replace the two-instruction sequence with a single vsrq
+// instruction.
+//
+// Optimization: vsr(vsro(input, shift), shift) -> vsrq(input, shift)
+SDValue PPCTargetLowering::combineVSROVSRToVSRQ(SDNode *N,
+                                                DAGCombinerInfo &DCI) const {
+
+  // Only available on ISA 3.1+ (Power10+)
+  if (!Subtarget.isISA3_1())
+    return SDValue();
+
+  SelectionDAG &DAG = DCI.DAG;
+  SDValue VSRInput = N->getOperand(1);
+  SDValue VSRShift = N->getOperand(2);
+
+  // Check if VSR input comes from a VSRO intrinsic
+  if (VSRInput.getOpcode() != ISD::INTRINSIC_WO_CHAIN)
+    return SDValue();
+
+  unsigned VSROIntrinsicID = VSRInput->getConstantOperandVal(0);
+  if (VSROIntrinsicID != Intrinsic::ppc_altivec_vsro)
+    return SDValue();
+
+  // Check if VSRO uses the same shift amount register as VSR
+  SDValue VSROShift = VSRInput.getOperand(2);
+  if (VSRShift != VSROShift)
+    return SDValue();
+
+  // Check single use - VSRO result should only be used by this VSR
+  if (!VSRInput.hasOneUse())
+    return SDValue();
+
+  // Get the original input to VSRO instruction
+  SDValue VSROOrigInput = VSRInput.getOperand(1);
+
+  return DAG.getNode(PPCISD::VSRQ, SDLoc(N),
+                     N->getValueType(0), // Preserve original result type
+                     VSROOrigInput,      // Original input vector
+                     VSRShift);          // Shift amount
+}
+
 static bool isStoreConditional(SDValue Intrin, unsigned &StoreWidth) {
   unsigned IntrinsicID = Intrin.getConstantOperandVal(1);
   if (IntrinsicID == Intrinsic::ppc_stdcx)
@@ -17207,6 +17269,10 @@ SDValue PPCTargetLowering::PerformDAGCombine(SDNode *N,
           }
         }
       }
+
+      // combine VSRO + VSR intrinsic calls to optimize with VSRQ
+      if (IID == Intrinsic::ppc_altivec_vsr)
+        return combineVSROVSRToVSRQ(N, DCI);
     }
 
     break;
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.h b/llvm/lib/Target/PowerPC/PPCISelLowering.h
index 5e0d6bf184f20..362ccdfb26efa 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.h
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.h
@@ -498,6 +498,9 @@ namespace llvm {
     /// SETBCR - The ISA 3.1 (P10) SETBCR instruction.
     SETBCR,
 
+    /// VSRQ - The ISA 3.1 (P10) Vector Shift right quadword instruction
+    VSRQ,
+
     // NOTE: The nodes below may require PC-Rel specific patterns if the
     // address could be PC-Relative. When adding new nodes below, consider
     // whether or not the address can be PC-Relative and add the corresponding
@@ -1447,6 +1450,7 @@ namespace llvm {
                                  SelectionDAG &DAG) const;
     SDValue combineVReverseMemOP(ShuffleVectorSDNode *SVN, LSBaseSDNode *LSBase,
                                  DAGCombinerInfo &DCI) const;
+    SDValue combineVSROVSRToVSRQ(SDNode *N, DAGCombinerInfo &DCI) const;
 
     /// ConvertSETCCToSubtract - looks at SETCC that compares ints. It replaces
     /// SETCC with integer subtraction when (1) there is a legal way of doing it
diff --git a/llvm/lib/Target/PowerPC/PPCInstrInfo.td b/llvm/lib/Target/PowerPC/PPCInstrInfo.td
index c2f91ce8e6b96..1a57d622a5f6c 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrInfo.td
+++ b/llvm/lib/Target/PowerPC/PPCInstrInfo.td
@@ -58,6 +58,10 @@ def SDT_PPCVecShift : SDTypeProfile<1, 3, [ SDTCisVec<0>,
   SDTCisVec<1>, SDTCisVec<2>, SDTCisPtrTy<3>
 ]>;
 
+def SDT_PPCVecShiftQuad : SDTypeProfile<1, 2, [
+  SDTCisVec<0>, SDTCisSameAs<0,1>, SDTCisSameAs<0,2>
+]>;
+
 def SDT_PPCVecInsert : SDTypeProfile<1, 3, [ SDTCisVec<0>,
   SDTCisVec<1>, SDTCisVec<2>, SDTCisInt<3>
 ]>;
@@ -157,6 +161,8 @@ def PPCfctiwz : SDNode<"PPCISD::FCTIWZ", SDTFPUnaryOp, []>;
 def PPCfctiduz: SDNode<"PPCISD::FCTIDUZ",SDTFPUnaryOp, []>;
 def PPCfctiwuz: SDNode<"PPCISD::FCTIWUZ",SDTFPUnaryOp, []>;
 
+def PPCvsrq: SDNode<"PPCISD::VSRQ", SDT_PPCVecShiftQuad, []>;
+
 def PPCstrict_fcfid : SDNode<"PPCISD::STRICT_FCFID",
                              SDTFPUnaryOp, [SDNPHasChain]>;
 def PPCstrict_fcfidu : SDNode<"PPCISD::STRICT_FCFIDU",
diff --git a/llvm/lib/Target/PowerPC/PPCInstrP10.td b/llvm/lib/Target/PowerPC/PPCInstrP10.td
index 98dd8464c0ac8..902c40544ac28 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrP10.td
+++ b/llvm/lib/Target/PowerPC/PPCInstrP10.td
@@ -1919,7 +1919,8 @@ let Predicates = [IsISA3_1] in {
                         RegConstraint<"$VDi = $VD">, NoEncode<"$VDi">;
   def VSLQ : VX1_VT5_VA5_VB5<261, "vslq", []>;
   def VSRAQ : VX1_VT5_VA5_VB5<773, "vsraq", []>;
-  def VSRQ : VX1_VT5_VA5_VB5<517, "vsrq", []>;
+  def VSRQ : VX1_VT5_VA5_VB5<517, "vsrq", 
+                            [(set v4i32:$VD, (PPCvsrq v4i32:$VA, v4i32:$VB))]>;
   def VRLQ : VX1_VT5_VA5_VB5<5, "vrlq", []>;
   def XSCVQPUQZ : X_VT5_XO5_VB5<63, 0, 836, "xscvqpuqz", []>;
   def XSCVQPSQZ : X_VT5_XO5_VB5<63, 8, 836, "xscvqpsqz", []>;
diff --git a/llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll b/llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll
new file mode 100644
index 0000000000000..c2599c8f6af13
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll
@@ -0,0 +1,337 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr10 -mtriple=powerpc64le-unknown-linux-gnu \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER10-LE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr10 -mtriple=powerpc64-ibm-aix-xcoff \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER10-BE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr10 -mtriple=powerpc-ibm-aix-xcoff \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER3210-BE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER9-LE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple=powerpc64-ibm-aix-xcoff \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER9-BE
+
+; Test VSRO + VSR peephole optimization to VSRQ on Power10+
+; This should combine consecutive VSRO (Vector Shift Right Octet) and VSR (Vector Shift Right)
+; instructions using the same shift amount into a single VSRQ (Vector Shift Right Quadword)
+; instruction when targeting Power10 or later processors.
+declare <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32>, <4 x i32>)
+declare <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32>, <4 x i32>)
+
+define <16 x i8> @shiftright128_v16i8(<16 x i8> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v16i8:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v16i8:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v16i8:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v16i8:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v16i8:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <16 x i8>
+  ret <16 x i8> %4
+}
+
+define <4 x i32> @shiftright128_v4i32(<4 x i32> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v4i32:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v4i32:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v4i32:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v4i32:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v4i32:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %1 = tail call  <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %in, <4 x i32> %0)
+  %2 = tail call  <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %1, <4 x i32> %0)
+  ret <4 x i32> %2
+}
+
+define <2 x i64> @shiftright128_v2i64(<2 x i64> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v2i64:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v2i64:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v2i64:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v2i64:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v2i64:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <2 x i64> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <2 x i64>
+  ret <2 x i64> %4
+}
+
+define <8 x i16> @shiftright128_v8i16(<8 x i16> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v8i16:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v8i16:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v8i16:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v8i16:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v8i16:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <8 x i16> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <8 x i16>
+  ret <8 x i16> %4
+}
+
+; Test case with different vectors (should not optimize - different shift amounts)
+define <16 x i8> @no_optimization_different_shifts(<16 x i8> %in, i8 zeroext %sh1, i8 zeroext %sh2)  {
+; POWER10-LE-LABEL: no_optimization_different_shifts:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    mtvsrd v4, r6
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vspltb v4, v4, 7
+; POWER10-LE-NEXT:    vsro v2, v2, v3
+; POWER10-LE-NEXT:    vsr v2, v2, v4
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: no_optimization_different_shifts:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    mtvsrwz v4, r4
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vspltb v4, v4, 7
+; POWER10-BE-NEXT:    vsro v2, v2, v3
+; POWER10-BE-NEXT:    vsr v2, v2, v4
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: no_optimization_different_shifts:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    mtvsrwz v4, r4
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vspltb v4, v4, 7
+; POWER3210-BE-NEXT:    vsro v2, v2, v3
+; POWER3210-BE-NEXT:    vsr v2, v2, v4
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: no_optimization_different_shifts:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    mtvsrd v4, r6
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vspltb v4, v4, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v4
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: no_optimization_different_shifts:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    mtvsrwz v4, r4
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vspltb v4, v4, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v4
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh1, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %splat.splatinsert.i2 = insertelement <16 x i8> poison, i8 %sh2, i64 0
+  %splat.splat.i2 = shufflevector <16 x i8> %splat.splatinsert.i2, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = bitcast <16 x i8> %splat.splat.i2 to <4 x i32>
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %4 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %3, <4 x i32> %2)
+  %5 = bitcast <4 x i32> %4 to <16 x i8>
+  ret <16 x i8> %5
+}
+
+; Test case with multiple uses of VSRO result (should not optimize)
+define <16 x i8> @no_optimization_multiple_uses(<16 x i8> %in, i8 zeroext %sh)  {
+; POWER10-LE-LABEL: no_optimization_multiple_uses:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsro v2, v2, v3
+; POWER10-LE-NEXT:    vsr v3, v2, v3
+; POWER10-LE-NEXT:    vaddubm v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: no_optimization_multiple_uses:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsro v2, v2, v3
+; POWER10-BE-NEXT:    vsr v3, v2, v3
+; POWER10-BE-NEXT:    vaddubm v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: no_optimization_multiple_uses:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsro v2, v2, v3
+; POWER3210-BE-NEXT:    vsr v3, v2, v3
+; POWER3210-BE-NEXT:    vaddubm v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: no_optimization_multiple_uses:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v3, v2, v3
+; POWER9-LE-NEXT:    vaddubm v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: no_optimization_multiple_uses:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v3, v2, v3
+; POWER9-BE-NEXT:    vaddubm v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<...
[truncated]

github-actions · 2025-08-19T17:11:58Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…put, shift)

tonykuttai · 2025-08-28T04:44:38Z

gentle ping @RolandF77 @amy-kwan @redstar @lei137

RolandF77

LGTM

tonykuttai requested review from RolandF77 and lei137 August 19, 2025 17:09

tonykuttai marked this pull request as ready for review August 19, 2025 17:10

llvmbot added the backend:PowerPC label Aug 19, 2025

tonykuttai requested review from amy-kwan and AditiRM August 19, 2025 17:10

tonykuttai force-pushed the tvarghese/vsrqcombine branch from a972eb8 to d288a26 Compare August 19, 2025 17:39

[PowerPC] Dag Combine to merge vsr(vsro(in, shift), shift) to vsrq(in…

d288a26

…put, shift)

tonykuttai requested a review from redstar August 19, 2025 18:13

Tony Varghese added 2 commits August 20, 2025 06:05

Fixed the comments for better understanding

b0b23c5

Using PatFrag to simplify the pattern matching

8571e67

tonykuttai changed the title ~~[PowerPC] Dag Combine to merge vsr(vsro(in, shift), shift) to vsrq(input,shift)~~ [PowerPC] Merge vsr(vsro(in, byte_shift), bit_shift) to vsrq(input, res_bit_shift) Aug 20, 2025

tonykuttai changed the title ~~[PowerPC] Merge vsr(vsro(in, byte_shift), bit_shift) to vsrq(input, res_bit_shift)~~ [PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) Aug 20, 2025

Using PatFrag to simplify the pattern matching

b737004

RolandF77 approved these changes Aug 29, 2025

View reviewed changes

lei137 approved these changes Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) #154388

[PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) #154388

tonykuttai commented Aug 19, 2025 •

edited

Loading

Uh oh!

llvmbot commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025 •

edited

Loading

Uh oh!

tonykuttai commented Aug 28, 2025

Uh oh!

RolandF77 left a comment

Uh oh!

Uh oh!

[PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) #154388

Are you sure you want to change the base?

[PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) #154388

Conversation

tonykuttai commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonykuttai commented Aug 28, 2025

Uh oh!

RolandF77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tonykuttai commented Aug 19, 2025 •

edited

Loading

github-actions bot commented Aug 19, 2025 •

edited

Loading