Skip to content

Conversation

hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Sep 3, 2025

Add a DAGCombine to perform the following transformations:

  • add(x, abs(y)) -> saba(x, y, 0)
  • add(x, zext(abs(y))) -> sabal(x, y, 0)

As well as being a useful generic transformation, this also fixes an issue where LLVM de-optimises [US]ABA neon ACLE intrinsics into separate ABD+ADD instructions when one of the operands is a zero vector.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'm surprised this is the first use of SDPatternMatch.

Do we need the new combine / node, or could this be done with a tablegen pattern directly?

@hazzlim
Copy link
Contributor Author

hazzlim commented Sep 3, 2025

Nice, I'm surprised this is the first use of SDPatternMatch.

I was suprised too, but it does seem to be (llvm::PatternMatch is used, but not llvm::SDPatternMatch).

Do we need the new combine / node, or could this be done with a tablegen pattern directly?

I was thinking that this couldn't be done via tablegen, because in order to transform add(x, abs(y)) -> saba(x, y, 0) we need to create a new operand (the zeros vector), which I wasn't sure is possible to do via a tablegen pattern. But maybe I'm wrong there?

@davemgreen
Copy link
Collaborator

We generate the instructions directly in some of the raddhn patterns

def : Pat<(v8i8 (trunc (AArch64vlshr (add (v8i16 V128:$Vn), VImm0080), (i32 8)))),
          (RADDHNv8i16_v8i8 V128:$Vn, (v8i16 (MOVIv2d_ns (i32 0))))>;

Generating two instructions in a pattern has it's down sides, so they both have advantages and disadvantages, but it is likely a little better than a new AArch64ISD node that isn't otherwise optimized.

IsZExt = false;
return true;
}
if (sd_match(V0, SDPatternMatch::m_ZExt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also look for ANY_EXTEND opcodes too and treat it as a zero-extend.


auto MatchAbsOrZExtAbs = [](SDValue V0, SDValue V1, SDValue &AbsOp,
SDValue &Other, bool &IsZExt) {
Other = V1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise with your code below it's not a problem right now, but I think it' still worth only setting Other if we found a match.

return SDValue();

SDLoc DL(N);
SDValue Zero = DCI.DAG.getConstant(0, DL, MVT::i64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to get a constant that matches the element type of AbsOp, right?

// Transform the following:
// - add(x, abs(y)) -> saba(x, y, 0)
// - add(x, zext(abs(y))) -> sabal(x, y, 0)
static SDValue performAddSABACombine(SDNode *N,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we limit this to post-legalisation only or do we specifically want to catch cases pre-legalisation?

SDValue Zeros = DCI.DAG.getSplatVector(AbsOp.getValueType(), DL, Zero);

unsigned Opcode = IsZExt ? AArch64ISD::SABAL : AArch64ISD::SABA;
return DCI.DAG.getNode(Opcode, DL, VT, Other, AbsOp, Zeros);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also be checking if the operation is legal for the instruction? If we allow this DAG combine to run pre-legalisation then we could be zero-extending 64-bit to 128-bit and I don't think we support that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @david-arm for reviewing this - at @davemgreen's suggestion I've removed this function in favor of the tablegen-only approach

@llvmbot
Copy link
Member

llvmbot commented Sep 3, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Hari Limaye (hazzlim)

Changes

Add a DAGCombine to perform the following transformations:

  • add(x, abs(y)) -> saba(x, y, 0)
  • add(x, zext(abs(y))) -> sabal(x, y, 0)

As well as being a useful generic transformation, this also fixes an issue where LLVM de-optimises [US]ABA neon ACLE intrinsics into separate ABD+ADD instructions when one of the operands is a zero vector.


Full diff: https://github.com/llvm/llvm-project/pull/156615.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+54)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+25)
  • (modified) llvm/test/CodeGen/AArch64/neon-saba.ll (+262)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index f1e8fb7f77652..b22cd04a6f3c0 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -50,6 +50,7 @@
 #include "llvm/CodeGen/MachineInstrBuilder.h"
 #include "llvm/CodeGen/MachineMemOperand.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/SDPatternMatch.h"
 #include "llvm/CodeGen/SelectionDAG.h"
 #include "llvm/CodeGen/SelectionDAGNodes.h"
 #include "llvm/CodeGen/TargetCallingConv.h"
@@ -21913,6 +21914,56 @@ static SDValue performExtBinopLoadFold(SDNode *N, SelectionDAG &DAG) {
   return DAG.getNode(N->getOpcode(), DL, VT, Ext0, NShift);
 }
 
+// Transform the following:
+// - add(x, abs(y)) -> saba(x, y, 0)
+// - add(x, zext(abs(y))) -> sabal(x, y, 0)
+static SDValue performAddSABACombine(SDNode *N,
+                                     TargetLowering::DAGCombinerInfo &DCI) {
+  if (N->getOpcode() != ISD::ADD)
+    return SDValue();
+
+  EVT VT = N->getValueType(0);
+  if (!VT.isFixedLengthVector())
+    return SDValue();
+
+  SDValue N0 = N->getOperand(0);
+  SDValue N1 = N->getOperand(1);
+
+  auto MatchAbsOrZExtAbs = [](SDValue V0, SDValue V1, SDValue &AbsOp,
+                              SDValue &Other, bool &IsZExt) {
+    Other = V1;
+    if (sd_match(V0, m_Abs(SDPatternMatch::m_Value(AbsOp)))) {
+      IsZExt = false;
+      return true;
+    }
+    if (sd_match(V0, SDPatternMatch::m_ZExt(
+                         m_Abs(SDPatternMatch::m_Value(AbsOp))))) {
+      IsZExt = true;
+      return true;
+    }
+
+    return false;
+  };
+
+  SDValue AbsOp;
+  SDValue Other;
+  bool IsZExt;
+  if (!MatchAbsOrZExtAbs(N0, N1, AbsOp, Other, IsZExt) &&
+      !MatchAbsOrZExtAbs(N1, N0, AbsOp, Other, IsZExt))
+    return SDValue();
+
+  // Don't perform this on abs(sub), as this will become an ABD/ABA anyway.
+  if (AbsOp.getOpcode() == ISD::SUB)
+    return SDValue();
+
+  SDLoc DL(N);
+  SDValue Zero = DCI.DAG.getConstant(0, DL, MVT::i64);
+  SDValue Zeros = DCI.DAG.getSplatVector(AbsOp.getValueType(), DL, Zero);
+
+  unsigned Opcode = IsZExt ? AArch64ISD::SABAL : AArch64ISD::SABA;
+  return DCI.DAG.getNode(Opcode, DL, VT, Other, AbsOp, Zeros);
+}
+
 static SDValue performAddSubCombine(SDNode *N,
                                     TargetLowering::DAGCombinerInfo &DCI) {
   // Try to change sum of two reductions.
@@ -21938,6 +21989,9 @@ static SDValue performAddSubCombine(SDNode *N,
   if (SDValue Val = performExtBinopLoadFold(N, DCI.DAG))
     return Val;
 
+  if (SDValue Val = performAddSABACombine(N, DCI))
+    return Val;
+
   return performAddSubLongCombine(N, DCI);
 }
 
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 62b26b5239365..fdfde5ea1dc37 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -1059,6 +1059,10 @@ def AArch64sdot     : SDNode<"AArch64ISD::SDOT", SDT_AArch64Dot>;
 def AArch64udot     : SDNode<"AArch64ISD::UDOT", SDT_AArch64Dot>;
 def AArch64usdot    : SDNode<"AArch64ISD::USDOT", SDT_AArch64Dot>;
 
+// saba/sabal
+def AArch64neonsaba     : SDNode<"AArch64ISD::SABA", SDT_AArch64trivec>;
+def AArch64neonsabal    : SDNode<"AArch64ISD::SABAL", SDT_AArch64Dot>;
+
 // Vector across-lanes addition
 // Only the lower result lane is defined.
 def AArch64saddv    : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;
@@ -6121,6 +6125,19 @@ defm SQRDMLAH : SIMDThreeSameVectorSQRDMLxHTiedHS<1,0b10000,"sqrdmlah",
 defm SQRDMLSH : SIMDThreeSameVectorSQRDMLxHTiedHS<1,0b10001,"sqrdmlsh",
                                                     int_aarch64_neon_sqrdmlsh>;
 
+def : Pat<(AArch64neonsaba (v8i8 V64:$Rd), V64:$Rn, V64:$Rm),
+          (SABAv8i8 V64:$Rd, V64:$Rn, V64:$Rm)>;
+def : Pat<(AArch64neonsaba (v4i16 V64:$Rd), V64:$Rn, V64:$Rm),
+          (SABAv4i16 V64:$Rd, V64:$Rn, V64:$Rm)>;
+def : Pat<(AArch64neonsaba (v2i32 V64:$Rd), V64:$Rn, V64:$Rm),
+          (SABAv2i32 V64:$Rd, V64:$Rn, V64:$Rm)>;
+def : Pat<(AArch64neonsaba (v16i8 V128:$Rd), V128:$Rn, V128:$Rm),
+          (SABAv16i8 V128:$Rd, V128:$Rn, V128:$Rm)>;
+def : Pat<(AArch64neonsaba (v8i16 V128:$Rd), V128:$Rn, V128:$Rm),
+          (SABAv8i16 V128:$Rd, V128:$Rn, V128:$Rm)>;
+def : Pat<(AArch64neonsaba (v4i32 V128:$Rd), V128:$Rn, V128:$Rm),
+          (SABAv4i32 V128:$Rd, V128:$Rn, V128:$Rm)>;
+
 defm AND : SIMDLogicalThreeVector<0, 0b00, "and", and>;
 defm BIC : SIMDLogicalThreeVector<0, 0b01, "bic",
                                   BinOpFrag<(and node:$LHS, (vnot node:$RHS))> >;
@@ -7008,6 +7025,14 @@ defm : AddSubHNPatterns<ADDHNv2i64_v2i32, ADDHNv2i64_v4i32,
                         SUBHNv2i64_v2i32, SUBHNv2i64_v4i32,
                         v2i32, v2i64, 32>;
 
+// Patterns for SABAL
+def : Pat<(AArch64neonsabal (v8i16 V128:$Rd), (v8i8 V64:$Rn), (v8i8 V64:$Rm)),
+          (SABALv8i8_v8i16 V128:$Rd, V64:$Rn, V64:$Rm)>;
+def : Pat<(AArch64neonsabal (v4i32 V128:$Rd), (v4i16 V64:$Rn), (v4i16 V64:$Rm)),
+          (SABALv4i16_v4i32 V128:$Rd, V64:$Rn, V64:$Rm)>;
+def : Pat<(AArch64neonsabal (v2i64 V128:$Rd), (v2i32 V64:$Rn), (v2i32 V64:$Rm)),
+          (SABALv2i32_v2i64 V128:$Rd, V64:$Rn, V64:$Rm)>;
+
 //----------------------------------------------------------------------------
 // AdvSIMD bitwise extract from vector instruction.
 //----------------------------------------------------------------------------
diff --git a/llvm/test/CodeGen/AArch64/neon-saba.ll b/llvm/test/CodeGen/AArch64/neon-saba.ll
index 19967bd1a69ec..c8de6b21e9764 100644
--- a/llvm/test/CodeGen/AArch64/neon-saba.ll
+++ b/llvm/test/CodeGen/AArch64/neon-saba.ll
@@ -174,6 +174,268 @@ define <8 x i8> @saba_sabd_8b(<8 x i8> %a, <8 x i8> %b, <8 x i8> %c) #0 {
   ret <8 x i8> %add
 }
 
+; SABA from ADD(SABD(X, ZEROS))
+
+define <4 x i32> @saba_sabd_zeros_4s(<4 x i32> %a, <4 x i32> %b) #0 {
+; CHECK-LABEL: saba_sabd_zeros_4s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    saba v0.4s, v1.4s, v2.4s
+; CHECK-NEXT:    ret
+  %sabd = call <4 x i32> @llvm.aarch64.neon.sabd.v4i32(<4 x i32> %b, <4 x i32> zeroinitializer)
+  %add = add <4 x i32> %sabd, %a
+  ret <4 x i32> %add
+}
+
+define <2 x i32> @saba_sabd_zeros_2s(<2 x i32> %a, <2 x i32> %b) #0 {
+; CHECK-LABEL: saba_sabd_zeros_2s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    saba v0.2s, v1.2s, v2.2s
+; CHECK-NEXT:    ret
+  %sabd = call <2 x i32> @llvm.aarch64.neon.sabd.v2i32(<2 x i32> %b, <2 x i32> zeroinitializer)
+  %add = add <2 x i32> %sabd, %a
+  ret <2 x i32> %add
+}
+
+define <8 x i16> @saba_sabd_zeros_8h(<8 x i16> %a, <8 x i16> %b) #0 {
+; CHECK-LABEL: saba_sabd_zeros_8h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    saba v0.8h, v1.8h, v2.8h
+; CHECK-NEXT:    ret
+  %sabd = call <8 x i16> @llvm.aarch64.neon.sabd.v8i16(<8 x i16> %b, <8 x i16> zeroinitializer)
+  %add = add <8 x i16> %sabd, %a
+  ret <8 x i16> %add
+}
+
+define <4 x i16> @saba_sabd_zeros_4h(<4 x i16> %a, <4 x i16> %b) #0 {
+; CHECK-LABEL: saba_sabd_zeros_4h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    saba v0.4h, v1.4h, v2.4h
+; CHECK-NEXT:    ret
+  %sabd = call <4 x i16> @llvm.aarch64.neon.sabd.v4i16(<4 x i16> %b, <4 x i16> zeroinitializer)
+  %add = add <4 x i16> %sabd, %a
+  ret <4 x i16> %add
+}
+
+define <16 x i8> @saba_sabd_zeros_16b(<16 x i8> %a, <16 x i8> %b) #0 {
+; CHECK-LABEL: saba_sabd_zeros_16b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    saba v0.16b, v1.16b, v2.16b
+; CHECK-NEXT:    ret
+  %sabd = call <16 x i8> @llvm.aarch64.neon.sabd.v16i8(<16 x i8> %b, <16 x i8> zeroinitializer)
+  %add = add <16 x i8> %sabd, %a
+  ret <16 x i8> %add
+}
+
+define <8 x i8> @saba_sabd_zeros_8b(<8 x i8> %a, <8 x i8> %b) #0 {
+; CHECK-LABEL: saba_sabd_zeros_8b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    saba v0.8b, v1.8b, v2.8b
+; CHECK-NEXT:    ret
+  %sabd = call <8 x i8> @llvm.aarch64.neon.sabd.v8i8(<8 x i8> %b, <8 x i8> zeroinitializer)
+  %add = add <8 x i8> %sabd, %a
+  ret <8 x i8> %add
+}
+
+define <4 x i32> @saba_abs_zeros_4s(<4 x i32> %a, <4 x i32> %b) #0 {
+; CHECK-SD-LABEL: saba_abs_zeros_4s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    saba v0.4s, v1.4s, v2.4s
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: saba_abs_zeros_4s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.4s, v1.4s
+; CHECK-GI-NEXT:    add v0.4s, v0.4s, v1.4s
+; CHECK-GI-NEXT:    ret
+  %abs = call <4 x i32> @llvm.abs.v4i32(<4 x i32> %b, i1 true)
+  %add = add <4 x i32> %a, %abs
+  ret <4 x i32> %add
+}
+
+define <2 x i32> @saba_abs_zeros_2s(<2 x i32> %a, <2 x i32> %b) #0 {
+; CHECK-SD-LABEL: saba_abs_zeros_2s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    saba v0.2s, v1.2s, v2.2s
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: saba_abs_zeros_2s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.2s, v1.2s
+; CHECK-GI-NEXT:    add v0.2s, v0.2s, v1.2s
+; CHECK-GI-NEXT:    ret
+  %abs = call <2 x i32> @llvm.abs.v2i32(<2 x i32> %b, i1 true)
+  %add = add <2 x i32> %a, %abs
+  ret <2 x i32> %add
+}
+
+define <8 x i16> @saba_abs_zeros_8h(<8 x i16> %a, <8 x i16> %b) #0 {
+; CHECK-SD-LABEL: saba_abs_zeros_8h:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    saba v0.8h, v1.8h, v2.8h
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: saba_abs_zeros_8h:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.8h, v1.8h
+; CHECK-GI-NEXT:    add v0.8h, v0.8h, v1.8h
+; CHECK-GI-NEXT:    ret
+  %abs = call <8 x i16> @llvm.abs.v8i16(<8 x i16> %b, i1 true)
+  %add = add <8 x i16> %a, %abs
+  ret <8 x i16> %add
+}
+
+define <4 x i16> @saba_abs_zeros_4h(<4 x i16> %a, <4 x i16> %b) #0 {
+; CHECK-SD-LABEL: saba_abs_zeros_4h:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    saba v0.4h, v1.4h, v2.4h
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: saba_abs_zeros_4h:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.4h, v1.4h
+; CHECK-GI-NEXT:    add v0.4h, v0.4h, v1.4h
+; CHECK-GI-NEXT:    ret
+  %abs = call <4 x i16> @llvm.abs.v4i16(<4 x i16> %b, i1 true)
+  %add = add <4 x i16> %a, %abs
+  ret <4 x i16> %add
+}
+
+define <16 x i8> @saba_abs_zeros_16b(<16 x i8> %a, <16 x i8> %b) #0 {
+; CHECK-SD-LABEL: saba_abs_zeros_16b:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    saba v0.16b, v1.16b, v2.16b
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: saba_abs_zeros_16b:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.16b, v1.16b
+; CHECK-GI-NEXT:    add v0.16b, v0.16b, v1.16b
+; CHECK-GI-NEXT:    ret
+  %abs = call <16 x i8> @llvm.abs.v16i8(<16 x i8> %b, i1 true)
+  %add = add <16 x i8> %a, %abs
+  ret <16 x i8> %add
+}
+
+define <8 x i8> @saba_abs_zeros_8b(<8 x i8> %a, <8 x i8> %b) #0 {
+; CHECK-SD-LABEL: saba_abs_zeros_8b:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    saba v0.8b, v1.8b, v2.8b
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: saba_abs_zeros_8b:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.8b, v1.8b
+; CHECK-GI-NEXT:    add v0.8b, v0.8b, v1.8b
+; CHECK-GI-NEXT:    ret
+  %abs = call <8 x i8> @llvm.abs.v8i8(<8 x i8> %b, i1 true)
+  %add = add <8 x i8> %a, %abs
+  ret <8 x i8> %add
+}
+
+; SABAL from ADD(ZEXT(SABD(X, ZEROS)))
+
+define <2 x i64> @sabal_sabd_zeros_2s(<2 x i64> %a, <2 x i32> %b) #0 {
+; CHECK-LABEL: sabal_sabd_zeros_2s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    sabal v0.2d, v1.2s, v2.2s
+; CHECK-NEXT:    ret
+  %sabd = call <2 x i32> @llvm.aarch64.neon.sabd.v2i32(<2 x i32> %b, <2 x i32> zeroinitializer)
+  %sabd.zext = zext <2 x i32> %sabd to <2 x i64>
+  %add = add <2 x i64> %a, %sabd.zext
+  ret <2 x i64> %add
+}
+
+define <4 x i32> @sabal_sabd_zeros_4h(<4 x i32> %a, <4 x i16> %b) #0 {
+; CHECK-LABEL: sabal_sabd_zeros_4h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    sabal v0.4s, v1.4h, v2.4h
+; CHECK-NEXT:    ret
+  %sabd = call <4 x i16> @llvm.aarch64.neon.sabd.v4i16(<4 x i16> %b, <4 x i16> zeroinitializer)
+  %sabd.zext = zext <4 x i16> %sabd to <4 x i32>
+  %add = add <4 x i32> %a, %sabd.zext
+  ret <4 x i32> %add
+}
+
+define <8 x i16> @sabal_sabd_zeros_8b(<8 x i16> %a, <8 x i8> %b) #0 {
+; CHECK-LABEL: sabal_sabd_zeros_8b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-NEXT:    sabal v0.8h, v1.8b, v2.8b
+; CHECK-NEXT:    ret
+  %sabd = call <8 x i8> @llvm.aarch64.neon.sabd.v8i8(<8 x i8> %b, <8 x i8> zeroinitializer)
+  %sabd.zext = zext <8 x i8> %sabd to <8 x i16>
+  %add = add <8 x i16> %a, %sabd.zext
+  ret <8 x i16> %add
+}
+
+define <2 x i64> @sabal_abs_zeros_2s(<2 x i64> %a, <2 x i32> %b) #0 {
+; CHECK-SD-LABEL: sabal_abs_zeros_2s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    sabal v0.2d, v1.2s, v2.2s
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sabal_abs_zeros_2s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.2s, v1.2s
+; CHECK-GI-NEXT:    uaddw v0.2d, v0.2d, v1.2s
+; CHECK-GI-NEXT:    ret
+  %abs = call <2 x i32> @llvm.abs.v2i32(<2 x i32> %b, i1 true)
+  %abs.zext = zext <2 x i32> %abs to <2 x i64>
+  %add = add <2 x i64> %a, %abs.zext
+  ret <2 x i64> %add
+}
+
+define <4 x i32> @sabal_abs_zeros_4h(<4 x i32> %a, <4 x i16> %b) #0 {
+; CHECK-SD-LABEL: sabal_abs_zeros_4h:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    sabal v0.4s, v1.4h, v2.4h
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sabal_abs_zeros_4h:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.4h, v1.4h
+; CHECK-GI-NEXT:    uaddw v0.4s, v0.4s, v1.4h
+; CHECK-GI-NEXT:    ret
+  %abs = call <4 x i16> @llvm.abs.v4i16(<4 x i16> %b, i1 true)
+  %abs.zext = zext <4 x i16> %abs to <4 x i32>
+  %add = add <4 x i32> %a, %abs.zext
+  ret <4 x i32> %add
+}
+
+define <8 x i16> @sabal_abs_zeros_8b(<8 x i16> %a, <8 x i8> %b) #0 {
+; CHECK-SD-LABEL: sabal_abs_zeros_8b:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    movi v2.2d, #0000000000000000
+; CHECK-SD-NEXT:    sabal v0.8h, v1.8b, v2.8b
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sabal_abs_zeros_8b:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    abs v1.8b, v1.8b
+; CHECK-GI-NEXT:    uaddw v0.8h, v0.8h, v1.8b
+; CHECK-GI-NEXT:    ret
+  %abs = call <8 x i8> @llvm.abs.v8i8(<8 x i8> %b, i1 true)
+  %abs.zext = zext <8 x i8> %abs to <8 x i16>
+  %add = add <8 x i16> %a, %abs.zext
+  ret <8 x i16> %add
+}
+
 declare <4 x i32> @llvm.abs.v4i32(<4 x i32>, i1)
 declare <2 x i32> @llvm.abs.v2i32(<2 x i32>, i1)
 declare <8 x i16> @llvm.abs.v8i16(<8 x i16>, i1)

@hazzlim
Copy link
Contributor Author

hazzlim commented Sep 4, 2025

We generate the instructions directly in some of the raddhn patterns

def : Pat<(v8i8 (trunc (AArch64vlshr (add (v8i16 V128:$Vn), VImm0080), (i32 8)))),
          (RADDHNv8i16_v8i8 V128:$Vn, (v8i16 (MOVIv2d_ns (i32 0))))>;

Generating two instructions in a pattern has it's down sides, so they both have advantages and disadvantages, but it is likely a little better than a new AArch64ISD node that isn't otherwise optimized.

Thanks for the pointer here - I have reverted the AArch64ISD node/combine changes and done this directly in tablegen.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants