-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[RISCV] Use slideup to lower build_vector when all operand are (extract_element X, 0) #154450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Min-Yih Hsu (mshockwave) ChangesThe general lowering of build_vector starts with splatting the first operand before sliding down other operands one-by-one. However, the initial splat could be avoid if the last operand is an extract from a reduction result, in which case we could use the original vector reduction result as start value before sliding up other operands (in reverse order) one-by-one. Original context: #154175 (comment) Full diff: https://github.com/llvm/llvm-project/pull/154450.diff 4 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 4a1db80076530..ce6fc8425856a 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -4512,33 +4512,94 @@ static SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,
"Illegal type which will result in reserved encoding");
const unsigned Policy = RISCVVType::TAIL_AGNOSTIC | RISCVVType::MASK_AGNOSTIC;
+ auto getVSlide = [&](bool SlideUp, EVT ContainerVT, SDValue Passthru,
+ SDValue Vec, SDValue Offset, SDValue Mask,
+ SDValue VL) -> SDValue {
+ if (SlideUp)
+ return getVSlideup(DAG, Subtarget, DL, ContainerVT, Passthru, Vec, Offset,
+ Mask, VL, Policy);
+ return getVSlidedown(DAG, Subtarget, DL, ContainerVT, Passthru, Vec, Offset,
+ Mask, VL, Policy);
+ };
+
+ // General case: splat the first operand and sliding other operands down one
+ // by one to form a vector. Alternatively, if the last operand is an
+ // extraction from a reduction result, we can use the original vector
+ // reduction result as the start value and slide up instead of slide down.
+ // Such that we can avoid the splat.
+ SmallVector<SDValue> Operands(Op->op_begin(), Op->op_end());
+ SDValue Reduce;
+ bool SlideUp = false;
+ // Find the first first non-undef from the tail.
+ auto ItLastNonUndef = find_if(Operands.rbegin(), Operands.rend(),
+ [](SDValue V) { return !V.isUndef(); });
+ if (ItLastNonUndef != Operands.rend()) {
+ using namespace SDPatternMatch;
+ // Check if the last non-undef operand was extracted from a reduction.
+ for (unsigned Opc :
+ {RISCVISD::VECREDUCE_ADD_VL, RISCVISD::VECREDUCE_UMAX_VL,
+ RISCVISD::VECREDUCE_SMAX_VL, RISCVISD::VECREDUCE_UMIN_VL,
+ RISCVISD::VECREDUCE_SMIN_VL, RISCVISD::VECREDUCE_AND_VL,
+ RISCVISD::VECREDUCE_OR_VL, RISCVISD::VECREDUCE_XOR_VL,
+ RISCVISD::VECREDUCE_FADD_VL, RISCVISD::VECREDUCE_SEQ_FADD_VL,
+ RISCVISD::VECREDUCE_FMAX_VL, RISCVISD::VECREDUCE_FMIN_VL}) {
+ SlideUp = sd_match(
+ *ItLastNonUndef,
+ m_ExtractElt(m_AllOf(m_Opc(Opc), m_Value(Reduce)), m_Zero()));
+ if (SlideUp)
+ break;
+ }
+ }
+
+ if (SlideUp) {
+ // Adapt Reduce's type into ContainerVT.
+ if (Reduce.getValueType().getVectorMinNumElements() <
+ ContainerVT.getVectorMinNumElements())
+ Reduce = DAG.getInsertSubvector(DL, DAG.getUNDEF(ContainerVT), Reduce, 0);
+ else
+ Reduce = DAG.getExtractSubvector(DL, ContainerVT, Reduce, 0);
+
+ // Reverse the elements as we're going to slide up from the last element.
+ for (unsigned i = 0U, N = Operands.size(), H = divideCeil(N, 2); i < H; ++i)
+ std::swap(Operands[i], Operands[N - 1 - i]);
+ }
SDValue Vec;
UndefCount = 0;
- for (SDValue V : Op->ops()) {
+ for (SDValue V : Operands) {
if (V.isUndef()) {
UndefCount++;
continue;
}
- // Start our sequence with a TA splat in the hopes that hardware is able to
- // recognize there's no dependency on the prior value of our temporary
- // register.
+ // Start our sequence with either a TA splat or a reduction result in the
+ // hopes that hardware is able to recognize there's no dependency on the
+ // prior value of our temporary register.
if (!Vec) {
- Vec = DAG.getSplatVector(VT, DL, V);
- Vec = convertToScalableVector(ContainerVT, Vec, DAG, Subtarget);
+ if (SlideUp) {
+ Vec = Reduce;
+ } else {
+ Vec = DAG.getSplatVector(VT, DL, V);
+ Vec = convertToScalableVector(ContainerVT, Vec, DAG, Subtarget);
+ }
+
UndefCount = 0;
continue;
}
if (UndefCount) {
const SDValue Offset = DAG.getConstant(UndefCount, DL, Subtarget.getXLenVT());
- Vec = getVSlidedown(DAG, Subtarget, DL, ContainerVT, DAG.getUNDEF(ContainerVT),
- Vec, Offset, Mask, VL, Policy);
+ Vec = getVSlide(SlideUp, ContainerVT, DAG.getUNDEF(ContainerVT), Vec,
+ Offset, Mask, VL);
UndefCount = 0;
}
- auto OpCode =
- VT.isFloatingPoint() ? RISCVISD::VFSLIDE1DOWN_VL : RISCVISD::VSLIDE1DOWN_VL;
+
+ unsigned OpCode;
+ if (VT.isFloatingPoint())
+ OpCode = SlideUp ? RISCVISD::VFSLIDE1UP_VL : RISCVISD::VFSLIDE1DOWN_VL;
+ else
+ OpCode = SlideUp ? RISCVISD::VSLIDE1UP_VL : RISCVISD::VSLIDE1DOWN_VL;
+
if (!VT.isFloatingPoint())
V = DAG.getNode(ISD::ANY_EXTEND, DL, Subtarget.getXLenVT(), V);
Vec = DAG.getNode(OpCode, DL, ContainerVT, DAG.getUNDEF(ContainerVT), Vec,
@@ -4546,8 +4607,8 @@ static SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,
}
if (UndefCount) {
const SDValue Offset = DAG.getConstant(UndefCount, DL, Subtarget.getXLenVT());
- Vec = getVSlidedown(DAG, Subtarget, DL, ContainerVT, DAG.getUNDEF(ContainerVT),
- Vec, Offset, Mask, VL, Policy);
+ Vec = getVSlide(SlideUp, ContainerVT, DAG.getUNDEF(ContainerVT), Vec,
+ Offset, Mask, VL);
}
return convertFromScalableVector(VT, Vec, DAG, Subtarget);
}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
index 3c3e08d387faa..a806e758d2758 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
@@ -1828,3 +1828,59 @@ define <8 x double> @buildvec_v8f64_zvl512(double %e0, double %e1, double %e2, d
%v7 = insertelement <8 x double> %v6, double %e7, i64 7
ret <8 x double> %v7
}
+
+define <4 x float> @buildvec_vfredusum(float %start, <8 x float> %arg1, <8 x float> %arg2, <8 x float> %arg3, <8 x float> %arg4) nounwind {
+; CHECK-LABEL: buildvec_vfredusum:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; CHECK-NEXT: vfmv.s.f v16, fa0
+; CHECK-NEXT: vfredusum.vs v8, v8, v16
+; CHECK-NEXT: vfredusum.vs v9, v10, v16
+; CHECK-NEXT: vfredusum.vs v10, v12, v16
+; CHECK-NEXT: vfmv.f.s fa5, v8
+; CHECK-NEXT: vfmv.f.s fa4, v9
+; CHECK-NEXT: vfmv.f.s fa3, v10
+; CHECK-NEXT: vfredusum.vs v8, v14, v16
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vfslide1up.vf v9, v8, fa3
+; CHECK-NEXT: vfslide1up.vf v10, v9, fa4
+; CHECK-NEXT: vfslide1up.vf v8, v10, fa5
+; CHECK-NEXT: ret
+ %247 = tail call reassoc float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg1)
+ %248 = insertelement <4 x float> poison, float %247, i64 0
+ %250 = tail call reassoc float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg2)
+ %251 = insertelement <4 x float> %248, float %250, i64 1
+ %252 = tail call reassoc float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg3)
+ %253 = insertelement <4 x float> %251, float %252, i64 2
+ %254 = tail call reassoc float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg4)
+ %255 = insertelement <4 x float> %253, float %254, i64 3
+ ret <4 x float> %255
+}
+
+define <4 x float> @buildvec_vfredosum(float %start, <8 x float> %arg1, <8 x float> %arg2, <8 x float> %arg3, <8 x float> %arg4) nounwind {
+; CHECK-LABEL: buildvec_vfredosum:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; CHECK-NEXT: vfmv.s.f v16, fa0
+; CHECK-NEXT: vfredosum.vs v8, v8, v16
+; CHECK-NEXT: vfredosum.vs v9, v10, v16
+; CHECK-NEXT: vfredosum.vs v10, v12, v16
+; CHECK-NEXT: vfmv.f.s fa5, v8
+; CHECK-NEXT: vfmv.f.s fa4, v9
+; CHECK-NEXT: vfmv.f.s fa3, v10
+; CHECK-NEXT: vfredosum.vs v8, v14, v16
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vfslide1up.vf v9, v8, fa3
+; CHECK-NEXT: vfslide1up.vf v10, v9, fa4
+; CHECK-NEXT: vfslide1up.vf v8, v10, fa5
+; CHECK-NEXT: ret
+ %247 = tail call float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg1)
+ %248 = insertelement <4 x float> poison, float %247, i64 0
+ %250 = tail call float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg2)
+ %251 = insertelement <4 x float> %248, float %250, i64 1
+ %252 = tail call float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg3)
+ %253 = insertelement <4 x float> %251, float %252, i64 2
+ %254 = tail call float @llvm.vector.reduce.fadd.v8f32(float %start, <8 x float> %arg4)
+ %255 = insertelement <4 x float> %253, float %254, i64 3
+ ret <4 x float> %255
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
index d9bb007a10f71..a02117fdd2833 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
@@ -3416,5 +3416,204 @@ define <4 x i1> @buildvec_i1_splat(i1 %e1) {
ret <4 x i1> %v4
}
+define <4 x i32> @buildvec_vredsum(<8 x i32> %arg0, <8 x i32> %arg1, <8 x i32> %arg2, <8 x i32> %arg3) nounwind {
+; RV32-LABEL: buildvec_vredsum:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV32-NEXT: vmv.s.x v16, zero
+; RV32-NEXT: vredsum.vs v8, v8, v16
+; RV32-NEXT: vredsum.vs v9, v10, v16
+; RV32-NEXT: vredsum.vs v10, v12, v16
+; RV32-NEXT: vmv.x.s a0, v8
+; RV32-NEXT: vmv.x.s a1, v9
+; RV32-NEXT: vmv.x.s a2, v10
+; RV32-NEXT: vredsum.vs v8, v14, v16
+; RV32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV32-NEXT: vslide1up.vx v9, v8, a2
+; RV32-NEXT: vslide1up.vx v10, v9, a1
+; RV32-NEXT: vslide1up.vx v8, v10, a0
+; RV32-NEXT: ret
+;
+; RV64V-ONLY-LABEL: buildvec_vredsum:
+; RV64V-ONLY: # %bb.0:
+; RV64V-ONLY-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV64V-ONLY-NEXT: vmv.s.x v16, zero
+; RV64V-ONLY-NEXT: vredsum.vs v8, v8, v16
+; RV64V-ONLY-NEXT: vredsum.vs v9, v10, v16
+; RV64V-ONLY-NEXT: vredsum.vs v10, v12, v16
+; RV64V-ONLY-NEXT: vmv.x.s a0, v8
+; RV64V-ONLY-NEXT: vmv.x.s a1, v9
+; RV64V-ONLY-NEXT: vmv.x.s a2, v10
+; RV64V-ONLY-NEXT: vredsum.vs v8, v14, v16
+; RV64V-ONLY-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64V-ONLY-NEXT: vslide1up.vx v9, v8, a2
+; RV64V-ONLY-NEXT: vslide1up.vx v10, v9, a1
+; RV64V-ONLY-NEXT: vslide1up.vx v8, v10, a0
+; RV64V-ONLY-NEXT: ret
+;
+; RVA22U64-LABEL: buildvec_vredsum:
+; RVA22U64: # %bb.0:
+; RVA22U64-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RVA22U64-NEXT: vmv.s.x v16, zero
+; RVA22U64-NEXT: vredsum.vs v8, v8, v16
+; RVA22U64-NEXT: vredsum.vs v9, v10, v16
+; RVA22U64-NEXT: vredsum.vs v10, v12, v16
+; RVA22U64-NEXT: vredsum.vs v11, v14, v16
+; RVA22U64-NEXT: vmv.x.s a0, v8
+; RVA22U64-NEXT: vmv.x.s a1, v9
+; RVA22U64-NEXT: vmv.x.s a2, v10
+; RVA22U64-NEXT: slli a1, a1, 32
+; RVA22U64-NEXT: add.uw a0, a0, a1
+; RVA22U64-NEXT: vmv.x.s a1, v11
+; RVA22U64-NEXT: slli a1, a1, 32
+; RVA22U64-NEXT: add.uw a1, a2, a1
+; RVA22U64-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; RVA22U64-NEXT: vmv.v.x v8, a0
+; RVA22U64-NEXT: vslide1down.vx v8, v8, a1
+; RVA22U64-NEXT: ret
+;
+; RVA22U64-PACK-LABEL: buildvec_vredsum:
+; RVA22U64-PACK: # %bb.0:
+; RVA22U64-PACK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RVA22U64-PACK-NEXT: vmv.s.x v16, zero
+; RVA22U64-PACK-NEXT: vredsum.vs v8, v8, v16
+; RVA22U64-PACK-NEXT: vredsum.vs v9, v10, v16
+; RVA22U64-PACK-NEXT: vredsum.vs v10, v12, v16
+; RVA22U64-PACK-NEXT: vredsum.vs v11, v14, v16
+; RVA22U64-PACK-NEXT: vmv.x.s a0, v8
+; RVA22U64-PACK-NEXT: vmv.x.s a1, v9
+; RVA22U64-PACK-NEXT: vmv.x.s a2, v10
+; RVA22U64-PACK-NEXT: pack a0, a0, a1
+; RVA22U64-PACK-NEXT: vmv.x.s a1, v11
+; RVA22U64-PACK-NEXT: pack a1, a2, a1
+; RVA22U64-PACK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; RVA22U64-PACK-NEXT: vmv.v.x v8, a0
+; RVA22U64-PACK-NEXT: vslide1down.vx v8, v8, a1
+; RVA22U64-PACK-NEXT: ret
+;
+; RV64ZVE32-LABEL: buildvec_vredsum:
+; RV64ZVE32: # %bb.0:
+; RV64ZVE32-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV64ZVE32-NEXT: vmv.s.x v16, zero
+; RV64ZVE32-NEXT: vredsum.vs v8, v8, v16
+; RV64ZVE32-NEXT: vredsum.vs v9, v10, v16
+; RV64ZVE32-NEXT: vredsum.vs v10, v12, v16
+; RV64ZVE32-NEXT: vmv.x.s a0, v8
+; RV64ZVE32-NEXT: vmv.x.s a1, v9
+; RV64ZVE32-NEXT: vmv.x.s a2, v10
+; RV64ZVE32-NEXT: vredsum.vs v8, v14, v16
+; RV64ZVE32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64ZVE32-NEXT: vslide1up.vx v9, v8, a2
+; RV64ZVE32-NEXT: vslide1up.vx v10, v9, a1
+; RV64ZVE32-NEXT: vslide1up.vx v8, v10, a0
+; RV64ZVE32-NEXT: ret
+ %247 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %arg0)
+ %248 = insertelement <4 x i32> poison, i32 %247, i64 0
+ %250 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %arg1)
+ %251 = insertelement <4 x i32> %248, i32 %250, i64 1
+ %252 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %arg2)
+ %253 = insertelement <4 x i32> %251, i32 %252, i64 2
+ %254 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %arg3)
+ %255 = insertelement <4 x i32> %253, i32 %254, i64 3
+ ret <4 x i32> %255
+}
+
+define <4 x i32> @buildvec_vredmax(<8 x i32> %arg0, <8 x i32> %arg1, <8 x i32> %arg2, <8 x i32> %arg3) nounwind {
+; RV32-LABEL: buildvec_vredmax:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV32-NEXT: vredmaxu.vs v8, v8, v8
+; RV32-NEXT: vredmaxu.vs v9, v10, v10
+; RV32-NEXT: vredmaxu.vs v10, v12, v12
+; RV32-NEXT: vmv.x.s a0, v8
+; RV32-NEXT: vmv.x.s a1, v9
+; RV32-NEXT: vmv.x.s a2, v10
+; RV32-NEXT: vredmaxu.vs v8, v14, v14
+; RV32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV32-NEXT: vslide1up.vx v9, v8, a2
+; RV32-NEXT: vslide1up.vx v10, v9, a1
+; RV32-NEXT: vslide1up.vx v8, v10, a0
+; RV32-NEXT: ret
+;
+; RV64V-ONLY-LABEL: buildvec_vredmax:
+; RV64V-ONLY: # %bb.0:
+; RV64V-ONLY-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV64V-ONLY-NEXT: vredmaxu.vs v8, v8, v8
+; RV64V-ONLY-NEXT: vredmaxu.vs v9, v10, v10
+; RV64V-ONLY-NEXT: vredmaxu.vs v10, v12, v12
+; RV64V-ONLY-NEXT: vmv.x.s a0, v8
+; RV64V-ONLY-NEXT: vmv.x.s a1, v9
+; RV64V-ONLY-NEXT: vmv.x.s a2, v10
+; RV64V-ONLY-NEXT: vredmaxu.vs v8, v14, v14
+; RV64V-ONLY-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64V-ONLY-NEXT: vslide1up.vx v9, v8, a2
+; RV64V-ONLY-NEXT: vslide1up.vx v10, v9, a1
+; RV64V-ONLY-NEXT: vslide1up.vx v8, v10, a0
+; RV64V-ONLY-NEXT: ret
+;
+; RVA22U64-LABEL: buildvec_vredmax:
+; RVA22U64: # %bb.0:
+; RVA22U64-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RVA22U64-NEXT: vredmaxu.vs v8, v8, v8
+; RVA22U64-NEXT: vredmaxu.vs v9, v10, v10
+; RVA22U64-NEXT: vredmaxu.vs v10, v12, v12
+; RVA22U64-NEXT: vredmaxu.vs v11, v14, v14
+; RVA22U64-NEXT: vmv.x.s a0, v8
+; RVA22U64-NEXT: vmv.x.s a1, v9
+; RVA22U64-NEXT: vmv.x.s a2, v10
+; RVA22U64-NEXT: slli a1, a1, 32
+; RVA22U64-NEXT: add.uw a0, a0, a1
+; RVA22U64-NEXT: vmv.x.s a1, v11
+; RVA22U64-NEXT: slli a1, a1, 32
+; RVA22U64-NEXT: add.uw a1, a2, a1
+; RVA22U64-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; RVA22U64-NEXT: vmv.v.x v8, a0
+; RVA22U64-NEXT: vslide1down.vx v8, v8, a1
+; RVA22U64-NEXT: ret
+;
+; RVA22U64-PACK-LABEL: buildvec_vredmax:
+; RVA22U64-PACK: # %bb.0:
+; RVA22U64-PACK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RVA22U64-PACK-NEXT: vredmaxu.vs v8, v8, v8
+; RVA22U64-PACK-NEXT: vredmaxu.vs v9, v10, v10
+; RVA22U64-PACK-NEXT: vredmaxu.vs v10, v12, v12
+; RVA22U64-PACK-NEXT: vredmaxu.vs v11, v14, v14
+; RVA22U64-PACK-NEXT: vmv.x.s a0, v8
+; RVA22U64-PACK-NEXT: vmv.x.s a1, v9
+; RVA22U64-PACK-NEXT: vmv.x.s a2, v10
+; RVA22U64-PACK-NEXT: pack a0, a0, a1
+; RVA22U64-PACK-NEXT: vmv.x.s a1, v11
+; RVA22U64-PACK-NEXT: pack a1, a2, a1
+; RVA22U64-PACK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; RVA22U64-PACK-NEXT: vmv.v.x v8, a0
+; RVA22U64-PACK-NEXT: vslide1down.vx v8, v8, a1
+; RVA22U64-PACK-NEXT: ret
+;
+; RV64ZVE32-LABEL: buildvec_vredmax:
+; RV64ZVE32: # %bb.0:
+; RV64ZVE32-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV64ZVE32-NEXT: vredmaxu.vs v8, v8, v8
+; RV64ZVE32-NEXT: vredmaxu.vs v9, v10, v10
+; RV64ZVE32-NEXT: vredmaxu.vs v10, v12, v12
+; RV64ZVE32-NEXT: vmv.x.s a0, v8
+; RV64ZVE32-NEXT: vmv.x.s a1, v9
+; RV64ZVE32-NEXT: vmv.x.s a2, v10
+; RV64ZVE32-NEXT: vredmaxu.vs v8, v14, v14
+; RV64ZVE32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64ZVE32-NEXT: vslide1up.vx v9, v8, a2
+; RV64ZVE32-NEXT: vslide1up.vx v10, v9, a1
+; RV64ZVE32-NEXT: vslide1up.vx v8, v10, a0
+; RV64ZVE32-NEXT: ret
+ %247 = tail call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> %arg0)
+ %248 = insertelement <4 x i32> poison, i32 %247, i64 0
+ %250 = tail call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> %arg1)
+ %251 = insertelement <4 x i32> %248, i32 %250, i64 1
+ %252 = tail call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> %arg2)
+ %253 = insertelement <4 x i32> %251, i32 %252, i64 2
+ %254 = tail call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> %arg3)
+ %255 = insertelement <4 x i32> %253, i32 %254, i64 3
+ ret <4 x i32> %255
+}
+
;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
; RV64: {{.*}}
diff --git a/llvm/test/CodeGen/RISCV/rvv/redundant-vfmvsf.ll b/llvm/test/CodeGen/RISCV/rvv/redundant-vfmvsf.ll
index da912bf401ec0..821d4240827fb 100644
--- a/llvm/test/CodeGen/RISCV/rvv/redundant-vfmvsf.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/redundant-vfmvsf.ll
@@ -9,12 +9,11 @@ define <2 x float> @redundant_vfmv(<2 x float> %arg0, <64 x float> %arg1, <64 x
; CHECK-NEXT: vfredusum.vs v9, v12, v8
; CHECK-NEXT: vsetivli zero, 1, e32, mf2, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vfmv.f.s fa5, v9
; CHECK-NEXT: vsetvli zero, a0, e32, m4, ta, ma
-; CHECK-NEXT: vfredusum.vs v8, v16, v8
-; CHECK-NEXT: vfmv.f.s fa5, v8
+; CHECK-NEXT: vfredusum.vs v9, v16, v8
; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
-; CHECK-NEXT: vrgather.vi v8, v9, 0
-; CHECK-NEXT: vfslide1down.vf v8, v8, fa5
+; CHECK-NEXT: vfslide1up.vf v8, v9, fa5
; CHECK-NEXT: ret
%s0 = extractelement <2 x float> %arg0, i64 0
%r0 = tail call reassoc float @llvm.vector.reduce.fadd.v64f32(float %s0, <64 x float> %arg1)
|
e5aa2a8
to
217402a
Compare
Mask, VL, Policy); | ||
}; | ||
|
||
// General case: splat the first operand and sliding other operands down one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// General case: splat the first operand and sliding other operands down one | |
// General case: splat the first operand and slide other operands down one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Co-Authored-By: Luke Lau <luke@igalia.com>
auto OpCode = | ||
VT.isFloatingPoint() ? RISCVISD::VFSLIDE1DOWN_VL : RISCVISD::VSLIDE1DOWN_VL; | ||
|
||
unsigned OpCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's spelt as one word in the rest of LLVM
unsigned OpCode; | |
unsigned Opcode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle reverse ping! I don't see any new commits, were they pushed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops yes I forgot to push, thanks for reminding. It should be synced now
if (!VT.isFloatingPoint()) | ||
V = DAG.getNode(ISD::ANY_EXTEND, DL, Subtarget.getXLenVT(), V); | ||
Vec = DAG.getNode(OpCode, DL, ContainerVT, DAG.getUNDEF(ContainerVT), Vec, | ||
V, Mask, VL); | ||
} | ||
if (UndefCount) { | ||
const SDValue Offset = DAG.getConstant(UndefCount, DL, Subtarget.getXLenVT()); | ||
Vec = getVSlidedown(DAG, Subtarget, DL, ContainerVT, DAG.getUNDEF(ContainerVT), | ||
Vec, Offset, Mask, VL, Policy); | ||
Vec = getVSlide(SlideUp, ContainerVT, DAG.getUNDEF(ContainerVT), Vec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have test coverage for emitting a vslideup. I think we can add a test for an undef in the middle of the build_vector, and another test with an undef at the start of the build_vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added tests for both cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, high level question on profitability. The reason we prefer slidedown is that slideup has a register constraint which forces the use of two registers through the build sequence. Particularly at high LMUL, doubling the register pressure is really expensive. We do cap at m2 today, so maybe that's not too bad a problem?
For the motivating sequence, have you considered using a normal slidedown sequence, and then a single vslideup (not slide1up) from the source register into the last destination? I think you end up with one extra slide1down (to put the other elements in the right spot), but this might be cheaper overall?
I think this is a good idea, though to give even more context: my actual motivation example is a build_vector where every operands are coming from a reduction -- that is, every operands is an (extract_element X, 0). With this patch, the following sequence
Is turned into
This patch will only save a single instruction (i.e. splat) no matter how many operands there are. But if we also take the follow-up patch #154847 into consideration, that patch further eliminates all the vector to scalar moves:
And the number of eliminated moves is proportional to the number of build_vector's operands. In other words, #154847 is a bigger win and it's more or less depending on this patch. The thing is, I don't think we can implement #154847 's move-elimination algorithm with vslidedown, because vslidedown reads pass the VL, unless we...concatenate v9 after v8 before sliding down, which I don't think will be profitable (note: the approach you proposed will eliminate the vector-to-scalar move of the last operand but not for other operands) That being said, I understand your concern about register pressure imposed by vslideup / vslide1up. Let me think about this. |
…extraction from first element
I have limited the scope of this optimization and it now only triggers when every operands in build_vector is an extraction from the first vector element. This is a tradeoff between register pressure brought by vslide1up and the potential benefit of eliminating vector to scalar after #154847 is landed. |
// capture the wrong EVec. | ||
for (SDValue V : Operands) { | ||
using namespace SDPatternMatch; | ||
SlideUp = V.isUndef() || sd_match(V, m_ExtractElt(m_Value(EVec), m_Zero())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that even we have interleaving undef "intervals", in the worst case the number of those intervals will only be one more than the number of non-undef values.
The general lowering of build_vector starts with splatting the first operand before sliding down other operands one-by-one. However, if the every operands is an extract_element from the first vector element, we could use the original vector (source of extraction) from the last build_vec operand as start value before sliding up other operands (in reverse order) one-by-one. By doing so we can avoid the initial splat and eliminate the vector to scalar movement later, which is something we cannot do with vslidedown/vslide1down.
Original context: #154175 (comment)