-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[LoopVectorize] Generate wide active lane masks #147535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-backend-aarch64 Author: Kerry McLaughlin (kmclaughlin-arm) ChangesThis patch adds a new flag (-enable-wide-lane-mask) which allows The transform in extractFromWideActiveLaneMask creates vector An additional operand is passed to the ActiveLaneMask instruction, The motivation for this change is to improve interleaved loops when This is based on a PR that was created by @momchil-velikov (#81140) Patch is 73.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147535.diff 15 Files Affected:
diff --git a/llvm/lib/Analysis/VectorUtils.cpp b/llvm/lib/Analysis/VectorUtils.cpp
index 63fccee63c0ae..1dff9c3513a28 100644
--- a/llvm/lib/Analysis/VectorUtils.cpp
+++ b/llvm/lib/Analysis/VectorUtils.cpp
@@ -163,6 +163,7 @@ bool llvm::isVectorIntrinsicWithScalarOpAtArg(Intrinsic::ID ID,
case Intrinsic::is_fpclass:
case Intrinsic::vp_is_fpclass:
case Intrinsic::powi:
+ case Intrinsic::vector_extract:
return (ScalarOpdIdx == 1);
case Intrinsic::smul_fix:
case Intrinsic::smul_fix_sat:
@@ -195,6 +196,7 @@ bool llvm::isVectorIntrinsicWithOverloadTypeAtArg(
case Intrinsic::vp_llrint:
case Intrinsic::ucmp:
case Intrinsic::scmp:
+ case Intrinsic::vector_extract:
return OpdIdx == -1 || OpdIdx == 0;
case Intrinsic::modf:
case Intrinsic::sincos:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 144f35e10132f..dd54d964f8883 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -44,6 +44,7 @@ class VPRecipeBuilder;
struct VFRange;
extern cl::opt<bool> EnableVPlanNativePath;
+extern cl::opt<bool> EnableWideActiveLaneMask;
extern cl::opt<unsigned> ForceTargetInstructionCost;
/// VPlan-based builder utility analogous to IRBuilder.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e7bae17dd2ceb..6e5f4caf93d23 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -356,6 +356,10 @@ cl::opt<bool> llvm::EnableVPlanNativePath(
cl::desc("Enable VPlan-native vectorization path with "
"support for outer loop vectorization."));
+cl::opt<bool> llvm::EnableWideActiveLaneMask(
+ "enable-wide-lane-mask", cl::init(false), cl::Hidden,
+ cl::desc("Enable use of wide get active lane mask instructions"));
+
cl::opt<bool>
llvm::VerifyEachVPlan("vplan-verify-each",
#ifdef EXPENSIVE_CHECKS
@@ -7328,7 +7332,10 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
VPlanTransforms::runPass(VPlanTransforms::addBranchWeightToMiddleTerminator,
BestVPlan, BestVF, VScale);
}
- VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
+ VPlanTransforms::optimizeForVFAndUF(
+ BestVPlan, BestVF, BestUF, PSE,
+ ILV.Cost->getTailFoldingStyle() ==
+ TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck);
VPlanTransforms::simplifyRecipes(BestVPlan, *Legal->getWidestInductionType());
VPlanTransforms::narrowInterleaveGroups(
BestVPlan, BestVF,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 356af4a0e74e4..6080aa88ec306 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -954,6 +954,9 @@ class VPInstruction : public VPRecipeWithIRFlags,
// part if it is scalar. In the latter case, the recipe will be removed
// during unrolling.
ExtractPenultimateElement,
+ // Extracts a subvector from a vector (first operand) starting at a given
+ // offset (second operand).
+ ExtractSubvector,
LogicalAnd, // Non-poison propagating logical And.
// Add an offset in bytes (second operand) to a base pointer (first
// operand). Only generates scalar values (either for the first lane only or
@@ -1887,6 +1890,9 @@ class VPHeaderPHIRecipe : public VPSingleDefRecipe, public VPPhiAccessors {
return getOperand(1);
}
+ // Update the incoming value from the loop backedge.
+ void setBackedgeValue(VPValue *V) { setOperand(1, V); }
+
/// Returns the backedge value as a recipe. The backedge value is guaranteed
/// to be a recipe.
virtual VPRecipeBase &getBackedgeRecipe() {
@@ -3234,10 +3240,12 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
/// TODO: It would be good to use the existing VPWidenPHIRecipe instead and
/// remove VPActiveLaneMaskPHIRecipe.
class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
+ unsigned UnrollPart = 0;
+
public:
- VPActiveLaneMaskPHIRecipe(VPValue *StartMask, DebugLoc DL)
- : VPHeaderPHIRecipe(VPDef::VPActiveLaneMaskPHISC, nullptr, StartMask,
- DL) {}
+ VPActiveLaneMaskPHIRecipe(VPValue *StartMask, DebugLoc DL, unsigned Part = 0)
+ : VPHeaderPHIRecipe(VPDef::VPActiveLaneMaskPHISC, nullptr, StartMask, DL),
+ UnrollPart(Part) {}
~VPActiveLaneMaskPHIRecipe() override = default;
@@ -3250,6 +3258,9 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
VP_CLASSOF_IMPL(VPDef::VPActiveLaneMaskPHISC)
+ unsigned getUnrollPart() { return UnrollPart; }
+ void setUnrollPart(unsigned Part) { UnrollPart = Part; }
+
/// Generate the active lane mask phi of the vector loop.
void execute(VPTransformState &State) override;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 92db9674ef42b..5e7f797b70978 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -74,6 +74,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
switch (Opcode) {
case Instruction::ExtractElement:
case Instruction::Freeze:
+ case VPInstruction::ExtractSubvector:
case VPInstruction::ReductionStartVector:
return inferScalarType(R->getOperand(0));
case Instruction::Select: {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
index efea99f22d086..62898bf2c1991 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
@@ -384,10 +384,11 @@ m_Broadcast(const Op0_t &Op0) {
return m_VPInstruction<VPInstruction::Broadcast>(Op0);
}
-template <typename Op0_t, typename Op1_t>
-inline BinaryVPInstruction_match<Op0_t, Op1_t, VPInstruction::ActiveLaneMask>
-m_ActiveLaneMask(const Op0_t &Op0, const Op1_t &Op1) {
- return m_VPInstruction<VPInstruction::ActiveLaneMask>(Op0, Op1);
+template <typename Op0_t, typename Op1_t, typename Op2_t>
+inline TernaryVPInstruction_match<Op0_t, Op1_t, Op2_t,
+ VPInstruction::ActiveLaneMask>
+m_ActiveLaneMask(const Op0_t &Op0, const Op1_t &Op1, const Op2_t &Op2) {
+ return m_VPInstruction<VPInstruction::ActiveLaneMask>(Op0, Op1, Op2);
}
template <typename Op0_t, typename Op1_t>
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index ccb7512051d77..c776d5cb91278 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -469,15 +469,16 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
case Instruction::ICmp:
case Instruction::FCmp:
case Instruction::Store:
- case VPInstruction::ActiveLaneMask:
case VPInstruction::BranchOnCount:
case VPInstruction::ComputeReductionResult:
+ case VPInstruction::ExtractSubvector:
case VPInstruction::FirstOrderRecurrenceSplice:
case VPInstruction::LogicalAnd:
case VPInstruction::PtrAdd:
case VPInstruction::WideIVStep:
return 2;
case Instruction::Select:
+ case VPInstruction::ActiveLaneMask:
case VPInstruction::ComputeAnyOfResult:
case VPInstruction::ReductionStartVector:
return 3;
@@ -614,7 +615,9 @@ Value *VPInstruction::generate(VPTransformState &State) {
Name);
auto *Int1Ty = Type::getInt1Ty(Builder.getContext());
- auto *PredTy = VectorType::get(Int1Ty, State.VF);
+ auto PredTy = VectorType::get(
+ Int1Ty, State.VF * cast<ConstantInt>(getOperand(2)->getLiveInIRValue())
+ ->getZExtValue());
return Builder.CreateIntrinsic(Intrinsic::get_active_lane_mask,
{PredTy, ScalarTC->getType()},
{VIVElem0, ScalarTC}, nullptr, Name);
@@ -846,6 +849,14 @@ Value *VPInstruction::generate(VPTransformState &State) {
Res->setName(Name);
return Res;
}
+ case VPInstruction::ExtractSubvector: {
+ Value *Vec = State.get(getOperand(0));
+ assert(State.VF.isVector());
+ auto Idx = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
+ auto ResTy = VectorType::get(
+ State.TypeAnalysis.inferScalarType(getOperand(0)), State.VF);
+ return Builder.CreateExtractVector(ResTy, Vec, Idx);
+ }
case VPInstruction::LogicalAnd: {
Value *A = State.get(getOperand(0));
Value *B = State.get(getOperand(1));
@@ -1044,6 +1055,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ExtractLastElement:
case VPInstruction::ExtractPenultimateElement:
+ case VPInstruction::ExtractSubvector:
case VPInstruction::FirstActiveLane:
case VPInstruction::FirstOrderRecurrenceSplice:
case VPInstruction::LogicalAnd:
@@ -1174,6 +1186,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
case VPInstruction::ExtractPenultimateElement:
O << "extract-penultimate-element";
break;
+ case VPInstruction::ExtractSubvector:
+ O << "extract-subvector";
+ break;
case VPInstruction::ComputeAnyOfResult:
O << "compute-anyof-result";
break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 90137b72c83fb..b8f14ca88e8a3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -12,6 +12,7 @@
//===----------------------------------------------------------------------===//
#include "VPlanTransforms.h"
+#include "LoopVectorizationPlanner.h"
#include "VPRecipeBuilder.h"
#include "VPlan.h"
#include "VPlanAnalysis.h"
@@ -1432,20 +1433,93 @@ static bool isConditionTrueViaVFAndUF(VPValue *Cond, VPlan &Plan,
return SE.isKnownPredicate(CmpInst::ICMP_EQ, TripCount, C);
}
+static void extractFromWideActiveLaneMask(VPlan &Plan, ElementCount VF,
+ unsigned UF) {
+ VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();
+ auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry());
+ VPBasicBlock *ExitingVPBB = VectorRegion->getExitingBasicBlock();
+ auto *Term = &ExitingVPBB->back();
+
+ VPCanonicalIVPHIRecipe *CanonicalIV = Plan.getCanonicalIV();
+ LLVMContext &Ctx = CanonicalIV->getScalarType()->getContext();
+ using namespace llvm::VPlanPatternMatch;
+
+ auto extractFromALM = [&](VPInstruction *ALM, VPInstruction *InsBefore,
+ SmallVectorImpl<VPValue *> &Extracts) {
+ VPBuilder Builder(InsBefore);
+ DebugLoc DL = ALM->getDebugLoc();
+ for (unsigned Part = 0; Part < UF; ++Part) {
+ SmallVector<VPValue *> Ops;
+ Ops.append({ALM, Plan.getOrAddLiveIn(
+ ConstantInt::get(IntegerType::getInt64Ty(Ctx),
+ VF.getKnownMinValue() * Part))});
+ Extracts.push_back(
+ Builder.createNaryOp(VPInstruction::ExtractSubvector, Ops, DL));
+ }
+ };
+
+ // Create a list of each active lane mask phi, ordered by unroll part.
+ SmallVector<VPActiveLaneMaskPHIRecipe *> Phis(UF, nullptr);
+ for (VPRecipeBase &R : Header->phis())
+ if (auto *Phi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R))
+ Phis[Phi->getUnrollPart()] = Phi;
+
+ assert(all_of(Phis, [](VPActiveLaneMaskPHIRecipe *Phi) { return Phi; }) &&
+ "Expected one VPActiveLaneMaskPHIRecipe for each unroll part");
+
+ // When using wide lane masks, the return type of the get.active.lane.mask
+ // intrinsic is VF x UF (second operand).
+ VPValue *ALMMultiplier =
+ Plan.getOrAddLiveIn(ConstantInt::get(IntegerType::getInt64Ty(Ctx), UF));
+ cast<VPInstruction>(Phis[0]->getStartValue())->setOperand(2, ALMMultiplier);
+ cast<VPInstruction>(Phis[0]->getBackedgeValue())
+ ->setOperand(2, ALMMultiplier);
+
+ // Create UF x extract vectors and insert into preheader.
+ SmallVector<VPValue *> EntryExtracts;
+ auto *EntryALM = cast<VPInstruction>(Phis[0]->getStartValue());
+ extractFromALM(EntryALM, cast<VPInstruction>(&EntryALM->getParent()->back()),
+ EntryExtracts);
+
+ // Create UF x extract vectors and insert before the loop compare & branch,
+ // updating the compare to use the first extract.
+ SmallVector<VPValue *> LoopExtracts;
+ auto *LoopALM = cast<VPInstruction>(Phis[0]->getBackedgeValue());
+ VPInstruction *Not = cast<VPInstruction>(Term->getOperand(0));
+ extractFromALM(LoopALM, Not, LoopExtracts);
+ Not->setOperand(0, LoopExtracts[0]);
+
+ // Update the incoming values of active lane mask phis.
+ for (unsigned Part = 0; Part < UF; ++Part) {
+ Phis[Part]->setStartValue(EntryExtracts[Part]);
+ Phis[Part]->setBackedgeValue(LoopExtracts[Part]);
+ }
+
+ return;
+}
+
/// Try to simplify the branch condition of \p Plan. This may restrict the
/// resulting plan to \p BestVF and \p BestUF.
-static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
- unsigned BestUF,
- PredicatedScalarEvolution &PSE) {
+static bool
+simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
+ unsigned BestUF,
+ PredicatedScalarEvolution &PSE,
+ bool DataAndControlFlowWithoutRuntimeCheck) {
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();
VPBasicBlock *ExitingVPBB = VectorRegion->getExitingBasicBlock();
auto *Term = &ExitingVPBB->back();
VPValue *Cond;
ScalarEvolution &SE = *PSE.getSE();
using namespace llvm::VPlanPatternMatch;
- if (match(Term, m_BranchOnCount(m_VPValue(), m_VPValue())) ||
- match(Term, m_BranchOnCond(
- m_Not(m_ActiveLaneMask(m_VPValue(), m_VPValue()))))) {
+ auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry());
+ bool BranchALM = match(Term, m_BranchOnCond(m_Not(m_ActiveLaneMask(
+ m_VPValue(), m_VPValue(), m_VPValue()))));
+
+ if (BranchALM || match(Term, m_BranchOnCount(m_VPValue(), m_VPValue()))) {
+ if (BranchALM && DataAndControlFlowWithoutRuntimeCheck &&
+ EnableWideActiveLaneMask && BestVF.isVector() && BestUF > 1)
+ extractFromWideActiveLaneMask(Plan, BestVF, BestUF);
+
// Try to simplify the branch condition if TC <= VF * UF when the latch
// terminator is BranchOnCount or BranchOnCond where the input is
// Not(ActiveLaneMask).
@@ -1470,7 +1544,6 @@ static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
// The vector loop region only executes once. If possible, completely remove
// the region, otherwise replace the terminator controlling the latch with
// (BranchOnCond true).
- auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry());
auto *CanIVTy = Plan.getCanonicalIV()->getScalarType();
if (all_of(
Header->phis(),
@@ -1507,14 +1580,15 @@ static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
return true;
}
-void VPlanTransforms::optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
- unsigned BestUF,
- PredicatedScalarEvolution &PSE) {
+void VPlanTransforms::optimizeForVFAndUF(
+ VPlan &Plan, ElementCount BestVF, unsigned BestUF,
+ PredicatedScalarEvolution &PSE,
+ bool DataAndControlFlowWithoutRuntimeCheck) {
assert(Plan.hasVF(BestVF) && "BestVF is not available in Plan");
assert(Plan.hasUF(BestUF) && "BestUF is not available in Plan");
- bool MadeChange =
- simplifyBranchConditionForVFAndUF(Plan, BestVF, BestUF, PSE);
+ bool MadeChange = simplifyBranchConditionForVFAndUF(
+ Plan, BestVF, BestUF, PSE, DataAndControlFlowWithoutRuntimeCheck);
MadeChange |= optimizeVectorInductionWidthForTCAndVFUF(Plan, BestVF, BestUF);
if (MadeChange) {
@@ -2006,9 +2080,11 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
"index.part.next");
// Create the active lane mask instruction in the VPlan preheader.
- auto *EntryALM =
- Builder.createNaryOp(VPInstruction::ActiveLaneMask, {EntryIncrement, TC},
- DL, "active.lane.mask.entry");
+ VPValue *ALMMultiplier = Plan.getOrAddLiveIn(
+ ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1));
+ auto *EntryALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
+ {EntryIncrement, TC, ALMMultiplier}, DL,
+ "active.lane.mask.entry");
// Now create the ActiveLaneMaskPhi recipe in the main loop using the
// preheader ActiveLaneMask instruction.
@@ -2023,8 +2099,8 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,
{IncrementValue}, {false, false}, DL);
auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
- {InLoopIncrement, TripCount}, DL,
- "active.lane.mask.next");
+ {InLoopIncrement, TripCount, ALMMultiplier},
+ DL, "active.lane.mask.next");
LaneMaskPhi->addOperand(ALM);
// Replace the original terminator with BranchOnCond. We have to invert the
@@ -2101,9 +2177,12 @@ void VPlanTransforms::addActiveLaneMask(
Plan, DataAndControlFlowWithoutRuntimeCheck);
} else {
VPBuilder B = VPBuilder::getToInsertAfter(WideCanonicalIV);
- LaneMask = B.createNaryOp(VPInstruction::ActiveLaneMask,
- {WideCanonicalIV, Plan.getTripCount()}, nullptr,
- "active.lane.mask");
+ VPValue *ALMMultiplier = Plan.getOrAddLiveIn(
+ ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1));
+ LaneMask =
+ B.createNaryOp(VPInstruction::ActiveLaneMask,
+ {WideCanonicalIV, Plan.getTripCount(), ALMMultiplier},
+ nullptr, "active.lane.mask");
}
// Walk users of WideCanonicalIV and replace all compares of the form
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 8d2eded45da22..920c7aa32cc97 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -109,7 +109,8 @@ struct VPlanTransforms {
/// resulting plan to \p BestVF and \p BestUF.
static void optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
unsigned BestUF,
- PredicatedScalarEvolution &PSE);
+ PredicatedScalarEvolution &PSE,
+ bool DataAndControlFlowWithoutRuntimeCheck);
/// Apply VPlan-to-VPlan optimizations to \p Plan, including induction recipe
/// optimizations, dead recipe removal, replicate region optimizations and
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
index 2dd43c092ff7a..76a37d5ba839b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
@@ -250,6 +250,7 @@ void UnrollState::unrollHeaderPHIByUF(VPHeaderPHIRecipe *R,
} else {
assert(isa<VPActiveLaneMaskPHIRecipe>(R) &&
"unexpected header phi recipe not needing unrolled part");
+ cast<VPActiveLaneMaskPHIRecipe>(Copy)->setUnrollPart(Part);
}
}
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
index 81bd21bb904c0..9fdc199fc1dfa 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
@@ -61,7 +61,7 @@ bool vputils::isHeaderMask(const VPValue *V, VPlan &Plan) {
VPValue *A, *B;
using namespace VPlanPatternMatch;
- if (match(V, m_ActiveLaneMask(m_VPValue(A), m_VPValue(B))))
+ if (match(V, m_ActiveLaneMask(m_VPValue(A...
[truncated]
|
@llvm/pr-subscribers-llvm-analysis Author: Kerry McLaughlin (kmclaughlin-arm) ChangesThis patch adds a new flag (-enable-wide-lane-mask) which allows The transform in extractFromWideActiveLaneMask creates vector An additional operand is passed to the ActiveLaneMask instruction, The motivation for this change is to improve interleaved loops when This is based on a PR that was created by @momchil-velikov (#81140) Patch is 73.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147535.diff 15 Files Affected:
diff --git a/llvm/lib/Analysis/VectorUtils.cpp b/llvm/lib/Analysis/VectorUtils.cpp
index 63fccee63c0ae..1dff9c3513a28 100644
--- a/llvm/lib/Analysis/VectorUtils.cpp
+++ b/llvm/lib/Analysis/VectorUtils.cpp
@@ -163,6 +163,7 @@ bool llvm::isVectorIntrinsicWithScalarOpAtArg(Intrinsic::ID ID,
case Intrinsic::is_fpclass:
case Intrinsic::vp_is_fpclass:
case Intrinsic::powi:
+ case Intrinsic::vector_extract:
return (ScalarOpdIdx == 1);
case Intrinsic::smul_fix:
case Intrinsic::smul_fix_sat:
@@ -195,6 +196,7 @@ bool llvm::isVectorIntrinsicWithOverloadTypeAtArg(
case Intrinsic::vp_llrint:
case Intrinsic::ucmp:
case Intrinsic::scmp:
+ case Intrinsic::vector_extract:
return OpdIdx == -1 || OpdIdx == 0;
case Intrinsic::modf:
case Intrinsic::sincos:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 144f35e10132f..dd54d964f8883 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -44,6 +44,7 @@ class VPRecipeBuilder;
struct VFRange;
extern cl::opt<bool> EnableVPlanNativePath;
+extern cl::opt<bool> EnableWideActiveLaneMask;
extern cl::opt<unsigned> ForceTargetInstructionCost;
/// VPlan-based builder utility analogous to IRBuilder.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e7bae17dd2ceb..6e5f4caf93d23 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -356,6 +356,10 @@ cl::opt<bool> llvm::EnableVPlanNativePath(
cl::desc("Enable VPlan-native vectorization path with "
"support for outer loop vectorization."));
+cl::opt<bool> llvm::EnableWideActiveLaneMask(
+ "enable-wide-lane-mask", cl::init(false), cl::Hidden,
+ cl::desc("Enable use of wide get active lane mask instructions"));
+
cl::opt<bool>
llvm::VerifyEachVPlan("vplan-verify-each",
#ifdef EXPENSIVE_CHECKS
@@ -7328,7 +7332,10 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
VPlanTransforms::runPass(VPlanTransforms::addBranchWeightToMiddleTerminator,
BestVPlan, BestVF, VScale);
}
- VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
+ VPlanTransforms::optimizeForVFAndUF(
+ BestVPlan, BestVF, BestUF, PSE,
+ ILV.Cost->getTailFoldingStyle() ==
+ TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck);
VPlanTransforms::simplifyRecipes(BestVPlan, *Legal->getWidestInductionType());
VPlanTransforms::narrowInterleaveGroups(
BestVPlan, BestVF,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 356af4a0e74e4..6080aa88ec306 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -954,6 +954,9 @@ class VPInstruction : public VPRecipeWithIRFlags,
// part if it is scalar. In the latter case, the recipe will be removed
// during unrolling.
ExtractPenultimateElement,
+ // Extracts a subvector from a vector (first operand) starting at a given
+ // offset (second operand).
+ ExtractSubvector,
LogicalAnd, // Non-poison propagating logical And.
// Add an offset in bytes (second operand) to a base pointer (first
// operand). Only generates scalar values (either for the first lane only or
@@ -1887,6 +1890,9 @@ class VPHeaderPHIRecipe : public VPSingleDefRecipe, public VPPhiAccessors {
return getOperand(1);
}
+ // Update the incoming value from the loop backedge.
+ void setBackedgeValue(VPValue *V) { setOperand(1, V); }
+
/// Returns the backedge value as a recipe. The backedge value is guaranteed
/// to be a recipe.
virtual VPRecipeBase &getBackedgeRecipe() {
@@ -3234,10 +3240,12 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
/// TODO: It would be good to use the existing VPWidenPHIRecipe instead and
/// remove VPActiveLaneMaskPHIRecipe.
class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
+ unsigned UnrollPart = 0;
+
public:
- VPActiveLaneMaskPHIRecipe(VPValue *StartMask, DebugLoc DL)
- : VPHeaderPHIRecipe(VPDef::VPActiveLaneMaskPHISC, nullptr, StartMask,
- DL) {}
+ VPActiveLaneMaskPHIRecipe(VPValue *StartMask, DebugLoc DL, unsigned Part = 0)
+ : VPHeaderPHIRecipe(VPDef::VPActiveLaneMaskPHISC, nullptr, StartMask, DL),
+ UnrollPart(Part) {}
~VPActiveLaneMaskPHIRecipe() override = default;
@@ -3250,6 +3258,9 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
VP_CLASSOF_IMPL(VPDef::VPActiveLaneMaskPHISC)
+ unsigned getUnrollPart() { return UnrollPart; }
+ void setUnrollPart(unsigned Part) { UnrollPart = Part; }
+
/// Generate the active lane mask phi of the vector loop.
void execute(VPTransformState &State) override;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 92db9674ef42b..5e7f797b70978 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -74,6 +74,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
switch (Opcode) {
case Instruction::ExtractElement:
case Instruction::Freeze:
+ case VPInstruction::ExtractSubvector:
case VPInstruction::ReductionStartVector:
return inferScalarType(R->getOperand(0));
case Instruction::Select: {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
index efea99f22d086..62898bf2c1991 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
@@ -384,10 +384,11 @@ m_Broadcast(const Op0_t &Op0) {
return m_VPInstruction<VPInstruction::Broadcast>(Op0);
}
-template <typename Op0_t, typename Op1_t>
-inline BinaryVPInstruction_match<Op0_t, Op1_t, VPInstruction::ActiveLaneMask>
-m_ActiveLaneMask(const Op0_t &Op0, const Op1_t &Op1) {
- return m_VPInstruction<VPInstruction::ActiveLaneMask>(Op0, Op1);
+template <typename Op0_t, typename Op1_t, typename Op2_t>
+inline TernaryVPInstruction_match<Op0_t, Op1_t, Op2_t,
+ VPInstruction::ActiveLaneMask>
+m_ActiveLaneMask(const Op0_t &Op0, const Op1_t &Op1, const Op2_t &Op2) {
+ return m_VPInstruction<VPInstruction::ActiveLaneMask>(Op0, Op1, Op2);
}
template <typename Op0_t, typename Op1_t>
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index ccb7512051d77..c776d5cb91278 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -469,15 +469,16 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
case Instruction::ICmp:
case Instruction::FCmp:
case Instruction::Store:
- case VPInstruction::ActiveLaneMask:
case VPInstruction::BranchOnCount:
case VPInstruction::ComputeReductionResult:
+ case VPInstruction::ExtractSubvector:
case VPInstruction::FirstOrderRecurrenceSplice:
case VPInstruction::LogicalAnd:
case VPInstruction::PtrAdd:
case VPInstruction::WideIVStep:
return 2;
case Instruction::Select:
+ case VPInstruction::ActiveLaneMask:
case VPInstruction::ComputeAnyOfResult:
case VPInstruction::ReductionStartVector:
return 3;
@@ -614,7 +615,9 @@ Value *VPInstruction::generate(VPTransformState &State) {
Name);
auto *Int1Ty = Type::getInt1Ty(Builder.getContext());
- auto *PredTy = VectorType::get(Int1Ty, State.VF);
+ auto PredTy = VectorType::get(
+ Int1Ty, State.VF * cast<ConstantInt>(getOperand(2)->getLiveInIRValue())
+ ->getZExtValue());
return Builder.CreateIntrinsic(Intrinsic::get_active_lane_mask,
{PredTy, ScalarTC->getType()},
{VIVElem0, ScalarTC}, nullptr, Name);
@@ -846,6 +849,14 @@ Value *VPInstruction::generate(VPTransformState &State) {
Res->setName(Name);
return Res;
}
+ case VPInstruction::ExtractSubvector: {
+ Value *Vec = State.get(getOperand(0));
+ assert(State.VF.isVector());
+ auto Idx = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
+ auto ResTy = VectorType::get(
+ State.TypeAnalysis.inferScalarType(getOperand(0)), State.VF);
+ return Builder.CreateExtractVector(ResTy, Vec, Idx);
+ }
case VPInstruction::LogicalAnd: {
Value *A = State.get(getOperand(0));
Value *B = State.get(getOperand(1));
@@ -1044,6 +1055,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ExtractLastElement:
case VPInstruction::ExtractPenultimateElement:
+ case VPInstruction::ExtractSubvector:
case VPInstruction::FirstActiveLane:
case VPInstruction::FirstOrderRecurrenceSplice:
case VPInstruction::LogicalAnd:
@@ -1174,6 +1186,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
case VPInstruction::ExtractPenultimateElement:
O << "extract-penultimate-element";
break;
+ case VPInstruction::ExtractSubvector:
+ O << "extract-subvector";
+ break;
case VPInstruction::ComputeAnyOfResult:
O << "compute-anyof-result";
break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 90137b72c83fb..b8f14ca88e8a3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -12,6 +12,7 @@
//===----------------------------------------------------------------------===//
#include "VPlanTransforms.h"
+#include "LoopVectorizationPlanner.h"
#include "VPRecipeBuilder.h"
#include "VPlan.h"
#include "VPlanAnalysis.h"
@@ -1432,20 +1433,93 @@ static bool isConditionTrueViaVFAndUF(VPValue *Cond, VPlan &Plan,
return SE.isKnownPredicate(CmpInst::ICMP_EQ, TripCount, C);
}
+static void extractFromWideActiveLaneMask(VPlan &Plan, ElementCount VF,
+ unsigned UF) {
+ VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();
+ auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry());
+ VPBasicBlock *ExitingVPBB = VectorRegion->getExitingBasicBlock();
+ auto *Term = &ExitingVPBB->back();
+
+ VPCanonicalIVPHIRecipe *CanonicalIV = Plan.getCanonicalIV();
+ LLVMContext &Ctx = CanonicalIV->getScalarType()->getContext();
+ using namespace llvm::VPlanPatternMatch;
+
+ auto extractFromALM = [&](VPInstruction *ALM, VPInstruction *InsBefore,
+ SmallVectorImpl<VPValue *> &Extracts) {
+ VPBuilder Builder(InsBefore);
+ DebugLoc DL = ALM->getDebugLoc();
+ for (unsigned Part = 0; Part < UF; ++Part) {
+ SmallVector<VPValue *> Ops;
+ Ops.append({ALM, Plan.getOrAddLiveIn(
+ ConstantInt::get(IntegerType::getInt64Ty(Ctx),
+ VF.getKnownMinValue() * Part))});
+ Extracts.push_back(
+ Builder.createNaryOp(VPInstruction::ExtractSubvector, Ops, DL));
+ }
+ };
+
+ // Create a list of each active lane mask phi, ordered by unroll part.
+ SmallVector<VPActiveLaneMaskPHIRecipe *> Phis(UF, nullptr);
+ for (VPRecipeBase &R : Header->phis())
+ if (auto *Phi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R))
+ Phis[Phi->getUnrollPart()] = Phi;
+
+ assert(all_of(Phis, [](VPActiveLaneMaskPHIRecipe *Phi) { return Phi; }) &&
+ "Expected one VPActiveLaneMaskPHIRecipe for each unroll part");
+
+ // When using wide lane masks, the return type of the get.active.lane.mask
+ // intrinsic is VF x UF (second operand).
+ VPValue *ALMMultiplier =
+ Plan.getOrAddLiveIn(ConstantInt::get(IntegerType::getInt64Ty(Ctx), UF));
+ cast<VPInstruction>(Phis[0]->getStartValue())->setOperand(2, ALMMultiplier);
+ cast<VPInstruction>(Phis[0]->getBackedgeValue())
+ ->setOperand(2, ALMMultiplier);
+
+ // Create UF x extract vectors and insert into preheader.
+ SmallVector<VPValue *> EntryExtracts;
+ auto *EntryALM = cast<VPInstruction>(Phis[0]->getStartValue());
+ extractFromALM(EntryALM, cast<VPInstruction>(&EntryALM->getParent()->back()),
+ EntryExtracts);
+
+ // Create UF x extract vectors and insert before the loop compare & branch,
+ // updating the compare to use the first extract.
+ SmallVector<VPValue *> LoopExtracts;
+ auto *LoopALM = cast<VPInstruction>(Phis[0]->getBackedgeValue());
+ VPInstruction *Not = cast<VPInstruction>(Term->getOperand(0));
+ extractFromALM(LoopALM, Not, LoopExtracts);
+ Not->setOperand(0, LoopExtracts[0]);
+
+ // Update the incoming values of active lane mask phis.
+ for (unsigned Part = 0; Part < UF; ++Part) {
+ Phis[Part]->setStartValue(EntryExtracts[Part]);
+ Phis[Part]->setBackedgeValue(LoopExtracts[Part]);
+ }
+
+ return;
+}
+
/// Try to simplify the branch condition of \p Plan. This may restrict the
/// resulting plan to \p BestVF and \p BestUF.
-static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
- unsigned BestUF,
- PredicatedScalarEvolution &PSE) {
+static bool
+simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
+ unsigned BestUF,
+ PredicatedScalarEvolution &PSE,
+ bool DataAndControlFlowWithoutRuntimeCheck) {
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();
VPBasicBlock *ExitingVPBB = VectorRegion->getExitingBasicBlock();
auto *Term = &ExitingVPBB->back();
VPValue *Cond;
ScalarEvolution &SE = *PSE.getSE();
using namespace llvm::VPlanPatternMatch;
- if (match(Term, m_BranchOnCount(m_VPValue(), m_VPValue())) ||
- match(Term, m_BranchOnCond(
- m_Not(m_ActiveLaneMask(m_VPValue(), m_VPValue()))))) {
+ auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry());
+ bool BranchALM = match(Term, m_BranchOnCond(m_Not(m_ActiveLaneMask(
+ m_VPValue(), m_VPValue(), m_VPValue()))));
+
+ if (BranchALM || match(Term, m_BranchOnCount(m_VPValue(), m_VPValue()))) {
+ if (BranchALM && DataAndControlFlowWithoutRuntimeCheck &&
+ EnableWideActiveLaneMask && BestVF.isVector() && BestUF > 1)
+ extractFromWideActiveLaneMask(Plan, BestVF, BestUF);
+
// Try to simplify the branch condition if TC <= VF * UF when the latch
// terminator is BranchOnCount or BranchOnCond where the input is
// Not(ActiveLaneMask).
@@ -1470,7 +1544,6 @@ static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
// The vector loop region only executes once. If possible, completely remove
// the region, otherwise replace the terminator controlling the latch with
// (BranchOnCond true).
- auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry());
auto *CanIVTy = Plan.getCanonicalIV()->getScalarType();
if (all_of(
Header->phis(),
@@ -1507,14 +1580,15 @@ static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
return true;
}
-void VPlanTransforms::optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
- unsigned BestUF,
- PredicatedScalarEvolution &PSE) {
+void VPlanTransforms::optimizeForVFAndUF(
+ VPlan &Plan, ElementCount BestVF, unsigned BestUF,
+ PredicatedScalarEvolution &PSE,
+ bool DataAndControlFlowWithoutRuntimeCheck) {
assert(Plan.hasVF(BestVF) && "BestVF is not available in Plan");
assert(Plan.hasUF(BestUF) && "BestUF is not available in Plan");
- bool MadeChange =
- simplifyBranchConditionForVFAndUF(Plan, BestVF, BestUF, PSE);
+ bool MadeChange = simplifyBranchConditionForVFAndUF(
+ Plan, BestVF, BestUF, PSE, DataAndControlFlowWithoutRuntimeCheck);
MadeChange |= optimizeVectorInductionWidthForTCAndVFUF(Plan, BestVF, BestUF);
if (MadeChange) {
@@ -2006,9 +2080,11 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
"index.part.next");
// Create the active lane mask instruction in the VPlan preheader.
- auto *EntryALM =
- Builder.createNaryOp(VPInstruction::ActiveLaneMask, {EntryIncrement, TC},
- DL, "active.lane.mask.entry");
+ VPValue *ALMMultiplier = Plan.getOrAddLiveIn(
+ ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1));
+ auto *EntryALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
+ {EntryIncrement, TC, ALMMultiplier}, DL,
+ "active.lane.mask.entry");
// Now create the ActiveLaneMaskPhi recipe in the main loop using the
// preheader ActiveLaneMask instruction.
@@ -2023,8 +2099,8 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,
{IncrementValue}, {false, false}, DL);
auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
- {InLoopIncrement, TripCount}, DL,
- "active.lane.mask.next");
+ {InLoopIncrement, TripCount, ALMMultiplier},
+ DL, "active.lane.mask.next");
LaneMaskPhi->addOperand(ALM);
// Replace the original terminator with BranchOnCond. We have to invert the
@@ -2101,9 +2177,12 @@ void VPlanTransforms::addActiveLaneMask(
Plan, DataAndControlFlowWithoutRuntimeCheck);
} else {
VPBuilder B = VPBuilder::getToInsertAfter(WideCanonicalIV);
- LaneMask = B.createNaryOp(VPInstruction::ActiveLaneMask,
- {WideCanonicalIV, Plan.getTripCount()}, nullptr,
- "active.lane.mask");
+ VPValue *ALMMultiplier = Plan.getOrAddLiveIn(
+ ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1));
+ LaneMask =
+ B.createNaryOp(VPInstruction::ActiveLaneMask,
+ {WideCanonicalIV, Plan.getTripCount(), ALMMultiplier},
+ nullptr, "active.lane.mask");
}
// Walk users of WideCanonicalIV and replace all compares of the form
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 8d2eded45da22..920c7aa32cc97 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -109,7 +109,8 @@ struct VPlanTransforms {
/// resulting plan to \p BestVF and \p BestUF.
static void optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
unsigned BestUF,
- PredicatedScalarEvolution &PSE);
+ PredicatedScalarEvolution &PSE,
+ bool DataAndControlFlowWithoutRuntimeCheck);
/// Apply VPlan-to-VPlan optimizations to \p Plan, including induction recipe
/// optimizations, dead recipe removal, replicate region optimizations and
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
index 2dd43c092ff7a..76a37d5ba839b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
@@ -250,6 +250,7 @@ void UnrollState::unrollHeaderPHIByUF(VPHeaderPHIRecipe *R,
} else {
assert(isa<VPActiveLaneMaskPHIRecipe>(R) &&
"unexpected header phi recipe not needing unrolled part");
+ cast<VPActiveLaneMaskPHIRecipe>(Copy)->setUnrollPart(Part);
}
}
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
index 81bd21bb904c0..9fdc199fc1dfa 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
@@ -61,7 +61,7 @@ bool vputils::isHeaderMask(const VPValue *V, VPlan &Plan) {
VPValue *A, *B;
using namespace VPlanPatternMatch;
- if (match(V, m_ActiveLaneMask(m_VPValue(A), m_VPValue(B))))
+ if (match(V, m_ActiveLaneMask(m_VPValue(A...
[truncated]
|
llvm/test/Transforms/LoopVectorize/AArch64/fixed-wide-lane-mask.ll
Outdated
Show resolved
Hide resolved
@@ -954,6 +954,9 @@ class VPInstruction : public VPRecipeWithIRFlags, | |||
// part if it is scalar. In the latter case, the recipe will be removed | |||
// during unrolling. | |||
ExtractPenultimateElement, | |||
// Extracts a subvector from a vector (first operand) starting at a given | |||
// offset (second operand). | |||
ExtractSubvector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add this to VPInstruction::computeCost to make sure the cost is properly represented in the VPlan, although I know that currently the only use case is during plan execution, after the cost model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, though I have removed ExtractSubvector from VPInstruction in favour of using VPWidenIntrinsic to create the extract instead.
auto *PredTy = VectorType::get(Int1Ty, State.VF); | ||
auto PredTy = VectorType::get( | ||
Int1Ty, State.VF * cast<ConstantInt>(getOperand(2)->getLiveInIRValue()) | ||
->getZExtValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that given we're now potentially generating a different mask we should update the cost for VPInstruction::ActiveLaneMask in VPInstruction::computeCost if using a wider mask. Again, it's not going to make much difference because the wider mask is generated after the cost model anyway, but good to have it for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included the extra operand where we calculate the cost of ActiveLaneMask, both in VPInstruction::computeCost and also in LoopVectorizationPlanner::selectVectorizationFactor. As you mentioned, at this point the multiplier is always 1 and so there were no changes to the existing cost model tests for ActiveLaneMask.
@@ -7328,7 +7332,10 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan( | |||
VPlanTransforms::runPass(VPlanTransforms::addBranchWeightToMiddleTerminator, | |||
BestVPlan, BestVF, VScale); | |||
} | |||
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE); | |||
VPlanTransforms::optimizeForVFAndUF( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps instead of passing in DataAndControlFlowWithoutRuntimeCheck
to optimizeForVFAndUF
and simplifyBranchConditionForVFAndUF
you can just pass a UseWideActiveLaneMask
flag instead and when calling optimizeForVFAndUF
, i.e.
bool UseWideActiveLaneMask = EnableWideActiveLaneMask && ILV.Cost->getTailFoldingStyle() == TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck;
optimizeForVFAndUF(..., UseWideActiveLaneMask);
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would have cleaned this up, thanks!
In the latest commit I am now only using the EnableWideActiveLaneMask
flag to decide whether to emit a wide mask, so passing DataAndControlFlowWithoutRuntimeCheck
is no longer required.
llvm/test/Transforms/LoopVectorize/AArch64/fixed-wide-lane-mask.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/fixed-wide-lane-mask.ll
Outdated
Show resolved
Hide resolved
@@ -44,6 +44,7 @@ class VPRecipeBuilder; | |||
struct VFRange; | |||
|
|||
extern cl::opt<bool> EnableVPlanNativePath; | |||
extern cl::opt<bool> EnableWideActiveLaneMask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useful for testing the new feature, although I guess to enable this by default in future you'll either need a new TTI hook to query the target's preference or compare the costs of using a wider mask + UF extracts with the costs of using UF normal masks and see which is cheapest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now this is just to test the feature and keep this first patch small, but enabling it without the flag will require more work to decide whether it's beneficial to use the wider mask.
SmallVector<VPActiveLaneMaskPHIRecipe *> Phis(UF, nullptr); | ||
for (VPRecipeBase &R : Header->phis()) | ||
if (auto *Phi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R)) | ||
Phis[Phi->getUnrollPart()] = Phi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the part here? or can we order the active-lane-mask phis by their backedge values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was useful to add the unroll part to help with ordering the phis, but I've tried to get the part from the backedge values instead. With these changes I'm now assuming that anything other than a CanonicalIVIncrementForPart
instruction is part 0.
I'm happy to do this differently it if I've missed a simpler way of ordering these, or is this similar to what you had in mind?
auto *Header = cast<VPBasicBlock>(VectorRegion->getEntry()); | ||
bool BranchALM = match(Term, m_BranchOnCond(m_Not(m_ActiveLaneMask( | ||
m_VPValue(), m_VPValue(), m_VPValue())))); | ||
|
||
if (BranchALM || match(Term, m_BranchOnCount(m_VPValue(), m_VPValue()))) { | ||
if (BranchALM && DataAndControlFlowWithoutRuntimeCheck && | ||
EnableWideActiveLaneMask && BestVF.isVector() && BestUF > 1) | ||
extractFromWideActiveLaneMask(Plan, BestVF, BestUF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any benefit from having this here? It doesn't seem to fit here, as it does not simplify the branch condition directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no real benefit, I've moved everything into useWideActiveLaneMask
which is being called directly from optimizeForVFAndUF
now.
simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF, | ||
unsigned BestUF, | ||
PredicatedScalarEvolution &PSE, | ||
bool DataAndControlFlowWithoutRuntimeCheck) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check DataAndControlFlowWithoutRuntimeCheck
? Can't other active-lane masks be widened similarly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing preventing other active lane masks being from widened similarly if the flag is set. I restricted the transform like this because if either DataAndControlFlowWithoutRuntimeCheck
is true or IVUpdateMayOverflow
is false, then we know that splitting the wider mask later could be done without worrying about overflow.
However, since there is nothing yet which checks if the start value may overflow during codegen of the get_active_lane_mask
node, I don't think it's adding any value checking this here. The decision on whether we should generate the wider mask may need to take the tail folding style into account in the future but the transform itself doesn't require it, so I have removed it from this patch.
@@ -1432,20 +1433,93 @@ static bool isConditionTrueViaVFAndUF(VPValue *Cond, VPlan &Plan, | |||
return SE.isKnownPredicate(CmpInst::ICMP_EQ, TripCount, C); | |||
} | |||
|
|||
static void extractFromWideActiveLaneMask(VPlan &Plan, ElementCount VF, | |||
unsigned UF) { | |||
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this needs to be cost-driven, to only be done when the wider active-lane-mask is profitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to enable this without passing the extra flag we will need to decide whether it's profitable based on the cost of the wider mask, taking into account the features available on the target.
auto Idx = cast<ConstantInt>(getOperand(1)->getLiveInIRValue()); | ||
auto ResTy = VectorType::get( | ||
State.TypeAnalysis.inferScalarType(getOperand(0)), State.VF); | ||
return Builder.CreateExtractVector(ResTy, Vec, Idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this maps 1-1 to an intrinsic, can we just use VPWidenIntrinsic instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did initially use VPWidenIntrinsic when I started working on this, but I wasn't sure if that should only be used when the intrinsic exists in the original IR. I've changed this back to use VPWidenIntrinsic to create a vector_extract
now.
05a7ba9
to
9d75fe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nit addressed! I think we can probably clean up the dead active lane masks in a follow-up patch, since the new flag is off by default anyway.
llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll
Outdated
Show resolved
Hide resolved
@@ -1446,6 +1448,92 @@ static bool isConditionTrueViaVFAndUF(VPValue *Cond, VPlan &Plan, | |||
return SE.isKnownPredicate(CmpInst::ICMP_EQ, VectorTripCount, C); | |||
} | |||
|
|||
static bool useWideActiveLaneMask(VPlan &Plan, ElementCount VF, unsigned UF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some documentation for the function.
Also the name could probably be clearer, something like tryToReplaceActiveLaneMaskWithWideActiveLandMask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to tryToReplaceALMWithWideALM
and added documentation.
llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll
Outdated
Show resolved
Hide resolved
llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll
Outdated
Show resolved
Hide resolved
@@ -481,6 +480,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) { | |||
case VPInstruction::WideIVStep: | |||
return 2; | |||
case Instruction::Select: | |||
case VPInstruction::ActiveLaneMask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment to ActiveLaneMask
definintion in VPlan.h to document the arguments
unsigned Multiplier = | ||
cast<ConstantInt>(VPI->getOperand(2)->getLiveInIRValue()) | ||
->getZExtValue(); | ||
C += VPI->cost(VF * Multiplier, CostCtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why multiply VF * Multiplier here? Looks like you also multiply it on computeCost
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also multiplied in computeCost
; I have removed these changes.
|
||
using namespace llvm::VPlanPatternMatch; | ||
if (!EnableWideActiveLaneMask || !VF.isVector() || UF == 1 || | ||
!match(Term, m_BranchOnCond(m_Not(m_ActiveLaneMask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description of the patch says
when it is safe to do so (i.e. using tail folding without runtime checks).
Is checking for ActiveLaneMask here enough to ensure that the unsafe cases are rejected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "using tail folding without runtime checks" requirement is no longer accurate since removing the restriction that the tail folding style must be DataAndControlFlowWithoutRuntimeCheck
.
There is nothing preventing this transform from applying in other cases, as long as the ActiveLaneMask instruction is being used for data and control flow in the loop, which I believe checking for the pattern !match(Term, m_BranchOnCond(m_Not(m_ActiveLaneMask(...
here is sufficient for. I will update the commit message.
@@ -958,6 +958,10 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags, | |||
Not, | |||
SLPLoad, | |||
SLPStore, | |||
// Creates an active lane mask used by predicated vector operations in | |||
// the loop. Elements in the mask are active if the corrosponding element | |||
// in the source (first op) are less than the counter, starting at index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? I thought it was something like
create a mask where each lane i of the mask is true when the current counter
(starting with the value specified by operand 0) is less than the second operand, i.e.
mask[i] = icmp ult (op0 + i), op1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @david-arm , my comment was incorrect.
This patch adds a new flag (-enable-wide-lane-mask) which allows LoopVectorize to generate wider-than-VF active lane masks when it is safe to do so (i.e. using tail folding without runtime checks). The transform in extractFromWideActiveLaneMask creates vector extracts from the first active lane mask in the header & loop body, modifying the active lane mask phi operands to use the extracts. An additional operand is passed to the ActiveLaneMask instruction, the value of which is used as a multiplier of VF when generating the mask. By default this is 1, and is updated to UF by extractFromWideActiveLaneMask. The motivation for this change is to improve interleaved loops when SVE2.1 is available, where we can make use of the whilelo instruction which returns a predicate pair. This is based on a PR that was created by @momchil-velikov (llvm#81140) and contains tests which were added there.
- Used the --filter-out-after option when regenerating test CHECK lines - Removed CodeGen test as similar tests exist in get-active-lane-mask-extract.ll
- Removed UnrollPart from VPActiveLaneMaskPHIRecipe & order based on phi backedge values - Include multiplier in VPInstruction::computeCost for ActiveLaneMask - Call useActiveLaneMask from optimizeForVFAndUF instead of simplifyBranchConditionForVFAndUF - Always enable wide active lane masks if flag is passed, not only if DataAndControlFlowWithoutRuntimeCheck - Added an assert that the incoming values to the first Phi are ActiveLaneMask instructions - Change -force-vector-interleave=0 to -force-vector-interleave=1 & removed tests for UF2 - Moved EnableWideActiveLaneMask to VPlanTransforms.cpp
- Update tests after rebasing
- Move some checks to the top of useWideActiveLaneMask - Add documentation to functions & rename useWideActiveLaneMask
11e89cd
to
d1419db
Compare
@@ -4214,9 +4214,15 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() { | |||
} | |||
} | |||
} | |||
[[fallthrough]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think this is right because previously when falling through it was hitting the VPInstruction::ExplicitVectorLength case below and adding on the cost of the instruction, i.e. C += VPI->cost(VF, CostCtx)
.
If you change this to
C += VPI->cost(VF, CostCtx);
break;
then it will be the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Seems like you've addressed all the comments now - thank you!
This patch adds a new flag (-enable-wide-lane-mask) which allows
LoopVectorize to generate wider-than-VF active lane masks when it
is safe to do so (i.e. the mask is used for data and control flow).
The transform in extractFromWideActiveLaneMask creates vector
extracts from the first active lane mask in the header & loop body,
modifying the active lane mask phi operands to use the extracts.
An additional operand is passed to the ActiveLaneMask instruction,
the value of which is used as a multiplier of VF when generating the mask.
By default this is 1, and is updated to UF by extractFromWideActiveLaneMask.
The motivation for this change is to improve interleaved loops when
SVE2.1 is available, where we can make use of the whilelo instruction
which returns a predicate pair.
This is based on a PR that was created by @momchil-velikov (#81140)
and contains tests which were added there.