Skip to content

ENH: Convert comparison from C universal intrinsics to C++ using Highway #28490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ixgbe
Copy link
Contributor

@ixgbe ixgbe commented Mar 13, 2025

No description provided.

@ixgbe
Copy link
Contributor Author

ixgbe commented Mar 24, 2025

Hello @r-devulap

I apologize for the interruption, as I know you must be very busy. I submitted PR #28490 a while ago and was wondering if you might have time to review it when your schedule permits. I understand you're likely juggling many priorities, but I'd greatly appreciate any feedback you could provide on my contribution.

Thank you for all your work maintaining NumPy!

Best regards!

@r-devulap r-devulap requested a review from seiko2plus March 24, 2025 21:52
@r-devulap
Copy link
Member

@ixgbe thanks for your contribution. @seiko2plus had a few comments about using a wrapper around Highway tags to make it easier to develop template functions. I have tagged him to add his comments as well.

@r-devulap
Copy link
Member

From #21057: If I understood correctly, it was along the lines of

template <typename T>
struct OpEq {
#if NPY_SIMD
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T> operator()(const Vec<T_> &a)
{ return a; }
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE auto operator()(const Vec<T_> &a, const Vec<T_> &b)
{return Eq(a, b); }
#endif
NPY_FINLINE T operator()(T a)
{ return a; }
NPY_FINLINE npy_bool operator()(T a, T b)
{ return a == b; }
};
template <typename T>
struct OpNe {
#if NPY_SIMD
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T> operator()(const Vec<T_> &a)
{ return a; }
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE auto operator()(const Vec<T_> &a, const Vec<T_> &b)
{ return Ne(a, b); }
#endif
NPY_FINLINE T operator()(T a)
{ return a; }
NPY_FINLINE npy_bool operator()(T a, T b)
{ return a != b; }
};
template <typename T>
struct OpLt {
#if NPY_SIMD
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T> operator()(const Vec<T_> &a)
{ return a; }
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE auto operator()(const Vec<T_> &a, const Vec<T_> &b)
{ return Lt(a, b); }
#endif
NPY_FINLINE T operator()(T a)
{ return a; }
NPY_FINLINE npy_bool operator()(T a, T b)
{ return a < b; }
};
template <typename T>
struct OpLe {
#if NPY_SIMD
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE Vec<T> operator()(const Vec<T_> &a)
{ return a; }
template <typename T_ = std::enable_if<kSupportLane<T>, T>>
NPY_FINLINE auto operator()(const Vec<T_> &a, const Vec<T_> &b)
{ return Le(a, b); }
#endif
NPY_FINLINE T operator()(T a)
{ return a; }
NPY_FINLINE npy_bool operator()(T a, T b)
{ return a <= b; }
};

@ixgbe
Copy link
Contributor Author

ixgbe commented Mar 27, 2025

Hi @seiko2plus,
I've updated the code to wrap Highway tags as suggested (#28490 (comment)). Could you please review the implementation when you have time?
Let me know if any further adjustments are needed. Thanks!

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.

While your last commits try to follow the stracture of the purposed C++ universal
intrinsics example still the code count on tags.

So, I suggest to use the following snippet as a reference to implement the wrapper:

We going to need two headers to implement the wrapper, the first one is the simd.hpp
header which will provide the SIMD interface and the second one is simd.inc.hpp
which is a template header will be called multiple times with different namespaces to
provide the implementation of the SIMD interface.

The simd.hpp header will look like this:

Click to expand
/**
 * This header provides a thin wrapper over Google's Highway SIMD library.
 *
 * The wrapper aims to simplify the SIMD interface of Google's Highway by
 * get ride of its class tags and use lane types directly which can be deduced
 * from the args in most cases.
 */

#ifndef NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_
#define NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_

#ifndef NPY_DISABLE_OPTIMIZATION
// Highway SIMD is only available when optimization is enabled
#include <hwy/highway.h>
#define NPY_SIMDX 1  // since NPY_SIMD indicates the C universal intrinsics interface
#else
#define NPY_SIMDX 0
#endif

namespace np {

/// Represents the max SIMD width supported by the platform.
namespace simd {
#if NPY_SIMDX
/// The highway namespace alias.
/// We can not import all the symbols from the HWY_NAMESPACE because it will
/// conflict with the existing symbols in the numpy namespace.
namespace hn = hwy::HWY_NAMESPACE;
// internaly used by the template header
template <typename TLane>
using _Tag = hn::ScalableTag<TLane>;
#endif
#include "simd.inc.hpp"
}  // namespace simd

/// Represents the 128-bit SIMD width.
namespace simd128 {
#if NPY_SIMDX
namespace hn = hwy::HWY_NAMESPACE;
template <typename TLane>
using _Tag = hn::Full128<TLane>;
#endif
#include "simd.inc.hpp"
}  // namespace simd128

}  // namespace np

#endif  // NUMPY__CORE_SRC_COMMON_SIMD_SIMD_HPP_

and the simd.inc.hpp header will look like this:

Click to expand
#ifndef NPY_SIMDX
#error "This is not defined, this is not a standalone header use simd.hpp instead"
#endif

// NOTE: This file is included by simd.hpp multiple times with different namespaces
// so avoid include any headers here

// #define NPY_SIMDX 1  // uncomment to enable Highlighting

/**
 * Determines whether the specified lane type is supported by the SIMD extension.
 * Alaways defined as false when SIMD is not enabled so it can be used in SFINAE.
 *
 * @tparam TLane The lane type to check for support.
 */
template <typename TLane>
constexpr bool kSupportLane = NPY_SIMDX != 0;

#if !NPY_SIMDX
// defined as void when SIMD is not enabled, for SFINAE purposes
template <typename TLane>
using Vec = void;
// defined as void when SIMD is not enabled, for SFINAE purposes
template <typename TLane>
using Mask = void;
#endif

#if NPY_SIMDX

#if HWY_HAVE_FLOAT16
template <>
constexpr bool kSupportLane<Half> = true;
#endif

#if HWY_HAVE_FLOAT64
template <>
constexpr bool kSupportLane<double> = true;
#endif

/// Represents an N-lane vector based on the specified lane type.
template <typename TLane>
using Vec = hn::Vec<_Tag<TLane>>;

/// Represents a mask vector with boolean values or as a bitmask.
template <typename TLane>
using Mask = hn::Mask<_Tag<TLane>>;

/// Unaligned load of a vector from memory.
template <typename TLane>
HWY_API Vec<TLane>
LoadU(const TLane *ptr)
{
    return hn::LoadU(_Tag<TLane>(), ptr);
}

/// Unaligned store of a vector to memory.
template <typename TLane>
HWY_API void
StoreU(const Vec<TLane> &a, TLane *ptr)
{
    hn::StoreU(a, _Tag<TLane>(), ptr);
}

/// Returns the number of vector lanes based on the lane type.
template <typename TLane>
HWY_API constexpr size_t
Lanes(TLane tag = 0 /*optional tag*/)
{
    return hn::Lanes(_Tag<TLane>());
}

/// Returns an uninitialized N-lane vector.
template <typename TLane>
HWY_API Vec<TLane>
Undefined(TLane tag = 0 /*optional tag*/)
{
    return hn::Undefined(_Tag<TLane>());
}

// Import common Highway intrinsics:
using hn::Add;
using hn::And;
using hn::Mul;
using hn::Not;
using hn::Or;
using hn::Sub;
using hn::Xor;

#endif  // NPY_SIMDX

"@ixgbe, please hold off on applying this suggestion, even if you agree with it, until we reach consensus on the final interface design. @r-devulap, @Mousius, and @jan-wassenberg - your input on this matter would be valuable."

Note this not my final review, just add it some examples:

#include "simd/simd.h"
#include "loops_utils.h"
#include "loops.h"
#include <hwy/highway.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <hwy/highway.h>
#include "simd/simd.hpp"

will be used instead, note that Highway SIMD is only available when optimization is enabled (default through meson flag)


namespace {

namespace hn = hwy::HWY_NAMESPACE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace hn = hwy::HWY_NAMESPACE;
using namespace np::simd;

example 0

Comment on lines 12 to 65
const hn::ScalableTag<uint8_t> u8;
const hn::ScalableTag<int8_t> s8;
const hn::ScalableTag<uint16_t> u16;
const hn::ScalableTag<int16_t> s16;
const hn::ScalableTag<uint32_t> u32;
const hn::ScalableTag<int32_t> s32;
const hn::ScalableTag<uint64_t> u64;
const hn::ScalableTag<int64_t> s64;
const hn::ScalableTag<float> f32;
const hn::ScalableTag<double> f64;

using vec_u8 = hn::Vec<decltype(u8)>;
using vec_s8 = hn::Vec<decltype(s8)>;
using vec_u16 = hn::Vec<decltype(u16)>;
using vec_s16 = hn::Vec<decltype(s16)>;
using vec_u32 = hn::Vec<decltype(u32)>;
using vec_s32 = hn::Vec<decltype(s32)>;
using vec_u64 = hn::Vec<decltype(u64)>;
using vec_s64 = hn::Vec<decltype(s64)>;
using vec_f32 = hn::Vec<decltype(f32)>;
using vec_f64 = hn::Vec<decltype(f64)>;

template<typename T>
struct TagSelector;

template<> struct TagSelector<uint8_t> { static const auto& value() { return u8; } };
template<> struct TagSelector<int8_t> { static const auto& value() { return s8; } };
template<> struct TagSelector<uint16_t> { static const auto& value() { return u16; } };
template<> struct TagSelector<int16_t> { static const auto& value() { return s16; } };
template<> struct TagSelector<uint32_t> { static const auto& value() { return u32; } };
template<> struct TagSelector<int32_t> { static const auto& value() { return s32; } };
template<> struct TagSelector<uint64_t> { static const auto& value() { return u64; } };
template<> struct TagSelector<int64_t> { static const auto& value() { return s64; } };
template<> struct TagSelector<float> { static const auto& value() { return f32; } };
template<> struct TagSelector<double> { static const auto& value() { return f64; } };

template<typename T>
constexpr const auto& GetTag() {
return TagSelector<T>::value();
}

template <typename T>
constexpr bool kSupportLane = false;

template <> constexpr bool kSupportLane<uint8_t> = true;
template <> constexpr bool kSupportLane<int8_t> = true;
template <> constexpr bool kSupportLane<uint16_t> = true;
template <> constexpr bool kSupportLane<int16_t> = true;
template <> constexpr bool kSupportLane<uint32_t> = true;
template <> constexpr bool kSupportLane<int32_t> = true;
template <> constexpr bool kSupportLane<uint64_t> = true;
template <> constexpr bool kSupportLane<int64_t> = true;
template <> constexpr bool kSupportLane<float> = true;
template <> constexpr bool kSupportLane<double> = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const hn::ScalableTag<uint8_t> u8;
const hn::ScalableTag<int8_t> s8;
const hn::ScalableTag<uint16_t> u16;
const hn::ScalableTag<int16_t> s16;
const hn::ScalableTag<uint32_t> u32;
const hn::ScalableTag<int32_t> s32;
const hn::ScalableTag<uint64_t> u64;
const hn::ScalableTag<int64_t> s64;
const hn::ScalableTag<float> f32;
const hn::ScalableTag<double> f64;
using vec_u8 = hn::Vec<decltype(u8)>;
using vec_s8 = hn::Vec<decltype(s8)>;
using vec_u16 = hn::Vec<decltype(u16)>;
using vec_s16 = hn::Vec<decltype(s16)>;
using vec_u32 = hn::Vec<decltype(u32)>;
using vec_s32 = hn::Vec<decltype(s32)>;
using vec_u64 = hn::Vec<decltype(u64)>;
using vec_s64 = hn::Vec<decltype(s64)>;
using vec_f32 = hn::Vec<decltype(f32)>;
using vec_f64 = hn::Vec<decltype(f64)>;
template<typename T>
struct TagSelector;
template<> struct TagSelector<uint8_t> { static const auto& value() { return u8; } };
template<> struct TagSelector<int8_t> { static const auto& value() { return s8; } };
template<> struct TagSelector<uint16_t> { static const auto& value() { return u16; } };
template<> struct TagSelector<int16_t> { static const auto& value() { return s16; } };
template<> struct TagSelector<uint32_t> { static const auto& value() { return u32; } };
template<> struct TagSelector<int32_t> { static const auto& value() { return s32; } };
template<> struct TagSelector<uint64_t> { static const auto& value() { return u64; } };
template<> struct TagSelector<int64_t> { static const auto& value() { return s64; } };
template<> struct TagSelector<float> { static const auto& value() { return f32; } };
template<> struct TagSelector<double> { static const auto& value() { return f64; } };
template<typename T>
constexpr const auto& GetTag() {
return TagSelector<T>::value();
}
template <typename T>
constexpr bool kSupportLane = false;
template <> constexpr bool kSupportLane<uint8_t> = true;
template <> constexpr bool kSupportLane<int8_t> = true;
template <> constexpr bool kSupportLane<uint16_t> = true;
template <> constexpr bool kSupportLane<int16_t> = true;
template <> constexpr bool kSupportLane<uint32_t> = true;
template <> constexpr bool kSupportLane<int32_t> = true;
template <> constexpr bool kSupportLane<uint64_t> = true;
template <> constexpr bool kSupportLane<int64_t> = true;
template <> constexpr bool kSupportLane<float> = true;
template <> constexpr bool kSupportLane<double> = true;

there will be no need for it.

struct OpEq {
#if NPY_SIMD
template <typename V, typename = std::enable_if_t<kSupportLane<T>>>
HWY_INLINE HWY_ATTR auto operator()(const V &v)
Copy link
Member

@seiko2plus seiko2plus Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HWY_INLINE HWY_ATTR auto operator()(const V &v)
HWY_INLINE auto operator()(const V &v)

No need for it I suppose, and even if we decided to move Highway CPU dispatcher start/end macros can be used instead.

Comment on lines 268 to 270
const int vstep = hn::Lanes(u8);
const size_t nlanes = hn::Lanes(GetTag<T>());
const vec_u8 truemask = hn::Set(u8, 0x1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int vstep = hn::Lanes(u8);
const size_t nlanes = hn::Lanes(GetTag<T>());
const vec_u8 truemask = hn::Set(u8, 0x1);
const int vstep = Lanes<uint8_t>();
const size_t nlanes = Lanes<T>();
const Vec<uint8_t> truemask = Set(uint8_t(0x1));

example 1

Comment on lines 274 to 275
auto a1 = op(hn::LoadU(GetTag<T>(), src1 + nlanes * 0));
auto b1 = op(hn::LoadU(GetTag<T>(), src2 + nlanes * 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto a1 = op(hn::LoadU(GetTag<T>(), src1 + nlanes * 0));
auto b1 = op(hn::LoadU(GetTag<T>(), src2 + nlanes * 0));
auto a1 = op(LoadU(src1 + nlanes * 0));
auto b1 = op(LoadU(src2 + nlanes * 0));

example 2

Comment on lines 290 to 291
auto m3_vec = hn::VecFromMask(GetTag<T>(), m3);
auto m4_vec = hn::VecFromMask(GetTag<T>(), m4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto m3_vec = hn::VecFromMask(GetTag<T>(), m3);
auto m4_vec = hn::VecFromMask(GetTag<T>(), m4);
auto m3_vec = VecFromMask<T>(m3);
auto m4_vec = VecFromMask<T>(m4);

example 3

}
}
else {
ret = hn::BitCast(u8, m1_vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ret = hn::BitCast(u8, m1_vec);
ret = BitCast<uint8_t>(m1_vec);

example 4

else {
ret = hn::BitCast(u8, m1_vec);
}
hn::StoreU(hn::And(ret, truemask), u8, dst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hn::StoreU(hn::And(ret, truemask), u8, dst);
StoreU(And(ret, truemask), dst);

example 5

template <typename T=uint8_t>
struct OpGeBool {};

#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if !defined(__s390x__) && !defined(__arm__) && !defined(__loongarch64) && !defined(__loongarch64__)

Snap for the build error, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snap for the build error, please

build error: OrderedTruncate2To is not a member of hn

I use hn::OrderedTruncate2To ,however, It seems that there is no support hn::OrderedTruncate2To on S390、ARM32、loongarch64 architecture.
339468e2d131c4072c239eb4b6b45da

Copy link
Member

@seiko2plus seiko2plus Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderedTruncate2To is only available when HWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures like z13 and longarch that Highway doesn't directly support. For armhf and risc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderedTruncate2To is only available when HWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures like z13 and longarch that Highway doesn't directly support. For armhf and risc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderedTruncate2To is only available when HWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures like z13 and longarch that Highway doesn't directly support. For armhf and risc-v, further investigation is needed.

Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderedTruncate2To is only available when HWY_TARGET != HWY_SCALAR. Highway appears to be falling back to scalar implementation, which makes sense for architectures like z13 and longarch that Highway doesn't directly support. For armhf and risc-v, further investigation is needed.

To avoid performance regression, implementing new Highway intrinsics to pack mask data types (similar to universal intrinsics) would be better than the current approach of bitmask-to-vector conversion followed by packing. Ideally, only one VecFromMask would be needed for lane types larger than 8-bits. This mask convervisons is zero-cost only for SIMD extensions that don't provide native mask operations.

image
For risc-v, OrderedTruncate2To is supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.

We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.

Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?

I agree, meson.build need to be updates and removes z13's VX:

[
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
VSX3, VSX2,
NEON,
VXE, VX,
LSX,
RVV,

We should remove avx512f and keep only AVX512_SKX since Highway doesn't support Xeon Phi processors (which Intel no longer supports).

LSX should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.

However, we should not use Highway when scalar mode is enabled.

For risc-v, OrderedTruncate2To is supported.

Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under #if NPY_SIMD? NPY_SIMD is part of the C universal intrinsics and should return 0 on RVV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build/meson-logs/meson-log.txt : meson-log.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice! I am not familar to new Highway intrinsics to pack mask data type , could you please tell me?

As far as I know, there's no available intrinsic that performs mask packing. To implement such functionality, you would need to submit a PR to the upstream repository. Once it gets merged, we can update the Highway submodule accordingly. Alternatively, you could temporarily implement it directly in the source code itself to speed up the process.

We can also postpone this step until we run benchmarks. Modern compilers may optimize out these extra unnecessary operations. In my last benchmark of universal intrinsics in C++, I observed performance gains only for 8-bit types, while higher bit-width types performed similarly to the C code. We expect the new Highway code to show similar performance improvements primarily for 8-bit data types.

Don't we already have scalar implementations and therefore we should never be building with HWY_SCALAR for fear of polluting the dispatch sources?

I agree, meson.build need to be updates and removes z13's VX:

[
AVX512_SKX, AVX512F, AVX2, SSE42, SSE2,
VSX3, VSX2,
NEON,
VXE, VX,
LSX,
RVV,

We should remove avx512f and keep only AVX512_SKX since Highway doesn't support Xeon Phi processors (which Intel no longer supports).

LSX should remain as-is (HWY WIP), it's part of the LoongArch CPU baseline and is statically dispatched anyway.

However, we should not use Highway when scalar mode is enabled.

For risc-v, OrderedTruncate2To is supported.

Could you provide the build log (build/meson-logs/meson-log.txt)? I need to see which flags are being provided to the CPU dispatcher. Another thing that caught my attention: how were you able to enable the SIMD path for RVV under #if NPY_SIMD? NPY_SIMD is part of the C universal intrinsics and should return 0 on RVV.
I am mistaken, you are right. It only compiled successfully on RISC-V.

@ixgbe ixgbe requested a review from seiko2plus March 28, 2025 03:07
@r-devulap
Copy link
Member

During the last optimization meeting, I proposed a thin wrapper over Google's Highway
SIMD library to simplify its interface. The wrapper would eliminate the need for
class tags and use lane types directly, which can be deduced from the arguments in
most cases. We can also leverage namespaces for low-level register access and still
rely on lane type only.

Can we have Highway own this wrapper which can simplify its usage across other projects too? @jan-wassenberg

@jan-wassenberg
Copy link
Contributor

During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion.

On OrderedTruncate2To: yes, that's supported on all targets except scalar. I think the issue is that compiler flags are missing the flags that explicitly ask for RVV, and the runtime dispatch is not yet workable due to compiler bugs e.g. llvm/llvm-project#56592.

@seiko2plus
Copy link
Member

During the meeting, I think Sayed preferred it be in numpy. I don't have a strong opinion.

I would prefer to keep it in NumPy for a while until it matures enough, and then pass it to the Highway main repository. I initiated this via PR #28622.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants