Skip to content

ENH, SIMD: Optimize the hyperbolic implementation based on Highway wr… #29211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ixgbe
Copy link
Contributor

@ixgbe ixgbe commented Jun 16, 2025

Test environment: Banana Pi BPI-F3 (SpacemiT K1 8-core RISC-V chip) gcc version 14.2.0
Test command: spin bench --compare HEAD~1 -t bench_ufunc.TanhSpecific

Add class TanhSpecific in benchmarks/benchmarks/bench_ufunc.py for this test on Banana Pi BPI-F3

class TanhSpecific(Benchmark):
    timeout = 10

    def setup(self):
        np.seterr(all='ignore')
        self.ufn = np.tanh

        from .common import get_square_

        self.float32_data = (get_square_('float32'),)
        self.float64_data = (get_square_('float64'),)

    def time_FLOAT_tanh(self):
        self.ufn(*self.float32_data)

    def time_DOUBLE_tanh(self):
        self.ufn(*self.float64_data)

(numpy_venv) root@yangwang:/home/yangwang/numpy# git log -2
commit bbe89a87b37f86d4f00e2a93882ef8d642a8db2c (HEAD -> main)
Author: Wang Yang yangwang@iscas.ac.cn
Date: Fri Aug 22 16:19:36 2025 +0800

ENH, SIMD: Optimize the hyperbolic implementation based on Highway wrapper

commit 4cbc80036001af445db0c21e71d0707f64014258
Author: Wang Yang yangwang@iscas.ac.cn
Date: Fri Aug 22 16:15:42 2025 +0800

TST:Add np.float and np.double for tanh test

(numpy_venv) root@yangwang:/home/yangwang/numpy#
(numpy_venv) root@yangwang:/home/yangwang/numpy# spin bench --compare HEAD~1 -t bench_ufunc.TanhSpecific

Change Before [4cbc8003] <main~1> After [bbe89a87] Ratio Benchmark (Parameter)
- 1.10±0.3ms 664±10μs 0.6 bench_ufunc.TanhSpecific.time_DOUBLE_tanh
- 944±300μs 226±30μs 0.24 bench_ufunc.TanhSpecific.time_FLOAT_tanh

…apper

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review based on just the first glance. AFAIK: we need HWY_ATTR for all the functions that use highway ops. From highway docs: each function that calls Highway ops (such as Load) must either be prefixed with HWY_ATTR...

[
AVX512_SKX, [AVX2, FMA3],
VSX4, VSX2,
VSX2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove VSX4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image During the build process on PPC64LE, the compiler command for the VSX4 dispatch target contains conflicting CPU flags. The build system seems to be appending the target-specific -mcpu=power10 flag without overriding the base -mcpu=power9 flag that was likely set for the baseline build. This results in a command with mutually exclusive options (-mcpu=power9 and -mcpu=power10), which causes the compiler to fail with a target specific option mismatch error during function inlining. Could you please confirm if this is a known issue or a bug in the Meson build configuration for ppc64le?Thanks !

Copy link
Contributor Author

@ixgbe ixgbe Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove VSX4?

Hi, @r-devulap
Before my modification, NPY_SIMD_F64 was used to control DOUBLE_tanh. For VSX4, the macro NPY_SIMD_F64 evaluates to 0, so DOUBLE_tanh was implemented using scalar code. Therefore, no compilation error occurred.

After my modification, I switched to using NPY_HWY_F64 to control DOUBLE_tanh. For VSX4, NPY_HWY_F64 evaluates to 1, so DOUBLE_tanh is implemented using vectorized code. However, when compiling on a Power9 system (with -mcpu=power9), calling hwy::N_PPC10 functions leads to a compilation error due to target-specific option mismatches.

Therefore, removing VSX4 from numpy/_core/meson.build results in behavior consistent with the original implementation before my change.
image
https://www.ibm.com/support/pages/node/7155156
CI use gcc11.4 now , however processor is Power9 used by CI.

HWY_ATTR NPY_FINLINE vtype
load_vector(type_t* src, npy_intp ssrc, npy_intp len){
auto D = hn::DFromV<vtype>();
HWY_INLINE vtype load_vector(type_t* src, npy_intp ssrc, npy_intp len){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it require HWY_ATTR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

HWY_ATTR NPY_FINLINE void
store_vector(vtype vec, type_t* dst, npy_intp sdst, npy_intp len){
auto D = hn::DFromV<vtype>();
HWY_INLINE void store_vector(vtype vec, type_t* dst, npy_intp sdst, npy_intp len){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: don't we need HWY_ATTR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ixgbe ixgbe requested a review from r-devulap July 15, 2025 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants