ENH: Modulate dispatched x86 CPU features #28896

seiko2plus · 2025-05-04T02:46:13Z

Overview

This PR reorganizes NumPy's CPU build options by replacing individual x86 features with microarchitecture levels. This change aligns with the Google Highway project requirements and common Linux distribution practices.

This PR default setting for cpu-baseline on x86 has been raised to x86-64-v2 microarchitecture as we're in 2025 and adding SIMD compatibility for antiquated CPUs from before 2009 is no longer practical or efficient.
This can be changed to cpu-baseline=none during build time to support older CPUs, though manual SIMD optimizations for pre-2009 processors are no longer supported. This change improves performance and reduces binary
size while only affecting hardware that is over 15 years old.

Key Changes

Consolidated into Microarchitecture Groups: Replaced individual features with X86_V2, x86_V3, and x86_V4 groups
Adjusted Baseline: Set to micro-architecture level 2 (x86_V2), covering features from CPUs since 2009.
This improves performance and reduces binary size
Improved - Operator: Corrected to properly exclude successor features
Backward Compatibility: Added redirection via meson for removed feature names
Stricter Compatibility: Features like AVX512 without full mask operations now considered unsupported rather than using fallbacks

Detailed CPU Feature Changes

Removed individual features (SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT) → now in X86_V2
Removed AMD legacy features (XOP, FMA4)
Removed Xeon Phi support (AVX512_KNL, AVX512_KNM)
Removed individual features (AVX, AVX2, FMA3, F16C) → now in X86_V3
Removed AVX512F, AVX512CD (from dropping Xeon Phi support)
Renamed AVX512_SKX to X86_V4
Removed redundant groups AVX512_CLX and AVX512_CNL
Updated AVX512_ICL to include VAES, GFNI, VPCLMULQDQ

New Feature Group Hierarchy

Name	Implies	Includes
X86_V2		SSE SSE2 SSE3 SSSE3 SSE4_1 SSE4_2 POPCNT CX16 LAHF
X86_V3	X86_V2	AVX AVX2 FMA3 BMI BMI2 LZCNT F16C MOVBE
X86_V4	X86_V3	AVX512F AVX512CD AVX512VL AVX512BW AVX512DQ
AVX512_ICL	X86_V4	AVX512VBMI AVX512VBMI2 AVX512VNNI AVX512BITALG AVX512VPOPCNTDQ AVX512IFMA VAES GFNI VPCLMULQDQ
AVX512_SPR	AVX512_ICL	AVX512FP16

CPU Generation Mapping

X86_V2: x86-64-v2 microarchitectures (CPUs since 2009)
X86_V3: x86-64-v3 microarchitectures (CPUs since 2015)
X86_V4: x86-64-v4 microarchitectures (AVX-512 capable CPUs)
AVX512_ICL: Intel Ice Lake and similar CPUs
AVX512_SPR: Intel Sapphire Rapids and newer CPUs

Note: On 32-bit x86, cx16 is excluded from X86_V2.

Documentation

Documentation has been updated to reflect these changes and to fit the current meson build system.

closes #27851

jorenham · 2025-05-14T22:42:54Z

needs a rebase

Copilot

Pull Request Overview

This PR reorganizes NumPy’s CPU build options by replacing individual x86 features with consolidated microarchitecture groups and bumps the baseline to x86-64-v2. Key changes include the introduction of X86_V2, X86_V3, and X86_V4 feature groups, updated enum values and CPU feature detection logic, and corresponding updates in tests and build configurations.

Reviewed Changes

Copilot reviewed 10 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
numpy/_core/tests/test_cpu_features.py	Updates to feature groups and test definitions reflecting the new model.
numpy/_core/tests/test_cpu_dispatcher.py	Adjusts dispatcher tests to use new group names.
numpy/_core/src/common/npy_cpu_features.h	Revises enum definitions to include new CPU groups.
numpy/_core/src/common/npy_cpu_features.c	Updates CPU feature detection logic and mappings.
meson_cpu/x86/test_x86_v[2-4].c	Introduces new tests for the respective microarchitecture groups.
doc/source/reference/simd/gen_features.py	Removes deprecated code generation for CPU features documentation.
.github/workflows/linux_simd.yml	Modifies build flags and cpu-dispatch settings to integrate new groups.

Files not reviewed (9)

doc/release/upcoming_changes/28896.change.rst: Language not supported
doc/source/reference/simd/generated_tables/compilers-diff.inc: Language not supported
doc/source/reference/simd/generated_tables/cpu_features.inc: Language not supported
doc/source/reference/simd/log_example.txt: Language not supported
meson.options: Language not supported
meson_cpu/meson.build: Language not supported
meson_cpu/x86/meson.build: Language not supported
numpy/_core/meson.build: Language not supported
numpy/_core/src/umath/loops_autovec.dispatch.c.src: Language not supported

Comments suppressed due to low confidence (1)

numpy/_core/src/common/npy_cpu_features.c:515

Review the modified condition for AVX512 OS support; verify that incorporating the avx_os check correctly handles systems without AVX OS support without unintended side effects.

if (!avx512_os && avx_os) {

numpy/_core/src/common/npy_cpu_features.c

.github/workflows/linux_simd.yml

tacaswell · 2025-05-16T16:50:52Z

If I am understanding

build from source and change the default baseline to cpu-baseline=none.

correctly, than the limitation on older machines depends on how the binaries are built?

For any of the downstream packagers (linux distros, conda-forge, etc) that is their responsibilities to sort out. For wheels I think there is a case to go more aggressively newer and push users towards better packaging ecosystems if they need to support older chips.

seiko2plus · 2025-05-16T18:02:46Z

You're right - we're bumping the default cpu-baseline and users can change this. I agree. Downstream packagers need to decide their own compatibility targets but we will no longer need to provide SIMD kernels below x86-64-v2. See the updated documentation for clarification:

https://github.com/numpy/numpy/blob/729d61c84d087dccf17e4ef960391227aaa0c5b3/doc/source/reference/simd/build-options.rst#targeting-older-cpus

**IMPORTANT NOTE**: The default setting for `cpu-baseline`` on x86 has been raised to `x86-64-v2` microarchitecture. This can be changed to `cpu-baseline=none` during build time to support older CPUs, though manual SIMD optimizations for pre-2009 processors are no longer supported. This patch reorganizes CPU build options by replacing individual x86 features with microarchitecture levels. This change aligns with the Google Highway requirements and common Linux distribution practices. This patch: - Removes all individual x86 features and replaces them with three microarchitecture levels (`X86_V2`, `X86_V3`, `X86_V4`) commonly used by Linux distributions - Raises the baseline to microarchitecture level 2 (replacing `SSE3`) since all known x86 CPUs since 2009 support it. This improves performance and reduces binary size - Updates documentation to to reflect these changes and to fit the current meson build system. - Corrects the behavior of the `-` operator, which now excludes successor features that imply the excluded feature - Adds redirection via meson for removed feature names to avoid breaking builds - Removes compiler compatibility workarounds, so features like AVX512 without full mask operations will be considered unsupported rather than providing fallbacks Detailed CPU features changes: - Removes individual features (`SSE`, `SSE2`, `SSE3`, `SSSE3`, `SSE4_1`, `SSE4_2`, `POPCNT`) which are now part of the new group `X86_V2` - Removes AMD legacy features (`XOP`, `FMA4`) - Removes Xeon Phi support (`AVX512_KNL`, `AVX512_KNM`) which Intel has discontinued - Removes individual features (`AVX`, `AVX2`, `FMA3`, `F16C`) which are now part of the new group `X86_V3` - Removes individual features `AVX512F`, `AVX512CD` as a result of dropping Xeon Phi support - Renames group `AVX512_SKX` to `x86_v4` to align with microarchitecture level naming - Removes groups `AVX512_CLX` and `AVX512_CNL` (features available via `AVX512_ICL`) - Updates `AVX512_ICL` to include features (`VAES`, `GFNI`, `VPCLMULQDQ`) for alignment with Highway New Feature Group Hierarchy: ``` Name | Implies | Includes --------------|-------------|----------------------------------------------------------- X86_V2 | | SSE SSE2 SSE3 SSSE3 SSE4_1 SSE4_2 POPCNT CX16 LAHF X86_V3 | X86_V2 | AVX AVX2 FMA3 BMI BMI2 LZCNT F16C MOVBE X86_V4 | X86_V3 | AVX512F AVX512CD AVX512VL AVX512BW AVX512DQ AVX512_ICL | X86_V4 | AVX512VBMI AVX512VBMI2 AVX512VNNI AVX512BITALG | | AVX512VPOPCNTDQ AVX512IFMA VAES GFNI VPCLMULQDQ AVX512_SPR | AVX512_ICL | AVX512FP16 ``` These groups correspond to CPU generations: - `X86_V2`: x86-64-v2 microarchitectures (CPUs since 2009) - `X86_V3`: x86-64-v3 microarchitectures (CPUs since 2015) - `X86_V4`: x86-64-v4 microarchitectures (AVX-512 capable CPUs) - `AVX512_ICL`: Intel Ice Lake and similar CPUs - `AVX512_SPR`: Intel Sapphire Rapids and newer CPUs Note: On 32-bit x86, `cx16` is excluded from `X86_V2`.

…64-v2 baseline Prevents "invalid value encountered in left_shift" warnings on clang-cl when testing bit shifts with long types under x86-64-v2 baseline.

Force SSE-based floating-point on 32-bit x86 systems to fix inconsistent results between einsum and other math functions. Prevents test failures with int16 operations by avoiding the x87 FPU's extended precision.

seiko2plus · 2025-05-16T18:35:33Z

@tacaswell, I've updated the release note to be more clear about the CPU baseline change.

seiko2plus

Three cases require changes; the rest are backport notes and comments that may need your decision.

numpy/_core/meson.build

seiko2plus · 2025-05-18T07:10:02Z

numpy/_core/src/common/npy_cpu_features.c

+    npy__cpu_have[NPY_CPU_FEATURE_AVX512_KNL] = npy__cpu_have[NPY_CPU_FEATURE_AVX512F] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512CD] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512ER] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512PF];
+
+    npy__cpu_have[NPY_CPU_FEATURE_AVX512_KNM] = npy__cpu_have[NPY_CPU_FEATURE_AVX512_KNL] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX5124FMAPS] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX5124VNNIW] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512VPOPCNTDQ];
+
+    npy__cpu_have[NPY_CPU_FEATURE_AVX512_CLX] = npy__cpu_have[NPY_CPU_FEATURE_X86_V4] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512VNNI];
+
+    npy__cpu_have[NPY_CPU_FEATURE_AVX512_CNL] = npy__cpu_have[NPY_CPU_FEATURE_X86_V4] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512IFMA] &&
+                                                npy__cpu_have[NPY_CPU_FEATURE_AVX512VBMI];


Suggested change

npy__cpu_have[NPY_CPU_FEATURE_AVX512_KNL] = npy__cpu_have[NPY_CPU_FEATURE_AVX512F] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512CD] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512ER] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512PF];

npy__cpu_have[NPY_CPU_FEATURE_AVX512_KNM] = npy__cpu_have[NPY_CPU_FEATURE_AVX512_KNL] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX5124FMAPS] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX5124VNNIW] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512VPOPCNTDQ];

npy__cpu_have[NPY_CPU_FEATURE_AVX512_CLX] = npy__cpu_have[NPY_CPU_FEATURE_X86_V4] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512VNNI];

npy__cpu_have[NPY_CPU_FEATURE_AVX512_CNL] = npy__cpu_have[NPY_CPU_FEATURE_X86_V4] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512IFMA] &&

npy__cpu_have[NPY_CPU_FEATURE_AVX512VBMI];

Not sure if we should drop legacy groups. I kept them in case there are projects that depend on them via dict __cpu_features__

.github/workflows/linux_simd.yml

seiko2plus · 2025-05-18T07:15:04Z

doc/source/reference/simd/build-options.rst

-     are not supported by the target CPU (raises Python runtime error).
+- ``cpu-baseline``: The minimum set of CPU features required to run the compiled NumPy.
+
+  * Default: ``min`` (provides compatibility across a wide range of platforms)


Suggested change

* Default: ``min`` (provides compatibility across a wide range of platforms)

* Default: ``min`` (provides compatibility across a wide range of platforms), see :ref:`special options <opt-special-options>` to check which min maps to for each architecture.

provides a ref for option min

seiko2plus · 2025-05-18T07:23:44Z

meson_cpu/meson.build

@@ -207,8 +226,17 @@ foreach opt_name, conf : parse_options
      endforeach
    else
      filterd = []
+      # filter out the features that are in the accumulate list 
+      # including any successor features


Not excluding successor features should be considered a bug and needs to be back-ported.
For example, with cpu-dispatch before this patch, to exclude AVX512 you had to exclude all successor features, otherwise it would not be disabled:

python -m build --wheel -Csetup-args=-Dcpu-dispatch="max -avx512f -avx512cd \ -avx512_knl -avx512_knm -avx512_skx -avx512_clx -avx512_cnl -avx512_icl -avx512_spr"

After this fix:

python -m build --wheel -Csetup-args=-Dcpu-dispatch="max -avx512f"

seiko2plus · 2025-05-18T07:24:41Z

meson_cpu/x86/meson.build

-  test_code: files(source_root + '/numpy/distutils/checks/cpu_avx512_cnl.c')[0]
-)
+HWY_SSE4_FLAGS = ['-DHWY_WANT_SSE4', '-DHWY_DISABLE_PCLMUL_AES']
+# Use SSE for floating-point on x86-32 to ensure numeric consistency.


This fix needs to be backported too.

seiko2plus · 2025-05-18T07:28:47Z

meson_cpu/x86/meson.build

+  test_code: files(current_dir + '/test_x86_v4.c')[0],
+)
+if cpu_family == 'x86'
+  X86_V4.update(disable: 'not supported on x86-32')


This disable needs to be backported too. We should not generate AVX512 kernels on 32-bit systems.

seiko2plus · 2025-05-18T07:43:26Z

meson_cpu/x86/meson.build

-  AVX2.update(args: {'val': '/arch:AVX2', 'match': clear_arch})
-  AVX512_SKX.update(args: {'val': '/arch:AVX512', 'match': clear_arch})
+  X86_V3.update(args: {'val': '/arch:AVX2', 'match': clear_arch})
+  X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch})


By default, Highway considers AVX512 a broken platform on MSVC. We've managed to support AVX512 on MSVC through universal intrinsics. We have three options here:

Keep it as-is: This leads to building AVX2 kernels with /arch:AVX512, which gives the compiler the opportunity to optimize (hopefully). Worst case: we'll have duplicate kernels.

Disable AVX512 support entirely:

Suggested change

X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch})

X86_V4.update(disable: 'not supported by Highway')

3.Change the default behavior and force-enable AVX512 on MSVC and lets see how deep the rabbit-hole goes:

Suggested change

X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch})

X86_V4.update(args: [{'val': '/arch:AVX512', 'match': clear_arch}, '-DHWY_BROKEN_MSVC=0'])

I think I'd prefer (2) or (3). If it's really very broken, disabling it seems fine (and low-effort). If it's feasible and there is energy for this, fixing it up in Highway over time seems nice.

(1) seems worst, if it doesn't do much, I'd much rather take the binary size gains from disabling it completely.

We still haven't converted many kernels yet from universal intrinsics to Highway, so adding extra burden isn't a wise decision. Let's go for option (2) for now and investigate later which MSVC versions are compatible with Highway.

- Disable X86_V4 (AVX-512) for MSVC builds due to Highway incompatibility - Add FIXME comment for future MSVC compatibility investigation - Update SIMD CI workflow to reflect x86-64-v2 baseline - Remove redundant test configurations - Add missing X86_V4 support to unary complex loops

github-actions bot added the 25 - WIP label May 4, 2025

seiko2plus force-pushed the modulate_x86_features branch 9 times, most recently from e1ee011 to 1cd1a0e Compare May 14, 2025 16:42

seiko2plus changed the title ~~WIP: MAINT: Modulate dispatched x86 CPU features~~ ENH: Modulate dispatched x86 CPU features May 14, 2025

seiko2plus marked this pull request as ready for review May 14, 2025 17:09

r-devulap requested review from r-devulap and Copilot May 14, 2025 22:53

Copilot AI reviewed May 14, 2025

View reviewed changes

numpy/_core/src/common/npy_cpu_features.c Show resolved Hide resolved

.github/workflows/linux_simd.yml Show resolved Hide resolved

seiko2plus force-pushed the modulate_x86_features branch from 1cd1a0e to b959fb3 Compare May 15, 2025 12:27

seiko2plus added component: SIMD Issues in SIMD (fast instruction sets) code or machinery 36 - Build Build related PR and removed 25 - WIP labels May 16, 2025

seiko2plus force-pushed the modulate_x86_features branch from b959fb3 to 729d61c Compare May 16, 2025 10:25

seiko2plus added 3 commits May 16, 2025 21:33

Fix: Clear FP status to prevent invalid left_shift warnings with x86-…

89eea69

…64-v2 baseline Prevents "invalid value encountered in left_shift" warnings on clang-cl when testing bit shifts with long types under x86-64-v2 baseline.

Set -mfpmath=sse on x86-32 for gcc/clang numeric consistency

c41541a

Force SSE-based floating-point on 32-bit x86 systems to fix inconsistent results between einsum and other math functions. Prevents test failures with int16 operations by avoiding the x87 FPU's extended precision.

seiko2plus force-pushed the modulate_x86_features branch from 729d61c to c41541a Compare May 16, 2025 18:33

seiko2plus mentioned this pull request May 17, 2025

ENH, SIMD: Initial implementation of Highway wrapper #28622

Merged

seiko2plus commented May 18, 2025

View reviewed changes

	* Default: ``min`` (provides compatibility across a wide range of platforms)
	* Default: ``min`` (provides compatibility across a wide range of platforms), see :ref:`special options <opt-special-options>` to check which min maps to for each architecture.

	X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch})
	X86_V4.update(disable: 'not supported by Highway')

	X86_V4.update(args: {'val': '/arch:AVX512', 'match': clear_arch})
	X86_V4.update(args: [{'val': '/arch:AVX512', 'match': clear_arch}, '-DHWY_BROKEN_MSVC=0'])

Uh oh!

ENH: Modulate dispatched x86 CPU features #28896

Are you sure you want to change the base?

ENH: Modulate dispatched x86 CPU features #28896

Conversation

seiko2plus commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Detailed CPU Feature Changes

New Feature Group Hierarchy

CPU Generation Mapping

Documentation

Uh oh!

jorenham commented May 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

tacaswell commented May 16, 2025

Uh oh!

seiko2plus commented May 16, 2025

Uh oh!

seiko2plus commented May 16, 2025

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seiko2plus commented May 4, 2025 •

edited

Loading