Skip to content

GCC 15.1.1 compiles NumPy, but the tests segfault. #28991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
charris opened this issue May 16, 2025 · 14 comments · Fixed by #29094
Closed

GCC 15.1.1 compiles NumPy, but the tests segfault. #28991

charris opened this issue May 16, 2025 · 14 comments · Fixed by #29094
Milestone

Comments

@charris
Copy link
Member

charris commented May 16, 2025

The failure is at _core/tests/test_half.py, line 54. I assume that is a problem with GCC 15.1.1, as it has had a rocky start. Has anyone else had problems? The problem doesn't seem to be in NumPy, it was compiling and running fine before I upgraded Fedora to 42.

@charris
Copy link
Member Author

charris commented May 17, 2025

The problem seems to be in boolean indexing.

@ngoldbaum
Copy link
Member

Can you share a gdb traceback?

@charris
Copy link
Member Author

charris commented May 18, 2025

Can you share a gdb traceback?

I was going to wait on the next GCC 15 version. Fedora (and Linus!) jumped the gun in using 15 in a release.

@NiLuJe
Copy link

NiLuJe commented May 25, 2025

I can reproduce segfaults on a small chunk of the testsuite (and, well, in production, which is what led me to double-check ;p) with GCC 15.1.0 over here, too, FWIW (Gentoo Linux, x64, on a zen4 CPU). Everything passes w/ GCC 14.x (right now, 14.3.0).

Will see about getting some traces in the next few days.

@thesamesam
Copy link
Contributor

thesamesam commented May 26, 2025

I was going to wait on the next GCC 15 version. Fedora (and Linus!) jumped the gun in using 15 in a release.

Fedora only shipped it briefly before the final release. The next GCC 15 release will be in a couple of months, then it goes to yearly (same schedule as usual). It's best if it turns out to be a GCC bug to get it reported before that next one. Otherwise, relying on the fix being out there takes way longer.

@thesamesam
Copy link
Contributor

I tried to reproduce this in the following environments:

  • on my regular Gentoo workstation using GCC trunk but with -O3 -march=znver2 -flto with our numpy-2.2.6 package's testsuite
  • in a clean Gentoo container running stable but with GCC 15 installed and in-use to build our numpy-2.5.5 package's testsuite
  • in a Fedora 42 container (fedora:42 with Docker) on a git clone of main using spin test
    and all of them worked.

@NiLuJe Feel free to file a bug on the Gentoo side so we can dig into it a bit there and see if we can come up with something to share on this side if there's nothing obvious we find here. If you do, please include the testsuite results when run via the ebuild (inc. build.log) and emerge --info. Thanks!

@NiLuJe
Copy link

NiLuJe commented May 28, 2025

  • on my regular Gentoo workstation using GCC trunk but with -O3 -march=znver2 -flto with our numpy-2.2.6 package's testsuite

That was an interesting observation!

I'm using -O3 -march=native (so ~ znver4)... and, indeed, it passes w/ znver2, it passes w/ znver3.... and it fails with znver4!

LTO doesn't affect the result, but, unsurprisingly (somewhat? I was under the impression that the buildsystem enforced O3 anyway?), -O3 does: it passes w/ -O2 -march=znver4.

I'll file that on the Gentoo side w/ the logs.

@NiLuJe
Copy link

NiLuJe commented May 28, 2025

I'll file that on the Gentoo side w/ the logs.

Here it is: https://bugs.gentoo.org/956770

@charris
Copy link
Member Author

charris commented May 28, 2025

I've found using -march=native triggers the error, as do -march=x86-64-v4 and -march=znver5. Omitting -march altogether works. For the record, I have an AMD Zen5 cpu.

I'm suspecting a conflict with either meson or the dispatcher. The lesson might be: don't use -march.

@thesamesam
Copy link
Contributor

thesamesam commented May 30, 2025

Summarising findings from the Gentoo side: it's a numpy bug.

Thread 1 "python3.13" received signal SIGSEGV, Segmentation fault.
0x00007ffff47df6d9 in npy_memchr (haystack=0x7fffe8835d07 "", needle=0 '\000', stride=1, size=514, psubloopsize=0x7fffffffa160, invert=1) at ../numpy-2.2.6/numpy/_core/src/multiarray/common.h:278
278                     if (v != 0) {
(gdb) bt
#0  0x00007ffff47df6d9 in npy_memchr (haystack=0x7fffe8835d07 "", needle=0 '\000', stride=1, size=514, psubloopsize=0x7fffffffa160, invert=1) at ../numpy-2.2.6/numpy/_core/src/multiarray/common.h:278
#1  array_boolean_subscript (self=self@entry=0x7fffec1c65b0, bmask=<optimized out>, order=order@entry=NPY_CORDER) at ../numpy-2.2.6/numpy/_core/src/multiarray/mapping.c:1022
[...]

Looking at npy_memchr, I see UBSAN gets suppressed at

#ifdef __clang__
(added for #21116).

(It actually looks quite similar to what subversion was doing: https://bugs.gentoo.org/950271).

Changing https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/npy_cpu.h#L128 to set #define NPY_ALIGNMENT_REQUIRED 1 works for Kostadin who could hit it otherwise.

I recommend ripping all of that out and always assuming NPY_ALIGNMENT_REQUIRED. You'll run into random issues like this otherwise with increasing frequency.

@ngoldbaum
Copy link
Member

Thanks for the deep dive! Ping @seberg - it looks like you added the suppression in 2022.

@seberg
Copy link
Member

seberg commented May 30, 2025

🤷 this is ancient code, I just added the ignore because the variable is supposed to be set when we know unaligned access is supported and also it clearly was supported on all systems UBSAN ever ran on.

I don't care for keeping NPY_ALIGNEMENT_REQUIRED around (it probably makes a bit of a difference for structured dtypes only). It's just in one corner of NumPy and not used for a large chunk NumPy runs.

While I doubt it is important, I do think that this did/does make some sense here (no idea how much), in subversion, and I am very sure I also saw it in linux kernel code.
The pattern is 10+ years old in NumPy, and while I don't know if the define could be smarter, I doubt we will be the last project that runs into it.

@charris
Copy link
Member Author

charris commented May 30, 2025

Could be that gcc is using SIMD instructions that expect alignment.

@thesamesam
Copy link
Contributor

thesamesam commented May 30, 2025

Could be that gcc is using SIMD instructions that expect alignment.

It's precisely that, yeah. GCC will even peel for alignment now to handle misaligned workloads.

While I doubt it is important, I do think that this did/does make some sense here (no idea how much), in subversion, and I am very sure I also saw it in linux kernel code.

Right, it was absolutely a reasonable thing to do and it would've given a nice boost in the past. On the other hand, it's always a pain when you start debugging something and realise it's an #ifdef away from being legal ;)

It's just at odds with expecting compilers to autovectorise things these days. I recommend just dropping all of it, and if there's performance regressions, we can look into those if-and-when.

@charris charris added this to the 2.3.0 release milestone May 30, 2025
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue May 30, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years.

Fixes: numpy#28991
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue May 30, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years.

Fixes: numpy#28991
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue May 30, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years.

Fixes: numpy#28991
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue May 30, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years. This might also
  fix numpy#25004.

Fixes: numpy#28991
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue May 30, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years. This might also
  fix numpy#25004.

Fixes: numpy#28991
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue Jun 3, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years. This might also
  fix numpy#25004.

Fixes: numpy#28991
charris added a commit to charris/numpy that referenced this issue Jun 5, 2025
GCC 15 generates code that segfaults on newer hardware when running
tests. See numpygh-28991 for details. A fix is to always require alignment,
which can be done by setting `NPY_ALIGNMENT_REQUIRED = 1`.

We retain `NPY_ALIGNMENT_REQUIRED` here for downstream compatibility in
the 2.3.x release series, but do remove its use in our own code. The
associated macro `NPY_USE_UNALIGNED_ACCESS` could also be removed as it
is always 0, but that is left for another PR.
charris added a commit to charris/numpy that referenced this issue Jun 5, 2025
GCC 15 generates code that segfaults on newer hardware when running
tests. See numpygh-28991 for details. A fix is to always require alignment,
which can be done by setting `NPY_ALIGNMENT_REQUIRED = 1`.

We completely remove `NPY_ALIGNMENT_REQUIRED` here in order to test
downstream compatibility before the 2.4 release. It is not expected to
cause problems, but one never knows. The associated macro
`NPY_USE_UNALIGNED_ACCESS` could also be removed as it is always 0, but
that is left for another PR.
@charris charris modified the milestones: 2.3.0 release, 2.4.0 release Jun 6, 2025
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue Jun 8, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years. This might also
  fix numpy#25004.

Fixes: numpy#28991
SoapGentoo added a commit to SoapGentoo/numpy that referenced this issue Jun 8, 2025
* This machinery requires strict-aliasing UB and isn't needed
  anymore with any GCC from the last 15 years. This might also
  fix numpy#25004.

Fixes: numpy#28991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants