Skip to content

BUG: MacOS matmul FPE heisenbug seen on CI #28227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngoldbaum opened this issue Jan 24, 2025 · 6 comments
Open

BUG: MacOS matmul FPE heisenbug seen on CI #28227

ngoldbaum opened this issue Jan 24, 2025 · 6 comments
Labels

Comments

@ngoldbaum
Copy link
Member

I can't find an open an issue about this, but for a while now we have infrequent failures on MacOS CI due to a heisenbug:

FAILED numpy/linalg/tests/test_linalg.py::TestQR::test_stacked_inputs[float32-outer_size2-size2] - RuntimeWarning: invalid value encountered in matmul

As far as I know, no one has been able to reproduce this outside of CI and no one has been able to reliably trigger it besides running the MacOS CI repeatedly.

Unfortunately, adding free-threaded CI is making this happen more often because we run CI on Mac runners more often.

I'm opening this issue to have a place to link to when I merge PRs with CI failing on this one test and hopefully to attract someone who can figure this out and fix it!

@ngoldbaum
Copy link
Member Author

See e.g. this CI run as well as this PR from December.

@charris
Copy link
Member

charris commented Jan 24, 2025

The MacOS failures are newish, occasional failures on Windows have been seen for years. That said, I have the impression that the Windows failures are less frequent now.

@seberg
Copy link
Member

seberg commented Jan 24, 2025

Hmmm, I feel there was a time where I had occasional errors similar to this locally. But I don't think there were in matmul.
(I feel I might have tried diving a bit into it at some point, but not deep enough as it got weedy.)

I doubt it originates in NumPy. More likely something in the BLAS layer (and possible an FPE invalid SIMD optimization).

@malciin
Copy link

malciin commented Apr 10, 2025

It may be related to #28687. In that issue, RuntimeWarning: invalid value encountered in matmul is one of the warnings that can be easily reproduced.

@charris
Copy link
Member

charris commented Apr 10, 2025

Used to be mostly on Microsoft, the Mac issue is fairly new. I've sometimes wondered it matmul was picking up the FPE from another test, but that raises the question: why it is always matmul?

@malciin
Copy link

malciin commented Apr 23, 2025

A loose idea, but maybe it's because a random runner happened to have an M4 processor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants