-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: BLAS setup can silently fail and not be reported in show_config() #24200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report @larsoner. That is probably simply expected right now, because SIMD support is incomplete in the Meson build. We can leave this open for a while, but I would recommend not investigating anything because looking at anything performance-related for a Meson vs. a distutils build isn't all that useful right now. |
FWIW locally I built using the Meson and distutils backends and performance was the same. But maybe I didn't configure options properly to see the slowdown... I also kind of expected this to just get sent over to OpenBLAS so didn't think the SIMD / dispatch stuff would matter. So I thought perhaps an OpenBLAS version changed between a month ago and now and that had performance problems. |
Oh wait, I missed that this was ~50x or so. If you rebuilt and it's both that much slower, maybe the build is silently using We do have a whole bunch of logic in |
No it's the opposite rather -- locally I rebuilt using Meson and distutils with my own OpenBLAS that was the same each time (0.3.21 or something?), and it's just as fast as the 1.25.rc wheel from SPNW (no regression): the only problematic case seems to be the 2.0.0dev0 wheel. Both the 1.25.rc (fast) and 2.0.0dev0 (slow) show that the OpenBLAS lib used is "OpenBLAS 0.3.23 with 1 thread" when I look using threadpoolctl on my machine (locally I limit it to 1 thread). But if something changed about how that was built or what version was actually used (despite being reported the same) in the last ~30 days that's a good candidate! |
|
How are you locally limiting OpenBLAS to one thread? |
Nevermind, this reproduces even when not limiting OpenBLAS to one thread |
As far as I can tell, both wheels use the same OpenBLAS Each build in the scheduled wheel build log has artifacts. Perhaps someone could download them and find the point at which the slowdown started. |
I can do it tomorrow morning EDT unless someone beats me to it (feel free!) |
Okay clearly I must have meant tomorrow July 19th EDT :) |
Actually it only took a couple of minutes to bisect using
Which I guess gets us this diff: https://github.com/numpy/numpy/compare/0e2af84..85385d6 |
... I am a bit worried about this meson check EDIT: potentially failing in the wheel build: It passing on my machine and failing in the wheel build (for some reason!) would be a possible explanation for the slowdown because |
I stuck a |
Aha, must have been gh-23984. |
One guess: The probing code for C-blas is probably missing the suffix, which we use in the wheels. That would explain why local testing doesn't pick it up. Maybe it would be good if compilation would just fail if some of these blas variables are set but then blas isn't found? |
Yes, I agree. I've thought about it before, but somehow we've never done this. I suspect the reason is that we have/had a whole bunch of users who didn't care about performance and wanted numpy to build from source without much trouble. Making the build fail basically is going to increase our "numpy doesn't build" issue load by a lot I think, but it'll avoid accidentally getting lapack-lite builds.
That makes sense, I use non-suffixed OpenBLAS or MKL builds typically. |
So I think there are arguably three bugs / places for improvement here:
|
We can make that differentiation indeed if we want. We've never done it before. There's a bunch of subtleties though. For example, it is possible to build OpenBLAS without LAPACK. In that case what do we do when you only say "give me openblas"? Ignoring the details for now, I think we could go either way. The main two options are:
In both cases, if the user say Another change I think I already made is that
Let's make our own wheel builds more robust either way by explicitly selecting OpenBLAS so that the build always fails if there's a detection issue. |
This is a mess and doesn't attempt to fix the issues around numpygh-24200, but it should fix the issue of our wheels not using CBLAS. The suffix may be directly injected into the flags to get the define and there is also a tangle between `HAVE_CBLAS` and `HAVE_LAPACK` and we only probe the first. I tried this locally by checking that *if* the define exist redefining it to nonesense gives the slow version. (I cannot test the full thing without a prefixed blas locally, because setting the prefix will not avoid blas use in LAPACK meaning you don't use CBLAS, but the blas symbols used by lapack are missing)
FWIW, I think gh-24222 should fix the wheels. But if there is a larger fix that isn't complicated that is better! Erroring out generously seems fine as long the error is clear about how to work-around things. Not sure weird compiles (mixing no cblas, but have full lapack with blas) is even worth the trouble in practice? (I would even be open to just drop lapack-lite if I was sure it's not a big burden on even niche platforms to get a proper lapack.) |
It was until recently, because Arch Linux built OpenBLAS like that. Luckily they finally fixed it, maybe 2 months ago. So you're probably correct here, no longer worth the trouble.
Me too, that code is more a burden than helpful at this point (e.g., gh-24198 from a few days ago, older ones like gh-20079, and other issues like build warnings with the low-quality generated C code), and the 50x slowdown type things are very rarely wanted. Two thoughts:
|
We should look into this. I would hate for building numpy for Emscripten to become too much like building scipy. On the other hand, we don't want to be holding back the world, if it's a good idea to remove lapacklite for other reasons, we'll just have to deal with it. And numpy is extremely simple compared to scipy so maybe we won't need such complicated hacks. |
#24222 was merged, which should close this issue. Or would you like to change the title and keep it open for further discussion and cleanup? |
It would seem like it based on the issue title, but in my mind there is more that came out of the discussion here. I suggested above that there were three problems, and I think #24222 only takes care of the first one. The remaining two are points (2) and (3) above, namely 2) someone can say "I want fast BLAS" but get a build with slow BLAS (and @rgommers had some ideas for that), and 3) in the wheel build that was broken So my vote would be to keep this open until those two other things are fixed, or create new issues that outline those two problems. |
Just to note, OpenBLAS was used in those wheels, just only for lapack functions (which use blas functions also!). |
Agreed this is a |
Also improve the `show_config` output (fixes one of the issues discussed in numpygh-24200), and improve handling of BLAS related logic in `numpy/meson.build`. The `OPENBLAS64_` environment variable is only used in `numpy.distutils.system_info`, so remove it from the places where we no longer use `numpy.distutils`.
Also improve the `show_config` output (fixes one of the issues discussed in numpygh-24200), and improve handling of BLAS related logic in `numpy/meson.build`. The `OPENBLAS64_` environment variable is only used in `numpy.distutils.system_info`, so remove it from the places where we no longer use `numpy.distutils`.
Also improve the `show_config` output (fixes one of the issues discussed in numpygh-24200), and improve handling of BLAS related logic in `numpy/meson.build`. The `OPENBLAS64_` environment variable is only used in `numpy.distutils.system_info`, so remove it from the places where we no longer use `numpy.distutils`.
I'm taking along the fix for
Thanks @hoodmane. It won't become as complicated I'm sure. In case we decide we want to move forward with this and it's a problem for Pyodide, we can perhaps just split off That said, let's not worry about it now. I think the first step would only be to fail by default, and let users opt in to a slow BLAS if they don't care about performance for linalg functionality. So for Pyodide the only change would be to pass a flag like |
Yeah I think in the medium term it will be better for us to also use better lapack/blas implementations. We'll get there eventually =) |
I experimented a bit with lapack-lite. When you have any BLAS library on your system, it's pretty difficult to tell the
The build completes, and is not linked against a BLAS or LAPACK library:
numpy imports and basic operations work, even >>> a = np.array([[1, 0], [0, 1]])
... b = np.array([[4, 1], [2, 2]])
... np.dot(a, b)
array([[4, 1],
[2, 2]])
>>> a @ b
array([[4, 1],
[2, 2]])
>>> a @ b.T
array([[4, 2],
[1, 2]])
>>> A = np.array([[1,-2j],[2j,5]])
>>> np.linalg.cholesky(A)
array([[1.+0.j, 0.+0.j],
[0.+2.j, 1.+0.j]]) However, trying to run the test suite fails, because >>> np.test()
Traceback (most recent call last):
Cell In[21], line 1
np.test()
File ~/code/numpy/numpy/_pytesttester.py:182 in __call__
from numpy.testing import IS_PYPY
File ~/code/numpy/numpy/testing/__init__.py:11
from ._private.utils import *
File ~/code/numpy/numpy/testing/_private/utils.py:25
import numpy.linalg.lapack_lite
ImportError: /home/rgommers/code/numpy/numpy/linalg/lapack_lite.cpython-39-x86_64-linux-gnu.so: undefined symbol: dlapy3_
The following diff fixes the test suite, and it passes then: --- a/numpy/testing/_private/utils.py
+++ b/numpy/testing/_private/utils.py
@@ -22,7 +22,6 @@
from numpy.core import (
intp, float32, empty, arange, array_repr, ndarray, isnat, array)
from numpy import isfinite, isnan, isinf
-import numpy.linalg.lapack_lite
from io import StringIO
@@ -54,7 +53,12 @@ class KnownFailureException(Exception):
IS_PYPY = sys.implementation.name == 'pypy'
IS_PYSTON = hasattr(sys, "pyston_version_info")
HAS_REFCOUNT = getattr(sys, 'getrefcount', None) is not None and not IS_PYSTON
-HAS_LAPACK64 = numpy.linalg.lapack_lite._ilp64
+
+try:
+ import numpy.linalg.lapack_lite
+ HAS_LAPACK64 = numpy.linalg.lapack_lite._ilp64
+except ImportError:
+ HAS_LAPACK64 = False
_OLD_PROMOTION = lambda: np._get_promotion_state() == 'legacy' So I think I'll fix up the test suite like that and implement a switch to opt into lapack-lite in Of course, there is more to clean up here - if we're not importing |
Actually, I think I'll just clean it up straight away. I did a GitHub search and I don't find usage in the wild of the |
This was broken with `undefined symbol: dlapy3_` because the test suite imports `linalg.lapack_lite` directly. See numpygh-24200 for more details.
Hmm, not entirely true unfortunately - |
This was broken with `undefined symbol: dlapy3_` because the test suite imports `linalg.lapack_lite` directly. See numpygh-24200 for more details.
gh-24279 adds an |
This was broken with `undefined symbol: dlapy3_` because the test suite imports `linalg.lapack_lite` directly. See numpygh-24200 for more details.
This caught some issues, but is also tripping folks up. The most annoying consequence is when a package has a transitive build-time dependency on We'll be reverting to defaulting to |
Could we make the default "True if env var NUMPY_ALLOW_NOBLAS is set and False otherwise"? Then these wheel builders or whoever can set this env var. Apologies if someone thought of this and discussed it already... |
That is possible, and we may end up doing that in |
Update on this: for 1.26.1 we keep the value of |
I just opened #25063 to revert the default to I decided against the environment variable for now, I really don't like it. The situation after gh-25063 will just go back to how it always was - with the difference that the warning about the slow fallback being used is much more prominent in the build log. |
This was broken with `undefined symbol: dlapy3_` because the test suite imports `linalg.lapack_lite` directly. See numpygh-24200 for more details.
Describe the issue:
With 2.0.0dev0 wheel on scientific-python-nightly-wheels there is a big slowdown for
@
/ dot for us. First noticed a ~1.5x slowdown in MNE-Python CIs, then went through a bunch of stuff with @seberg on Discord (thanks for your patience!). Finally I think I created a minimal example:TL;DR: 18.6ms on "1.25.0rc1" from just under a month ago, 911ms on latest 2.0.0dev0 on my machine.
I've tried to reproduce this on
main
on my machine as well by building myself by setting dispatches, using meson or not, etc. but have only ever managed to get the good/fast time.Reproduce the code example:
Above
Error message:
Runtime information:
Above
Context for the issue:
Above
The text was updated successfully, but these errors were encountered: