-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
This is a follow up to gh-15050, which describes a performance issue on arm64 macOS due to having two OpenBLAS libraries present. In the arm64
wheels for SciPy 1.7.3
, that was fixed by setting a single environment variable (OPENBLAS_THREAD_TIMEOUT
) in _distributor_init.py
.
That may not be the optimal long-term fix, and scikit-learn has seen a similar issue on Linux when two OpenMP runtimes are present. So this issue is for tracking another potential solution. Starting from #15050 (comment), comment by @ogrisel, copied here:
Opened MacPython/scipy-wheels#143 for the environment variable solution. I'd like to go with that for 1.7.3. To be discussed I'd say if we keep that or if we want to rebuild OpenBLAS for 1.8.0.
Indeed for a minor release, a minimal fix only for the macos m1 problem is best.
For the more medium term plan, maybe we could do this series of change more progressively. Here are some suggestions:
- Build OpenBLAS with
THREAD_TIMEOUT=1
for the wheels will probably help for the short to medium term on making it possible to workaround problems such as Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 scikit-learn/scikit-learn#20642 when there are several threading runtimes linked to the same Python process.
For the even longer term we could try plan changes to completely avoid redundancy of OpenBLAS and OpenMP wheels in the scipy stack. First for OpenBLAS itself:
- Expose the Cython BLAS API in
numpy.linalg
the same way scipy.linalg does it; - Make scipy use the BLAS implementation of numpy and stop shipping it (while continuing exposing it via the Cython BLAS API replicated in scipy for backward compat reasons). This means that pypi will stop shipping two copies of OpenBLAS for the scipy stack which should help with maintenance in the long run (and reduce a bit the scipy wheel sizes which is always nice).
- This way numpy would be the official BLAS implementation provider for the full scipy stack.
About the switch from native OpenBLAS threads to OpenMP, this would be nice for scikit-learn but libgomp is still not fork safe. conda-forge typically replaces libgomp by llvm-openmp by default for this reason. Note that on linux, gcc is still the recommended compiler for conda-forge, which means that the library is built against libgomp but then the openmp runtime is replaced afterwards.
Also note, for some reason conda-forge decided to use OpenBLAS native threads by default on Linux while it's also possible to make it use OpenMP. See the 2020-07-17
entry of the conda-forge announcements. Maybe @isuruf can explain why not use LLVM OpenMP (edit: added missing LLVM) by default on Linux?
More details on OpenMP in conda-forge in https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp
Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC.
My main reply:
This way numpy would be the official BLAS implementation provider for the full scipy stack.
This sounds less than ideal to me - we'd have to duplicate the Cython APIs, plus NumPy has far fewer BLAS/LAPACK needs than SciPy and is still on an older minimum required version than SciPy (that's why for example Accelerate is re-enabled in NumPy but cannot be in SciPy).
My first reaction is that shipping a standalone wheel would be preferable to this solution.
Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC.
Yes, I think it does. And it's broader than just packaging, there's an important UX question there as well that deserves thinking about - are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls?
And @ogrisel's reply:
My first reaction is that shipping a standalone wheel would be preferable to this solution.
Fine with me as well.
are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls?
I cannot say for scipy but for scikit-learn we have already started to use OpenMP multithreaded Cython code by default for some components and are progressively generalizing this approach in the code base.
xref gh-10239 for the issues around OpenMP, and why we don't use it in SciPy.
For the main topic of this issue: shipping separate OpenBLAS and OpenMP wheels isn't great either, and not something I'd like us to be responsible for. But it's probably either that, or stay with the status quo - which perhaps isn't very healthy if scikit-learn continues to use more OpenMP in more of its API.
The issue is specific to PyPI/wheels, because any other binary install mechanism tends to have a proper package manager which can guarantee only a single OpenBLAS/OpenMP library is used. Shipping native dependencies like these as separate wheels is in general an unhealthy idea for multiple reasons, but if it's just limited to OpenBLAS and OpenMP and we can scope it to just the NumPy/SciPy stack for those libraries, then it may be okay to do.