-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
BUG: ArgKmin64 on Windows with scipy 1.13rc1 or 1.14.dev times out #28625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We can make a PR and trigger the nightly build to see if something changed. |
Actually from looking at our outputs the NumPy version that ends up being installed is NumPy |
On the nightly build, we compile against the 2.0 indeed and we have the normal CI running on 1.26. But this is an interesting problem because it means that compiling against 2.0 and having 1.26 could trigger a huge regression. I might try this configuration locally. |
... we also have |
Arff this is on Windows actually. I'll have to go through the CI then. |
Yes it's a pain :) Or you could use https://developer.microsoft.com/en-us/windows/downloads/virtual-machines/ if you want (it works for about a month) |
@larsoner is this still happening in the MNE CI and do you have a stand-alone snippet reproducing the issue? I have a Windows VM so I could try to reproduce but I would welcome any help in order to put me on the right track. |
Not minimal by any means but I just used something like this to try to reproduce the same issue on a Windows 10 VM with some stable numpy and scipy releases:
This last one hits a SciPy import error (seen here in CIs just now) due to scipy/scipy#20268 about the C header size change. But going back to a SciPy that works:
things are still okay. I then told the CIs in MNE-Python to use |
OK thanks for the details, let me know if the issue comes back! |
😭 it's back -- from here the failing config is: ├☑ numpy 1.26.4 (OpenBLAS 0.3.23.dev with 2 threads)
├☑ scipy 1.14.0.dev0+577.d891e40
├☑ sklearn 1.5.dev0 Locally in my VM installing these and doing:
So I'm guessing it must have something to do with sklearn's internal use of some SciPy functions. Perhaps it's related to BLAS/LAPACK -- I know some infrastructure there has changed lately for SciPy... I also boiled it down to a one-liner for you @lesteve , hopefully it reproduces for you like it does for me: python -c "import numpy as np; from sklearn.neighbors import LocalOutlierFactor; LocalOutlierFactor(metric='euclidean').fit_predict(np.zeros((20, 14000)))" |
... and oddly enough if I reduce the dimensionality to something like |
I can reproduce the hang, the fact that it still happens with scipy-dev and scikit-learn 1.4.1.post1 would seem to point towards a scipy dev change, but I will try to put together a snippet only using scipy to be 100% sure. |
Setting I am wondering if this is not due to nested parallelism OpenBLAS withing OpenMP in our neighbors code, not sure how this would be related to scipy-dev though. |
There was a bump to OpenBLAS 0.326 in scipy/scipy#20215 maybe that's it? |
Hmmm maybe hard to tell ... I am putting below what I learned so far, to be continued. Here is a simpler scikit-learn snippet reproducing the hang, this seems to be related to ArgKmin in pairwise distances reductions. cc @jeremiedbb and @jjerphan in case they have some insights into this. from sklearn.metrics._pairwise_distances_reduction import ArgKmin
import numpy as np
import threadpoolctl
# Uncommenting the next line fixes it, a similar line with OpenMP fixes it as well I think
# threadpoolctl.threadpool_limits(limits=1, user_api='blas')
X = np.zeros((20, 14000))
ArgKmin.compute(X=X, Y=X, k=10, metric='euclidean') The sklearn show_versions info.
|
Based on #28625 (comment), I am afraid that there is not much we can do apart from recommending using conda packages instead of wheels. #23574 is a similar issue in this regards whose discussions provide some background on this situation. @larsoner: could you report the runtime of the stack which are loaded using |
maybe related to scipy/scipy#20271 |
In my VM:
And I can confirm |
I opened scipy/scipy#20294 to get insights from Scipy developers. I also realised the issue is in the scipy 1.13 release candidate.
This will always be the case with wheels, right? The OpenBLAS shipped with the numpy wheel and the OpenBLAS shipped with the scipy wheel will be different. This is a super common use case, we need to make sure it works ... |
Actually debugging a bit further it seems like this is due to OpenBLAS 0.3.26, I can reproduce the hang with conda-forge packages, see scipy/scipy#20294 (comment) I guess this will need to be reported to OpenBLAS, although putting together some kind of reproducer will be a bit of work. |
Hi. NumPy developer and OpenBLAS packager here. There should be no conflict using those versions of NumPy and SciPy since NumPy is building with the 64bit ILP interfaces. Using
|
Thank you for your help, @mattip. There are generally as many threads as physical cores. Here are some pointers:
|
On my failing CI I had explicitly set OPENBLAS_NUM_THREADS to 2 if that matters. I know it does for OpenBLAS but not sure if sklearn uses it in ArgKmin. Happy to test whatever might help in my VM. Maybe setting OMP_NUM_THREADS=2, since it looks like that would be used by both OpenBLAS and sklearn? |
Okay ran some tests using
This table lists only the combinations out of the 25 tested that hung:
I guess the TL;DR is that there are problems when |
In MNE-Python our Windows pip-pre job on Azure has started reliably timing out (and a second example):
Our code just calls the following (and hasn't been changed):
which eventually in the traceback points to the line:
scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py
Line 278 in e5ce4bc
18 hours ago all our tests passed in 40 minutes, then 3 hours ago it started failing 38% through the tests with a 70 minute timeout, and gets to the point only ~27 minutes into the build:
This suggests that the latest scientific-python-nightly-wheels upload of scikit-learn (and/or NumPy) 11 hours ago caused something in here to hang, so probably some recent PR to sklearn or NumPy is the culprit.
Not exactly a MWE -- I'm not on Windows at the moment but could switch at some point -- but maybe someone has an idea about why it's happening...?
The text was updated successfully, but these errors were encountered: