Skip to content

BUG: ArgKmin64 on Windows with scipy 1.13rc1 or 1.14.dev times out #28625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larsoner opened this issue Mar 13, 2024 · 24 comments · Fixed by #28692
Closed

BUG: ArgKmin64 on Windows with scipy 1.13rc1 or 1.14.dev times out #28625

larsoner opened this issue Mar 13, 2024 · 24 comments · Fixed by #28692
Labels

Comments

@larsoner
Copy link
Contributor

In MNE-Python our Windows pip-pre job on Azure has started reliably timing out (and a second example):

mne/preprocessing/tests/test_interpolate.py::test_find_centroid PASSED   [ 38%]
##[error]The Operation will be canceled. The next steps may not contain expected logs.
Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: initialized

...
Thread 0x000014fc (most recent call first):
  File "C:\hostedtoolcache\windows\Python\3.11.8\x64\Lib\site-packages\sklearn\metrics\_pairwise_distances_reduction\_dispatcher.py", line 278 in compute
  File "C:\hostedtoolcache\windows\Python\3.11.8\x64\Lib\site-packages\sklearn\neighbors\_base.py", line 850 in kneighbors
  File "C:\hostedtoolcache\windows\Python\3.11.8\x64\Lib\site-packages\sklearn\neighbors\_lof.py", line 291 in fit
  File "C:\hostedtoolcache\windows\Python\3.11.8\x64\Lib\site-packages\sklearn\base.py", line 1474 in wrapper
  File "C:\hostedtoolcache\windows\Python\3.11.8\x64\Lib\site-packages\sklearn\neighbors\_lof.py", line 256 in fit_predict
  File "D:\a\1\s\mne\preprocessing\_lof.py", line 89 in find_bad_channels_lof
  File "<decorator-gen-627>", line 12 in find_bad_channels_lof
  File "D:\a\1\s\mne\preprocessing\tests\test_lof.py", line 31 in test_lof
...

Our code just calls the following (and hasn't been changed):

    clf = LocalOutlierFactor(n_neighbors=n_neighbors, metric=metric)
    clf.fit_predict(data)

which eventually in the traceback points to the line:

18 hours ago all our tests passed in 40 minutes, then 3 hours ago it started failing 38% through the tests with a 70 minute timeout, and gets to the point only ~27 minutes into the build:

image

This suggests that the latest scientific-python-nightly-wheels upload of scikit-learn (and/or NumPy) 11 hours ago caused something in here to hang, so probably some recent PR to sklearn or NumPy is the culprit.

Not exactly a MWE -- I'm not on Windows at the moment but could switch at some point -- but maybe someone has an idea about why it's happening...?

@glemaitre
Copy link
Member

We can make a PR and trigger the nightly build to see if something changed.

@larsoner
Copy link
Contributor Author

Actually from looking at our outputs the NumPy version that ends up being installed is NumPy 1.26.4 (OpenBLAS 0.3.23.dev with 2 threads) so it's maybe only a NumPy 2.0 compat issue in that you (presumably) compile your wheels against NumPy 2.0 dev or beta

@glemaitre
Copy link
Member

On the nightly build, we compile against the 2.0 indeed and we have the normal CI running on 1.26. But this is an interesting problem because it means that compiling against 2.0 and having 1.26 could trigger a huge regression. I might try this configuration locally.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Mar 13, 2024
@larsoner
Copy link
Contributor Author

... we also have scipy 1.13.0.dev0+1500.8c5c4fa on that run if it matters (see the print config step from any of our failing CI runs), but I assume sklearn has its own code for computing pairwise distances as opposed to using something like cdist / pdist (otherwise it could be SciPy's problem!).

@glemaitre
Copy link
Member

Arff this is on Windows actually. I'll have to go through the CI then.

@larsoner
Copy link
Contributor Author

Yes it's a pain :) Or you could use https://developer.microsoft.com/en-us/windows/downloads/virtual-machines/ if you want (it works for about a month)

@ogrisel ogrisel added the cython label Mar 14, 2024
@lesteve
Copy link
Member

lesteve commented Mar 18, 2024

@larsoner is this still happening in the MNE CI and do you have a stand-alone snippet reproducing the issue?

I have a Windows VM so I could try to reproduce but I would welcome any help in order to put me on the right track.

@larsoner
Copy link
Contributor Author

Not minimal by any means but I just used something like this to try to reproduce the same issue on a Windows 10 VM with some stable numpy and scipy releases:

git clone git@github.com:/larsoner/mne-python.git
cd mne-python
git checkout -b check origin/check  # branch with the problematic test reenabled
pytest mne\preprocessing\tests\test_lof.py  # works
pip install --index-url "https://pypi.anaconda.org/scientific-python-nightly-wheels/simple" "scikit-learn>=1.5.dev0"
pytest mne\preprocessing\tests\test_lof.py  # still works!
pip install --index-url "https://pypi.anaconda.org/scientific-python-nightly-wheels/simple" "scipy>=1.14.0.dev0"

This last one hits a SciPy import error (seen here in CIs just now) due to scipy/scipy#20268 about the C header size change. But going back to a SciPy that works:

pip install "scipy==1.12.0"
pytest mne\preprocessing\tests\test_lof.py

things are still okay. I then told the CIs in MNE-Python to use scipy==1.12.0 and things are okay there, too. So I'm guessing it's probably no longer an issue, and thus closing -- but I'll reopen if it pops back up once the NumPy<->SciPy compat issue is sorted out and I re-enable SciPy dev!

@lesteve
Copy link
Member

lesteve commented Mar 18, 2024

OK thanks for the details, let me know if the issue comes back!

@larsoner
Copy link
Contributor Author

larsoner commented Mar 19, 2024

😭 it's back -- from here the failing config is:

├☑ numpy                1.26.4 (OpenBLAS 0.3.23.dev with 2 threads)
├☑ scipy                1.14.0.dev0+577.d891e40
├☑ sklearn              1.5.dev0

Locally in my VM installing these and doing:

$ pip install --extra-index-url "https://pypi.anaconda.org/scientific-python-nightly-wheels/simple" "scikit-learn>=1.5.dev0" "scipy>=1.14.0.dev0" "numpy==1.26.4"
$ pytest mne/preprocessing/tests/test_lof.py
  1. I can reproduce the hang!
  2. Doing pip install scipy==1.12.0 fixes it
  3. It doesn't matter which numpy I have installed, same hang with both 1.26.4 and 2.1.0.dev0

So I'm guessing it must have something to do with sklearn's internal use of some SciPy functions. Perhaps it's related to BLAS/LAPACK -- I know some infrastructure there has changed lately for SciPy...

I also boiled it down to a one-liner for you @lesteve , hopefully it reproduces for you like it does for me:

python -c "import numpy as np; from sklearn.neighbors import LocalOutlierFactor; LocalOutlierFactor(metric='euclidean').fit_predict(np.zeros((20, 14000)))"

@larsoner
Copy link
Contributor Author

... and oddly enough if I reduce the dimensionality to something like (20, 1400) it does not hang. So there's something about the size being big enough like (20, 14000) that triggers the issue (maybe you have "big" and "small" code paths?).

@lesteve
Copy link
Member

lesteve commented Mar 19, 2024

I can reproduce the hang, the fact that it still happens with scipy-dev and scikit-learn 1.4.1.post1 would seem to point towards a scipy dev change, but I will try to put together a snippet only using scipy to be 100% sure.

@lesteve
Copy link
Member

lesteve commented Mar 19, 2024

Setting OPENBLAS_NUM_THREADS=1 or OMP_NUM_THREADS=1 seems to get rid of the issue, maybe a reasonable work-around in the short term (using threadpoolctl in a single test is an alternative if setting an environment variable is not convenient).

I am wondering if this is not due to nested parallelism OpenBLAS withing OpenMP in our neighbors code, not sure how this would be related to scipy-dev though.

@larsoner
Copy link
Contributor Author

There was a bump to OpenBLAS 0.326 in scipy/scipy#20215 maybe that's it?

@lesteve
Copy link
Member

lesteve commented Mar 19, 2024

There was a bump to OpenBLAS 0.326 in scipy/scipy#20215 maybe that's it?

Hmmm maybe hard to tell ... I am putting below what I learned so far, to be continued.

Here is a simpler scikit-learn snippet reproducing the hang, this seems to be related to ArgKmin in pairwise distances reductions. cc @jeremiedbb and @jjerphan in case they have some insights into this.

from sklearn.metrics._pairwise_distances_reduction import ArgKmin
import numpy as np
import threadpoolctl

# Uncommenting the next line fixes it, a similar line with OpenMP fixes it as well I think
# threadpoolctl.threadpool_limits(limits=1, user_api='blas')
X = np.zeros((20, 14000))
ArgKmin.compute(X=X, Y=X, k=10, metric='euclidean')

The sklearn show_versions info.

System:
    python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:42:31) [MSC v.1937 64 bit (AMD64)]
executable: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.2.0
        numpy: 1.26.4
        scipy: 1.14.0.dev0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.3.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: vcomp
       filepath: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue\Lib\site-packages\numpy.libs\libopenblas64__v
0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue\Lib\site-packages\scipy.libs\libopenblas_v0.3
.26-gcc_10_3_0-75ebbb8345f75277878db24d649d8b7e.dll
        version: 0.3.26
threading_layer: pthreads
   architecture: Haswell

@jjerphan
Copy link
Member

jjerphan commented Mar 19, 2024

Based on #28625 (comment), libopenblas seems to be loaded twice. If this is the case, this leads to undefined behavior.

I am afraid that there is not much we can do apart from recommending using conda packages instead of wheels. #23574 is a similar issue in this regards whose discussions provide some background on this situation.

@larsoner: could you report the runtime of the stack which are loaded using threadpoolctl for this case?

@jeremiedbb
Copy link
Member

maybe related to scipy/scipy#20271

@larsoner
Copy link
Contributor Author

In my VM:

$ python -c "import numpy; import scipy.linalg; import sklearn.neighbors; from threadpoolctl import threadpool_info; from pprint import pprint; pprint(threadpool_info())"
[{'architecture': 'Haswell',
  'filepath': 'C:\\Users\\tester\\mne-python\\1.6.1_0\\Lib\\site-packages\\numpy.libs\\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll',
  'internal_api': 'openblas',
  'num_threads': 4,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.23.dev'},
 {'architecture': 'Haswell',
  'filepath': 'C:\\Users\\tester\\mne-python\\1.6.1_0\\Lib\\site-packages\\scipy.libs\\libopenblas_v0.3.26-gcc_10_3_0-75ebbb8345f75277878db24d649d8b7e.dll',
  'internal_api': 'openblas',
  'num_threads': 4,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.26'},
 {'filepath': 'C:\\Users\\tester\\mne-python\\1.6.1_0\\Lib\\site-packages\\sklearn\\.libs\\vcomp140.dll',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'vcomp',
  'user_api': 'openmp',
  'version': None}]

Azure:

[{'architecture': 'SkylakeX',
  'filepath': 'C:\\hostedtoolcache\\windows\\Python\\3.11.8\\x64\\Lib\\site-packages\\scipy.libs\\libopenblas_v0.3.26-gcc_10_3_0-75ebbb8345f75277878db24d649d8b7e.dll',
  'internal_api': 'openblas',
  'num_threads': 2,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.26'},
 {'filepath': 'C:\\hostedtoolcache\\windows\\Python\\3.11.8\\x64\\Lib\\site-packages\\sklearn\\.libs\\vcomp140.dll',
  'internal_api': 'openmp',
  'num_threads': 2,
  'prefix': 'vcomp',
  'user_api': 'openmp',
  'version': None}]

And I can confirm OPENBLAS_NUM_THREADS=1 fixes it locally at least.

@lesteve
Copy link
Member

lesteve commented Mar 20, 2024

I opened scipy/scipy#20294 to get insights from Scipy developers. I also realised the issue is in the scipy 1.13 release candidate.

Based on #28625 (comment), libopenblas seems to be loaded twice. If this is the case, this leads to undefined behavior.

I am afraid that there is not much we can do apart from recommending using conda packages instead of wheels.

This will always be the case with wheels, right? The OpenBLAS shipped with the numpy wheel and the OpenBLAS shipped with the scipy wheel will be different. This is a super common use case, we need to make sure it works ...

@lesteve lesteve changed the title BUG: ArgKmin64 on Windows with NumPy 2.0 times out BUG: ArgKmin64 on Windows with scipy 1.13rc1 or 1.14.dev times out Mar 20, 2024
@lesteve
Copy link
Member

lesteve commented Mar 20, 2024

Actually debugging a bit further it seems like this is due to OpenBLAS 0.3.26, I can reproduce the hang with conda-forge packages, see scipy/scipy#20294 (comment)

I guess this will need to be reported to OpenBLAS, although putting together some kind of reproducer will be a bit of work.

@mattip
Copy link

mattip commented Mar 20, 2024

Hi. NumPy developer and OpenBLAS packager here. There should be no conflict using those versions of NumPy and SciPy since NumPy is building with the 64bit ILP interfaces. Using OPENBLAS_NUM_THREADS=1 points to some problem with threading. Do you know how many concurrent tasks are created when calling the reproducer?

python -c "import numpy as np; from sklearn.neighbors import LocalOutlierFactor; \
     LocalOutlierFactor(metric='euclidean').fit_predict(np.zeros((20, 14000)))"

@jjerphan
Copy link
Member

jjerphan commented Mar 20, 2024

Thank you for your help, @mattip.

There are generally as many threads as physical cores. Here are some pointers:

  • # Not using less, not using more.
    self.chunks_n_threads = min(
    self.Y_n_chunks if self.execute_in_parallel_on_Y else self.X_n_chunks,
    self.effective_n_threads,
    )
  • self.effective_n_threads = _openmp_effective_n_threads()
  • cpdef _openmp_effective_n_threads(n_threads=None, only_physical_cores=True):
    """Determine the effective number of threads to be used for OpenMP calls
    - For ``n_threads = None``,
    - if the ``OMP_NUM_THREADS`` environment variable is set, return
    ``openmp.omp_get_max_threads()``
    - otherwise, return the minimum between ``openmp.omp_get_max_threads()``
    and the number of cpus, taking cgroups quotas into account. Cgroups
    quotas can typically be set by tools such as Docker.
    The result of ``omp_get_max_threads`` can be influenced by environment
    variable ``OMP_NUM_THREADS`` or at runtime by ``omp_set_num_threads``.
    - For ``n_threads > 0``, return this as the maximal number of threads for
    parallel OpenMP calls.
    - For ``n_threads < 0``, return the maximal number of threads minus
    ``|n_threads + 1|``. In particular ``n_threads = -1`` will use as many
    threads as there are available cores on the machine.
    - Raise a ValueError for ``n_threads = 0``.
    Passing the `only_physical_cores=False` flag makes it possible to use extra
    threads for SMT/HyperThreading logical cores. It has been empirically
    observed that using as many threads as available SMT cores can slightly
    improve the performance in some cases, but can severely degrade
    performance other times. Therefore it is recommended to use
    `only_physical_cores=True` unless an empirical study has been conducted to
    assess the impact of SMT on a case-by-case basis (using various input data
    shapes, in particular small data shapes).
    If scikit-learn is built without OpenMP support, always return 1.
    """
    if n_threads == 0:
    raise ValueError("n_threads = 0 is invalid")
    if not SKLEARN_OPENMP_PARALLELISM_ENABLED:
    # OpenMP disabled at build-time => sequential mode
    return 1
    if os.getenv("OMP_NUM_THREADS"):
    # Fall back to user provided number of threads making it possible
    # to exceed the number of cpus.
    max_n_threads = omp_get_max_threads()
    else:
    try:
    n_cpus = _CPU_COUNTS[only_physical_cores]
    except KeyError:
    n_cpus = cpu_count(only_physical_cores=only_physical_cores)
    _CPU_COUNTS[only_physical_cores] = n_cpus
    max_n_threads = min(omp_get_max_threads(), n_cpus)
    if n_threads is None:
    return max_n_threads
    elif n_threads < 0:
    return max(1, max_n_threads + n_threads + 1)
    return n_threads

@larsoner
Copy link
Contributor Author

On my failing CI I had explicitly set OPENBLAS_NUM_THREADS to 2 if that matters. I know it does for OpenBLAS but not sure if sklearn uses it in ArgKmin. Happy to test whatever might help in my VM. Maybe setting OMP_NUM_THREADS=2, since it looks like that would be used by both OpenBLAS and sklearn?

@larsoner
Copy link
Contributor Author

larsoner commented Mar 20, 2024

Okay ran some tests using OPENMP_NUM_THREADS in range(1, 6) and OMP_NUM_THREADS in range(1, 6) to test messing with OpenBLAS and sklearn parallelization, respectively (I think!), simultaneously with:

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 python -c "import numpy as np; from sklearn.neighbors import LocalOutlierFactor; LocalOutlierFactor(metric='euclidean').fit_predict(np.zeros((20, 14000)))"

This table lists only the combinations out of the 25 tested that hung:

OPENBLAS_NUM_THREADS (OpenBLAS) OMP_NUM_THREADS (sklearn)
2 2
-- --
3 3
-- --
4 2
4 3
4 4
4 5
-- --
5 2
5 3
5 4
5 5

I guess the TL;DR is that there are problems when OPENBLAS_NUM_THREADS == OMP_NUM_THREADS != 1 or when OPENBLAS_NUM_THREADS>=4 and OMP_NUM_THREADS>=2 (though I didn't test over 5, spot checking 7, 8 also hung).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants