-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Tests test_theil_sen_parallel and test_multi_output_classification_partial_fit_parallelism hang on Windows #12263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for reporting! That's not great :-/ |
Could you also install It is weird that the test passes on appveyor if you see a deterministic failure on your machine. |
Here is what I see:
|
It looks like you are missing a part of the trace in |
@tomMoral Yes, my bad. I have updated the earlier comment to include the said line. |
How many workers do you have here? Are you on a server with more than 127 cores? |
The machine has Intel Xeon E5-2630 v3, with 2 sockets, 8 cores per socket, HT: 2. The task manager reports 16 cores, and 32 logical cores. When the test run hangs I do indeed see over 100 python processes in the task manager. |
@oleksandr-pavlyk could you clarify whether this is resolved in master after c81e255? |
I am starting a rackspace VM under windows to give it a try. |
I have tried on a big Dual Intel Xeon E5-2660 v3 running Windows 2012 and scikit-learn 0.20.0 installed from pip in a dedicated conda env exactly as reported by @oleksandr-pavlyk and I could not reproduce the hangs. I repeated the pytest commands more than 20 times and each completed successfully in less than 2s. And I could see in the task manager than many cores were used. I also launched the full test suite successfully. I tried also to build scikit-learn master and launch the full test suite twice and it completed without any hang. There are failing doctests under windows but this is expected (the way the default dtype of integer arrays is printed in doctest output is not necessarily always consistent with the output under other platforms but this is harmless). |
I can still see the hang with a local build from current master (4e81949). Upon execution of
There are 130 idle Python processes shown in task manager and the execution never terminates. Pressing Ctrl+C terminates all 130 processes, which is an improvement over an earlier version of joblib. Is there a way for me to access prebuilt wheels, or other binaries that you used to check whether the problem is fixed? Thanks! |
If you built the master branch then the problem is still there. This is weird because I could not reproduce the issue, neither on 0.20.0 nor on master.
Our CI uploads the wheels built on the master branch for 64 bit Python 3.7 and 32 bit Python 2.7 on the following rackspace blob storage: But they should yield the same result as the one you observed by building the master branch yourself. |
Thanks @ogrisel . I downloaded and installed |
retagging see #12548 (comment) |
Could you please run the following on your machine : from sklearn.externals.joblib import cpu_count
print(cpu_count()) I will try investigating that in the coming days. |
@tomMoral , here it is:
|
Ok so the problem is caused by I opened an issue in the But it is still strange that you report echo %NUMBER_OF_PROCESSORS%
python -c "import multiprocessing as mp; print('mp:', mp.cpu_count())" |
Multiprocessign also reports 128: (skl-0.21.dev0) tmp>echo %NUMBER_OF_PROCESSORS%
16
(skl-0.21.dev0) tmp>python -c "import multiprocessing as mp; print('mp:', mp.cpu_count())"
mp: 128 |
So there might be a bug in the way python is introspecting the system to detect the number of processors on this machine. Actually on python 3, what matters is |
Yes,
|
This might be this bug: https://bugs.python.org/issue30581 It should be fixed in Python 3.7.1. |
Perhaps @anton-malakhov might have an idea for why My wish, though, is that this does not cause a hang. |
Here is the fix for bpo-30581: https://github.com/python/cpython/pull/2934/files The hang itself is caused by https://github.com/tomMoral/loky/issues/192 but it would not be caused if python did not report a wrong number of CPUs. |
@ogrisel By the way,
|
@ogrisel but more cores would still trigger the bug, even with python fixed, right? EC2 has c5.18xlarge with 72 cores for $0.70/h. |
Note that this problem only exists on windows, so big instances are more around 5/10$ per hour. But this should still be fixed in |
@tomMoral fair point ;) |
I volunteer to test fixes for free :) |
On a version of Python with the fix for By that time loky will be fixed hopefully :) |
Actually no, you are right, 64 cores is enough. I misread the above conversation. |
Using a non-loky joblib.parallel backend would be a sufficient workaround here? |
We got bit by another one of these parallel tests, namely At the same time, the test The difference between them is that the hanging one sets If you think it is a good idea to be explicit about the number of jobs in test suite (say |
Sure, please open a pr with explicit n_jobs in the tests.
|
This change is to work around the hang scikit-learn#12263 afflicting Windows on machines with > 62 hyperthreads.
This change is to work around the hang #12263 afflicting Windows on machines with > 62 hyperthreads.
This change is to work around the hang scikit-learn#12263 afflicting Windows on machines with > 62 hyperthreads.
This change is to work around the hang scikit-learn#12263 afflicting Windows on machines with > 62 hyperthreads.
This change is to work around the hang scikit-learn#12263 afflicting Windows on machines with > 62 hyperthreads.
It seems this issue is fixed? Closing, please feel free to reopen if it's still an issue. |
Description
Install scikit-learn 0.20.0 on Windows as follows:
Steps/Code to Reproduce
The following two individual test runs hang (never finish, and remain uninterruptable):
Expected Results
Expecting them to pass as they do on Linux, or skipped in the distribution
Actual Results
Tests hang in both pip installed scikit-learn, as well in scikit-learn installed via conda itself.
Reproduced on "Windows Server 2012 R2 Standard".
Versions
@ogrisel
The text was updated successfully, but these errors were encountered: