Skip to content

Tests test_theil_sen_parallel and test_multi_output_classification_partial_fit_parallelism hang on Windows #12263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
oleksandr-pavlyk opened this issue Oct 3, 2018 · 35 comments
Milestone

Comments

@oleksandr-pavlyk
Copy link
Contributor

Description

Install scikit-learn 0.20.0 on Windows as follows:

conda create -n hang python=3.6.5
pip install scikit-learn==0.20.0
pip install pytest

Steps/Code to Reproduce

The following two individual test runs hang (never finish, and remain uninterruptable):

pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism
pytest -v --pyarg sklearn.linear_model.tests.test_theil_sen::test_theil_sen_parallel

Expected Results

Expecting them to pass as they do on Linux, or skipped in the distribution

Actual Results

Tests hang in both pip installed scikit-learn, as well in scikit-learn installed via conda itself.

Reproduced on "Windows Server 2012 R2 Standard".

Versions

>pip list
Package        Version
-------------- ---------
atomicwrites   1.2.1
attrs          18.2.0
certifi        2018.8.24
colorama       0.3.9
more-itertools 4.3.0
numpy          1.15.2
pip            10.0.1
pluggy         0.7.1
py             1.6.0
pytest         3.8.2
scikit-learn   0.20.0
scipy          1.1.0
setuptools     40.2.0
six            1.11.0
wheel          0.31.1
wincertstore   0.2

@ogrisel

@amueller
Copy link
Member

amueller commented Oct 3, 2018

Thank you for reporting! That's not great :-/

@jnothman jnothman added this to the 0.20.1 milestone Oct 4, 2018
@tomMoral
Copy link
Contributor

tomMoral commented Oct 5, 2018

Could you also install pytest-timeout in your env and add the option -s --timeout=10 to your call to pytest. This way we will have more info on why this is stucked.

It is weird that the test passes on appveyor if you see a deterministic failure on your machine.

@oleksandr-pavlyk
Copy link
Contributor Author

oleksandr-pavlyk commented Oct 5, 2018

Here is what I see:

>pytest -s --timeout=10 -v --pyarg sklearn.linear_model.tests.test_theil_sen::test_theil_sen_parallel

Exception in thread QueueManagerThread:

  File "<<elided>>\lib\site-packages\sklearn\externals\joblib\externals\loky\process_executor.py", line 617, in _queue_management_worker
    ready = wait(readers + worker_sentinels)
  File "<<elided>>\lib\multiprocessing\connection.py", line 859, in wait
  File "<<elided>>\lib\multiprocessing\connection.py", line 791, in _exhaustive_wait
    res = _winapi.WaitForMultipleObjects(L, False, timeout)
ValueError: need at most 63 handles, got a sequence of length 129

<< lots of warnings suppressed >>

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

~~~~~~~~~~~~~~~~~~~~~~ Stack of QueueFeederThread (22640) ~~~~~~~~~~~~~~~~~~~~~~

  File "<<elided>>\lib\site-packages\sklearn\externals\joblib\externals\loky\backend\queues.py", line 138, in _feed
    nwait()
  File "<<elided>>\lib\threading.py", line 295, in wait
    waiter.acquire()

~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of MainThread (24148) ~~~~~~~~~~~~~~~~~~~~~~~~~~

    testfunction(**testargs)
  File "<elided>\lib\site-packages\sklearn\linear_model\tests\test_theil_sen.py", line 267, in test_theil_sen_parallel
    max_subpopulation=2e3).fit(X, y)
  File "<elided>\site-packages\sklearn\linear_model\theil_sen.py", line 390, in fit
    for job in range(n_jobs))
  File "<<elided>>\lib\site-packages\sklearn\externals\joblib\parallel.py", line 996, in __call__
    self.retrieve()
  File "<<elided>>\lib\site-packages\sklearn\externals\joblib\parallel.py", line 899, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "<<elided>>\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 517, in wrap_future_result
    return future.result(timeout=timeout)
  File "<<elided>>\lib\concurrent\futures\_base.py", line 427, in result
    self._condition.wait(timeout)
  File "<<elided>>\lib\threading.py", line 295, in wait
    waiter.acquire()

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

@tomMoral
Copy link
Contributor

tomMoral commented Oct 5, 2018

It looks like you are missing a part of the trace in QueueManagerThread.
Can you confirm that we can see all the trace or update it? (the first line should show the _queue_management_worker function first.)

@oleksandr-pavlyk
Copy link
Contributor Author

@tomMoral Yes, my bad. I have updated the earlier comment to include the said line.

@tomMoral
Copy link
Contributor

tomMoral commented Oct 5, 2018

How many workers do you have here? Are you on a server with more than 127 cores?

@oleksandr-pavlyk
Copy link
Contributor Author

oleksandr-pavlyk commented Oct 5, 2018

The machine has Intel Xeon E5-2630 v3, with 2 sockets, 8 cores per socket, HT: 2. The task manager reports 16 cores, and 32 logical cores.

When the test run hangs I do indeed see over 100 python processes in the task manager.

@jnothman
Copy link
Member

@oleksandr-pavlyk could you clarify whether this is resolved in master after c81e255?

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2018

I am starting a rackspace VM under windows to give it a try.

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2018

I have tried on a big Dual Intel Xeon E5-2660 v3 running Windows 2012 and scikit-learn 0.20.0 installed from pip in a dedicated conda env exactly as reported by @oleksandr-pavlyk and I could not reproduce the hangs. I repeated the pytest commands more than 20 times and each completed successfully in less than 2s. And I could see in the task manager than many cores were used. I also launched the full test suite successfully.

I tried also to build scikit-learn master and launch the full test suite twice and it completed without any hang. There are failing doctests under windows but this is expected (the way the default dtype of integer arrays is printed in doctest output is not necessarily always consistent with the output under other platforms but this is harmless).

@oleksandr-pavlyk
Copy link
Contributor Author

@ogrisel @jnothman Please allow me a moment to build it and run the test. I would report in about one hour (expected build + test run time on our server)

@oleksandr-pavlyk
Copy link
Contributor Author

I can still see the hang with a local build from current master (4e81949).

Upon execution of

pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism

There are 130 idle Python processes shown in task manager and the execution never terminates.

Pressing Ctrl+C terminates all 130 processes, which is an improvement over an earlier version of joblib.

Is there a way for me to access prebuilt wheels, or other binaries that you used to check whether the problem is fixed?

Thanks!

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2018

If you built the master branch then the problem is still there. This is weird because I could not reproduce the issue, neither on 0.20.0 nor on master.

Is there a way for me to access prebuilt wheels, or other binaries that you used to check whether the problem is fixed?

Our CI uploads the wheels built on the master branch for 64 bit Python 3.7 and 32 bit Python 2.7 on the following rackspace blob storage:
http://bc1c050a9b3a9f5bf3f5-965058070168eb2f1b8f5a23def30e14.r96.cf2.rackcdn.com/

But they should yield the same result as the one you observed by building the master branch yourself.

@oleksandr-pavlyk
Copy link
Contributor Author

Thanks @ogrisel . I downloaded and installed scikit_learn-0.21.dev0+20181114100605-cp37-cp37m-win_amd64.whl and can still reproduce the hang.

@amueller amueller modified the milestones: 0.20.1, 0.21 Nov 20, 2018
@amueller
Copy link
Member

retagging see #12548 (comment)

@tomMoral
Copy link
Contributor

Could you please run the following on your machine :

from sklearn.externals.joblib import cpu_count
print(cpu_count())

I will try investigating that in the coming days.

@oleksandr-pavlyk
Copy link
Contributor Author

@tomMoral , here it is:

(skl-0.21.dev0) tmp>ipython
Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from sklearn.externals.joblib import cpu_count
   ...: print(cpu_count())
   ...:
128

@tomMoral
Copy link
Contributor

Ok so the problem is caused by loky implementation of wait, which relies on winapi.WaitForMultipleObjects which cannot wait for more than MAXIMUM_WAIT_OBJECTS=64 objects.

I opened an issue in the loky tracker (tomMoral/loky#192).

But it is still strange that you report 32 cores in your machine but get cpu_count() == 128. There seems to have another issue here. Could you please report the results of the following commands?

echo %NUMBER_OF_PROCESSORS%
python -c "import multiprocessing as mp; print('mp:', mp.cpu_count())"

@oleksandr-pavlyk
Copy link
Contributor Author

Multiprocessign also reports 128:

(skl-0.21.dev0) tmp>echo %NUMBER_OF_PROCESSORS%
16

(skl-0.21.dev0) tmp>python -c "import multiprocessing as mp; print('mp:', mp.cpu_count())"
mp: 128

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2018

So there might be a bug in the way python is introspecting the system to detect the number of processors on this machine.

Actually on python 3, what matters is os.cpu_count() for loky / joblib. I don't know if it relies on the same underlying implementation as multiprocessing.

@oleksandr-pavlyk
Copy link
Contributor Author

Yes, os.cpu_count() also returns 128.

In [2]: import os

In [3]: os.cpu_count()
Out[3]: 128

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2018

This might be this bug:

https://bugs.python.org/issue30581

It should be fixed in Python 3.7.1.

@oleksandr-pavlyk
Copy link
Contributor Author

Perhaps @anton-malakhov might have an idea for why os.cpu_count() is returning this value.

My wish, though, is that this does not cause a hang.

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2018

Here is the fix for bpo-30581:

https://github.com/python/cpython/pull/2934/files

The hang itself is caused by https://github.com/tomMoral/loky/issues/192 but it would not be caused if python did not report a wrong number of CPUs.

@oleksandr-pavlyk
Copy link
Contributor Author

@ogrisel By the way, os.cpu_count() returns 128 in Python 3.7.1

(skl-0.21.dev0) tmp>ipython
Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import os

In [2]: os.cpu_count()
Out[2]: 128

@amueller
Copy link
Member

@ogrisel but more cores would still trigger the bug, even with python fixed, right? EC2 has c5.18xlarge with 72 cores for $0.70/h.

@tomMoral
Copy link
Contributor

but more cores would still trigger the bug, even with python fixed, right? EC2 has c5.18xlarge with 72 cores for $0.70/h.

Note that this problem only exists on windows, so big instances are more around 5/10$ per hour.

But this should still be fixed in loky.

@amueller
Copy link
Member

@tomMoral fair point ;)

@oleksandr-pavlyk
Copy link
Contributor Author

I volunteer to test fixes for free :)

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2018

but more cores would still trigger the bug, even with python fixed, right? EC2 has c5.18xlarge with 72 cores for $0.70/h.

On a version of Python with the fix for cpu_count, this bug in loky can only be triggered with 128 cores or more. It's hard to find a machine like this in 2018.

By that time loky will be fixed hopefully :)

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2018

Actually no, you are right, 64 cores is enough. I misread the above conversation.

@jnothman
Copy link
Member

jnothman commented Apr 9, 2019

Using a non-loky joblib.parallel backend would be a sufficient workaround here?

@jnothman jnothman mentioned this issue Apr 9, 2019
2 tasks
@jnothman jnothman modified the milestones: 0.21, 0.21.1 Apr 9, 2019
@oleksandr-pavlyk
Copy link
Contributor Author

We got bit by another one of these parallel tests, namely sklearn/decomposition/tests/test_dict_learning.py::test_dict_learning_reconstruction_parallel.

At the same time, the test sklearn/decomposition/tests/test_dict_learning.py::test_sparse_coder_parallel_mmap runs fine.

The difference between them is that the hanging one sets n_jobs=-1, while the working one is explicit n_jobs=2.

If you think it is a good idea to be explicit about the number of jobs in test suite (say n_job 4 or less), I would open a PR. Please let me know.

@jnothman
Copy link
Member

jnothman commented Apr 14, 2019 via email

oleksandr-pavlyk added a commit to oleksandr-pavlyk/scikit-learn that referenced this issue Apr 15, 2019
This change is to work around the hang

 scikit-learn#12263

afflicting Windows on machines with > 62 hyperthreads.
agramfort pushed a commit that referenced this issue Apr 15, 2019
This change is to work around the hang

 #12263

afflicting Windows on machines with > 62 hyperthreads.
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this issue Apr 25, 2019
This change is to work around the hang

 scikit-learn#12263

afflicting Windows on machines with > 62 hyperthreads.
xhluca pushed a commit to xhluca/scikit-learn that referenced this issue Apr 28, 2019
This change is to work around the hang

 scikit-learn#12263

afflicting Windows on machines with > 62 hyperthreads.
@rth rth modified the milestones: 0.21.1, 0.21.2 May 18, 2019
@jnothman jnothman modified the milestones: 0.21.2, 0.21.3 May 25, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this issue Jul 12, 2019
This change is to work around the hang

 scikit-learn#12263

afflicting Windows on machines with > 62 hyperthreads.
@jnothman jnothman modified the milestones: 0.21.3, 0.23 Oct 31, 2019
@adrinjalali
Copy link
Member

It seems this issue is fixed? Closing, please feel free to reopen if it's still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants