-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-105829: Fix concurrent.futures.ProcessPoolExecutor deadlock #108513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the management thread does not clear the wakeup pipe fast enough the wakeup code will block holding the shutdown lock causing deadlock. python#105829
This fixes issue python#105829, python#105829 The _ExecutorManagerThread wake-up code could deadlock if the wake-up pipe filled up and blocked. The relevant code looked like this: class _ThreadWakeup: def wakeup(self): if not self._closed: self._writer.send_bytes(b"") def clear(self): if not self._closed: while self._reader.poll(): self._reader.recv_bytes() class ProcessPoolExecutor(_base.Executor): def submit(self, fn, /, *args, **kwargs): with self._shutdown_lock: ... # Wake up queue management thread self._executor_manager_thread_wakeup.wakeup() class _ExecutorManagerThread(threading.Thread): def wait_result_broken_or_wakeup(self): ... with self.shutdown_lock: self.thread_wakeup.clear() The shutdown_lock must be taken for both reads and writes of the wake-up pipe. If a read or a write of the pipe blocks, the code will deadlock. It looks like reads can't block (a poll() is done before doing any reads) but writes have not protection against blocking. If the _ExecutorManagerThread cannot keep up and clear the wake-up pipe it will fill up and block. This seems to have been rather easy to do in the real world as long as the number of tasks is more than 100000 or so. With this change we make the writes to the wake-up pipe non blocking. If the pipe blocks we will simply skip the write. This should be OK since the reason for the problem is that both reader and writer must hold the shutdown_lock when accessing the pipe. That should imply that we don't need to send anything if the pipe is full, the reader can't be reading it concurrently, it will eventually wake up due to the data already in the pipe.
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
Try to reduce the size of the pipe to make the test faster. The solution only works on Unix-like platforms, we fall back on a hardcoded size for other platforms.
Yes, the test confirms that deadlocks does not happen using either solution. The benefit of pitrou's version is that we also remove the lock contention for read/write of the wakeup pipe. The submitting main thread can still potentially block in that version but as soon as the manager thread wakes up it will clear the wakeup pipe and the writer should be able to submit several thousand more tasks before it could potentially block again, 16k tasks for the default linux 64k pipe. Since there are already other cases in the submit code that could block I don't think this is an issue. The important thing is that the blocking wont cause a deadlock in this case. The bigger issue might be if we can't say for sure that removing the lock on the reader is safe. Both @pitrou and I have looked at the code and we believe it is safe to do but that is no guarantee. It would be really nice if there was some sort of stress test that could be used to really hammer this code for a while to increase confidence in this type of change but I haven't found anything like that in the tests. Is there something like that available somewhere? |
For those looking for a temporary fix, here is a drop-in replacement of
|
@gciaberi - yes, I'd wondered why a |
Chris, feel free to try on your own. Given how this queue is used inside the module, it might not be technically possible to replace it with a semaphore.
Le 9 septembre 2023 11:33:24 GMT+02:00, Chris Withers ***@***.***> a écrit :
…
@gciaberi - yes, I'd wondered why a `Queue` was used for this, pumping empty strings down feels hacky compared to a bounded semphore. @pitrou / @elfstrom - how do you both feel about switching `ProcessPoolExecutor` to use a `BoundedSemaphore` instead of a queue? @sharvil: is there any chance you could this this `BoundedSemaphore` approach instead and see if it fixes your issues?
--
Reply to this email directly or view it on GitHub:
#108513 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
My assumption about the current design is that a pipe was chosen for the wakeup mechanism because it allows the management thread to use select (connection.wait) to wait on the FDs of both the result queue, the worker process termination sentinels and the wakeup events at the same time and wake when any one of these events occur. A semaphore is not compatible with this type of wait mechanism as far as I know. It might be possible to use a semaphore anyway by converting the other events to trigger the semaphore as well but that seems like it could be complicated and lead to unintended side-effects and problematic corner-case behaviour. I would be very hesitant to make such a large change in the implementation strategy of this module given the inherent complexity of the problem and the relatively low amount of testing available. |
Thanks @elfstrom for taking the time to explain it more thoroughly (I'm on a mobile phone). Yes, the need to wait on multiple synchronization objects, and do it cross-platform, limits our options here. It's not obvious whether there's a way to make semaphores work for this.
Alternatives to waiting on multiple objects at once, such as a semi-busy wait loop, are undesirable for multiple reasons.
|
…adlock This reverts the previous fix and instead opts to remove the locking completely when clearing the wakeup pipe. We can do this because clear() and close() are both called from the same thread and nowhere else. In this version of this fix, the call to ProcessPoolExecutor.submit can still block on the wakeup pipe if it happens to fill up. This should not be an issue as there are already other cases where the submit call can block and if the wakeup pipe is full it implies there is already a lot of work items queued up. Co-authored-by: Antoine Pitrou <antoine@python.org>
@elfstrom It's entirely fine :-) However, you will also need to merge from main and fix the conflicts now. |
Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>
Change test strategy. We now force the main thread to block during the wake-up call by mocking the wake-up object and artificially limiting to a single wake-up before blocking. This allows us to reduce some timeouts, number of tasks and lower the total runtime of the test. It should also guarantee a blocking main thread on all platforms, regardless of any pipe buffer sizes. The drawback is that the test is now a bit opinionated on how we fix this issue (i.e. just making the wake-up pipe non blocking would not satisfy this test even though it is a valid fix for the issue).
@cjw296 Is there something more that I need to do to get this ready for merge? |
Sorry, @elfstrom and @cjw296, I could not cleanly backport this to |
Sorry, @elfstrom and @cjw296, I could not cleanly backport this to |
@elfstrom - are you able to do the backport PR's? (3.11 being the most pressing! :-) ) |
@elfstrom - also, could you have a look at these |
It looks like flakiness to me: def test_future_times_out(self):
"""Test ``futures.as_completed`` timing out before
completing it's final future."""
already_completed = {CANCELLED_AND_NOTIFIED_FUTURE,
EXCEPTION_FUTURE,
SUCCESSFUL_FUTURE}
for timeout in (0, 0.01):
with self.subTest(timeout):
future = self.executor.submit(time.sleep, 0.1)
completed_futures = set()
try:
for f in futures.as_completed(
already_completed | {future},
timeout
):
completed_futures.add(f)
except futures.TimeoutError:
pass
# Check that ``future`` wasn't completed.
self.assertEqual(completed_futures, already_completed)
It looks like the future complete when the test expects it not to. The future is just |
I will look into the backports tomorrow if I get some spare time. |
GH-109783 is a backport of this pull request to the 3.11 branch. |
GH-109784 is a backport of this pull request to the 3.12 branch. |
…adlock (pythonGH-108513) This fixes issue pythonGH-105829, python#105829 Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Chris Withers <chris@withers.org> Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>. (cherry picked from commit 405b063) Co-authored-by: elfstrom <elfstrom@users.noreply.github.com>
…GH-108513) (#109783) This fixes issue GH-105829, #105829 (cherry picked from commit 405b063)
@elfstrom - thanks for the backports!
Well, you've done an amazing job with this issue, if you're feeling super keen, you could always make these tests more reliable 😃 |
The added test is unstable: it fails randomly. See issue gh-109917.
I confirm that the test is unstable: issue gh-109565. Can someone have a look? |
@vstinner - that test wasn't added in this PR, though? The only test added here was |
…ython#108513) This fixes issue python#105829, python#105829 Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Chris Withers <chris@withers.org> Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>
…GH-108513) (#109784) This fixes issue GH-105829, #105829 (cherry picked from commit 405b063)
…ython#108513) This fixes issue python#105829, python#105829 Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Chris Withers <chris@withers.org> Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>
This fixes issue #105829
The _ExecutorManagerThread wake-up code could deadlock if the wake-up pipe filled up and blocked.
Please review this very carefully, I'm not 100% convinced I understand all the details here. I'm mostly submitting this as a proof of concept test/fix due to the slow progress on the issue. Hopefully this can serve as a starting point at least.
concurrent.futures.ProcessPoolExecutor
pool deadlocks when submitting many tasks #105829