gh-66587: Fix deadlock from pool worker death without communication #16103

applio · 2019-09-13T13:38:05Z

Adds tracking of which worker process in the pool takes which job from the queue.

When a worker process dies without communication, its task/job is also lost. By tracking what job that worker took off the job queue as its task, upon detecting the death, the parent process can add an item to the result queue indicating the failure of that task/job.

In case of a future regression, the supplied test uses subprocess to constrain the test with a timeout to ensure an indefinite hang does not interfere with the running of tests.

https://bugs.python.org/issue22393

Issue: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly #66587

…ueue; adds test for issue22393/issue38084.

…down.

pierreglaser · 2019-09-19T16:21:55Z

This looks good to me, simply a few remarks:

out of curiosity, are you aware of bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441, that implements a fix for this issue?
what I like about bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441 is that the pool errors out as soon as the worker crashes, whereas in this PR, the pool waits until all tasks are complete before erroring.

Also pinging @tomMoral

zooba · 2019-09-19T19:13:17Z

For mine, I think this fix seems more elegant than #10441, but the tests in that PR seem to have more coverage.

I personally prefer to just have the task fail, and the pool continue. The current behaviour is that the broken worker is immediately replaced and other work continues, but if you wait on the failed task then it will never complete. Now it does complete (with a failure), which means robust code can re-queue it if appropriate. I don't see any reason to tear down the entire pool.

Few comments on the PR incoming.

Lib/multiprocessing/pool.py

zooba · 2019-09-19T19:18:16Z

Lib/multiprocessing/pool.py

                worker.join()
                cleaned = True
+                if pid in job_assignments:


Suggested change

if pid in job_assignments:

job = job_assignments.pop(pid, None)

if job:

outqueue.put((job, i, (False, RuntimeError("Worker died"))))

And some additional simplification below, of course.

Lib/multiprocessing/pool.py

tomMoral

Here is a batch of comments.

I have to say that I like this solution as it is the most robust way of handling this, (a kind of scheduler). But it also comes with more complexity and increase communication needs -> more changes for deadlocks.

One of the main argument for the fail on error design is that there is no way there is no way to know in the main process if the worker that died had a lock on one of the communication queue. In this situation, the only way to recover the system and avoid a deadlock is to kill the Pool and re-spawn one.

tomMoral · 2019-09-20T09:13:54Z

Lib/multiprocessing/pool.py

+                job_assignments[value] = job
+            else:
+                try:
+                    cache[job]._set(i, (task_info, value))


Why don't you remove the job from job_assignement here? It would avoid unecessary operation when a worker died gracefully.

Lib/multiprocessing/pool.py

Co-Authored-By: Steve Dower <steve.dower@microsoft.com>

… use dict.pop().

taleinat

Additional tests would certainly be a good idea.

taleinat · 2019-11-17T16:31:05Z

Lib/test/_test_multiprocessing.py

+        # Issue22393: test fix of indefinite hang caused by worker processes
+        # exiting abruptly (such as via os._exit()) without communicating
+        # back to the pool at all.
+        prog = (


This can be written much more clearly using a multi-line string. See for example a very similar case in test_shared_memory_cleaned_after_process_termination in this file.

taleinat · 2019-11-17T16:35:07Z

Lib/test/_test_multiprocessing.py

+        # Only if there is a regression will this ever trigger a
+        # subprocess.TimeoutExpired.
+        completed_process = subprocess.run(
+            [sys.executable, '-E', '-S', '-O', '-c', prog],


The '-O' flag probably shouldn't be used here, but '-S' and '-E' seem fine.

Also, consider calling test.support.script_utils.interpreter_requires_environment(), and only use the '-E' flag if that returns False, as done by the other Python script running utils in test.support.script_utils.

Or just use test.support.script_utils.run_python_until_end() instead of subprocess.run().

csabella · 2020-01-10T22:20:42Z

@applio, I'm not sure where this one is at, but I believe there are some comments that still need to be addressed. I don't know if it's waiting on anything else, but it would probably be nice to get this merged.

ambv · 2021-09-23T21:56:33Z

Closing and re-opening to re-trigger CI.

bedevere-bot · 2021-09-23T21:56:47Z

🤖 New build scheduled with the buildbot fleet by @ambv for commit 6459284 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

bedevere-bot · 2021-09-23T21:56:53Z

🤖 New build scheduled with the buildbot fleet by @ambv for commit 6459284 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

ambv · 2022-05-17T16:40:59Z

This missed the boat for inclusion in Python 3.9 which accepts security fixes only as of today.

python-cla-bot · 2025-04-18T09:52:46Z

The following commit authors need to sign the Contributor License Agreement:

applio@users.noreply.github.com

Adds tracking of which process in the pool takes which job from the q…

c8f4896

…ueue; adds test for issue22393/issue38084.

applio requested review from zooba and pablogsal September 13, 2019 13:38

the-knights-who-say-ni added the CLA signed label Sep 13, 2019

bedevere-bot added the awaiting core review label Sep 13, 2019

Added blurb.

315ec3d

applio added the needs backport to 3.8 label Sep 13, 2019

Fix for missing checks on resources still being available during tear…

bcbd7d3

…down.

zooba reviewed Sep 19, 2019

View reviewed changes

tomMoral reviewed Sep 20, 2019

View reviewed changes

applio and others added 2 commits September 22, 2019 16:37

Remove spurious space.

e1a9eb5

Co-Authored-By: Steve Dower <steve.dower@microsoft.com>

Fix result position for killed workers, add Steve suggested change to…

6459284

… use dict.pop().

taleinat reviewed Nov 17, 2019

View reviewed changes

csabella added awaiting changes needs backport to 3.9 only security fixes and removed awaiting core review labels May 28, 2020

ambv removed the needs backport to 3.8 label May 4, 2021

ambv closed this Sep 23, 2021

ambv reopened this Sep 23, 2021

ambv added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 23, 2021

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 23, 2021

ambv added 🔨 test-with-buildbots Test PR w/ buildbots; report in status section needs backport to 3.10 only security fixes labels Sep 23, 2021

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 23, 2021

ambv removed the needs backport to 3.9 only security fixes label May 17, 2022

danoreilly mannequin mentioned this pull request Apr 10, 2022

multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly #66587

Open

serhiy-storchaka added the needs backport to 3.11 only security fixes label May 20, 2022

ezio-melotti removed the CLA signed label Jul 13, 2022

serhiy-storchaka added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes and removed needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes labels May 9, 2024

Yhg1s removed the needs backport to 3.12 only security fixes label Apr 8, 2025

serhiy-storchaka added the needs backport to 3.14 bugs and security fixes label May 8, 2025

gpshead changed the title ~~bpo-22393: Fix deadlock from pool worker death without communication~~ gh-66587: Fix deadlock from pool worker death without communication May 22, 2025

gpshead mentioned this pull request May 22, 2025

bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441

Closed

gpshead self-assigned this May 22, 2025

gpshead added stdlib Python modules in the Lib dir topic-multiprocessing labels May 22, 2025

Merge branch 'main' into fix_multiprocessing_worker_died_indefinite_hang

fa54afb

gpshead self-requested a review as a code owner May 22, 2025 18:02

gmanthey mentioned this pull request Jun 25, 2025

Switch multiprocessing to concurrent.futures popgenmethods/ldpop#3

Open

-                if pid in job_assignments:
+                job = job_assignments.pop(pid, None)
+                if job:
+                    outqueue.put((job, i, (False, RuntimeError("Worker died"))))

Uh oh!

gh-66587: Fix deadlock from pool worker death without communication #16103

Are you sure you want to change the base?

gh-66587: Fix deadlock from pool worker death without communication #16103

Uh oh!

Conversation

applio commented Sep 13, 2019 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pierreglaser commented Sep 19, 2019

Uh oh!

zooba commented Sep 19, 2019

Uh oh!

Uh oh!

Uh oh!

zooba Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tomMoral left a comment

Choose a reason for hiding this comment

Uh oh!

tomMoral Sep 20, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taleinat left a comment

Choose a reason for hiding this comment

Uh oh!

taleinat Nov 17, 2019

Choose a reason for hiding this comment

Uh oh!

taleinat Nov 17, 2019

Choose a reason for hiding this comment

Uh oh!

taleinat Nov 17, 2019

Choose a reason for hiding this comment

Uh oh!

csabella commented Jan 10, 2020

Uh oh!

ambv commented Sep 23, 2021

Uh oh!

bedevere-bot commented Sep 23, 2021

Uh oh!

bedevere-bot commented Sep 23, 2021

Uh oh!

ambv commented May 17, 2022

Uh oh!

python-cla-bot bot commented Apr 18, 2025

Uh oh!

Uh oh!

applio commented Sep 13, 2019 •

edited by bedevere-app bot

Loading

zooba Sep 19, 2019 •

edited

Loading