bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441

oesteban · 2018-11-09T20:51:30Z

This PR fixes issue22393.

Three new unittests have been added.

https://bugs.python.org/issue22393

the-knights-who-say-ni · 2018-11-09T20:51:32Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

This PR relates to nipy#2700, and should fix the problem underlying nipy#2548. I first considered adding a control thread that monitors the `Pool` of workers, but that would require a large overhead keeping track of PIDs and polling very often. Just adding the core file of [bpo-22393](python/cpython#10441) should fix nipy#2548

effigies

Just a couple comments, pending review from the cpython devs.

Lib/multiprocessing/pool.py

…x-issue-22393

Lib/multiprocessing/pool.py

oesteban · 2018-12-03T21:02:42Z

Hi @pitrou (or anyone with a say), can you give us a hint about the fate of this PR (even if you honestly think it does not have a very promising future).

Thanks

pitrou

Sorry for the delay @oesteban. I've made a couple of comments, you might want to address them.

Also, it seems you'll need to merge/rebase from master and fix any conflicts.

Lib/multiprocessing/pool.py

Lib/test/_test_multiprocessing.py

Doc/library/multiprocessing.rst

bedevere-bot · 2018-12-15T20:31:29Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be poked with soft cushions!

oesteban · 2018-12-18T00:19:08Z

I have made the requested changes; please review again

bedevere-bot · 2018-12-18T00:19:11Z

Thanks for making the requested changes!

@pitrou: please review the changes made to this pull request.

oesteban · 2019-02-07T19:24:33Z

pinging @pitrou, at least to know if the changes pointed at the right direction.

pitrou · 2019-02-07T19:33:55Z

Sorry, will take a look again. Also @pablogsal you may be interested in this.

oesteban · 2019-03-29T17:59:51Z

bumping up!

oesteban · 2019-05-22T15:37:18Z

Are there any plans for deprecating multiprocessing? Otherwise, I think this bug should be addressed.

If the proposed fix is not the right way of fixing it, please let me know. I'll resolve the conflicts only once I know there is interest in doing so.

Thanks very much

pitrou · 2019-05-22T20:31:17Z

@pierreglaser @tomMoral Would you like to take a look at this?

pierreglaser · 2019-05-22T20:35:57Z

Yes I can have a look.

tomMoral · 2019-05-23T07:34:47Z

I'll have a look too.

oesteban · 2019-05-23T19:29:36Z

@pitrou thanks for the prompt response!

pierreglaser

Here is a first review. @tomMoral's one should land sometime next week :)

pierreglaser · 2019-05-24T11:47:32Z

Lib/multiprocessing/pool.py

+
+class BrokenProcessPool(RuntimeError):
+    """
+    Raised when a process in a ProcessPoolExecutor terminated abruptly


Maybe avoid using ProcessPoolExecutor and future terms, which are objects of the concurrent.futures package and not the multiprocessing package.

pierreglaser · 2019-05-24T11:47:56Z

Lib/multiprocessing/pool.py

+        util.debug('terminate pool entering')
+        is_broken = BROKEN in (task_handler._state,
+                               worker_handler._state,
+                               result_handler._state)

        worker_handler._state = TERMINATE


No need to use the _worker_state_lock here? And in other places where _worker_handler._state is manipulated?

pierreglaser · 2019-05-24T11:54:36Z

Lib/multiprocessing/pool.py

+            util.debug('helping task handler/workers to finish')
+            cls._help_stuff_finish(inqueue, task_handler, len(pool))
+        else:
+            util.debug('finishing BROKEN process pool')


What happens here if the task_handler is blocked, but we do not run _help_stuff_finish?

pierreglaser · 2019-05-24T12:13:49Z

Lib/multiprocessing/pool.py

+
+            err = BrokenProcessPool(
+                'A worker in the pool terminated abruptly.')
+            # Exhaust MapResult with errors


This also applies to ApplyResult right?

pierreglaser · 2019-05-24T12:16:56Z

Lib/multiprocessing/pool.py

+            err = BrokenProcessPool(
+                'A worker in the pool terminated abruptly.')
+            # Exhaust MapResult with errors
+            for i, cache_ent in list(self._cache.items()):


Out of curiosity, is there any reason why we iterate on a list of of self._cache?

pablogsal · 2019-05-24T20:54:04Z

There are multiple tests being added that make use of sleep to synchronize processes (in particular it assumes that the processes will be entered in time when sleep finishes). This is very unreliable and it will most certainly fail on the slowest buildbots.

Please, try to add some synchronization to the tests to make them more deterministic.

tomMoral · 2019-06-12T08:11:00Z

Note that this PR, while improving the current state of the multiprocessing.Pool, is not a full solution as it is still very easy to deadlock by calling sys.exit(0) in the function.

import sys
from multiprocessing import Pool

pool = Pool(2)
pool.apply(sys.exit, (0,))

or at unpickling time

import sys
from multiprocessing import Pool


class Failure:
    def __reduce__(self):
        return sys.exit, (0, )


pool = Pool(2)
pool.apply(id, (Failure(),))

Also, many other problems exists with multiprocessing.Pool as you can easily deadlock it (choose from failure to serialize/deserialize, flooding the queue with many tasks and one failure, segfaulting with bad timing). I did some work to try to make it fault tolerant (see class _ReusablePool in this branch ) but some design in the communication process make it tricky to fix all possible deadlocks/interpreter freeze.

Maybe a more stable solution would be to actually change the Pool to rely on a concurrent.futures executor for the parallel computations (which is now far more stable IMO), just keeping the API to reduce the maintenance burden to only one implementation of the parallel pool of workers.

github-actions · 2025-04-13T06:03:36Z

This PR is stale because it has been open for 30 days with no activity.

gpshead · 2025-05-22T18:01:24Z

closing in favor of #16103

oesteban added 7 commits November 6, 2018 15:14

add tests

d37e360

base patch

bc08d85

finishing up fix

4eac116

cleanup not needed imports

c8ba754

avert race condition

b36663b

add documentation

f8500e2

make patchcheck

848d304

the-knights-who-say-ni added the CLA not signed label Nov 9, 2018

bedevere-bot added the awaiting review label Nov 9, 2018

add News entry

1f93322

oesteban mentioned this pull request Nov 9, 2018

[FIX] LegacyMultiProc hangs up indefinitely nipy/nipype#2773

Closed

1 task

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Nov 12, 2018

stylistic fixes, avoid shadowing worker variable name

a172df6

effigies reviewed Nov 12, 2018

View reviewed changes

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

oesteban added 3 commits November 12, 2018 08:31

address some of @effigies' comments

4d614b3

protect changes of state of worker handler thread with lock

65f6eaf

Merge branch 'fix-issue-22393' of github.com:oesteban/cpython into fi…

6d9c4ca

…x-issue-22393

effigies reviewed Nov 13, 2018

View reviewed changes

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

pitrou requested changes Dec 15, 2018

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Dec 15, 2018

oesteban added 2 commits December 17, 2018 15:12

Merge remote-tracking branch 'upstream/master' into fix-issue-22393

706f178

address @pitrou's comments

933c77a

bedevere-bot removed the awaiting changes label Dec 18, 2018

bedevere-bot added the awaiting change review label Dec 18, 2018

oesteban added 2 commits December 17, 2018 16:20

fix typo

efcc185

fix typo (sorry for the rebound commit)

7c21ddd

pitrou changed the title ~~bpo-22393: FIX multiprocessing.Pool hangs if a worker process dies unexpectedly~~ bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly Feb 7, 2019

pierreglaser reviewed May 24, 2019

View reviewed changes

pierreglaser mentioned this pull request Sep 19, 2019

gh-66587: Fix deadlock from pool worker death without communication #16103

Open

danoreilly mannequin mentioned this pull request Apr 10, 2022

multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly #66587

Open

vstinner added the topic-multiprocessing label Jun 6, 2022

ezio-melotti removed the CLA signed label Jul 13, 2022

github-actions bot added the stale Stale PR or inactive for long period of time. label Apr 13, 2025

gpshead closed this May 22, 2025

Uh oh!

bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441

bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441

Uh oh!

Conversation

oesteban commented Nov 9, 2018 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-knights-who-say-ni commented Nov 9, 2018

Uh oh!

effigies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oesteban commented Dec 3, 2018

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Dec 15, 2018

Uh oh!

oesteban commented Dec 18, 2018

Uh oh!

bedevere-bot commented Dec 18, 2018

Uh oh!

oesteban commented Feb 7, 2019

Uh oh!

pitrou commented Feb 7, 2019

Uh oh!

oesteban commented Mar 29, 2019

Uh oh!

oesteban commented May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented May 22, 2019

Uh oh!

pierreglaser commented May 22, 2019

Uh oh!

tomMoral commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oesteban commented May 23, 2019

Uh oh!

pierreglaser left a comment

Choose a reason for hiding this comment

Uh oh!

pierreglaser May 24, 2019

Choose a reason for hiding this comment

Uh oh!

pierreglaser May 24, 2019

Choose a reason for hiding this comment

Uh oh!

pierreglaser May 24, 2019

Choose a reason for hiding this comment

Uh oh!

pierreglaser May 24, 2019

Choose a reason for hiding this comment

Uh oh!

pierreglaser May 24, 2019

Choose a reason for hiding this comment

Uh oh!

pablogsal commented May 24, 2019

Uh oh!

tomMoral commented Jun 12, 2019

Uh oh!

github-actions bot commented Apr 13, 2025

oesteban commented Nov 9, 2018 •

edited by bedevere-bot

Loading

oesteban commented May 22, 2019 •

edited

Loading

tomMoral commented May 23, 2019 •

edited

Loading