Skip to content

Add terminate_workers to ProcessPoolExecutor #128041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
csm10495 opened this issue Dec 17, 2024 · 11 comments
Closed

Add terminate_workers to ProcessPoolExecutor #128041

csm10495 opened this issue Dec 17, 2024 · 11 comments
Labels
stdlib Python modules in the Lib dir topic-multiprocessing type-feature A feature request or enhancement

Comments

@csm10495
Copy link
Contributor

csm10495 commented Dec 17, 2024

Feature or enhancement

Proposal:

This is an interpretation of the feature ask in: https://discuss.python.org/t/cancel-running-work-in-processpoolexecutor/58605/1. It would be a way to stop all the workers running in a ProcessPoolExecutor

p = ProcessPoolExecutor()
# use p

# i know i want p to die at this point no matter what
p.terminate_workers()

Previously the way to do this was to use to loop through the ._processes of the ProcessPoolExecutor though, preferably this should be possible without accessing implementation details.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/cancel-running-work-in-processpoolexecutor/58605/1

Linked PRs

@csm10495 csm10495 added the type-feature A feature request or enhancement label Dec 17, 2024
@picnixz picnixz added the stdlib Python modules in the Lib dir label Dec 17, 2024
csm10495 added a commit to csm10495/cpython that referenced this issue Dec 17, 2024
Provides a way to forcefully stop all the workers in the pool

Typically this would be used as a last effort to stop all workers if unable to shutdown / join in the expected way
gpshead pushed a commit that referenced this issue Mar 3, 2025
…essPoolExecutor (GH-128043)

This adds two new methods to `multiprocessing`'s `ProcessPoolExecutor`:
- **`terminate_workers()`**: forcefully terminates worker processes using `Process.terminate()`
- **`kill_workers()`**: forcefully kills worker processes using `Process.kill()`

These methods provide users with a direct way to stop worker processes without `shutdown()` or relying on implementation details, addressing situations where immediate termination is needed.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Commit-message-mostly-authored-by: Claude Sonnet 3.7 (because why not -greg)
@gpshead gpshead added the 3.14 new features, bugs and security fixes label Mar 3, 2025
@gpshead
Copy link
Member

gpshead commented Mar 3, 2025

thanks for the feature contribution!

@gpshead gpshead closed this as completed Mar 3, 2025
@picnixz
Copy link
Member

picnixz commented Mar 3, 2025

Re-opening because we have some build bots failing (I'm not sure whether the macOS failure is related but the other seems so)

https://github.com/python/cpython/actions/runs/13631446353/job/38099943353#step:11:686
https://buildbot.python.org/#/builders/568/builds/8398

@csm10495
Copy link
Contributor Author

csm10495 commented Mar 3, 2025

Opened #130812 to try to fix the transiency. I can't seem to add the skip-news label, but it should get that.

colesbury added a commit to colesbury/cpython that referenced this issue Mar 4, 2025
…ethods to ProcessPoolExecutor (pythonGH-128043)"

The test_concurrent_futures.test_process_pool test is failing in CI.

This reverts commit f97e409.
colesbury added a commit that referenced this issue Mar 4, 2025
… to ProcessPoolExecutor (GH-128043)" (#130838)

The test_concurrent_futures.test_process_pool test is failing in CI.

This reverts commit f97e409.
@colesbury
Copy link
Contributor

@csm10495 - I reverted #128043 due to CI failures. Please open a new PR with the changes and fixes when you have a chance. I can help with testing if you'd like.

@csm10495
Copy link
Contributor Author

csm10495 commented Mar 4, 2025

Thanks! @colesbury. I'll dig more to figure out the issue before opening a fresh pr.

@picnixz picnixz removed the 3.14 new features, bugs and security fixes label Mar 4, 2025
@csm10495
Copy link
Contributor Author

csm10495 commented Mar 4, 2025

Current issue:

test_force_shutdown_workers_stops_pool (test.test_concurrent_futures.test_process_pool.ProcessPoolForkserverProcessPoolExecutorTest.test_force_shutdown_workers_stops_pool) ... 0.07s Warning -- reap_children() reaped child process 33328
...

0:00:09 load avg: 2.32 [1/1/1] test.test_concurrent_futures.test_process_pool failed (env changed)

== Tests result: ENV CHANGED ==

1 test altered the execution environment (env changed):
    test.test_concurrent_futures.test_process_pool

Total duration: 9.0 sec
Total tests: run=84 skipped=9
Total test files: run=1/1 env_changed=1
Result: ENV CHANGED

Interestingly enough it was suggested that commenting out

            self.assertRaises(RuntimeError, executor.submit, time.sleep, 0)

makes the test pass. It does.. most of the time. It still transiently fails the same way with a leaked child process (though much rarer).

So I guess the timing changed caused by that line make it more likely to happen, but we're still dealing with a race condition.

@csm10495
Copy link
Contributor Author

csm10495 commented Mar 4, 2025

I think I figured it out. Will open a PR in a few minutes.

csm10495 added a commit to csm10495/cpython that referenced this issue Mar 4, 2025
…o ProcessPoolExecutor

Add some fixes to tests to make them no longer transient
@csm10495
Copy link
Contributor Author

csm10495 commented Mar 4, 2025

Opened a new PR: #130849. I believe the transient issues are resolved.

@gpshead
Copy link
Member

gpshead commented Mar 4, 2025

FWIW on the original PR while I suspected we might have flakiness issues... I was way underestimating that apparently. I should've does the buildbot tests before merging. sorry for the churn! :)

gpshead pushed a commit that referenced this issue Mar 5, 2025
…essPoolExecutor (GH-130849)

This adds two new methods to `multiprocessing`'s `ProcessPoolExecutor`:
- **`terminate_workers()`**: forcefully terminates worker processes using `Process.terminate()`
- **`kill_workers()`**: forcefully kills worker processes using `Process.kill()`

These methods provide users with a direct way to stop worker processes without `shutdown()` or relying on implementation details, addressing situations where immediate termination is needed.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Sam Gross @colesbury
Commit-message-mostly-authored-by: Claude Sonnet 3.7 (because why not -greg)
@gpshead
Copy link
Member

gpshead commented Mar 5, 2025

merged again with the improvements. 🤞🏾

@csm10495
Copy link
Contributor Author

csm10495 commented Mar 5, 2025

Tiny related fix: #130900 (docs only)

duaneg added a commit to duaneg/cpython that referenced this issue Mar 19, 2025
This bug is caused by race conditions in the poll implementations (which are
called by join/wait) where if multiple threads try to reap the dead process
only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error,
possibly overwriting the correct code set by the winning thread. This is
relatively easy to fix: we can just take a lock before waiting for the process,
since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the
exit code is set, meaning the process may still report itself as alive after
join returns. Fixing this is trickier as we have to support a mixture of
blocking and non-blocking calls to poll, and we cannot have the latter waiting
to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The
non-blocking variant does its work with the lock held: since it won't block
this should be safe. The blocking variant releases the lock before making the
blocking operating system call. It then retakes the lock and either sets the
code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it
may still "lose" the race, and return None instead of the exit code, even
though the process is dead. However, as the process could be alive at the time
the call is made but die immediately afterwards, this situation should already
be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for
all three implementations. A work-around for this bug in a test added for
pythongh-128041 is also reverted.
duaneg added a commit to duaneg/cpython that referenced this issue Mar 19, 2025
This bug is caused by race conditions in the poll implementations (which are
called by join/wait) where if multiple threads try to reap the dead process
only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error,
possibly overwriting the correct code set by the winning thread. This is
relatively easy to fix: we can just take a lock before waiting for the process,
since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the
exit code is set, meaning the process may still report itself as alive after
join returns. Fixing this is trickier as we have to support a mixture of
blocking and non-blocking calls to poll, and we cannot have the latter waiting
to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The
non-blocking variant does its work with the lock held: since it won't block
this should be safe. The blocking variant releases the lock before making the
blocking operating system call. It then retakes the lock and either sets the
code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it
may still "lose" the race, and return None instead of the exit code, even
though the process is dead. However, as the process could be alive at the time
the call is made but die immediately afterwards, this situation should already
be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for
all three implementations. A work-around for this bug in a test added for
pythongh-128041 is also reverted.
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…o ProcessPoolExecutor (pythonGH-128043)

This adds two new methods to `multiprocessing`'s `ProcessPoolExecutor`:
- **`terminate_workers()`**: forcefully terminates worker processes using `Process.terminate()`
- **`kill_workers()`**: forcefully kills worker processes using `Process.kill()`

These methods provide users with a direct way to stop worker processes without `shutdown()` or relying on implementation details, addressing situations where immediate termination is needed.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Commit-message-mostly-authored-by: Claude Sonnet 3.7 (because why not -greg)
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…ethods to ProcessPoolExecutor (pythonGH-128043)" (python#130838)

The test_concurrent_futures.test_process_pool test is failing in CI.

This reverts commit f97e409.
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…o ProcessPoolExecutor (pythonGH-130849)

This adds two new methods to `multiprocessing`'s `ProcessPoolExecutor`:
- **`terminate_workers()`**: forcefully terminates worker processes using `Process.terminate()`
- **`kill_workers()`**: forcefully kills worker processes using `Process.kill()`

These methods provide users with a direct way to stop worker processes without `shutdown()` or relying on implementation details, addressing situations where immediate termination is needed.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Sam Gross @colesbury
Commit-message-mostly-authored-by: Claude Sonnet 3.7 (because why not -greg)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-multiprocessing type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants