Skip to content

gh-132969: Fix error/hang when shutdown(wait=False) and task exited abnormally #133222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 10, 2025

Conversation

ogbiggles
Copy link
Contributor

@ogbiggles ogbiggles commented Apr 30, 2025

When shutdown is called with wait=False, the executor thread keeps running even after the ProcessPoolExecutor's state is reset. The executor then tries to replenish the worker processes pool resulting in an error and a potential hang when it comes across a worker that has died. Fixed the issue by having _adjust_process_count() return without doing anything if the ProcessPoolExecutor's state has been reset.

Added unit tests to validate two scenarios:
max_workers < num_tasks (exception)
max_workers > num_tasks (exception + hang)

…sk exited abnormally

When shutdown is called with wait=False, the executor thread keeps running
even after the ProcessPoolExecutor's state is reset. The executor then tries
to replenish the worker processes pool resulting in an error and a potential hang
when it comes across a worker that has died. Fixed the issue by having
_adjust_process_count() return without doing anything if the ProcessPoolExecutor's
state has been reset.

Added unit tests to validate two scenarios:
max_workers < num_tasks (exception)
max_workers > num_tasks (exception + hang)
@bedevere-app
Copy link

bedevere-app bot commented Apr 30, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link

python-cla-bot bot commented Apr 30, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@encukou
Copy link
Member

encukou commented May 5, 2025

@hugovk: If this is indeed a blocker for the beta, +1 for me -- it shouldn't break anything.
For a proper review, I'd need more time to familiarize myself with this part of the code. (@gpshead, should I?)

But, as week-old issue that affects 3.12+, I think it's OK to defer it.

@itamaro: Did you intend this to block the beta?

@itamaro
Copy link
Contributor

itamaro commented May 5, 2025

Did you intend this to block the beta?

Should have said it here, not only in Discord :)
I wasn't sure whether this should block the release, so added the tag to make sure someone at least takes a look, considering beta 1 is imminent!

@hugovk
Copy link
Member

hugovk commented May 5, 2025

Thanks all, it's been looked at and sounds like it can wait, so let's defer it.

@YvesDup
Copy link
Contributor

YvesDup commented May 13, 2025

LGTM

@ogbiggles
Copy link
Contributor Author

LGTM

Thank you @YvesDup for the thorough review of my first PR, much appreciated.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very fond of a long NEWS, especially with the "A combination of ...", but it's the full CHANGELOG, so why not.

@YvesDup
Copy link
Contributor

YvesDup commented May 22, 2025

@ogbiggles Would you apply the last suggestions ?

@tabrezm
Copy link

tabrezm commented Jun 3, 2025

Does this fix supersede this one?

@ogbiggles
Copy link
Contributor Author

Does this fix supersede this one?

The two are addressing different issues, the referenced PR will not prevent the error that I am fixing in this one.

@encukou encukou added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jun 9, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @encukou for commit 4f46875 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133222%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jun 9, 2025
Copy link
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I started the buildbots; if they like the PR I'll merge it.

(Don't be alarmed if there are failures; not all buildbots are stable.)

@encukou encukou merged commit 598aa7c into python:main Jun 10, 2025
123 of 128 checks passed
@encukou encukou added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 10, 2025
@miss-islington-app
Copy link

Thanks @ogbiggles for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Thanks @ogbiggles for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 10, 2025
…ited abnormally (pythonGH-133222)

When shutdown is called with wait=False, the executor thread keeps running
even after the ProcessPoolExecutor's state is reset. The executor then tries
to replenish the worker processes pool resulting in an error and a potential hang
when it comes across a worker that has died. Fixed the issue by having
_adjust_process_count() return without doing anything if the ProcessPoolExecutor's
state has been reset.

Added unit tests to validate two scenarios:
max_workers < num_tasks (exception)
max_workers > num_tasks (exception + hang)
(cherry picked from commit 598aa7c)

Co-authored-by: Ajay Kamdar <140011370+ogbiggles@users.noreply.github.com>
@bedevere-app
Copy link

bedevere-app bot commented Jun 10, 2025

GH-135343 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jun 10, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 10, 2025
…ited abnormally (pythonGH-133222)

When shutdown is called with wait=False, the executor thread keeps running
even after the ProcessPoolExecutor's state is reset. The executor then tries
to replenish the worker processes pool resulting in an error and a potential hang
when it comes across a worker that has died. Fixed the issue by having
_adjust_process_count() return without doing anything if the ProcessPoolExecutor's
state has been reset.

Added unit tests to validate two scenarios:
max_workers < num_tasks (exception)
max_workers > num_tasks (exception + hang)
(cherry picked from commit 598aa7c)

Co-authored-by: Ajay Kamdar <140011370+ogbiggles@users.noreply.github.com>
@bedevere-app
Copy link

bedevere-app bot commented Jun 10, 2025

GH-135344 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants