Skip to content

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

duaneg
Copy link

@duaneg duaneg commented Mar 19, 2025

This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for gh-128041 is also reverted.

This bug is caused by race conditions in the poll implementations (which are
called by join/wait) where if multiple threads try to reap the dead process
only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error,
possibly overwriting the correct code set by the winning thread. This is
relatively easy to fix: we can just take a lock before waiting for the process,
since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the
exit code is set, meaning the process may still report itself as alive after
join returns. Fixing this is trickier as we have to support a mixture of
blocking and non-blocking calls to poll, and we cannot have the latter waiting
to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The
non-blocking variant does its work with the lock held: since it won't block
this should be safe. The blocking variant releases the lock before making the
blocking operating system call. It then retakes the lock and either sets the
code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it
may still "lose" the race, and return None instead of the exit code, even
though the process is dead. However, as the process could be alive at the time
the call is made but die immediately afterwards, this situation should already
be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for
all three implementations. A work-around for this bug in a test added for
pythongh-128041 is also reverted.
@duaneg duaneg requested a review from gpshead as a code owner March 19, 2025 02:06
@ghost
Copy link

ghost commented Mar 19, 2025

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Mar 19, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant