Skip to content

bpo-38323, asyncio: Fix MultiLoopChildWatcher race condition #26536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

bpo-38323, asyncio: Fix MultiLoopChildWatcher race condition #26536

wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jun 4, 2021

Fix a race condition in asyncio MultiLoopChildWatcher: call the loop
add_signal_handler() rather than calling directly signal.signal(), so
signal.set_wakeup_fd() is used to wake up the event loop when the
SIGCHLD signal is received. Previously, the event loop could block
forever if the signal was received while the loop was idle. The
signal was only proceed when another event woke up the event loop.

https://bugs.python.org/issue38323

Fix a race condition in asyncio MultiLoopChildWatcher: call the loop
add_signal_handler() rather than calling directly signal.signal(), so
signal.set_wakeup_fd() is used to wake up the event loop when the
SIGCHLD signal is received. Previously, the event loop could block
forever if the signal was received while the loop was idle. The
signal was only proceed when another event woke up the event loop.
@cjerdonek
Copy link
Member

@vstinner This removes the purpose of having a separate MultiLoopChildWatcher class and makes it no longer match its name. It makes it the same as SafeChildWatcher. If you want to go this route, you are better off deprecating / removing the class in favor of SafeChildWatcher.

@vstinner
Copy link
Member Author

vstinner commented Jun 4, 2021

Without this PR, test_asyncio is less than 1 minute on my laptop (8 logical CPUs, 4 physical CPUs) with the command:

./python -u -m test test_asyncio -m test.test_asyncio.test_subprocess.SubprocessMultiLoopWatcherTests.test_cancel_make_subprocess_transport_exec -F --timeout=10.0 -j20

With this PR, the test no longer hangs. I ran it for 30 minutes (8332 runs).

I marked this PR as a draft since this PR changes the MultiLoopChildWatcher design. Currently, MultiLoopChildWatcher documentation says:

A watcher that doesn't require running loop in the main thread.

With this change, MultiLoopChildWatcher looks like a clone of SafeChildWatcher.

IMHO MultiLoopChildWatcher is broken by design. You must connect the signal handler to an event loop to be able to wake up the event loop when a signal is received, using signal.set_wakeup_fd().

At least, test_cancel_make_subprocess_transport_exec() is wrong: it expects that the event loop will be awaken when child process completes, wheras SIGCHLD currently cannot wake up the event loop, since nothing links the signal handler and the event loop. Sometimes, the Python signal handler is called by luck only because the signal is received while the event loop has pending events which wake up the event loop which indirectly calls the Python signal handler in Python/ceval.c.

Moreover, as I already wrote 2 years ago, MultiLoopChildWatcher is incompatible with the PEP 475 design: https://bugs.python.org/issue38323#msg355159 A signal handler cannot "wake up" Python code if it doesn't raise an exception.

In short, I suggest to remove MultiLoopChildWatcher, rather than trying to fix it.

@1st1 @asvetlov @aeros @cjerdonek: What do you think?

@pablogsal
Copy link
Member

Having a race condition in the test suite is raising so many false positives and is just a pain for automated systems and people watching the buildbots. We must prioritize fixing this issue.

@cjerdonek
Copy link
Member

cjerdonek commented Jun 4, 2021

I'm okay with removing it, but I think Andrew should weigh in since I think the class was his idea / implementation.

One approach would be to deprecate the class and skip all of its tests. That would address the buildbot issue and without changing the behavior of MultiLoopChildWatcher in a fundamental way.

@vstinner
Copy link
Member Author

vstinner commented Jun 4, 2021

I'm okay with removing it, but I think Andrew should weigh in since I think the class was his idea / implementation.

This bug affects 3.9 and 3.10 branches. I don't think that we can remove a class in a stable branche.

I propose to merge this fix (or a similar fix) in all branches, deprecate the class, and only remove it later (2 releases later).

If we fix the class, it becomes less urgent to remove it.

@pablogsal
Copy link
Member

I propose to merge this fix (or a similar fix) in all branches, deprecate the class, and only remove it later (2 releases later).

I am confused, why is this PR backwards compatible if this is now adding the requirement that "A watcher that requires running loop in the main thread" while before that was not a requirement?

@vstinner
Copy link
Member Author

vstinner commented Jun 4, 2021

I am confused, why is this PR backwards compatible if this is now adding the requirement that "A watcher that requires running loop in the main thread" while before that was not a requirement?

I'm no longer working on/using asyncio, I cannot say what is the best trade-off:

  • (A) Change MultiLoopChildWatcher contract (require a event loop running in the main thread).
  • (B) Remove the class: this is way more likely to break backward compatibility than (A). I don't think that it's acceptable for the 3.9 branch.
  • (C) Do nothing: keep the race condition, but skip unit tests.

We can take a different decisions depending on the branch. (C) is acceptable for 3.9 branch. (A) and (B) can be considered for 3.10 since 3.10.0 final is not released yet.

I don't recall if it's important that the event loop runs in the main thread, or if it works if the event loop runs in another thread. Does someone know? The answer matters a lot :-)

@cjerdonek
Copy link
Member

cjerdonek commented Jun 4, 2021

I am confused, why is this PR backwards compatible if this is now adding the requirement that "A watcher that requires running loop in the main thread" while before that was not a requirement?

I agree. In my (edited) comment above, I suggested the following approach:

One approach would be to deprecate the class and skip all of its tests. That would address the buildbot issue and without changing the behavior of MultiLoopChildWatcher in a fundamental way.

@pablogsal
Copy link
Member

Well, meanwhile we discuss this issue, I am going to open a PR to just remove the test so we can backport it

@pablogsal
Copy link
Member

Well, meanwhile we discuss this issue, I am going to open a PR to just remove the test so we can backport it

Apparently this is not enough, other tests have the same problem:

#26541 (comment)

@pablogsal
Copy link
Member

Seems that we either need to remove all tests or proceed with VIctor's PR

@cjerdonek
Copy link
Member

Apparently this is not enough, other tests have the same problem:

Correct, you need to skip or remove all the tests of that class. Can you skip them without removing them?

@cjerdonek
Copy link
Member

@pablogsal
Copy link
Member

Correct, you need to skip or remove all the tests of that class. Can you skip them without removing them?

Right, what I was saying is that I was more comfortable skipping/removing one test, but removing all of them sounds quite suboptinmal.

@pablogsal
Copy link
Member

pablogsal commented Jun 4, 2021

Ok, I opened #26542 for the skip of the class. Let's go with that meanwhile we decide what to do in general, we can reactivate the tests if we decide to fix/deprecate the class.

@pablogsal
Copy link
Member

@vstinner I'm landing I'm landing #26542 as a temporary solution meanwhile we decide what to do with this PR. Could you also remove the skip in this PR so the tests are reactivated once is merged?

Copy link
Member

@cjerdonek cjerdonek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest converting this PR into a draft that deprecates MultiLoopChildWatcher as I think that is something more likely to be approved, or else make a separate PR that does that.

@shreyanavigyan
Copy link
Contributor

How about installing the handler for SIGCHLD for all event loops? I've implemented that in my fork and running,

./python -u -m test test_asyncio -m SubprocessMultiLoopWatcherTests -F --timeout=20.0 -j20

works really well. The patch does the following -

  1. Creates a list that will keep track of the event loops where the handler has been installed by MultiLoopChildWatcher.
  2. In attach_loop and add_child_handler, the handler is installed for the loop using add_signal_handler and the event loop instance is also added to the list if it is not already done.
  3. In close, the remove_signal_handler is called on every loop and the loop is removed from the list.

I've committed the changes in my fork - shreyanavigyan@90c0e81.

I've prepared the patch by keeping this phrase in mind - "We have to support multiple event loops. That's why MultiLoopChildWatcher is used instead of SafeChildWatcher". I don't know if there's any other thing I might be missing.

@cjerdonek
Copy link
Member

@shreyanavigyan I don't think what you've done will work for a few different reasons. For example, you are changing add_child_handler() to call attach_loop(), so attach_loop() is no longer guaranteed to be called only from the main thread. Also, you changed attach_loop() to call loop.add_signal_handler(), but loop.add_signal_handler() can only be called from the main thread.

@cjerdonek
Copy link
Member

@shreyanavigyan PS - if you have code to suggest, it's better to file a separate PR and discuss it there.

@shreyanavigyan
Copy link
Contributor

@cjerdonek I've opened a new PR as you suggested - #26574

@vstinner
Copy link
Member Author

https://bugs.python.org/issue38323 issue was fixed by skipping the test (GH-26542), I close my PR.

@vstinner vstinner closed this Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants