-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
bpo-38323, asyncio: Fix MultiLoopChildWatcher race condition #26536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix a race condition in asyncio MultiLoopChildWatcher: call the loop add_signal_handler() rather than calling directly signal.signal(), so signal.set_wakeup_fd() is used to wake up the event loop when the SIGCHLD signal is received. Previously, the event loop could block forever if the signal was received while the loop was idle. The signal was only proceed when another event woke up the event loop.
@vstinner This removes the purpose of having a separate |
Without this PR, test_asyncio is less than 1 minute on my laptop (8 logical CPUs, 4 physical CPUs) with the command:
With this PR, the test no longer hangs. I ran it for 30 minutes (8332 runs). I marked this PR as a draft since this PR changes the MultiLoopChildWatcher design. Currently, MultiLoopChildWatcher documentation says:
With this change, MultiLoopChildWatcher looks like a clone of SafeChildWatcher. IMHO MultiLoopChildWatcher is broken by design. You must connect the signal handler to an event loop to be able to wake up the event loop when a signal is received, using signal.set_wakeup_fd(). At least, test_cancel_make_subprocess_transport_exec() is wrong: it expects that the event loop will be awaken when child process completes, wheras SIGCHLD currently cannot wake up the event loop, since nothing links the signal handler and the event loop. Sometimes, the Python signal handler is called by luck only because the signal is received while the event loop has pending events which wake up the event loop which indirectly calls the Python signal handler in Moreover, as I already wrote 2 years ago, MultiLoopChildWatcher is incompatible with the PEP 475 design: https://bugs.python.org/issue38323#msg355159 A signal handler cannot "wake up" Python code if it doesn't raise an exception. In short, I suggest to remove MultiLoopChildWatcher, rather than trying to fix it. @1st1 @asvetlov @aeros @cjerdonek: What do you think? |
Having a race condition in the test suite is raising so many false positives and is just a pain for automated systems and people watching the buildbots. We must prioritize fixing this issue. |
I'm okay with removing it, but I think Andrew should weigh in since I think the class was his idea / implementation. One approach would be to deprecate the class and skip all of its tests. That would address the buildbot issue and without changing the behavior of |
This bug affects 3.9 and 3.10 branches. I don't think that we can remove a class in a stable branche. I propose to merge this fix (or a similar fix) in all branches, deprecate the class, and only remove it later (2 releases later). If we fix the class, it becomes less urgent to remove it. |
I am confused, why is this PR backwards compatible if this is now adding the requirement that "A watcher that requires running loop in the main thread" while before that was not a requirement? |
I'm no longer working on/using asyncio, I cannot say what is the best trade-off:
We can take a different decisions depending on the branch. (C) is acceptable for 3.9 branch. (A) and (B) can be considered for 3.10 since 3.10.0 final is not released yet. I don't recall if it's important that the event loop runs in the main thread, or if it works if the event loop runs in another thread. Does someone know? The answer matters a lot :-) |
I agree. In my (edited) comment above, I suggested the following approach:
|
Well, meanwhile we discuss this issue, I am going to open a PR to just remove the test so we can backport it |
Apparently this is not enough, other tests have the same problem: |
Seems that we either need to remove all tests or proceed with VIctor's PR |
Correct, you need to skip or remove all the tests of that class. Can you skip them without removing them? |
You can skip classes, FYI: https://docs.python.org/3/library/unittest.html#skipping-tests-and-expected-failures |
Right, what I was saying is that I was more comfortable skipping/removing one test, but removing all of them sounds quite suboptinmal. |
Ok, I opened #26542 for the skip of the class. Let's go with that meanwhile we decide what to do in general, we can reactivate the tests if we decide to fix/deprecate the class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest converting this PR into a draft that deprecates MultiLoopChildWatcher
as I think that is something more likely to be approved, or else make a separate PR that does that.
How about installing the handler for SIGCHLD for all event loops? I've implemented that in my fork and running,
works really well. The patch does the following -
I've committed the changes in my fork - shreyanavigyan@90c0e81. I've prepared the patch by keeping this phrase in mind - "We have to support multiple event loops. That's why MultiLoopChildWatcher is used instead of SafeChildWatcher". I don't know if there's any other thing I might be missing. |
@shreyanavigyan I don't think what you've done will work for a few different reasons. For example, you are changing |
@shreyanavigyan PS - if you have code to suggest, it's better to file a separate PR and discuss it there. |
@cjerdonek I've opened a new PR as you suggested - #26574 |
https://bugs.python.org/issue38323 issue was fixed by skipping the test (GH-26542), I close my PR. |
Fix a race condition in asyncio MultiLoopChildWatcher: call the loop
add_signal_handler() rather than calling directly signal.signal(), so
signal.set_wakeup_fd() is used to wake up the event loop when the
SIGCHLD signal is received. Previously, the event loop could block
forever if the signal was received while the loop was idle. The
signal was only proceed when another event woke up the event loop.
https://bugs.python.org/issue38323