-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/scheduler,rp2: Avoid scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error #15448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
py/scheduler,rp2: Avoid scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error #15448
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #15448 +/- ##
=======================================
Coverage 98.43% 98.44%
=======================================
Files 163 163
Lines 21294 21296 +2
=======================================
+ Hits 20960 20964 +4
+ Misses 334 332 -2 ☔ View full report in Codecov by Sentry. |
Code size report:
|
215791a
to
c902051
Compare
I guess that was introduced in bcbdee2 ? Prior to that commit, tinyusb was guarded so it ran only on core 0. |
I think that commit set up the conditions for the bug. TinyUSB still only runs on CPU0 as of this commit because that's where the scheduler runs it. However, that commit introduces a different bug - waiting for data in places where the scheduler doesn't run will hang. Commit a00c9d5 fixed that by re-introducing a TinyUSB task hook, but without the CPU check. |
Oh wait, I have this wrong don't I? mp_handle_pending() can run on either CPU... I didn't realise this... (The bug still doesn't happen in that commit because the scheduler has a lock, but I think that this fix is incomplete as it doesn't handle the case where mp_usb_task is scheduled to run, the scheduler runs on CPU1, and simultaneously is polled from code running on CPU0...) |
Correct. Eg called from the VM itself.
Are you saying this PR is not correct ("this fix is incomplete"), or are you referring to the commit a00c9d5 not being fully correct? |
This PR isn't a complete fix. I think there's still a race where:
The simple fix is to always guard the scheduled mp_usb_task to not run on CPU1, but because mp_usbd_task is run on-demand then there needs to be a way to re-schedule mp_usbd_task to run later, and keep retrying until it runs on CPU0. I have a fix for this here now, but I'm not sure about it (also requires a scheduler change). Or, add proper locking around mp_usbd_task so it can run on either CPU. |
I've pushed two commits which are a possible fix for this, but I don't like it much. I think the smart fix here might be to revert bcbdee2 and re-introduce the dedicated TinyUSB task hook for CPU0, instead of calling TinyUSB via the scheduler. If the performance hit of that change is high then can keep the DCD event handler function and use it to set a flag that we check before calling into TinyUSB. |
We need to think about this in depth, because all other bare-metal ports that use TinyUSB need to do the same thing. |
Agree about thinking it through. Currently we have rp2 and samd using the scheduler approach, and all of the other ports using a task hook. The queue-via-scheduler approach should work fine on all the single core bare metal ports. It's only made complex in the dual core case where it's possible for both CPUs to be trying to call into the TinyUSB task at once. Will think about it some more. Maybe adding another recursive mutex on rp2 is the way to go. |
@projectgus and @jimmo and I discussed this and decided to fix it by only allowing scheduled functions to run on the main thread. See related #8838 and pybricks@b196c78 |
I've implemented this "scheduler only on the main thread" change and it's working but I don't think it's a good solution after all. See this branch: It's a problem for ports with GIL and multiple threads (i.e. ESP32). The
Going to take another look at this next week. @jimmo mentioned the idea of having a flag so some scheduler nodes only run on the main thread, this might be the best compromise. Otherwise maybe implementing a mutex for TinyUSB (with some caution around ensuring race conditions can't lose TinyUSB events despite the mutex). |
Hmmm, OK, I'll also have a think about this. We do need to try and reduce the latency of scheduled callbacks, because they are being used more and more for background workloads. |
76c025b
to
a0e1d30
Compare
Finally got a version of this fix I'm happy with, after @dpgeorge pointed out that this problem only exists where GIL is disabled. (An aside, I also noticed that there is no user-facing documentation for the GIL at all. I made a note to submit something.) |
a0e1d30
to
4732969
Compare
@dlech This may be of interest, although I'm unsure if its equivalent to pybricks@b196c78 or not (specifically, I don't know if pybricks enables the GIL or not). |
Yeah, we are using that patch on the Unix port with the GIL enabled. So it looks like this isn't quite the same. The problems were were trying to solve are A) Unix signal handlers should always run on the main thread, like in CPython and B) if there is an unhandled exception in any background task, we want it to stop the whole program by throwing in the main thread rather than throwing in a random thread. |
@dlech That makes sense, thanks for the extra detail. |
OK, so PyBrick's use case is more about functionality, rather than fixing a bug. And because it's on the unix port there's less concern about loss of performance / increase latency of scheduled code. I have tested this latest PR on rp2 and it fixes the original report in #15390. |
4732969
to
54df502
Compare
54df502
to
cf8d5f4
Compare
Otherwise it's very difficult to reason about thread safety in a scheduler callback, as it can run at any time on any thread - including racing against any bytecode operation on any thread. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
If GIL is disabled then there's threat of a race condition if some other code specifically requests USB processing (i.e. to unblock stdio), while a scheduled TinyUSB callback is already running on another thread. Relies on the change in the parent commit, where scheduler is restricted to main thread if GIL is disabled. Fixes micropython#15390 - "TinyUSB callback can't recurse" exceptions on rp2 when using _thread module and USB serial I/O. Adds a unit test for stdin functioning correctly in threads (fails on rp2 port without this fix). This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
cf8d5f4
to
5d8878b
Compare
This turned out to be a relatively small patch, which is great. Thank you! |
Summary
Looks like there was a bug since v1.22 when using rp2 threads, where either CPU may poll CDC input and trigger the TinyUSB task. TinyUSB could run on both CPUs concurrently, which may have lead to some incorrect behaviour.
The race started triggering an exception when runtime USB support was added, and a check was added for the USB task recursing on itself from a Python handler function. This check can be incorrectly triggered by the race (as the flag that indicates the TinyUSB task is running is set by the other CPU).
The race is most commonly triggered when working from the interactive REPL, even a minimal running thread can trigger it. This commit adds a test case that triggers it in a different way (polling stdin from a thread).
The root cause is that the scheduler on threaded ports without GIL can run on any thread, and creates potential for race conditions that are hard to avoid. So the first fix is to run all scheduler callbacks on the main thread if GIL disabled.
The secondary fix is not to run the TinyUSB task when polled from another task, if the GIL is disabled. Instead we schedule the TinyUSB task to run on the main thread.
Closes #15390.
Trade-offs and Alternatives
Could make the scheduler always only run on the main thread, but with the GIL this has quite an impact on scheduler latency - and isn't necessary for correct behaviour. There is a branch linked from a comment below that mostly works around this, but it's a more complex change (and will still have higher latency than running the callback immediately in whatever thread is currently holding the GIL).
Testing
MICROPY_HW_ENABLE_USB_RUNTIME_DEVICE
set, verified no longer raises an exception.This work was funded through GitHub Sponsors.