py/scheduler,rp2: Avoid scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error #15448

projectgus · 2024-07-11T02:44:00Z

Summary

Looks like there was a bug since v1.22 when using rp2 threads, where either CPU may poll CDC input and trigger the TinyUSB task. TinyUSB could run on both CPUs concurrently, which may have lead to some incorrect behaviour.

The race started triggering an exception when runtime USB support was added, and a check was added for the USB task recursing on itself from a Python handler function. This check can be incorrectly triggered by the race (as the flag that indicates the TinyUSB task is running is set by the other CPU).

The race is most commonly triggered when working from the interactive REPL, even a minimal running thread can trigger it. This commit adds a test case that triggers it in a different way (polling stdin from a thread).

The root cause is that the scheduler on threaded ports without GIL can run on any thread, and creates potential for race conditions that are hard to avoid. So the first fix is to run all scheduler callbacks on the main thread if GIL disabled.

The secondary fix is not to run the TinyUSB task when polled from another task, if the GIL is disabled. Instead we schedule the TinyUSB task to run on the main thread.

Closes #15390.

Trade-offs and Alternatives

Could make the scheduler always only run on the main thread, but with the GIL this has quite an impact on scheduler latency - and isn't necessary for correct behaviour. There is a branch linked from a comment below that mostly works around this, but it's a more complex change (and will still have higher latency than running the callback immediately in whatever thread is currently holding the GIL).

Testing

Ran all thread tests, including the new test case, on rp2 port with and without MICROPY_HW_ENABLE_USB_RUNTIME_DEVICE set, verified no longer raises an exception.
Ran all thread tests on esp32 port, verified no regression when GIL enabled.

This work was funded through GitHub Sponsors.

ports/rp2/mphalport.h

codecov · 2024-07-11T02:47:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.44%. Comparing base (451ba1c) to head (5d8878b).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #15448   +/-   ##
=======================================
  Coverage   98.43%   98.44%           
=======================================
  Files         163      163           
  Lines       21294    21296    +2     
=======================================
+ Hits        20960    20964    +4     
+ Misses        334      332    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-07-11T02:50:35Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:   +16 +0.002% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:   +40 +0.004% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

dpgeorge · 2024-07-11T04:45:21Z

Looks like there was a long standing underlying bug here when using rp2 threads, where either CPU may poll CDC input and trigger the TinyUSB task. TinyUSB could run on both CPUs concurrently, which may have lead to some incorrect behaviour.

I guess that was introduced in bcbdee2 ? Prior to that commit, tinyusb was guarded so it ran only on core 0.

projectgus · 2024-07-11T04:57:56Z

I guess that was introduced in bcbdee2 ? Prior to that commit, tinyusb was guarded so it ran only on core 0.

I think that commit set up the conditions for the bug. TinyUSB still only runs on CPU0 as of this commit because that's where the scheduler runs it. However, that commit introduces a different bug - waiting for data in places where the scheduler doesn't run will hang. Commit a00c9d5 fixed that by re-introducing a TinyUSB task hook, but without the CPU check.

projectgus · 2024-07-11T05:01:42Z

TinyUSB still only runs on CPU0 as of this commit because that's where the scheduler runs it.

Oh wait, I have this wrong don't I? mp_handle_pending() can run on either CPU... I didn't realise this...

(The bug still doesn't happen in that commit because the scheduler has a lock, but I think that this fix is incomplete as it doesn't handle the case where mp_usb_task is scheduled to run, the scheduler runs on CPU1, and simultaneously is polled from code running on CPU0...)

dpgeorge · 2024-07-11T05:31:55Z

mp_handle_pending() can run on either CPU

Correct. Eg called from the VM itself.

The bug still doesn't happen in that commit because the scheduler has a lock, but I think that this fix is incomplete as it doesn't handle the case where mp_usb_task is scheduled to run, the scheduler runs on CPU1, and simultaneously is polled from code running on CPU0...)

Are you saying this PR is not correct ("this fix is incomplete"), or are you referring to the commit a00c9d5 not being fully correct?

projectgus · 2024-07-11T05:44:38Z

Are you saying this PR is not correct ("this fix is incomplete")

This PR isn't a complete fix. I think there's still a race where:

CPU0 is calling poll_cdc_interfaces() which runs mp_usbd_task (allowed by the CPU0 check)
CPU1 runs the scheduler and this runs mp_usbd_task

The simple fix is to always guard the scheduled mp_usb_task to not run on CPU1, but because mp_usbd_task is run on-demand then there needs to be a way to re-schedule mp_usbd_task to run later, and keep retrying until it runs on CPU0. I have a fix for this here now, but I'm not sure about it (also requires a scheduler change).

Or, add proper locking around mp_usbd_task so it can run on either CPU.

projectgus · 2024-07-11T06:03:05Z

I've pushed two commits which are a possible fix for this, but I don't like it much.

I think the smart fix here might be to revert bcbdee2 and re-introduce the dedicated TinyUSB task hook for CPU0, instead of calling TinyUSB via the scheduler. If the performance hit of that change is high then can keep the DCD event handler function and use it to set a flag that we check before calling into TinyUSB.

dpgeorge · 2024-07-11T06:22:55Z

We need to think about this in depth, because all other bare-metal ports that use TinyUSB need to do the same thing.

projectgus · 2024-07-11T06:32:16Z

We need to think about this in depth, because all other bare-metal ports that use TinyUSB need to do the same thing.

Agree about thinking it through. Currently we have rp2 and samd using the scheduler approach, and all of the other ports using a task hook.

The queue-via-scheduler approach should work fine on all the single core bare metal ports. It's only made complex in the dual core case where it's possible for both CPUs to be trying to call into the TinyUSB task at once.

Will think about it some more. Maybe adding another recursive mutex on rp2 is the way to go.

dpgeorge · 2024-08-02T01:50:18Z

@projectgus and @jimmo and I discussed this and decided to fix it by only allowing scheduled functions to run on the main thread. See related #8838 and pybricks@b196c78

projectgus · 2024-09-05T04:33:51Z

decided to fix it by only allowing scheduled functions to run on the main thread

I've implemented this "scheduler only on the main thread" change and it's working but I don't think it's a good solution after all. See this branch:
https://github.com/projectgus/micropython/commits/bugfix/rp2_tinyusb_recursive_mainthread/

It's a problem for ports with GIL and multiple threads (i.e. ESP32). The stress_schedule.py test really falls over, and I think real workloads will also suffer:

vm.c:1358 only bounces the GIL if the scheduler is idle, not pending, which is a problem if only main thread can run the scheduler and another thread is running. Changing this to also bounce the GIL if pending on a non-main thread (similar to the pybricks patch) helps but it's still quite slow because the GIL only bounces every MICROPY_PY_THREAD_GIL_VM_DIVISOR branches. Therefore scheduled function latency can be quite high if another thread is running.
Replaced that with a change in the linked branch, where scheduler.c bounces the GIL instead of running the scheduler on a non-main threads. This improves things a lot, but it's still higher latency because now the main thread has to take the GIL, then execute another iteration of the main VM loop, and only then it handles the scheduled operation. Whereas before the non-main thread would have run the scheduler immediately.

Going to take another look at this next week. @jimmo mentioned the idea of having a flag so some scheduler nodes only run on the main thread, this might be the best compromise. Otherwise maybe implementing a mutex for TinyUSB (with some caution around ensuring race conditions can't lose TinyUSB events despite the mutex).

dpgeorge · 2024-09-06T00:30:23Z

I've implemented this "scheduler only on the main thread" change and it's working but I don't think it's a good solution after all.

Hmmm, OK, I'll also have a think about this. We do need to try and reduce the latency of scheduled callbacks, because they are being used more and more for background workloads.

projectgus · 2024-09-10T08:07:40Z

Finally got a version of this fix I'm happy with, after @dpgeorge pointed out that this problem only exists where GIL is disabled.

(An aside, I also noticed that there is no user-facing documentation for the GIL at all. I made a note to submit something.)

projectgus · 2024-09-10T08:40:31Z

@dlech This may be of interest, although I'm unsure if its equivalent to pybricks@b196c78 or not (specifically, I don't know if pybricks enables the GIL or not).

dlech · 2024-09-10T13:35:46Z

Yeah, we are using that patch on the Unix port with the GIL enabled. So it looks like this isn't quite the same. The problems were were trying to solve are A) Unix signal handlers should always run on the main thread, like in CPython and B) if there is an unhandled exception in any background task, we want it to stop the whole program by throwing in the main thread rather than throwing in a random thread.

projectgus · 2024-09-11T03:58:33Z

@dlech That makes sense, thanks for the extra detail.

dpgeorge · 2024-09-11T05:18:24Z

OK, so PyBrick's use case is more about functionality, rather than fixing a bug. And because it's on the unix port there's less concern about loss of performance / increase latency of scheduled code.

I have tested this latest PR on rp2 and it fixes the original report in #15390.

py/scheduler.c

shared/tinyusb/mp_usbd_runtime.c

tests/thread/thread_stdin.py

Otherwise it's very difficult to reason about thread safety in a scheduler callback, as it can run at any time on any thread - including racing against any bytecode operation on any thread. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>

If GIL is disabled then there's threat of a race condition if some other code specifically requests USB processing (i.e. to unblock stdio), while a scheduled TinyUSB callback is already running on another thread. Relies on the change in the parent commit, where scheduler is restricted to main thread if GIL is disabled. Fixes micropython#15390 - "TinyUSB callback can't recurse" exceptions on rp2 when using _thread module and USB serial I/O. Adds a unit test for stdin functioning correctly in threads (fails on rp2 port without this fix). This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>

dpgeorge · 2024-09-19T03:29:25Z

This turned out to be a relatively small patch, which is great. Thank you!

projectgus added port-rp2 shared Relates to shared/ directory in source labels Jul 11, 2024

projectgus commented Jul 11, 2024

View reviewed changes

ports/rp2/mphalport.h Outdated Show resolved Hide resolved

projectgus force-pushed the bugfix/rp2_tinyusb_recurse branch 3 times, most recently from 215791a to c902051 Compare July 11, 2024 04:41

projectgus added this to the release-1.24.0 milestone Sep 3, 2024

projectgus added the py-core Relates to py/ directory in source label Sep 10, 2024

projectgus changed the title ~~usb,rp2: Fix "TinyUSB callback can't recurse" error when using threads.~~ py/scheduler,rp2: Fix scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error Sep 10, 2024

projectgus changed the title ~~py/scheduler,rp2: Fix scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error~~ py/scheduler,rp2: Avoid scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error Sep 10, 2024

projectgus force-pushed the bugfix/rp2_tinyusb_recurse branch from 76c025b to a0e1d30 Compare September 10, 2024 08:04

projectgus force-pushed the bugfix/rp2_tinyusb_recurse branch from a0e1d30 to 4732969 Compare September 10, 2024 08:08

projectgus requested a review from dpgeorge September 10, 2024 08:09

dpgeorge reviewed Sep 11, 2024

View reviewed changes

projectgus force-pushed the bugfix/rp2_tinyusb_recurse branch from 4732969 to 54df502 Compare September 11, 2024 07:01

projectgus requested a review from dpgeorge September 11, 2024 07:03

projectgus force-pushed the bugfix/rp2_tinyusb_recurse branch from 54df502 to cf8d5f4 Compare September 17, 2024 01:09

dpgeorge approved these changes Sep 18, 2024

View reviewed changes

projectgus added 2 commits September 19, 2024 13:17

dpgeorge force-pushed the bugfix/rp2_tinyusb_recurse branch from cf8d5f4 to 5d8878b Compare September 19, 2024 03:28

dpgeorge merged commit 5d8878b into micropython:master Sep 19, 2024
63 checks passed

projectgus deleted the bugfix/rp2_tinyusb_recurse branch September 25, 2024 03:46

Uh oh!

py/scheduler,rp2: Avoid scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error #15448

py/scheduler,rp2: Avoid scheduler race conditions when GIL disabled, fix "TinyUSB callback can't recurse" error #15448

Uh oh!

Conversation

projectgus commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Trade-offs and Alternatives

Testing

Uh oh!

Uh oh!

codecov bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Jul 11, 2024

Uh oh!

projectgus commented Jul 11, 2024

Uh oh!

projectgus commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Jul 11, 2024

Uh oh!

projectgus commented Jul 11, 2024

Uh oh!

projectgus commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Jul 11, 2024

Uh oh!

projectgus commented Jul 11, 2024

Uh oh!

dpgeorge commented Aug 2, 2024

Uh oh!

projectgus commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Sep 6, 2024

Uh oh!

projectgus commented Sep 10, 2024

Uh oh!

projectgus commented Sep 10, 2024

Uh oh!

dlech commented Sep 10, 2024

Uh oh!

projectgus commented Sep 11, 2024

Uh oh!

dpgeorge commented Sep 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dpgeorge commented Sep 19, 2024

Uh oh!

Uh oh!

projectgus commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

github-actions bot commented Jul 11, 2024 •

edited

Loading

projectgus commented Jul 11, 2024 •

edited

Loading

projectgus commented Jul 11, 2024 •

edited

Loading

projectgus commented Sep 5, 2024 •

edited

Loading