-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
rp2: Fix hang triggered by timing of short sleeps and soft timer events #13329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rp2: Fix hang triggered by timing of short sleeps and soft timer events #13329
Conversation
@dpgeorge Marked as Draft because I don't have a Pico-W so need help to test the soft timer changes thoroughly (I did some tests with custom C code.) |
ports/rp2/mphalport.c
Outdated
hardware_alarm_set_callback(MICROPY_HW_SOFT_TIMER_ALARM_NUM, (void *)soft_timer_handler); | ||
// soft timer IRQ handler has to run at PendSV priority. adjusting priority here so we don't | ||
// need to trigger PendSV the timer IRQ. | ||
NVIC_SetPriority(TIMER_IRQ_0_IRQn + MICROPY_HW_SOFT_TIMER_ALARM_NUM, PICO_LOWEST_IRQ_PRIORITY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A weird thing here: the real PendSV IRQ priority depends on whether the network support is compiled in. It's set here in modnetwork.c:
https://github.com/micropython/micropython/blob/master/ports/rp2/mpnetworkport.c#L69
Practically I don't think this matters too much as (AFAIK) it's not used for anything else on rp2 otherwise, but it's a bit odd so I wanted to point it out (which is why I've done it this way here as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. But the comment on lines 306-307 needs rewriting, I don't understand it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dpgeorge Right, it had at least one typo in it to boot. PTAL!
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #13329 +/- ##
=======================================
Coverage 98.36% 98.36%
=======================================
Files 159 159
Lines 21088 21088
=======================================
Hits 20743 20743
Misses 345 345 ☔ View full report in Codecov by Sentry. |
ports/rp2/mphalport.c
Outdated
hardware_alarm_set_callback(MICROPY_HW_SOFT_TIMER_ALARM_NUM, (void *)soft_timer_handler); | ||
// soft timer IRQ handler has to run at PendSV priority. adjusting priority here so we don't | ||
// need to trigger PendSV the timer IRQ. | ||
NVIC_SetPriority(TIMER_IRQ_0_IRQn + MICROPY_HW_SOFT_TIMER_ALARM_NUM, PICO_LOWEST_IRQ_PRIORITY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. But the comment on lines 306-307 needs rewriting, I don't understand it!
35245a7
to
8af2410
Compare
@dpgeorge Updated, PTAL again! :) |
8af2410
to
f55f520
Compare
Hmm, there's a test (extmod/vfs_lfs_mtime.py) that's started failing with this PR. Investigating now... |
I also see that failure. It's probably because the delay at the start of |
748ef55
to
a7e2484
Compare
Unfortunately this fails on a Pico W running the |
@dpgeorge Ah, damn. I have a Pico W on order but it might not get here until late next week, but I can order another one with faster shipping if this is a blocker. If you're sure it's a race or a deadlock here then another option would be to occupy another hardware alarm interrupt (0 & 1 both still unused) and use that one for |
Yeah, I was thinking that. Would need to use both interrupts, in case the separate cores both need to do the wait at the same time.
I'm going to add/adjust the thread tests so they can run on rp2 properly. That should give a way to test this on a standard Pico. Don't worry about getting this done quickly. It can wait until next week. It needs thorough testing. |
a7e2484
to
2923d54
Compare
@dpgeorge Thanks for adding those I think this is fixed now (failing tests The fix for |
2923d54
to
2926001
Compare
The alarm pool on rp2 pico-sdk stops triggering at times, until an additional interrupt is received to wake the CPU from a WFE state. In systems without regular interrupts this can hang indefinitely. See #12873, and also linked pico-sdk issue raspberrypi/pico-sdk#1552.
This PR removes most MicroPython dependencies on the pico-sdk alarm pool, as a fix for the readily reproducible versions of this bug:
pendsv_suspend()
&pendsv_resume()
is made safe to run from their core, as these functions protect access to the soft timer data structures.There are two remaining places that MicroPython uses the pico-sdk alarm pool on rp2:
machine.Timer()
uses the default alarm pool. A future PR is planned to refactor this to use soft timer, but this requires deeper soft timer changes to support microsecond resolution.multicore_lockout_start_blocking()
which adds an alarmat_end_of_time
. This modifies the alarm pool - but with a timer that never expires, so unlikely to trigger the bug. There is also a USB workaround timeouthw_enumeration_fix_wait_se0()
that creates a short-lived timeout using the alarm pool.After
machine.Timer()
is refactored then a fix for the second point is to build pico-sdk withPICO_TIME_DEFAULT_ALARM_POOL_DISABLED=1
. This converts the other timeouts into busy-waits inside the SDK, and will presumably save some code size as alarm pool code will no longer be compiled into MicroPython. This is not possible yet, asmachine.Timer()
needs an alarm pool and pico-sdk doesn't provide a way to statically initialise a new alarm pool (there isalarm_pool_create()
, but this calls pico-sdk's malloc to allocate memory for the alarm pool... non-functional commit for this is at projectgus@9c8ddd8)Closes #12873 (Rationale: although the alarm pool is still used in the places described, the access patterns that made it easy to hang the rp2 in a WFE state should no longer be triggered unless
machine.Timer()
is used a lot for very short delays.)Future work
machine.Timer()
to use theseThis work was funded through GitHub Sponsors.