Skip to content

rp2: Fix hang triggered by timing of short sleeps and soft timer events #13329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

projectgus
Copy link
Contributor

@projectgus projectgus commented Jan 3, 2024

The alarm pool on rp2 pico-sdk stops triggering at times, until an additional interrupt is received to wake the CPU from a WFE state. In systems without regular interrupts this can hang indefinitely. See #12873, and also linked pico-sdk issue raspberrypi/pico-sdk#1552.

This PR removes most MicroPython dependencies on the pico-sdk alarm pool, as a fix for the readily reproducible versions of this bug:

  • Soft timer is refactored to use a standalone hardware timer alarm. To support this, pendsv_suspend() & pendsv_resume() is made safe to run from their core, as these functions protect access to the soft timer data structures.
  • Functions for sleeping & waiting for events are refactored to use the soft timer instead of pico-sdk functions.

There are two remaining places that MicroPython uses the pico-sdk alarm pool on rp2:

  • machine.Timer() uses the default alarm pool. A future PR is planned to refactor this to use soft timer, but this requires deeper soft timer changes to support microsecond resolution.
  • pico-sdk invokes alarm pool functions internally. Notably multicore_lockout_start_blocking() which adds an alarm at_end_of_time. This modifies the alarm pool - but with a timer that never expires, so unlikely to trigger the bug. There is also a USB workaround timeout hw_enumeration_fix_wait_se0() that creates a short-lived timeout using the alarm pool.

After machine.Timer() is refactored then a fix for the second point is to build pico-sdk with PICO_TIME_DEFAULT_ALARM_POOL_DISABLED=1. This converts the other timeouts into busy-waits inside the SDK, and will presumably save some code size as alarm pool code will no longer be compiled into MicroPython. This is not possible yet, as machine.Timer() needs an alarm pool and pico-sdk doesn't provide a way to statically initialise a new alarm pool (there is alarm_pool_create(), but this calls pico-sdk's malloc to allocate memory for the alarm pool... non-functional commit for this is at projectgus@9c8ddd8)

Closes #12873 (Rationale: although the alarm pool is still used in the places described, the access patterns that made it easy to hang the rp2 in a WFE state should no longer be triggered unless machine.Timer() is used a lot for very short delays.)

Future work

  1. Refactor soft_timer to support 64-bit microsecond timeouts
  2. Refactor machine.Timer() to use these
  3. Disable default alarm pool in the pico-sdk configuration.

This work was funded through GitHub Sponsors.

@projectgus
Copy link
Contributor Author

projectgus commented Jan 3, 2024

@dpgeorge Marked as Draft because I don't have a Pico-W so need help to test the soft timer changes thoroughly (I did some tests with custom C code.)

hardware_alarm_set_callback(MICROPY_HW_SOFT_TIMER_ALARM_NUM, (void *)soft_timer_handler);
// soft timer IRQ handler has to run at PendSV priority. adjusting priority here so we don't
// need to trigger PendSV the timer IRQ.
NVIC_SetPriority(TIMER_IRQ_0_IRQn + MICROPY_HW_SOFT_TIMER_ALARM_NUM, PICO_LOWEST_IRQ_PRIORITY);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A weird thing here: the real PendSV IRQ priority depends on whether the network support is compiled in. It's set here in modnetwork.c:
https://github.com/micropython/micropython/blob/master/ports/rp2/mpnetworkport.c#L69

Practically I don't think this matters too much as (AFAIK) it's not used for anything else on rp2 otherwise, but it's a bit odd so I wanted to point it out (which is why I've done it this way here as well.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. But the comment on lines 306-307 needs rewriting, I don't understand it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpgeorge Right, it had at least one typo in it to boot. PTAL!

Copy link

codecov bot commented Jan 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2ed976f) 98.36% compared to head (2923d54) 98.36%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #13329   +/-   ##
=======================================
  Coverage   98.36%   98.36%           
=======================================
  Files         159      159           
  Lines       21088    21088           
=======================================
  Hits        20743    20743           
  Misses        345      345           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hardware_alarm_set_callback(MICROPY_HW_SOFT_TIMER_ALARM_NUM, (void *)soft_timer_handler);
// soft timer IRQ handler has to run at PendSV priority. adjusting priority here so we don't
// need to trigger PendSV the timer IRQ.
NVIC_SetPriority(TIMER_IRQ_0_IRQn + MICROPY_HW_SOFT_TIMER_ALARM_NUM, PICO_LOWEST_IRQ_PRIORITY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. But the comment on lines 306-307 needs rewriting, I don't understand it!

@projectgus projectgus force-pushed the bugfix/remove_alarm_pool branch from 35245a7 to 8af2410 Compare January 3, 2024 05:29
@projectgus
Copy link
Contributor Author

@dpgeorge Updated, PTAL again! :)

@projectgus projectgus marked this pull request as ready for review January 3, 2024 05:31
@projectgus projectgus force-pushed the bugfix/remove_alarm_pool branch from 8af2410 to f55f520 Compare January 3, 2024 06:13
@projectgus projectgus marked this pull request as draft January 3, 2024 06:16
@projectgus
Copy link
Contributor Author

Hmm, there's a test (extmod/vfs_lfs_mtime.py) that's started failing with this PR. Investigating now...

@dpgeorge
Copy link
Member

dpgeorge commented Jan 3, 2024

there's a test (extmod/vfs_lfs_mtime.py) that's started failing with this PR

I also see that failure. It's probably because the delay at the start of mp_hal_time_ns_set_from_rtc() that was changed in this PR is no longer long enough.

@projectgus projectgus force-pushed the bugfix/remove_alarm_pool branch 2 times, most recently from 748ef55 to a7e2484 Compare January 3, 2024 06:58
@projectgus projectgus marked this pull request as ready for review January 3, 2024 07:06
@dpgeorge
Copy link
Member

dpgeorge commented Jan 3, 2024

Unfortunately this fails on a Pico W running the extmod/select_poll_eintr.py test. It looks like it's due to races in the soft timer between core0 and core1, due to mp_wfe_or_timeout() being implemented using the soft timer. I guess soft timer needs to have a mutex and/or use a proper atomic section. That needs further investigation.

@projectgus
Copy link
Contributor Author

projectgus commented Jan 3, 2024

@dpgeorge Ah, damn. I have a Pico W on order but it might not get here until late next week, but I can order another one with faster shipping if this is a blocker.

If you're sure it's a race or a deadlock here then another option would be to occupy another hardware alarm interrupt (0 & 1 both still unused) and use that one for mp_wfe_or_timeout(). I can make some time to implement this if it's useful, although obviously I can't properly test it yet...

@dpgeorge
Copy link
Member

dpgeorge commented Jan 3, 2024

If you're sure it's a race or a deadlock here then another option would be to occupy another hardware alarm interrupt (0 & 1 both still unused) and use that one for mp_wfe_or_timeout()

Yeah, I was thinking that. Would need to use both interrupts, in case the separate cores both need to do the wait at the same time.

I can make some time to implement this if it's useful, although obviously I can't properly test it yet...

I'm going to add/adjust the thread tests so they can run on rp2 properly. That should give a way to test this on a standard Pico.

Don't worry about getting this done quickly. It can wait until next week. It needs thorough testing.

@dpgeorge
Copy link
Member

dpgeorge commented Jan 4, 2024

I added thread tests in #13351. Running the test suite on a Pico (via ./run-tests.py --target rp2) should now pick up all known bugs with threading, including those already fixed by #13312.

@projectgus projectgus force-pushed the bugfix/remove_alarm_pool branch from a7e2484 to 2923d54 Compare January 10, 2024 00:47
@projectgus
Copy link
Contributor Author

@dpgeorge Thanks for adding those _thread tests! My Pico W turned up as well, so I was able to run tests on it also.

I think this is fixed now (failing tests builtin_pow3_intbig ssl_sslcontext_ciphers thread_ident1 which all seem to be unrelated/known failures.)

The fix for pendsv_suspend() uses a recursive mutex (@jimmo's favourite), as it seemed like the least complex way to achieve two slightly unrelated goals: atomic increment/decrement of the pendsv suspend counter, and mutual exclusive access to soft timer internals while pendsv is suspended. See the commit message for more of an explanation.

@dpgeorge dpgeorge added this to the release-1.24.0 milestone Mar 25, 2024
@dpgeorge dpgeorge closed this May 31, 2024
@dpgeorge dpgeorge force-pushed the bugfix/remove_alarm_pool branch from 2923d54 to 2926001 Compare May 31, 2024 06:50
@dpgeorge
Copy link
Member

Sorry, I messed up rebasing this on latest master, and force pushing (I pushed master instead of the branch... so this PR disappeared!).

Rebased and merged in 74fb42a through 3af006e

@projectgus projectgus deleted the bugfix/remove_alarm_pool branch June 4, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pico W: sleep_us causes execution to hang
2 participants