stm32: Remove preemptive keyboard interrupt via PendSV #15988

dpgeorge · 2024-10-10T01:45:51Z

Summary

Since the very beginning, the stm32 port (first called stm, then stmhal now stm32) has had a special keyboard interrupt feature which works by using PendSV to break out of any running code.

This preemptive ctrl-C was added long ago in commit 01156d5

The stm32 port still uses that ancient code, and current does this:

If ctrl-C is received on UART or USB then mp_sched_keyboard_interrupt() is called (like all other ports) to set a flag for the VM to see, and then the VM (or any loop calling mp_handle_pending(true)) will eventually handle the KeyboardInterrupt exception, raising it via NLR.
If another ctrl-C is received while the existing scheduled keyboard interrupt is still pending (ie the VM has not yet processed it) then a special hard NLR jump will activate, that preempts the calling code. Within the PendSV interrupt the stack is adjusted and an NLR jump is made to the most recent nlr_push() location. This is like a normal NLR except it is called from an interrupt context and completely annihilates the code that was interrupted by the IRQ.

The reason for the preemptive interrupt was to handle ctrl-C properly before the VM was able to handle it. Then when the esp8266 port came along, it needed a more port-agnostic way to handle ctrl-C, and so the mechanism was added to the VM and runtime to be able to check for pending interrupts. Then the stm32 port was updated to use this mechanism, with a fallback to the old preemptive way if a second ctrl-C was received (without the first one being processed).

This preemptive NLR jump is problematic because it can interrupt long-running instructions (eg store multiple, usually used at the end of a function to restore registers and return). If such an instruction is interrupted the CPU remembers that with some flags, and can resume the long-running instruction when the interrupt finishes. But the preemptive NLR does a long jump to different code at thread level and so the long-running interrupt is never resumed. This leads to a CPU fault.

This fault has been previously reported in #3807 and #3842 (see also #294). I was never able to reproduce the issues reported there, but finally (after 6 years!) I can reproduce it.

The bad behaviour of preemptive NLR jump has become apparent since #15486. If you run the test suite over and over again on any stm32 board, it will eventually crash the board (see testing section below).

The point is, a skipped test now soft resets the board and so the board must run boot.py again. The test runner may then interrupt the execution of boot.py with the double-ctrl-C that it sends (in tools/pyboard.py:enter_raw_repl()) in order to get the board into a known good state for the next test. If the timing is right, this can trigger the preemptive PendSV in an unfortunate location and hard fault the board.

The fix in this PR is to just remove the preemptive NLR jump feature. No other port has this feature and it's not needed, ctrl-C works very well on those ports. Preemptive NLR jump is a very dangerous thing and is obviously buggy.

See this comment here which suggests what's done in this PR: #3842 (comment)

Testing

I have been using this to trigger the fault:

while true; do ./run-tests.py --target pyboard -d extmod stress float micropython basics misc --exclude extreme_exc --exclude mtime --exclude pow3_intbig --exclude recursive_iternext  --exclude float_parse --exclude math_constants --exclude vfs_blockdev_invalid || break; done

That can trigger the error on PYBv1.x, but it happens more regularly on PYBD-SF2/6.

With this PR, the board no longer hard faults. (But it does show a new issue, it can still interrupt boot.py with a single ctrl-C, and that will be fixed in a follow-up PR.)

Trade-offs and Alternatives

Could clear the CPU state for the long-running instruction as suggested in #3842. But it's much simpler to just remove this code, which can have other issues as per #294.

dpgeorge · 2024-10-10T01:46:55Z

@iabdalkader @dlech are you using this preemptive PendSV interrupt feature?

github-actions · 2024-10-10T01:53:30Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:   -80 -0.020% PYBV10[incl -4(bss)]
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

dlech · 2024-10-10T02:15:23Z

you using this preemptive PendSV interrupt feature?

Nope, we use mp_sched_keyboard_interrupt().

andrewleech · 2024-10-10T02:24:25Z

That's a sneaky problem to track down @dpgeorge nice work!

I've noticed that mpremote/pyboard.py do a double ctrl-c

micropython/tools/mpremote/mpremote/transport_serial.py

Line 124 in 197becb

self.serial.write(b"\r\x03\x03") # ctrl-C twice: interrupt any running program

was this originally done to use this hard interrupt? Maybe after sending the ctrl-c a slight timeout after sending would be warranted now as it could take a little longer for the vm to respond? Though as you say, if it's working well on other ports already it's probably not worth worrying about!

Regardless, this change improves consistency with other ports and simplifies behavior so I'm all for it!

dpgeorge · 2024-10-10T03:05:56Z

I've noticed that mpremote/pyboard.py do a double ctrl-c
was this originally done to use this hard interrupt?

Yes it was. Because originally a single ctrl-C would trigger the preemptive NLR jump, but then when softer interrupt handling was added, pyboard.py was updated to issue a double ctrl-c so that it could stop anything on stm32.

Implementation of generic ctrl-C handling and change stm32 to try "soft" interrupt first: 124df6f

Change pyboard.py to issue a double ctrl-C (most relevant to your question): bc1488a

IMO we no longer need the double ctrl-C, we should remove it.

dpgeorge · 2024-10-10T03:07:36Z

ports/stm32/boards/common_isr_ram/common_isr.ld

@@ -14,7 +14,7 @@
       flash (in main.c) along with the isr_vector above.
    */
    . = ALIGN(4);
-    *(.text.pendsv_kbd_intr)
+    *(.text.mp_sched_keyboard_interrupt)


@andrewleech it's curious that you only needed to put pendsv_kbd_intr in RAM. That function calls mp_sched_keyboard_interrupt so I wonder why you didn't need to put the latter in RAM as well?

Oh that is interesting... maybe neither of them actually need to be in ram... or more likely my testing didn't catch the scenario where ctrl-c was being handled by the UART isr at the exact moment flash was locked. Just before or just after it wouldn't matter.
I think I added that function to the ram list based just on review of things called from the isr function but forgot to follow it down the call stack to subsequent functions.

OK, thanks for confirming. I'll leave this as it is in this PR because it makes sense to have this function in RAM.

projectgus

This looks good to me! Rationale makes sense, and the other ports have been doing without it just fine.

Apart from the very legitimate problems with CPU register state, the idea of randomly jumping out of slow-running or stuck C code seems convenient on the one hand but terrifying on the other - too many opportunities for hardware and software to wind up left in some invalid state with no cleanup path.

dpgeorge · 2024-10-10T05:38:04Z

But it does show a new issue, it can still interrupt boot.py with a single ctrl-C, and that will be fixed in a follow-up PR.

See #15990 for full description of this problem, and a fix for that.

dpgeorge · 2024-10-10T05:53:40Z

IMO we no longer need the double ctrl-C, we should remove it.

See #15991 for that.

Since the very beginning, the stm32 port (first called stm, then stmhal now stm32) has had a special keyboard interrupt feature which works by using PendSV to break out of any running code. This preemptive ctrl-C was added long ago in commit 01156d5. The stm32 port still uses that code, and current does this: - If ctrl-C is received on UART or USB then `mp_sched_keyboard_interrupt()` is called (like all other ports) to set a flag for the VM to see, and then the VM (or any loop calling `mp_handle_pending(true)`) will eventually handle the `KeyboardInterrupt` exception, raising it via NLR. - If another ctrl-C is received while the existing scheduled keyboard interrupt is still pending (ie the VM has not yet processed it) then a special hard NLR jump will activate, that preempts the calling code. Within the PendSV interrupt the stack is adjusted and an NLR jump is made to the most recent `nlr_push()` location. This is like a normal NLR except it is called from an interrupt context and completely annihilates the code that was interrupted by the IRQ. The reason for the preemptive interrupt was to handle ctrl-C before the VM was able to handle it. Eventually a mechanism (that's in use today by all ports) was added to the VM and runtime to be able to check for pending interrupts. Then the stm32 port was updated to use this mechanism, with a fallback to the old preemptive way if a second ctrl-C was received (without the first one being processed). This preemptive NLR jump is problematic because it can interrupt long-running instructions (eg store multiple, usually used at the end of a function to restore registers and return). If such an instruction is interrupted the CPU remembers that with some flags, and can resume the long-running instruction when the interrupt finishes. But the preemptive NLR does a long jump to different code at thread level and so the long-running interrupt is never resumed. This leads to a CPU fault. This fault has been previously reported in issues micropython#3807 and micropython#3842 (see also issue micropython#294). It's now possible to easily reproduce this problem, since commit 69c25ea. Running the test suite over and over again on any stm32 board will eventually crash the board (it can happen on a PYBv1.x, but it happens more regularly on PYBD-SF2/6). The point is, a skipped test now soft resets the board and so the board must run `boot.py` again. The test runner may then interrupt the execution of `boot.py` with the double-ctrl-C that it sends (in `tools/pyboard.py`, `enter_raw_repl()`) in order to get the board into a known good state for the next test. If the timing is right, this can trigger the preemptive PendSV in an unfortunate location and hard fault the board. The fix in this commit is to just remove the preemptive NLR jump feature. No other port has this feature and it's not needed, ctrl-C works very well on those ports. Preemptive NLR jump is a very dangerous thing (eg it may interrupt and break out of an external SPI flash operation when reading code from a filesystem) and is obviously buggy. With this commit, stm32 borads no longer hard fault when running the test suite (but it does leave an issue, the tests can still interrupt `boot.py` with a single ctrl-C; that will be fixed separately). An alternative to this commit would be to clear the CPU state for the long-running instruction as suggested in issue micropython#3842. But it's much simpler to just remove this code, which is now unnecessary and can have other problems as per issue micropython#294. Signed-off-by: Damien George <damien@micropython.org>

Following the same change to the stm32 port. Signed-off-by: Damien George <damien@micropython.org>

iabdalkader · 2024-10-15T07:13:38Z

This fault has been previously reported in #3807 and #3842 (see also #294). I was never able to reproduce the issues reported there, but finally (after 6 years!) I can reproduce it.
@iabdalkader @dlech are you using this preemptive PendSV interrupt feature?

Sorry for the late reply. As a workaround for this issue, I've been clearing the stacked LDM/STM, but I've recently switched to the new VM abort feature and no longer need the hard-pendsv.

iabdalkader · 2024-10-15T07:15:09Z

ports/renesas-ra/pendsv.c

-        "ldr r0, [r1]\n"
-        "cmp r0, #0\n"
-        "beq .no_obj\n"
-        #if defined(PENDSV_DEBUG)


Note the comment just above this, about the debug layout has been invalid and now outdated. Glad to see the PENDSV_DEBUG removed, it should be removed from all ports.

It is now removed from all ports (although I see there's still a few stray -DPENDSV_DEBUG remaining in Makefile's that should be removed).

As for the comments for the stack layout: feel free to open a PR to fix those.

As for the comments for the stack layout: feel free to open a PR to fix those.

I think we should remove, it was never correct; the stack layout doesn't change with different builds, the registers are stacked by the hardware so the layout is always the same. I did some extensive testing here:

#315 (comment)

The hardfault reported in #315 was due to this very same issue, but back then (10 years ago) we assumed, incorrectly, that the stack layout changed for debug builds. This was the reason PENDSV_DEBUG was introduced, and it somehow made debugging made work again, for a while at least.

dpgeorge · 2024-10-15T11:04:38Z

As a workaround for this issue, I've been clearing the stacked LDM/STM, but I've recently switched to the new VM abort feature and no longer need the hard-pendsv.

OK, great, thanks for confirming.

dpgeorge requested a review from projectgus October 10, 2024 01:46

dpgeorge commented Oct 10, 2024

View reviewed changes

projectgus approved these changes Oct 10, 2024

View reviewed changes

dpgeorge mentioned this pull request Oct 10, 2024

tools: Only issue a single Ctrl-C when entering raw REPL. #15991

Merged

dpgeorge added the port-stm32 label Oct 10, 2024

dpgeorge added this to the release-1.24.0 milestone Oct 10, 2024

dpgeorge added 2 commits October 15, 2024 10:32

renesas-ra/pendsv: Remove preemptive keyboard interrupt via PendSV.

1223fa9

Following the same change to the stm32 port. Signed-off-by: Damien George <damien@micropython.org>

dpgeorge force-pushed the stm32-remove-pendsv-kbd-intr branch from 28a5c39 to 1223fa9 Compare October 14, 2024 23:33

dpgeorge merged commit 1223fa9 into micropython:master Oct 14, 2024
11 checks passed

dpgeorge deleted the stm32-remove-pendsv-kbd-intr branch October 14, 2024 23:42

iabdalkader reviewed Oct 15, 2024

View reviewed changes

Uh oh!

stm32: Remove preemptive keyboard interrupt via PendSV #15988

stm32: Remove preemptive keyboard interrupt via PendSV #15988

Uh oh!

Conversation

dpgeorge commented Oct 10, 2024

Summary

Testing

Trade-offs and Alternatives

Uh oh!

dpgeorge commented Oct 10, 2024

Uh oh!

github-actions bot commented Oct 10, 2024

Uh oh!

dlech commented Oct 10, 2024

Uh oh!

andrewleech commented Oct 10, 2024

Uh oh!

dpgeorge commented Oct 10, 2024

Uh oh!

dpgeorge Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

andrewleech Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

dpgeorge Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

projectgus left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpgeorge commented Oct 10, 2024

Uh oh!

dpgeorge commented Oct 10, 2024

Uh oh!

Uh oh!

iabdalkader commented Oct 15, 2024

Uh oh!

iabdalkader Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

dpgeorge Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

iabdalkader Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpgeorge commented Oct 15, 2024

Uh oh!

Uh oh!

projectgus left a comment •

edited

Loading

iabdalkader Oct 15, 2024 •

edited

Loading