Skip to content

stm32: Remove preemptive keyboard interrupt via PendSV #15988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 14, 2024

Conversation

dpgeorge
Copy link
Member

Summary

Since the very beginning, the stm32 port (first called stm, then stmhal now stm32) has had a special keyboard interrupt feature which works by using PendSV to break out of any running code.

This preemptive ctrl-C was added long ago in commit 01156d5

The stm32 port still uses that ancient code, and current does this:

  • If ctrl-C is received on UART or USB then mp_sched_keyboard_interrupt() is called (like all other ports) to set a flag for the VM to see, and then the VM (or any loop calling mp_handle_pending(true)) will eventually handle the KeyboardInterrupt exception, raising it via NLR.
  • If another ctrl-C is received while the existing scheduled keyboard interrupt is still pending (ie the VM has not yet processed it) then a special hard NLR jump will activate, that preempts the calling code. Within the PendSV interrupt the stack is adjusted and an NLR jump is made to the most recent nlr_push() location. This is like a normal NLR except it is called from an interrupt context and completely annihilates the code that was interrupted by the IRQ.

The reason for the preemptive interrupt was to handle ctrl-C properly before the VM was able to handle it. Then when the esp8266 port came along, it needed a more port-agnostic way to handle ctrl-C, and so the mechanism was added to the VM and runtime to be able to check for pending interrupts. Then the stm32 port was updated to use this mechanism, with a fallback to the old preemptive way if a second ctrl-C was received (without the first one being processed).

This preemptive NLR jump is problematic because it can interrupt long-running instructions (eg store multiple, usually used at the end of a function to restore registers and return). If such an instruction is interrupted the CPU remembers that with some flags, and can resume the long-running instruction when the interrupt finishes. But the preemptive NLR does a long jump to different code at thread level and so the long-running interrupt is never resumed. This leads to a CPU fault.

This fault has been previously reported in #3807 and #3842 (see also #294). I was never able to reproduce the issues reported there, but finally (after 6 years!) I can reproduce it.

The bad behaviour of preemptive NLR jump has become apparent since #15486. If you run the test suite over and over again on any stm32 board, it will eventually crash the board (see testing section below).

The point is, a skipped test now soft resets the board and so the board must run boot.py again. The test runner may then interrupt the execution of boot.py with the double-ctrl-C that it sends (in tools/pyboard.py:enter_raw_repl()) in order to get the board into a known good state for the next test. If the timing is right, this can trigger the preemptive PendSV in an unfortunate location and hard fault the board.

The fix in this PR is to just remove the preemptive NLR jump feature. No other port has this feature and it's not needed, ctrl-C works very well on those ports. Preemptive NLR jump is a very dangerous thing and is obviously buggy.

See this comment here which suggests what's done in this PR: #3842 (comment)

Testing

I have been using this to trigger the fault:

while true; do ./run-tests.py --target pyboard -d extmod stress float micropython basics misc --exclude extreme_exc --exclude mtime --exclude pow3_intbig --exclude recursive_iternext  --exclude float_parse --exclude math_constants --exclude vfs_blockdev_invalid || break; done

That can trigger the error on PYBv1.x, but it happens more regularly on PYBD-SF2/6.

With this PR, the board no longer hard faults. (But it does show a new issue, it can still interrupt boot.py with a single ctrl-C, and that will be fixed in a follow-up PR.)

Trade-offs and Alternatives

Could clear the CPU state for the long-running instruction as suggested in #3842. But it's much simpler to just remove this code, which can have other issues as per #294.

@dpgeorge dpgeorge requested a review from projectgus October 10, 2024 01:46
@dpgeorge
Copy link
Member Author

@iabdalkader @dlech are you using this preemptive PendSV interrupt feature?

Copy link

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:   -80 -0.020% PYBV10[incl -4(bss)]
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@dlech
Copy link
Contributor

dlech commented Oct 10, 2024

you using this preemptive PendSV interrupt feature?

Nope, we use mp_sched_keyboard_interrupt().

@andrewleech
Copy link
Contributor

That's a sneaky problem to track down @dpgeorge nice work!

I've noticed that mpremote/pyboard.py do a double ctrl-c

self.serial.write(b"\r\x03\x03") # ctrl-C twice: interrupt any running program
was this originally done to use this hard interrupt? Maybe after sending the ctrl-c a slight timeout after sending would be warranted now as it could take a little longer for the vm to respond? Though as you say, if it's working well on other ports already it's probably not worth worrying about!

Regardless, this change improves consistency with other ports and simplifies behavior so I'm all for it!

@dpgeorge
Copy link
Member Author

I've noticed that mpremote/pyboard.py do a double ctrl-c
was this originally done to use this hard interrupt?

Yes it was. Because originally a single ctrl-C would trigger the preemptive NLR jump, but then when softer interrupt handling was added, pyboard.py was updated to issue a double ctrl-c so that it could stop anything on stm32.

Implementation of generic ctrl-C handling and change stm32 to try "soft" interrupt first: 124df6f

Change pyboard.py to issue a double ctrl-C (most relevant to your question): bc1488a

IMO we no longer need the double ctrl-C, we should remove it.

@@ -14,7 +14,7 @@
flash (in main.c) along with the isr_vector above.
*/
. = ALIGN(4);
*(.text.pendsv_kbd_intr)
*(.text.mp_sched_keyboard_interrupt)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewleech it's curious that you only needed to put pendsv_kbd_intr in RAM. That function calls mp_sched_keyboard_interrupt so I wonder why you didn't need to put the latter in RAM as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is interesting... maybe neither of them actually need to be in ram... or more likely my testing didn't catch the scenario where ctrl-c was being handled by the UART isr at the exact moment flash was locked. Just before or just after it wouldn't matter.
I think I added that function to the ram list based just on review of things called from the isr function but forgot to follow it down the call stack to subsequent functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for confirming. I'll leave this as it is in this PR because it makes sense to have this function in RAM.

Copy link
Contributor

@projectgus projectgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Rationale makes sense, and the other ports have been doing without it just fine.

Apart from the very legitimate problems with CPU register state, the idea of randomly jumping out of slow-running or stuck C code seems convenient on the one hand but terrifying on the other - too many opportunities for hardware and software to wind up left in some invalid state with no cleanup path.

@dpgeorge
Copy link
Member Author

But it does show a new issue, it can still interrupt boot.py with a single ctrl-C, and that will be fixed in a follow-up PR.

See #15990 for full description of this problem, and a fix for that.

@dpgeorge
Copy link
Member Author

IMO we no longer need the double ctrl-C, we should remove it.

See #15991 for that.

@dpgeorge dpgeorge added this to the release-1.24.0 milestone Oct 10, 2024
Since the very beginning, the stm32 port (first called stm, then stmhal now
stm32) has had a special keyboard interrupt feature which works by using
PendSV to break out of any running code.  This preemptive ctrl-C was added
long ago in commit 01156d5.

The stm32 port still uses that code, and current does this:

- If ctrl-C is received on UART or USB then `mp_sched_keyboard_interrupt()`
  is called (like all other ports) to set a flag for the VM to see, and
  then the VM (or any loop calling `mp_handle_pending(true)`) will
  eventually handle the `KeyboardInterrupt` exception, raising it via NLR.

- If another ctrl-C is received while the existing scheduled keyboard
  interrupt is still pending (ie the VM has not yet processed it) then a
  special hard NLR jump will activate, that preempts the calling code.
  Within the PendSV interrupt the stack is adjusted and an NLR jump is made
  to the most recent `nlr_push()` location.  This is like a normal NLR
  except it is called from an interrupt context and completely annihilates
  the code that was interrupted by the IRQ.

The reason for the preemptive interrupt was to handle ctrl-C before the VM
was able to handle it.  Eventually a mechanism (that's in use today by all
ports) was added to the VM and runtime to be able to check for pending
interrupts.  Then the stm32 port was updated to use this mechanism, with a
fallback to the old preemptive way if a second ctrl-C was received (without
the first one being processed).

This preemptive NLR jump is problematic because it can interrupt
long-running instructions (eg store multiple, usually used at the end of a
function to restore registers and return).  If such an instruction is
interrupted the CPU remembers that with some flags, and can resume the
long-running instruction when the interrupt finishes.  But the preemptive
NLR does a long jump to different code at thread level and so the
long-running interrupt is never resumed.  This leads to a CPU fault.

This fault has been previously reported in issues micropython#3807 and micropython#3842 (see also
issue micropython#294).  It's now possible to easily reproduce this problem, since
commit 69c25ea.  Running the test suite
over and over again on any stm32 board will eventually crash the board (it
can happen on a PYBv1.x, but it happens more regularly on PYBD-SF2/6).

The point is, a skipped test now soft resets the board and so the board
must run `boot.py` again.  The test runner may then interrupt the execution
of `boot.py` with the double-ctrl-C that it sends (in `tools/pyboard.py`,
`enter_raw_repl()`) in order to get the board into a known good state for
the next test.  If the timing is right, this can trigger the preemptive
PendSV in an unfortunate location and hard fault the board.

The fix in this commit is to just remove the preemptive NLR jump feature.
No other port has this feature and it's not needed, ctrl-C works very well
on those ports.  Preemptive NLR jump is a very dangerous thing (eg it may
interrupt and break out of an external SPI flash operation when reading
code from a filesystem) and is obviously buggy.

With this commit, stm32 borads no longer hard fault when running the test
suite (but it does leave an issue, the tests can still interrupt `boot.py`
with a single ctrl-C; that will be fixed separately).

An alternative to this commit would be to clear the CPU state for the
long-running instruction as suggested in issue micropython#3842.  But it's much
simpler to just remove this code, which is now unnecessary and can have
other problems as per issue micropython#294.

Signed-off-by: Damien George <damien@micropython.org>
Following the same change to the stm32 port.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge dpgeorge force-pushed the stm32-remove-pendsv-kbd-intr branch from 28a5c39 to 1223fa9 Compare October 14, 2024 23:33
@dpgeorge dpgeorge merged commit 1223fa9 into micropython:master Oct 14, 2024
11 checks passed
@dpgeorge dpgeorge deleted the stm32-remove-pendsv-kbd-intr branch October 14, 2024 23:42
@iabdalkader
Copy link
Contributor

This fault has been previously reported in #3807 and #3842 (see also #294). I was never able to reproduce the issues reported there, but finally (after 6 years!) I can reproduce it.
@iabdalkader @dlech are you using this preemptive PendSV interrupt feature?

Sorry for the late reply. As a workaround for this issue, I've been clearing the stacked LDM/STM, but I've recently switched to the new VM abort feature and no longer need the hard-pendsv.

"ldr r0, [r1]\n"
"cmp r0, #0\n"
"beq .no_obj\n"
#if defined(PENDSV_DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the comment just above this, about the debug layout has been invalid and now outdated. Glad to see the PENDSV_DEBUG removed, it should be removed from all ports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now removed from all ports (although I see there's still a few stray -DPENDSV_DEBUG remaining in Makefile's that should be removed).

As for the comments for the stack layout: feel free to open a PR to fix those.

Copy link
Contributor

@iabdalkader iabdalkader Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the comments for the stack layout: feel free to open a PR to fix those.

I think we should remove, it was never correct; the stack layout doesn't change with different builds, the registers are stacked by the hardware so the layout is always the same. I did some extensive testing here:

#315 (comment)

The hardfault reported in #315 was due to this very same issue, but back then (10 years ago) we assumed, incorrectly, that the stack layout changed for debug builds. This was the reason PENDSV_DEBUG was introduced, and it somehow made debugging made work again, for a while at least.

@dpgeorge
Copy link
Member Author

As a workaround for this issue, I've been clearing the stacked LDM/STM, but I've recently switched to the new VM abort feature and no longer need the hard-pendsv.

OK, great, thanks for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants