-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
stm32: Remove preemptive keyboard interrupt via PendSV #15988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stm32: Remove preemptive keyboard interrupt via PendSV #15988
Conversation
@iabdalkader @dlech are you using this preemptive PendSV interrupt feature? |
Code size report:
|
Nope, we use |
That's a sneaky problem to track down @dpgeorge nice work! I've noticed that mpremote/pyboard.py do a double ctrl-c
Regardless, this change improves consistency with other ports and simplifies behavior so I'm all for it! |
Yes it was. Because originally a single ctrl-C would trigger the preemptive NLR jump, but then when softer interrupt handling was added, Implementation of generic ctrl-C handling and change Change IMO we no longer need the double ctrl-C, we should remove it. |
@@ -14,7 +14,7 @@ | |||
flash (in main.c) along with the isr_vector above. | |||
*/ | |||
. = ALIGN(4); | |||
*(.text.pendsv_kbd_intr) | |||
*(.text.mp_sched_keyboard_interrupt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewleech it's curious that you only needed to put pendsv_kbd_intr
in RAM. That function calls mp_sched_keyboard_interrupt
so I wonder why you didn't need to put the latter in RAM as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that is interesting... maybe neither of them actually need to be in ram... or more likely my testing didn't catch the scenario where ctrl-c was being handled by the UART isr at the exact moment flash was locked. Just before or just after it wouldn't matter.
I think I added that function to the ram list based just on review of things called from the isr function but forgot to follow it down the call stack to subsequent functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for confirming. I'll leave this as it is in this PR because it makes sense to have this function in RAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Rationale makes sense, and the other ports have been doing without it just fine.
Apart from the very legitimate problems with CPU register state, the idea of randomly jumping out of slow-running or stuck C code seems convenient on the one hand but terrifying on the other - too many opportunities for hardware and software to wind up left in some invalid state with no cleanup path.
See #15990 for full description of this problem, and a fix for that. |
See #15991 for that. |
Since the very beginning, the stm32 port (first called stm, then stmhal now stm32) has had a special keyboard interrupt feature which works by using PendSV to break out of any running code. This preemptive ctrl-C was added long ago in commit 01156d5. The stm32 port still uses that code, and current does this: - If ctrl-C is received on UART or USB then `mp_sched_keyboard_interrupt()` is called (like all other ports) to set a flag for the VM to see, and then the VM (or any loop calling `mp_handle_pending(true)`) will eventually handle the `KeyboardInterrupt` exception, raising it via NLR. - If another ctrl-C is received while the existing scheduled keyboard interrupt is still pending (ie the VM has not yet processed it) then a special hard NLR jump will activate, that preempts the calling code. Within the PendSV interrupt the stack is adjusted and an NLR jump is made to the most recent `nlr_push()` location. This is like a normal NLR except it is called from an interrupt context and completely annihilates the code that was interrupted by the IRQ. The reason for the preemptive interrupt was to handle ctrl-C before the VM was able to handle it. Eventually a mechanism (that's in use today by all ports) was added to the VM and runtime to be able to check for pending interrupts. Then the stm32 port was updated to use this mechanism, with a fallback to the old preemptive way if a second ctrl-C was received (without the first one being processed). This preemptive NLR jump is problematic because it can interrupt long-running instructions (eg store multiple, usually used at the end of a function to restore registers and return). If such an instruction is interrupted the CPU remembers that with some flags, and can resume the long-running instruction when the interrupt finishes. But the preemptive NLR does a long jump to different code at thread level and so the long-running interrupt is never resumed. This leads to a CPU fault. This fault has been previously reported in issues micropython#3807 and micropython#3842 (see also issue micropython#294). It's now possible to easily reproduce this problem, since commit 69c25ea. Running the test suite over and over again on any stm32 board will eventually crash the board (it can happen on a PYBv1.x, but it happens more regularly on PYBD-SF2/6). The point is, a skipped test now soft resets the board and so the board must run `boot.py` again. The test runner may then interrupt the execution of `boot.py` with the double-ctrl-C that it sends (in `tools/pyboard.py`, `enter_raw_repl()`) in order to get the board into a known good state for the next test. If the timing is right, this can trigger the preemptive PendSV in an unfortunate location and hard fault the board. The fix in this commit is to just remove the preemptive NLR jump feature. No other port has this feature and it's not needed, ctrl-C works very well on those ports. Preemptive NLR jump is a very dangerous thing (eg it may interrupt and break out of an external SPI flash operation when reading code from a filesystem) and is obviously buggy. With this commit, stm32 borads no longer hard fault when running the test suite (but it does leave an issue, the tests can still interrupt `boot.py` with a single ctrl-C; that will be fixed separately). An alternative to this commit would be to clear the CPU state for the long-running instruction as suggested in issue micropython#3842. But it's much simpler to just remove this code, which is now unnecessary and can have other problems as per issue micropython#294. Signed-off-by: Damien George <damien@micropython.org>
Following the same change to the stm32 port. Signed-off-by: Damien George <damien@micropython.org>
28a5c39
to
1223fa9
Compare
Sorry for the late reply. As a workaround for this issue, I've been clearing the stacked LDM/STM, but I've recently switched to the new VM abort feature and no longer need the hard-pendsv. |
"ldr r0, [r1]\n" | ||
"cmp r0, #0\n" | ||
"beq .no_obj\n" | ||
#if defined(PENDSV_DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the comment just above this, about the debug layout has been invalid and now outdated. Glad to see the PENDSV_DEBUG
removed, it should be removed from all ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now removed from all ports (although I see there's still a few stray -DPENDSV_DEBUG
remaining in Makefile's that should be removed).
As for the comments for the stack layout: feel free to open a PR to fix those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the comments for the stack layout: feel free to open a PR to fix those.
I think we should remove, it was never correct; the stack layout doesn't change with different builds, the registers are stacked by the hardware so the layout is always the same. I did some extensive testing here:
The hardfault reported in #315 was due to this very same issue, but back then (10 years ago) we assumed, incorrectly, that the stack layout changed for debug builds. This was the reason PENDSV_DEBUG
was introduced, and it somehow made debugging made work again, for a while at least.
OK, great, thanks for confirming. |
Summary
Since the very beginning, the
stm32
port (first calledstm
, thenstmhal
nowstm32
) has had a special keyboard interrupt feature which works by using PendSV to break out of any running code.This preemptive ctrl-C was added long ago in commit 01156d5
The
stm32
port still uses that ancient code, and current does this:mp_sched_keyboard_interrupt()
is called (like all other ports) to set a flag for the VM to see, and then the VM (or any loop callingmp_handle_pending(true)
) will eventually handle theKeyboardInterrupt
exception, raising it via NLR.nlr_push()
location. This is like a normal NLR except it is called from an interrupt context and completely annihilates the code that was interrupted by the IRQ.The reason for the preemptive interrupt was to handle ctrl-C properly before the VM was able to handle it. Then when the esp8266 port came along, it needed a more port-agnostic way to handle ctrl-C, and so the mechanism was added to the VM and runtime to be able to check for pending interrupts. Then the
stm32
port was updated to use this mechanism, with a fallback to the old preemptive way if a second ctrl-C was received (without the first one being processed).This preemptive NLR jump is problematic because it can interrupt long-running instructions (eg store multiple, usually used at the end of a function to restore registers and return). If such an instruction is interrupted the CPU remembers that with some flags, and can resume the long-running instruction when the interrupt finishes. But the preemptive NLR does a long jump to different code at thread level and so the long-running interrupt is never resumed. This leads to a CPU fault.
This fault has been previously reported in #3807 and #3842 (see also #294). I was never able to reproduce the issues reported there, but finally (after 6 years!) I can reproduce it.
The bad behaviour of preemptive NLR jump has become apparent since #15486. If you run the test suite over and over again on any stm32 board, it will eventually crash the board (see testing section below).
The point is, a skipped test now soft resets the board and so the board must run
boot.py
again. The test runner may then interrupt the execution ofboot.py
with the double-ctrl-C that it sends (intools/pyboard.py:enter_raw_repl()
) in order to get the board into a known good state for the next test. If the timing is right, this can trigger the preemptive PendSV in an unfortunate location and hard fault the board.The fix in this PR is to just remove the preemptive NLR jump feature. No other port has this feature and it's not needed, ctrl-C works very well on those ports. Preemptive NLR jump is a very dangerous thing and is obviously buggy.
See this comment here which suggests what's done in this PR: #3842 (comment)
Testing
I have been using this to trigger the fault:
That can trigger the error on PYBv1.x, but it happens more regularly on PYBD-SF2/6.
With this PR, the board no longer hard faults. (But it does show a new issue, it can still interrupt
boot.py
with a single ctrl-C, and that will be fixed in a follow-up PR.)Trade-offs and Alternatives
Could clear the CPU state for the long-running instruction as suggested in #3842. But it's much simpler to just remove this code, which can have other issues as per #294.