py/nlrthumb: Add returns_twice attribute to nlr_push #3889

glennrub · 2018-06-22T22:04:29Z

After the change of using mp_load_method_protected in
7241d90 it looks like the code
change triggered an issue with nlr_push when using -flto. How
related this specific commit is to the issue is unclear. At least
it is the commit that has been tracked down to be the first which
made the issue visible.

The reason for the behaviour is unknown and is only triggered
when using LTO. The strange thing about the issue is that it is
not consistent accross build targets. The code have been tested
on both cortex-m0 (nrf51822) and cortex-m4 (nrf52832/nrf52840),
and it only fails on cortex-m4.

Adding the attribute "returns_twice" on the declaration of
nlr_push seems to resolve the issue seen on cortex-m4.

Toolchain used, arm-none-eabi-gcc 7.2.1.

Also, a thanks to @aykevl for proposing to try this attribute.

@aykevl

After the change of using mp_load_method_protected in 7241d90 it looks like the code change triggered an issue with nlr_push when using -flto. How related this specific commit is to the issue is unclear. At least it is the commit that has been tracked down to be the first which made the issue visible. The reason for the behaviour is unknown and is only triggered when using LTO. The strange thing about the issue is that it is not consistent accross build targets. The code have been tested on both cortex-m0 (nrf51822) and cortex-m4 (nrf52832/nrf52840), and it only fails on cortex-m4. Adding the attribute "returns_twice" on the declaration of nlr_push seems to resolve the issue seen on cortex-m4. Toolchain used, arm-none-eabi-gcc 7.2.1. Also, a thanks to @aykevl for proposing to try this attribute.

aykevl · 2018-06-23T00:44:27Z

py/nlrthumb.c

@@ -36,7 +36,7 @@
 // For reference, arm/thumb callee save regs are:
 //      r4-r11, r13=sp

-__attribute__((naked)) unsigned int nlr_push(nlr_buf_t *nlr) {
+__attribute__((naked)) __attribute__((returns_twice)) unsigned int nlr_push(nlr_buf_t *nlr) {


Nitpick: you can merge these attributes with __attribute__((naked,returns_twice)).

aykevl · 2018-06-23T00:53:44Z

For some background as to why this is needed, see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49243
Basically, it appears to avoids inlining or interprocedural optimizations which will produce incorrect code when the inlined function returns twice (something that is very uncommon and unexpected by the optimizers). I think we only got lucky so far because most ports don't use LTO.

Also see #3526, which may be the same bug (miscompilation with LTO in gcc 7 on Cortex-M4).

dpgeorge · 2018-06-23T13:14:05Z

Thanks for raising this issue. I did see kind-of-similar behaviour on SPARC where gcc would treat the built-in setjmp specially and if one renamed setjmp to something else (but with an equivalent implementation) then that would break uPy. Dissasembly there showed the issue and the fix was to simply save and restore more registers in the custom setjmp/longjmp handler.

I think it's important to properly understand what is going on here with the Cortex-M4 and LTO. What is the difference in the disassembly with and without using the returns_twice attribute?

Also, wouldn't it make more sense to declare this attribute in nlr.h, so all the callers of nlr_push can see that it it'll return twice and so forbid the relevant optimisations?

glennrub · 2018-06-24T16:51:52Z

Thanks for the feedback and comments. I did an attempt on declaring the attribute in nlr.h and that gave me also a failing result.

This give me 3 compilation scenarios:

working - attribute defined in nlrthumb.c
failing - old/current code
failing - attribute defined in nlr.h

For reference the code used is an upmerge of tralamazza/master at 187cb09 with micropython/master at 34344a4

As a note, the nrl_push, nlr_pop and nlr_push_tail are equal among the 3 different builds with one exception, that the nlr.h declared attribute one got an entry of mp_call_function_1_protected in between nlr_pop and nlr_push_tail making the address to nlr_push_tail having a different address (mp_call_function_1_protected is not explicitly seen in the current/old code and the one with this PR, only in the compilation with the attribute in nlr.h). Besides from this one offset, in one compilation, there are no difference on these functions between the builds.
Below is the disassembly of these functions:

00001b30 <nlr_push>:
    1b30:       60c4            str     r4, [r0, #12]
    1b32:       6105            str     r5, [r0, #16]
    1b34:       6146            str     r6, [r0, #20]
    1b36:       6187            str     r7, [r0, #24]
    1b38:       f8c0 801c       str.w   r8, [r0, #28]
    1b3c:       f8c0 9020       str.w   r9, [r0, #32]
    1b40:       f8c0 a024       str.w   sl, [r0, #36]   ; 0x24
    1b44:       f8c0 b028       str.w   fp, [r0, #40]   ; 0x28
    1b48:       f8c0 d02c       str.w   sp, [r0, #44]   ; 0x2c
    1b4c:       f8c0 e008       str.w   lr, [r0, #8]
    1b50:       f000 b808       b.w     1b64 <nlr_push_tail>

00001b54 <nlr_pop>:
    1b54:       4b02            ldr     r3, [pc, #8]    ; (1b60 <nlr_pop+0xc>)
    1b56:       691a            ldr     r2, [r3, #16]
    1b58:       6812            ldr     r2, [r2, #0]
    1b5a:       611a            str     r2, [r3, #16]
    1b5c:       4770            bx      lr
    1b5e:       bf00            nop
    1b60:       00d4            lsls    r4, r2, #3
    1b62:       2000            movs    r0, #0

00001b64 <nlr_push_tail>:
    1b64:       4b02            ldr     r3, [pc, #8]    ; (1b70 <nlr_push_tail+0xc>)
    1b66:       691a            ldr     r2, [r3, #16]
    1b68:       6002            str     r2, [r0, #0]
    1b6a:       6118            str     r0, [r3, #16]
    1b6c:       2000            movs    r0, #0
    1b6e:       4770            bx      lr
    1b70:       00d4            lsls    r4, r2, #3
    1b72:       2000            movs    r0, #0

Next, an user of these functions. I selected the mp_load_method_protected as this is where the issue was discovered. However, i also see other changes in the binary, like gc_collect_end is different between the 3 builds.

Working - attribute in nlrthumb.c:

00005ec4 <mp_load_method_protected>:
    5ec4:       b510            push    {r4, lr}
    5ec6:       b090            sub     sp, #64 ; 0x40
    5ec8:       e9cd 0100       strd    r0, r1, [sp]
    5ecc:       a804            add     r0, sp, #16
    5ece:       e9cd 2302       strd    r2, r3, [sp, #8]
    5ed2:       f7fb fe2d       bl      1b30 <nlr_push>
    5ed6:       b940            cbnz    r0, 5eea <mp_load_method_protected+0x26>
    5ed8:       e9dd 1201       ldrd    r1, r2, [sp, #4]
    5edc:       9800            ldr     r0, [sp, #0]
    5ede:       f7ff fd03       bl      58e8 <mp_load_method_maybe>
    5ee2:       f7fb fe37       bl      1b54 <nlr_pop>
    5ee6:       b010            add     sp, #64 ; 0x40
    5ee8:       bd10            pop     {r4, pc}
    5eea:       9b03            ldr     r3, [sp, #12]
    5eec:       2b00            cmp     r3, #0
    5eee:       d1fa            bne.n   5ee6 <mp_load_method_protected+0x22>
    5ef0:       9c05            ldr     r4, [sp, #20]
    5ef2:       4904            ldr     r1, [pc, #16]   ; (5f04 <mp_load_method_protected+0x40>)
    5ef4:       6820            ldr     r0, [r4, #0]
    5ef6:       f00b ff28       bl      11d4a <mp_obj_is_subclass_fast>
    5efa:       2800            cmp     r0, #0
    5efc:       d1f3            bne.n   5ee6 <mp_load_method_protected+0x22>
    5efe:       4620            mov     r0, r4
    5f00:       f7fc f9b7       bl      2272 <nlr_jump>
    5f04:       c6a4            stmia   r6!, {r2, r5, r7}
    5f06:       0001            movs    r1, r0

Failing - old/current code:

00005ebc <mp_load_method_protected>:
    5ebc:       b510            push    {r4, lr}
    5ebe:       b08c            sub     sp, #48 ; 0x30
    5ec0:       4604            mov     r4, r0
    5ec2:       4668            mov     r0, sp
    5ec4:       f7fb fe34       bl      1b30 <nlr_push>
    5ec8:       b930            cbnz    r0, 5ed8 <mp_load_method_protected+0x1c>
    5eca:       4620            mov     r0, r4
    5ecc:       f7ff fd08       bl      58e0 <mp_load_method_maybe>
    5ed0:       f7fb fe40       bl      1b54 <nlr_pop>
    5ed4:       b00c            add     sp, #48 ; 0x30
    5ed6:       bd10            pop     {r4, pc}
    5ed8:       2b00            cmp     r3, #0
    5eda:       d1fb            bne.n   5ed4 <mp_load_method_protected+0x18>
    5edc:       9c01            ldr     r4, [sp, #4]
    5ede:       4904            ldr     r1, [pc, #16]   ; (5ef0 <mp_load_method_protected+0x34>)
    5ee0:       6820            ldr     r0, [r4, #0]
    5ee2:       f00b ff0c       bl      11cfe <mp_obj_is_subclass_fast>
    5ee6:       2800            cmp     r0, #0
    5ee8:       d1f4            bne.n   5ed4 <mp_load_method_protected+0x18>
    5eea:       4620            mov     r0, r4
    5eec:       f7fc f9c1       bl      2272 <nlr_jump>
    5ef0:       c6e4            stmia   r6!, {r2, r5, r6, r7}
    5ef2:       0001            movs    r1, r0

Failing - attribute in nlr.h:

00005efc <mp_load_method_protected>:
    5efc:       b510            push    {r4, lr}
    5efe:       b08c            sub     sp, #48 ; 0x30
    5f00:       4604            mov     r4, r0
    5f02:       4668            mov     r0, sp
    5f04:       f7fb fe14       bl      1b30 <nlr_push>
    5f08:       b930            cbnz    r0, 5f18 <mp_load_method_protected+0x1c>
    5f0a:       4620            mov     r0, r4
    5f0c:       f7ff fcfc       bl      5908 <mp_load_method_maybe>
    5f10:       f7fb fe20       bl      1b54 <nlr_pop>
    5f14:       b00c            add     sp, #48 ; 0x30
    5f16:       bd10            pop     {r4, pc}
    5f18:       2b00            cmp     r3, #0
    5f1a:       d1fb            bne.n   5f14 <mp_load_method_protected+0x18>
    5f1c:       9c01            ldr     r4, [sp, #4]
    5f1e:       4904            ldr     r1, [pc, #16]   ; (5f30 <mp_load_method_protected+0x34>)
    5f20:       6820            ldr     r0, [r4, #0]
    5f22:       f00b ff14       bl      11d4e <mp_obj_is_subclass_fast>
    5f26:       2800            cmp     r0, #0
    5f28:       d1f4            bne.n   5f14 <mp_load_method_protected+0x18>
    5f2a:       4620            mov     r0, r4
    5f2c:       f7fc f9b5       bl      229a <nlr_jump>
    5f30:       c684            stmia   r6!, {r2, r7}
    5f32:       0001            movs    r1, r0

glennrub · 2018-06-24T17:26:18Z

@aykevl, you might be right that its the same as reported in #3526. I tested the patch proposed in that reference and it seems to solve the issue without the patch in this PR.

@dpgeorge Should we close this PR and go for the "clobbered register" patch instead?

For comparison, i'll add the disassembly of the mp_load_method_protected using the "clobbered register" proposal:

00005ec4 <mp_load_method_protected>:
    5ec4:       b5f0            push    {r4, r5, r6, r7, lr}
    5ec6:       b08d            sub     sp, #52 ; 0x34
    5ec8:       4605            mov     r5, r0
    5eca:       4668            mov     r0, sp
    5ecc:       460e            mov     r6, r1
    5ece:       4617            mov     r7, r2
    5ed0:       461c            mov     r4, r3
    5ed2:       f7fb fe2d       bl      1b30 <nlr_push>
    5ed6:       b940            cbnz    r0, 5eea <mp_load_method_protected+0x26>
    5ed8:       463a            mov     r2, r7
    5eda:       4631            mov     r1, r6
    5edc:       4628            mov     r0, r5
    5ede:       f7ff fd03       bl      58e8 <mp_load_method_maybe>
    5ee2:       f7fb fe37       bl      1b54 <nlr_pop>
    5ee6:       b00d            add     sp, #52 ; 0x34
    5ee8:       bdf0            pop     {r4, r5, r6, r7, pc}
    5eea:       2c00            cmp     r4, #0
    5eec:       d1fb            bne.n   5ee6 <mp_load_method_protected+0x22>
    5eee:       9c01            ldr     r4, [sp, #4]
    5ef0:       4904            ldr     r1, [pc, #16]   ; (5f04 <mp_load_method_protected+0x40>)
    5ef2:       6820            ldr     r0, [r4, #0]
    5ef4:       f00b ff0f       bl      11d16 <mp_obj_is_subclass_fast>
    5ef8:       2800            cmp     r0, #0
    5efa:       d1f4            bne.n   5ee6 <mp_load_method_protected+0x22>
    5efc:       4620            mov     r0, r4
    5efe:       f7fc f9b8       bl      2272 <nlr_jump>
    5f02:       bf00            nop
    5f04:       c6fc            stmia   r6!, {r2, r3, r4, r5, r6, r7}
    5f06:       0001            movs    r1, r0

The nlr_push, nlr_pop and nlr_push_tail did not change from what is posted earlier.

aykevl · 2018-06-24T23:10:52Z

Maybe we're on the wrong track. Maybe the simplest option is to simply disable LTO for this file only (nlrthumb.c) as it has given a lot of problems, similar to how gc.c uses -O3. It could be something else that happened to fix this bug, like a missing clobbered register or a GCC bug. I'm certainly a bit lost after I learned that putting it on the declaration results in yet another version of the output while breaking it again.

Nonetheless I think the attribute should be added because it is intended for this purpose (double returns, setjmp like functions).

dpgeorge · 2018-06-25T07:28:52Z

Thanks @glennrub for the disassembly. It's obvious what the problem is: the failing mp_load_method_protected() doesn't save the caller-save registers r0-r3 when it calls nlr_push(). It does save them with the returns-twice attribute in nlrthumb.c, and with the clobbered register approach.

So the question is, why does it not save these registers in the other cases? I guess because LTO is being too optimistic and assuming that nlr_push() does not touch these caller-save registers.

Should we close this PR and go for the "clobbered register" patch instead?

IIRC, from previous discussions it was concluded that a clobbered register list cannot appear here because it's a naked function.

Maybe the simplest option is to simply disable LTO for this file only

If you can do that I think that's the best option at this point.

Nonetheless I think the attribute should be added because it is intended for this purpose (double returns, setjmp like functions).

If we do that it should be added to the header nlr.h, and probably only for gcc builds. Also it would be good to see that this actually did something rather than nothing.

dhalbert · 2018-06-25T12:55:50Z

IIRC, from previous discussions it was concluded that a clobbered register list cannot appear here because it's a naked function.

The gcc documentation says that, but it actually works anyway, and we've been using that fix as I mentioned in #3526 in CircuitPython for several months: adafruit#500.

Maybe the simplest option is to simply disable LTO for this file only

I'm not sure that would work, because LTO is a multiple file thing, but it's worth trying. (Maybe all the files have to be LTO or not).

stinos · 2018-06-25T13:56:14Z

Not particularily easy: https://stackoverflow.com/questions/49018738/gcc-lto-limit-scope-of-optimization

aykevl · 2018-06-25T15:04:56Z

In my experience LTO and non-LTO object files can simply be linked together without problems.

glennrub · 2018-07-02T17:01:30Z

Looks like adding -fno-lto on this line also does the trick:
https://github.com/micropython/micropython/blob/master/py/py.mk#L302

Some size comparisons on nrf52832:
144868 -Os (current non-working)
144916 -Os -fno-lto
144812 __attribute__(returns_twice) in nlrthumb.c
144900 clobbered registers in nlrthumb.c

Could adding -fno-lto in py.mk be an acceptable solution?

For completeness, this is the generated assembly if -fno-lto is provided as the only change:

00004f08 <mp_load_method_protected>:
    4f08:       b5f0            push    {r4, r5, r6, r7, lr}
    4f0a:       b08d            sub     sp, #52 ; 0x34
    4f0c:       4605            mov     r5, r0
    4f0e:       4668            mov     r0, sp
    4f10:       460e            mov     r6, r1
    4f12:       4617            mov     r7, r2
    4f14:       461c            mov     r4, r3
    4f16:       f015 fe85       bl      1ac24 <nlr_push>
    4f1a:       b940            cbnz    r0, 4f2e <mp_load_method_protected+0x26>
    4f1c:       463a            mov     r2, r7
    4f1e:       4631            mov     r1, r6
    4f20:       4628            mov     r0, r5
    4f22:       f7ff ff53       bl      4dcc <mp_load_method_maybe>
    4f26:       f7fd f885       bl      2034 <nlr_pop>
    4f2a:       b00d            add     sp, #52 ; 0x34
    4f2c:       bdf0            pop     {r4, r5, r6, r7, pc}
    4f2e:       2c00            cmp     r4, #0
    4f30:       d1fb            bne.n   4f2a <mp_load_method_protected+0x22>
    4f32:       9c01            ldr     r4, [sp, #4]
    4f34:       4904            ldr     r1, [pc, #16]   ; (4f48 <mp_load_method_protected+0x40>)
    4f36:       6820            ldr     r0, [r4, #0]
    4f38:       f00c fec1       bl      11cbe <mp_obj_is_subclass_fast>
    4f3c:       2800            cmp     r0, #0
    4f3e:       d1f4            bne.n   4f2a <mp_load_method_protected+0x22>
    4f40:       4620            mov     r0, r4
    4f42:       f015 fe81       bl      1ac48 <nlr_jump>
    4f46:       bf00            nop
    4f48:       c708            stmia   r7!, {r3}
    4f4a:       0001            movs    r1, r0

source: https://gcc.gnu.org/wiki/LinkTimeOptimizationFAQ
As a workaround you may use -fno-lto when compiling the files that contain the asm statements.

aykevl · 2018-07-02T18:10:52Z

Yes, that's exactly what I was thinking about.

dpgeorge · 2018-07-03T07:42:12Z

Could adding -fno-lto in py.mk be an acceptable solution?

I think this is the right solution but should be done on a per-port basis. Mainly because -fno-lto may not be accepted by non-gcc compilers.

You should be able to add this to the port's Makefile:

$(PY_BUILD)/nlrthumb.o: CFLAGS += -fno-lto

Does that work?

glennrub · 2018-07-03T16:58:54Z

Yes, that worked. I guess keeping it in the port specific Makefile is the most neat way of handling this.
I will close this PR as this will be handled by the nrf-port specifically.

Thanks to all for the help and guidance.

dpgeorge · 2018-07-04T02:38:28Z

I guess keeping it in the port specific Makefile is the most neat way of handling this.
I will close this PR as this will be handled by the nrf-port specifically.

Ok, sounds good. We should keep in mind this issue for future use of LTO (it's not yet used by stm32, but maybe one day...).

aykevl reviewed Jun 23, 2018

View reviewed changes

glennrub closed this Jul 3, 2018

glennrub mentioned this pull request Jul 3, 2018

nrf: Compile nlr objects with -fno-lto flag tralamazza/micropython#212

Merged

dhalbert mentioned this pull request Dec 21, 2022

Use returns_twice attribute to preserve regs in nlrthumb nlr_push() adafruit/circuitpython#7368

Merged

Uh oh!

py/nlrthumb: Add returns_twice attribute to nlr_push #3889

py/nlrthumb: Add returns_twice attribute to nlr_push #3889

Uh oh!

Conversation

glennrub commented Jun 22, 2018

Uh oh!

aykevl Jun 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aykevl commented Jun 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Jun 23, 2018

Uh oh!

glennrub commented Jun 24, 2018

Uh oh!

glennrub commented Jun 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aykevl commented Jun 24, 2018

Uh oh!

dpgeorge commented Jun 25, 2018

Uh oh!

dhalbert commented Jun 25, 2018

Uh oh!

stinos commented Jun 25, 2018

Uh oh!

aykevl commented Jun 25, 2018

Uh oh!

glennrub commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aykevl commented Jul 2, 2018

Uh oh!

dpgeorge commented Jul 3, 2018

Uh oh!

glennrub commented Jul 3, 2018

Uh oh!

dpgeorge commented Jul 4, 2018

Uh oh!

Uh oh!

aykevl Jun 23, 2018 •

edited

Loading

aykevl commented Jun 23, 2018 •

edited

Loading

glennrub commented Jun 24, 2018 •

edited

Loading

glennrub commented Jul 2, 2018 •

edited

Loading