py/mpconfig.h: Finer-grained super-opt setting. #12644

jimmo · 2023-10-10T02:55:42Z

This allows -O3 to be enabled on a per-function basis, rather than whole-file as it is currently implemented with CSUPEROPT. This avoids having to waste code size on functions that don't need -O3.

Interestingly I found that adding -O2 or -O3 to mp_execute_bytecode both made it smaller and slower (i.e. the exact opposite of expected) on PYBV11, so mp_execute_bytecode does not have the attribute. On x86/x64 or other architectures though this is likely to not be the case.

This PR on PYBV11 (+300 bytes).

bm_chaos.py                    357.96 ->     365.55 :      +7.59 =  +2.120% (+/-0.01%)
bm_fannkuch.py                  74.78 ->      74.70 :      -0.08 =  -0.107% (+/-0.01%)
bm_fft.py                     2336.14 ->    2400.24 :     +64.10 =  +2.744% (+/-0.00%)
bm_float.py                   5681.08 ->    5834.14 :    +153.06 =  +2.694% (+/-0.02%)
bm_hexiom.py                    45.28 ->      47.96 :      +2.68 =  +5.919% (+/-0.00%)
bm_nqueens.py                 4227.99 ->    4305.10 :     +77.11 =  +1.824% (+/-0.00%)
bm_pidigits.py                 644.35 ->     646.20 :      +1.85 =  +0.287% (+/-0.42%)
core_import_mpy_multi.py       421.17 ->     508.94 :     +87.77 = +20.840% (+/-0.01%)
core_import_mpy_single.py       63.15 ->      78.39 :     +15.24 = +24.133% (+/-0.01%)
core_qstr.py                    64.10 ->      94.50 :     +30.40 = +47.426% (+/-0.00%)
core_yield_from.py             353.04 ->     347.02 :      -6.02 =  -1.705% (+/-0.01%)
misc_aes.py                    406.02 ->     414.08 :      +8.06 =  +1.985% (+/-0.01%)
misc_mandel.py                3048.78 ->    3222.61 :    +173.83 =  +5.702% (+/-0.01%)
misc_pystone.py               2419.68 ->    2568.70 :    +149.02 =  +6.159% (+/-0.00%)
misc_raytrace.py               375.09 ->     380.06 :      +4.97 =  +1.325% (+/-0.01%)

Also clang doesn't support the optimise attribute. I don't have a good workaround to suggest for eg Unix, webassembly.

So there's a bit of tuning and other things to consider. Maybe a good compromise for now is just to add to map.c and qstr.c and leave the old CSUPEROPT in place. But either way I think this shows that there is some performance to be gained here for very little code size.

For reference, this is what adding -O2 to the whole PYBV11 build does (at a cost of +60kiB):

bm_chaos.py                    357.96 ->     402.29 :     +44.33 = +12.384% (+/-0.00%)
bm_fannkuch.py                  74.78 ->      80.75 :      +5.97 =  +7.983% (+/-0.01%)
bm_fft.py                     2336.14 ->    2502.37 :    +166.23 =  +7.116% (+/-0.00%)
bm_float.py                   5681.08 ->    6366.30 :    +685.22 = +12.061% (+/-0.02%)
bm_hexiom.py                    45.28 ->      50.52 :      +5.24 = +11.572% (+/-0.00%)
bm_nqueens.py                 4227.99 ->    4548.46 :    +320.47 =  +7.580% (+/-0.00%)
bm_pidigits.py                 644.35 ->     751.15 :    +106.80 = +16.575% (+/-0.44%)
core_import_mpy_multi.py       421.17 ->     529.33 :    +108.16 = +25.681% (+/-0.00%)
core_import_mpy_single.py       63.15 ->      81.54 :     +18.39 = +29.121% (+/-0.01%)
core_qstr.py                    64.10 ->      94.82 :     +30.72 = +47.925% (+/-0.02%)
core_yield_from.py             353.04 ->     367.50 :     +14.46 =  +4.096% (+/-0.01%)
misc_aes.py                    406.02 ->     442.25 :     +36.23 =  +8.923% (+/-0.01%)
misc_mandel.py                3048.78 ->    3308.63 :    +259.85 =  +8.523% (+/-0.01%)
misc_pystone.py               2419.68 ->    2674.53 :    +254.85 = +10.532% (+/-0.00%)
misc_raytrace.py               375.09 ->     409.72 :     +34.63 =  +9.232% (+/-0.01%)

-O3 isn't much different (but much more code size). This shows an upper bound on how much can be gained by selectively optimising functions.

This work was funded through GitHub Sponsors.

github-actions · 2023-10-10T03:04:19Z

Code size report:

   bare-arm:  +392 +0.690% 
minimal x86: +1976 +1.056% 
   unix x64:  -344 -0.043% standard[incl -16(data)]
      stm32:  +112 +0.028% PYBV10
     mimxrt:   -80 -0.022% TEENSY40
        rp2:  +856 +0.258% RPI_PICO
       samd:   +12 +0.005% ADAFRUIT_ITSYBITSY_M4_EXPRESS

codecov · 2023-10-10T03:11:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (307ecc5) 98.36% compared to head (f008756) 98.39%.

❗ Current head f008756 differs from pull request most recent head 62fbbe0. Consider uploading reports for the commit 62fbbe0 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12644      +/-   ##
==========================================
+ Coverage   98.36%   98.39%   +0.02%     
==========================================
  Files         159      158       -1     
  Lines       21088    20945     -143     
==========================================
- Hits        20743    20608     -135     
+ Misses        345      337       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

projectgus · 2023-10-17T05:29:47Z

Good idea digging into these! I was curious what's going on here so I got the output from make V=1 and compiled the relevant source files with -fopt-info, so gcc will tell us what optimizations it has applied...

Extra optimizations from `MICROPY_ENABLE_SUPEROPT`

map.c

../../py/map.c:133:24: optimized:  Inlining get_hash_alloc_greater_or_equal_to/627 into mp_map_rehash/632.
../../py/map.c:194:13: optimized: unswitching loop 2 on 'if' with condition: compare_only_ptrs_100 != 0
../../py/map.c:255:5: optimized: unswitching outer loop 1 on 'if' with condition: compare_only_ptrs_100 != 0

gc.c

../../py/gc.c:1156:5: optimized:  Inlining gc_free/644 into gc_realloc/646.
../../py/gc.c:1018:9: optimized:  Inlining gc_free/644 into gc_realloc/646.

Optimizations only performed without MICROPY_ENABLE_SUPEROPT

../../py/gc.c:595:21: optimized:  Inlining gc_get_ptr/638 into gc_collect_root/639.
optimized:  Inlined gc_free.part.0/670 into gc_free/644 which now has time 118.446599 and size 46, net change of +0.

qstr.c

../../py/qstr.c:184:23: optimized:  Inlining qstr_compute_hash/629 into qstr_find_strn/636.

One optimization seems to be slightly different with MICROPY_ENABLE_SUPEROPT. With:

optimized:  Inlined qstr_from_strn.part.0/663 into qstr_from_strn/638 which now has time 52.401703 and size 81, net change of +0.

Without:

../../py/qstr.c:260:13: optimized:  Inlined qstr_add/635 into qstr_from_strn.part.0/662 which now has time 89.713901 and size 114, net change of +0.

Click for even more output

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=1 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/map.o ../../py/map.c -fopt-info
../../py/map.c:307:26: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:266:22: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:246:9: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:225:14: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:177:20: optimized:  Inlining mp_obj_is_obj/601 into mp_map_lookup/633.
../../py/map.c:175:13: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:133:24: optimized:  Inlining get_hash_alloc_greater_or_equal_to/627 into mp_map_rehash/632.
../../py/map.c:424:13: optimized:  Inlined mp_set_slot_is_filled.isra/655 into mp_set_remove_first/637 which now has time 124.756833 and size 33, net change of +0.
../../py/map.c:338:18: optimized:  Inlined get_hash_alloc_greater_or_equal_to/627 into mp_set_rehash/635 which now has time 233.412141 and size 39, net change of +0.
../../py/map.c:92:20: optimized: sinking common stores to map_6(D)->table
../../py/map.c:194:13: optimized: unswitching loop 2 on 'if' with condition: compare_only_ptrs_100 != 0
../../py/map.c:255:5: optimized: unswitching outer loop 1 on 'if' with condition: compare_only_ptrs_100 != 0
../../py/map.c:432:33: optimized: sinking common stores to *_30

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=0 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/map.o ../../py/map.c -fopt-info
../../py/map.c:307:26: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:266:22: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:246:9: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:225:14: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:177:20: optimized:  Inlining mp_obj_is_obj/601 into mp_map_lookup/633.
../../py/map.c:175:13: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:424:13: optimized:  Inlined mp_set_slot_is_filled.isra/654 into mp_set_remove_first/637 which now has time 124.756833 and size 33, net change of +0.
../../py/map.c:92:20: optimized: sinking common stores to map_6(D)->table
../../py/map.c:432:33: optimized: sinking common stores to *_30

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=1 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/gc.o ../../py/gc.c -fopt-info
../../py/gc.c:1156:5: optimized:  Inlining gc_free/644 into gc_realloc/646.
../../py/gc.c:1018:9: optimized:  Inlining gc_free/644 into gc_realloc/646.
../../py/gc.c:194:5: optimized:  Inlined gc_setup_area.constprop/673 into gc_init/630 which now has time 37.500000 and size 30, net change of +0.
../../py/gc.c:620:5: optimized:  Inlined gc_deal_with_stack_overflow/635 into gc_collect_end/640 which now has time 1243.779463 and size 31, net change of +0.

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=0 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/gc.o ../../py/gc.c -fopt-info
../../py/gc.c:595:21: optimized:  Inlining gc_get_ptr/638 into gc_collect_root/639.
../../py/gc.c:194:5: optimized:  Inlined gc_setup_area.constprop/671 into gc_init/630 which now has time 37.500000 and size 30, net change of +0.
../../py/gc.c:620:5: optimized:  Inlined gc_deal_with_stack_overflow/635 into gc_collect_end/640 which now has time 1243.779463 and size 31, net change of +0.
optimized:  Inlined gc_free.part.0/670 into gc_free/644 which now has time 118.446599 and size 46, net change of +0.

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=1 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/qstr.o ../../py/qstr.c -fopt-info
../../py/qstr.c:184:23: optimized:  Inlining qstr_compute_hash/629 into qstr_find_strn/636.
../../py/qstr.c:213:49: optimized:   Inlining MP_COMPRESSED_ROM_TEXT/600 into qstr_from_strn/638 (always_inline).
../../py/qstr.c:374:32: optimized:  Inlined find_uncompressed_string/646 into mp_decompress_rom_string/647 which now has time 722.557576 and size 42, net change of +0.
optimized:  Inlined qstr_from_strn.part.0/663 into qstr_from_strn/638 which now has time 52.401703 and size 81, net change of +0.

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=0 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/qstr.o ../../py/qstr.c -fopt-info
../../py/qstr.c:213:49: optimized:   Inlining MP_COMPRESSED_ROM_TEXT/600 into qstr_from_strn/638 (always_inline).
../../py/qstr.c:374:32: optimized:  Inlined find_uncompressed_string/646 into mp_decompress_rom_string/647 which now has time 722.557576 and size 42, net change of +0.
../../py/qstr.c:260:13: optimized:  Inlined qstr_add/635 into qstr_from_strn.part.0/662 which now has time 89.713901 and size 114, net change of +0.

The summary seems to be: more aggressively inlining, and in map.c it applied Loop Unswitching in two places.

I was a little skeptical that loop unswitching would be much help on microcontrollers... However although __attribute__((optimize("O3", "no-unswitch-loops"))) shrunk code size by 24 bytes, it also made some perf tests slower (on PYBV11, caches disabled. Some tests got faster, so this may all be noise, but overall slightly slower.)

Also clang doesn't support the optimise attribute. I don't have a good workaround to suggest for eg Unix, webassembly.

FWIW extra inlining in webassembly may be counter productive, so maybe this doesn't matter that much...?

dpgeorge · 2024-01-16T22:12:59Z

Enabling this on rp2, firmware for RPI_PICO is +1032 and performance change is:

diff of scores (higher is better)
N=100 M=100                     perf0 ->      perf1         diff      diff% (error%)
bm_chaos.py                    212.07 ->     215.91 :      +3.84 =  +1.811% (+/-0.06%)
bm_fannkuch.py                  55.56 ->      56.62 :      +1.06 =  +1.908% (+/-0.01%)
bm_fft.py                     1490.14 ->    1539.10 :     +48.96 =  +3.286% (+/-0.01%)
bm_float.py                   2420.81 ->    2364.14 :     -56.67 =  -2.341% (+/-0.07%)
bm_hexiom.py                    33.42 ->      35.81 :      +2.39 =  +7.151% (+/-0.06%)
bm_nqueens.py                 2543.84 ->    2744.60 :    +200.76 =  +7.892% (+/-0.05%)
bm_pidigits.py                 401.16 ->     405.13 :      +3.97 =  +0.990% (+/-0.02%)
bm_wordcount.py                 37.15 ->      41.79 :      +4.64 = +12.490% (+/-0.03%)
core_import_mpy_multi.py       331.67 ->     347.03 :     +15.36 =  +4.631% (+/-0.08%)
core_import_mpy_single.py       64.74 ->      67.44 :      +2.70 =  +4.171% (+/-0.16%)
core_locals.py                  28.14 ->      30.66 :      +2.52 =  +8.955% (+/-0.00%)
core_qstr.py                   137.27 ->     134.84 :      -2.43 =  -1.770% (+/-0.05%)
core_str.py                     19.02 ->      20.24 :      +1.22 =  +6.414% (+/-0.02%)
core_yield_from.py             215.23 ->     222.67 :      +7.44 =  +3.457% (+/-0.01%)
misc_aes.py                    301.82 ->     316.70 :     +14.88 =  +4.930% (+/-0.02%)
misc_mandel.py                1684.51 ->    1672.65 :     -11.86 =  -0.704% (+/-0.03%)
misc_pystone.py               1399.46 ->    1515.15 :    +115.69 =  +8.267% (+/-0.05%)
misc_raytrace.py               236.69 ->     241.34 :      +4.65 =  +1.965% (+/-0.05%)
viper_call0.py                 331.10 ->     331.12 :      +0.02 =  +0.006% (+/-0.00%)
viper_call1a.py                323.40 ->     323.41 :      +0.01 =  +0.003% (+/-0.00%)
viper_call1b.py                241.04 ->     241.05 :      +0.01 =  +0.004% (+/-0.00%)
viper_call1c.py                242.91 ->     242.92 :      +0.01 =  +0.004% (+/-0.00%)
viper_call2a.py                318.45 ->     318.47 :      +0.02 =  +0.006% (+/-0.00%)
viper_call2b.py                211.29 ->     211.30 :      +0.01 =  +0.005% (+/-0.00%)

That's really good! (Note that rp2 has never used the SUPEROPT option because it's cmake.)

dpgeorge · 2024-01-16T22:20:54Z

@jimmo I think it would be simpler to configure this at the C level only (ie not in the makefile). There's no real need to configure it at the makefile level. Doing it at the C level in py/mpconfig.h means cmake-based ports will work the same. It also means we can easily add further fine-grained options like MICROPY_ENABLE_SUPEROPT_MPZ to optimize certain parts of the runtime when the port has space, or even MICROPY_ENABLE_SUPEROPT_LEVEL_2 to have a set of levels that can be turned up to get more performance.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

It's unused outside of mpz.c. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This will be replaced with a function attribute approach, configured via mpconfig. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

It's useful to provide a way for a port/board to customise an individual function, but no point cluttering up mpconfig.h. These wrap macros are now defined in terms of stastandardised levels (O3+ram, O3+mayberam, O3, maybeO3) which are defined in mpconfig.h. This is what most ports/boards should configure instead. Currently only level 1 and 2 are used, and the various functions have been assigned levels to match the way esp32 currently overrides them. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This replaces the previous CSUPEROPT used for gc.o, however because it's applied just to the required functions it leads to a code size saving. Defaults to level 3 (i.e. apply `-O3`, but don't place in RAM). Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This currently uses level 4 (not enabled by default), so should be a no-op change, but now a board with spare flash can opt-into it. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

Provides a default implementation of a macro that will enable `-O3`, and enable this by default on level 1, 2, and 3. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

Instead of applying to individual functions, configure the levels instead. This should be a no-op change -- the wrap functions map to level 1 and 2 in the same way as the current rules. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

bare-arm, minimal, and stm32-on-CM0 used to disable `CSUPEROPT`. This re-instates this behavior by disabling `MICROPY_APPLY_COMPILER_OPTIMISATIONS` instead. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo · 2024-01-25T04:42:56Z

I have re-worked this from scratch.

"hot" functions can be annotated with a per-function MICROPY_WRAP_FOO (as before, except now gc_alloc, gc_free, and gc_realloc are included, as well as the whole mpz public API, with the assumption that the mpz internal methods are inline/static and so will inherit this)
The default definition of these MICROPY_WRAP_FOO macros moves to place-of-use (i.e. immediately above the function), and they're defined in terms of four optimisation levels. This avoids cluttering mpconfig.h with lots of these.
Ideally ports/boards don't configure the per-function macros, instead they configure the four levels.
These four levels have default definitions in mpconfig.h.
CSUPEROPT is replaced with the gcc-attribute based approach, and applied to the first three levels by default.
Additionally, ESP32 adds the IRAM_ATTR to level 1 (and level 2 depending on which family).

cc @dhalbert @tannewt this may be RTYI.

jimmo · 2024-01-25T04:45:34Z

However, the results are still counter-intuitive (as suggested in the first comment in this PR). As before, adding -O3 (or -O2) seems to be actually harmful.

The best results I get on pybv11 are when MICROPY_APPLY_COMPILER_OPTIMISATIONS is disabled:

$ ./run-perfbench.py -s ~/mpy/perf/superopt/baseline-pybv11 ~/mpy/perf/superopt/noopt-pybv11 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/superopt/baseline-pybv11 -> /home/jimmo/mpy/perf/superopt/noopt-pybv11         diff      diff% (error%)
bm_chaos.py                    352.34 ->     361.75 :      +9.41 =  +2.671% (+/-0.00%)
bm_fannkuch.py                  75.08 ->      76.59 :      +1.51 =  +2.011% (+/-0.01%)
bm_fft.py                     2346.26 ->    2374.90 :     +28.64 =  +1.221% (+/-0.00%)
bm_float.py                   5724.38 ->    5970.45 :    +246.07 =  +4.299% (+/-0.03%)
bm_hexiom.py                    46.65 ->      47.22 :      +0.57 =  +1.222% (+/-0.00%)
bm_nqueens.py                 4224.78 ->    4364.56 :    +139.78 =  +3.309% (+/-0.00%)
bm_pidigits.py                 648.53 ->     642.07 :      -6.46 =  -0.996% (+/-0.27%)
bm_wordcount.py                 47.61 ->      47.39 :      -0.22 =  -0.462% (+/-0.01%)
core_import_mpy_multi.py       604.12 ->     611.12 :      +7.00 =  +1.159% (+/-0.00%)
core_import_mpy_single.py       99.28 ->     100.58 :      +1.30 =  +1.309% (+/-0.01%)
core_locals.py                  40.87 ->      41.50 :      +0.63 =  +1.541% (+/-0.01%)
core_qstr.py                   205.10 ->     206.82 :      +1.72 =  +0.839% (+/-0.00%)
core_str.py                     27.20 ->      27.65 :      +0.45 =  +1.654% (+/-0.00%)
core_yield_from.py             354.68 ->     346.56 :      -8.12 =  -2.289% (+/-0.00%)
misc_aes.py                    406.64 ->     410.91 :      +4.27 =  +1.050% (+/-0.01%)
misc_mandel.py                3018.46 ->    3139.44 :    +120.98 =  +4.008% (+/-0.00%)
misc_pystone.py               2355.08 ->    2375.38 :     +20.30 =  +0.862% (+/-0.00%)
misc_raytrace.py               368.25 ->     379.26 :     +11.01 =  +2.990% (+/-0.01%)

More investigation required...

jimmo · 2024-01-29T23:21:45Z

The next step here would be to apply a similar idea to functions that need to be attributed for correctness (i.e. code that must be in RAM). ESP8266 does this in a port-specific way with MP_FASTCODE (and applies it to the MICROPY_WRAP_MP_SCHED_* wrappers). (I think this is more along the lines of what @tannewt and @dhalbert were interested in).

jimmo · 2024-01-29T23:23:01Z

Another open question is, if you apply the optimise attribute to a function, does that "propagate" to e.g. a called function (in particular, if that function is static inline).

tannewt · 2024-01-30T17:44:47Z

More investigation required...

I suspect that caches play a big role for any MCU that has them. Especially if they are in front of XIP flash. The miss cost is quite high in that case.

We have a few annotations in CP designed to fill up 32k ITCM on the iMX RT to make room in the cache for "maybe" used code. My long term plan is to make the compiler and linker smarter using performance information. Maintaining per-function annotations will be more time consuming than relying on performance guided optimization (PGO).

projectgus · 2024-03-07T23:49:25Z

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

dpgeorge added the py-core Relates to py/ directory in source label Oct 11, 2023

projectgus mentioned this pull request Oct 17, 2023

py/py.mk: Build some core C files as single compilation units. #12714

Closed

jimmo added 9 commits January 25, 2024 14:01

py/mpz: Use consistent comment style. Remove unused code.

fec85d8

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

py/mpz: Remove mpz_set from public API.

ec5b80b

It's unused outside of mpz.c. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

py/py.mk: Remove support for CSUPEROPT.

26418c0

This will be replaced with a function attribute approach, configured via mpconfig. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

py/mpz: Add MICROPY_WRAP macros for public mpz API.

d93cbae

This currently uses level 4 (not enabled by default), so should be a no-op change, but now a board with spare flash can opt-into it. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

py/mpconfig.h: Allow per-function super optimisations.

f3cc364

Provides a default implementation of a macro that will enable `-O3`, and enable this by default on level 1, 2, and 3. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

{bare-arm,minimal,stm32}/mpconfigport.h: Disable super-optimisations.

62fbbe0

bare-arm, minimal, and stm32-on-CM0 used to disable `CSUPEROPT`. This re-instates this behavior by disabling `MICROPY_APPLY_COMPILER_OPTIMISATIONS` instead. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo force-pushed the superopt-per-function branch from f008756 to 62fbbe0 Compare January 25, 2024 04:36

jimmo changed the title ~~py/mkrules.mk: Finer-grained super-opt setting.~~ py/mpconfig.h: Finer-grained super-opt setting. Jan 25, 2024

Gadgetoid mentioned this pull request Feb 29, 2024

global: Remove the STATIC macro. #13763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py/mpconfig.h: Finer-grained super-opt setting. #12644

py/mpconfig.h: Finer-grained super-opt setting. #12644

jimmo commented Oct 10, 2023

github-actions bot commented Oct 10, 2023 •

edited

Loading

codecov bot commented Oct 10, 2023 •

edited

Loading

projectgus commented Oct 17, 2023 •

edited

Loading

map.c

gc.c

qstr.c

dpgeorge commented Jan 16, 2024

dpgeorge commented Jan 16, 2024

jimmo commented Jan 25, 2024

jimmo commented Jan 25, 2024

jimmo commented Jan 29, 2024

jimmo commented Jan 29, 2024

tannewt commented Jan 30, 2024

projectgus commented Mar 7, 2024

py/mpconfig.h: Finer-grained super-opt setting. #12644

Are you sure you want to change the base?

py/mpconfig.h: Finer-grained super-opt setting. #12644

Conversation

jimmo commented Oct 10, 2023

github-actions bot commented Oct 10, 2023 • edited Loading

codecov bot commented Oct 10, 2023 • edited Loading

Codecov Report

projectgus commented Oct 17, 2023 • edited Loading

map.c

gc.c

qstr.c

dpgeorge commented Jan 16, 2024

dpgeorge commented Jan 16, 2024

jimmo commented Jan 25, 2024

jimmo commented Jan 25, 2024

jimmo commented Jan 29, 2024

jimmo commented Jan 29, 2024

tannewt commented Jan 30, 2024

projectgus commented Mar 7, 2024

github-actions bot commented Oct 10, 2023 •

edited

Loading

codecov bot commented Oct 10, 2023 •

edited

Loading

projectgus commented Oct 17, 2023 •

edited

Loading