Skip to content

py/mpconfig.h: Finer-grained super-opt setting. #12644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jimmo
Copy link
Member

@jimmo jimmo commented Oct 10, 2023

Split out from #10758 (comment)

This allows -O3 to be enabled on a per-function basis, rather than whole-file as it is currently implemented with CSUPEROPT. This avoids having to waste code size on functions that don't need -O3.

Interestingly I found that adding -O2 or -O3 to mp_execute_bytecode both made it smaller and slower (i.e. the exact opposite of expected) on PYBV11, so mp_execute_bytecode does not have the attribute. On x86/x64 or other architectures though this is likely to not be the case.

This PR on PYBV11 (+300 bytes).

bm_chaos.py                    357.96 ->     365.55 :      +7.59 =  +2.120% (+/-0.01%)
bm_fannkuch.py                  74.78 ->      74.70 :      -0.08 =  -0.107% (+/-0.01%)
bm_fft.py                     2336.14 ->    2400.24 :     +64.10 =  +2.744% (+/-0.00%)
bm_float.py                   5681.08 ->    5834.14 :    +153.06 =  +2.694% (+/-0.02%)
bm_hexiom.py                    45.28 ->      47.96 :      +2.68 =  +5.919% (+/-0.00%)
bm_nqueens.py                 4227.99 ->    4305.10 :     +77.11 =  +1.824% (+/-0.00%)
bm_pidigits.py                 644.35 ->     646.20 :      +1.85 =  +0.287% (+/-0.42%)
core_import_mpy_multi.py       421.17 ->     508.94 :     +87.77 = +20.840% (+/-0.01%)
core_import_mpy_single.py       63.15 ->      78.39 :     +15.24 = +24.133% (+/-0.01%)
core_qstr.py                    64.10 ->      94.50 :     +30.40 = +47.426% (+/-0.00%)
core_yield_from.py             353.04 ->     347.02 :      -6.02 =  -1.705% (+/-0.01%)
misc_aes.py                    406.02 ->     414.08 :      +8.06 =  +1.985% (+/-0.01%)
misc_mandel.py                3048.78 ->    3222.61 :    +173.83 =  +5.702% (+/-0.01%)
misc_pystone.py               2419.68 ->    2568.70 :    +149.02 =  +6.159% (+/-0.00%)
misc_raytrace.py               375.09 ->     380.06 :      +4.97 =  +1.325% (+/-0.01%)

Also clang doesn't support the optimise attribute. I don't have a good workaround to suggest for eg Unix, webassembly.

So there's a bit of tuning and other things to consider. Maybe a good compromise for now is just to add to map.c and qstr.c and leave the old CSUPEROPT in place. But either way I think this shows that there is some performance to be gained here for very little code size.

For reference, this is what adding -O2 to the whole PYBV11 build does (at a cost of +60kiB):

bm_chaos.py                    357.96 ->     402.29 :     +44.33 = +12.384% (+/-0.00%)
bm_fannkuch.py                  74.78 ->      80.75 :      +5.97 =  +7.983% (+/-0.01%)
bm_fft.py                     2336.14 ->    2502.37 :    +166.23 =  +7.116% (+/-0.00%)
bm_float.py                   5681.08 ->    6366.30 :    +685.22 = +12.061% (+/-0.02%)
bm_hexiom.py                    45.28 ->      50.52 :      +5.24 = +11.572% (+/-0.00%)
bm_nqueens.py                 4227.99 ->    4548.46 :    +320.47 =  +7.580% (+/-0.00%)
bm_pidigits.py                 644.35 ->     751.15 :    +106.80 = +16.575% (+/-0.44%)
core_import_mpy_multi.py       421.17 ->     529.33 :    +108.16 = +25.681% (+/-0.00%)
core_import_mpy_single.py       63.15 ->      81.54 :     +18.39 = +29.121% (+/-0.01%)
core_qstr.py                    64.10 ->      94.82 :     +30.72 = +47.925% (+/-0.02%)
core_yield_from.py             353.04 ->     367.50 :     +14.46 =  +4.096% (+/-0.01%)
misc_aes.py                    406.02 ->     442.25 :     +36.23 =  +8.923% (+/-0.01%)
misc_mandel.py                3048.78 ->    3308.63 :    +259.85 =  +8.523% (+/-0.01%)
misc_pystone.py               2419.68 ->    2674.53 :    +254.85 = +10.532% (+/-0.00%)
misc_raytrace.py               375.09 ->     409.72 :     +34.63 =  +9.232% (+/-0.01%)

-O3 isn't much different (but much more code size). This shows an upper bound on how much can be gained by selectively optimising functions.

This work was funded through GitHub Sponsors.

@github-actions
Copy link

github-actions bot commented Oct 10, 2023

Code size report:

   bare-arm:  +392 +0.690% 
minimal x86: +1976 +1.056% 
   unix x64:  -344 -0.043% standard[incl -16(data)]
      stm32:  +112 +0.028% PYBV10
     mimxrt:   -80 -0.022% TEENSY40
        rp2:  +856 +0.258% RPI_PICO
       samd:   +12 +0.005% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (307ecc5) 98.36% compared to head (f008756) 98.39%.

❗ Current head f008756 differs from pull request most recent head 62fbbe0. Consider uploading reports for the commit 62fbbe0 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12644      +/-   ##
==========================================
+ Coverage   98.36%   98.39%   +0.02%     
==========================================
  Files         159      158       -1     
  Lines       21088    20945     -143     
==========================================
- Hits        20743    20608     -135     
+ Misses        345      337       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Oct 11, 2023
@projectgus
Copy link
Contributor

projectgus commented Oct 17, 2023

Good idea digging into these! I was curious what's going on here so I got the output from make V=1 and compiled the relevant source files with -fopt-info, so gcc will tell us what optimizations it has applied...

Extra optimizations from `MICROPY_ENABLE_SUPEROPT`

map.c

../../py/map.c:133:24: optimized:  Inlining get_hash_alloc_greater_or_equal_to/627 into mp_map_rehash/632.
../../py/map.c:194:13: optimized: unswitching loop 2 on 'if' with condition: compare_only_ptrs_100 != 0
../../py/map.c:255:5: optimized: unswitching outer loop 1 on 'if' with condition: compare_only_ptrs_100 != 0

gc.c

../../py/gc.c:1156:5: optimized:  Inlining gc_free/644 into gc_realloc/646.
../../py/gc.c:1018:9: optimized:  Inlining gc_free/644 into gc_realloc/646.

Optimizations only performed without MICROPY_ENABLE_SUPEROPT

../../py/gc.c:595:21: optimized:  Inlining gc_get_ptr/638 into gc_collect_root/639.
optimized:  Inlined gc_free.part.0/670 into gc_free/644 which now has time 118.446599 and size 46, net change of +0.

qstr.c

../../py/qstr.c:184:23: optimized:  Inlining qstr_compute_hash/629 into qstr_find_strn/636.

One optimization seems to be slightly different with MICROPY_ENABLE_SUPEROPT. With:

optimized:  Inlined qstr_from_strn.part.0/663 into qstr_from_strn/638 which now has time 52.401703 and size 81, net change of +0.

Without:

../../py/qstr.c:260:13: optimized:  Inlined qstr_add/635 into qstr_from_strn.part.0/662 which now has time 89.713901 and size 114, net change of +0.
Click for even more output
❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=1 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/map.o ../../py/map.c -fopt-info
../../py/map.c:307:26: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:266:22: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:246:9: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:225:14: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:177:20: optimized:  Inlining mp_obj_is_obj/601 into mp_map_lookup/633.
../../py/map.c:175:13: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:133:24: optimized:  Inlining get_hash_alloc_greater_or_equal_to/627 into mp_map_rehash/632.
../../py/map.c:424:13: optimized:  Inlined mp_set_slot_is_filled.isra/655 into mp_set_remove_first/637 which now has time 124.756833 and size 33, net change of +0.
../../py/map.c:338:18: optimized:  Inlined get_hash_alloc_greater_or_equal_to/627 into mp_set_rehash/635 which now has time 233.412141 and size 39, net change of +0.
../../py/map.c:92:20: optimized: sinking common stores to map_6(D)->table
../../py/map.c:194:13: optimized: unswitching loop 2 on 'if' with condition: compare_only_ptrs_100 != 0
../../py/map.c:255:5: optimized: unswitching outer loop 1 on 'if' with condition: compare_only_ptrs_100 != 0
../../py/map.c:432:33: optimized: sinking common stores to *_30

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=0 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/map.o ../../py/map.c -fopt-info
../../py/map.c:307:26: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:266:22: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:246:9: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:225:14: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:177:20: optimized:  Inlining mp_obj_is_obj/601 into mp_map_lookup/633.
../../py/map.c:175:13: optimized:  Inlining mp_obj_is_qstr/599 into mp_map_lookup/633.
../../py/map.c:424:13: optimized:  Inlined mp_set_slot_is_filled.isra/654 into mp_set_remove_first/637 which now has time 124.756833 and size 33, net change of +0.
../../py/map.c:92:20: optimized: sinking common stores to map_6(D)->table
../../py/map.c:432:33: optimized: sinking common stores to *_30

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=1 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/gc.o ../../py/gc.c -fopt-info
../../py/gc.c:1156:5: optimized:  Inlining gc_free/644 into gc_realloc/646.
../../py/gc.c:1018:9: optimized:  Inlining gc_free/644 into gc_realloc/646.
../../py/gc.c:194:5: optimized:  Inlined gc_setup_area.constprop/673 into gc_init/630 which now has time 37.500000 and size 30, net change of +0.
../../py/gc.c:620:5: optimized:  Inlined gc_deal_with_stack_overflow/635 into gc_collect_end/640 which now has time 1243.779463 and size 31, net change of +0.

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=0 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/gc.o ../../py/gc.c -fopt-info
../../py/gc.c:595:21: optimized:  Inlining gc_get_ptr/638 into gc_collect_root/639.
../../py/gc.c:194:5: optimized:  Inlined gc_setup_area.constprop/671 into gc_init/630 which now has time 37.500000 and size 30, net change of +0.
../../py/gc.c:620:5: optimized:  Inlined gc_deal_with_stack_overflow/635 into gc_collect_end/640 which now has time 1243.779463 and size 31, net change of +0.
optimized:  Inlined gc_free.part.0/670 into gc_free/644 which now has time 118.446599 and size 46, net change of +0.

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=1 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/qstr.o ../../py/qstr.c -fopt-info
../../py/qstr.c:184:23: optimized:  Inlining qstr_compute_hash/629 into qstr_find_strn/636.
../../py/qstr.c:213:49: optimized:   Inlining MP_COMPRESSED_ROM_TEXT/600 into qstr_from_strn/638 (always_inline).
../../py/qstr.c:374:32: optimized:  Inlined find_uncompressed_string/646 into mp_decompress_rom_string/647 which now has time 722.557576 and size 42, net change of +0.
optimized:  Inlined qstr_from_strn.part.0/663 into qstr_from_strn/638 which now has time 52.401703 and size 81, net change of +0.

❯ arm-none-eabi-gcc -DMICROPY_VFS_FAT=1 -DMICROPY_VFS_LFS2=1 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DLFS2_NO_MALLOC -DLFS2_NO_DEBUG -DLFS2_NO_WARN -DLFS2_NO_ERROR -DLFS2_NO_ASSERT -I. -I../.. -Ibuild-PYBV11 -I../../lib/cmsis/inc -I../../lib/stm32lib/CMSIS/STM32F4xx/Include -I../../lib/stm32lib/STM32F4xx_HAL_Driver/Inc -Iusbdev/core/inc -Iusbdev/class/inc -Ilwip_inc -Wall -Wpointer-arith -Werror -Wdouble-promotion -Wfloat-conversion -std=gnu99 -nostdlib  -DSTM32F405xx -DUSE_FULL_LL_DRIVER -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mtune=cortex-m4 -mcpu=cortex-m4 -Os -DNDEBUG -Iboards/PYBV11 -DSTM32_HAL_H='<stm32f4xx_hal.h>' -DMBOOT_VTOR=0x08000000 -DMICROPY_HW_VTOR=0x08000000 -DMICROPY_FLOAT_IMPL=MICROPY_FLOAT_IMPL_FLOAT -fsingle-precision-constant -fdata-sections -ffunction-sections -g   -DMICROPY_ROM_TEXT_COMPRESSION=1 -DMICROPY_ENABLE_SUPEROPT=0 -DMICROPY_QSTR_EXTRA_POOL=mp_qstr_frozen_const_pool -DMICROPY_MODULE_FROZEN_MPY -DMICROPY_MODULE_FROZEN_STR -c -MD -o build-PYBV11/py/qstr.o ../../py/qstr.c -fopt-info
../../py/qstr.c:213:49: optimized:   Inlining MP_COMPRESSED_ROM_TEXT/600 into qstr_from_strn/638 (always_inline).
../../py/qstr.c:374:32: optimized:  Inlined find_uncompressed_string/646 into mp_decompress_rom_string/647 which now has time 722.557576 and size 42, net change of +0.
../../py/qstr.c:260:13: optimized:  Inlined qstr_add/635 into qstr_from_strn.part.0/662 which now has time 89.713901 and size 114, net change of +0.

The summary seems to be: more aggressively inlining, and in map.c it applied Loop Unswitching in two places.

I was a little skeptical that loop unswitching would be much help on microcontrollers... However although __attribute__((optimize("O3", "no-unswitch-loops"))) shrunk code size by 24 bytes, it also made some perf tests slower (on PYBV11, caches disabled. Some tests got faster, so this may all be noise, but overall slightly slower.)

Also clang doesn't support the optimise attribute. I don't have a good workaround to suggest for eg Unix, webassembly.

FWIW extra inlining in webassembly may be counter productive, so maybe this doesn't matter that much...?

@dpgeorge
Copy link
Member

Enabling this on rp2, firmware for RPI_PICO is +1032 and performance change is:

diff of scores (higher is better)
N=100 M=100                     perf0 ->      perf1         diff      diff% (error%)
bm_chaos.py                    212.07 ->     215.91 :      +3.84 =  +1.811% (+/-0.06%)
bm_fannkuch.py                  55.56 ->      56.62 :      +1.06 =  +1.908% (+/-0.01%)
bm_fft.py                     1490.14 ->    1539.10 :     +48.96 =  +3.286% (+/-0.01%)
bm_float.py                   2420.81 ->    2364.14 :     -56.67 =  -2.341% (+/-0.07%)
bm_hexiom.py                    33.42 ->      35.81 :      +2.39 =  +7.151% (+/-0.06%)
bm_nqueens.py                 2543.84 ->    2744.60 :    +200.76 =  +7.892% (+/-0.05%)
bm_pidigits.py                 401.16 ->     405.13 :      +3.97 =  +0.990% (+/-0.02%)
bm_wordcount.py                 37.15 ->      41.79 :      +4.64 = +12.490% (+/-0.03%)
core_import_mpy_multi.py       331.67 ->     347.03 :     +15.36 =  +4.631% (+/-0.08%)
core_import_mpy_single.py       64.74 ->      67.44 :      +2.70 =  +4.171% (+/-0.16%)
core_locals.py                  28.14 ->      30.66 :      +2.52 =  +8.955% (+/-0.00%)
core_qstr.py                   137.27 ->     134.84 :      -2.43 =  -1.770% (+/-0.05%)
core_str.py                     19.02 ->      20.24 :      +1.22 =  +6.414% (+/-0.02%)
core_yield_from.py             215.23 ->     222.67 :      +7.44 =  +3.457% (+/-0.01%)
misc_aes.py                    301.82 ->     316.70 :     +14.88 =  +4.930% (+/-0.02%)
misc_mandel.py                1684.51 ->    1672.65 :     -11.86 =  -0.704% (+/-0.03%)
misc_pystone.py               1399.46 ->    1515.15 :    +115.69 =  +8.267% (+/-0.05%)
misc_raytrace.py               236.69 ->     241.34 :      +4.65 =  +1.965% (+/-0.05%)
viper_call0.py                 331.10 ->     331.12 :      +0.02 =  +0.006% (+/-0.00%)
viper_call1a.py                323.40 ->     323.41 :      +0.01 =  +0.003% (+/-0.00%)
viper_call1b.py                241.04 ->     241.05 :      +0.01 =  +0.004% (+/-0.00%)
viper_call1c.py                242.91 ->     242.92 :      +0.01 =  +0.004% (+/-0.00%)
viper_call2a.py                318.45 ->     318.47 :      +0.02 =  +0.006% (+/-0.00%)
viper_call2b.py                211.29 ->     211.30 :      +0.01 =  +0.005% (+/-0.00%)

That's really good! (Note that rp2 has never used the SUPEROPT option because it's cmake.)

@dpgeorge
Copy link
Member

@jimmo I think it would be simpler to configure this at the C level only (ie not in the makefile). There's no real need to configure it at the makefile level. Doing it at the C level in py/mpconfig.h means cmake-based ports will work the same. It also means we can easily add further fine-grained options like MICROPY_ENABLE_SUPEROPT_MPZ to optimize certain parts of the runtime when the port has space, or even MICROPY_ENABLE_SUPEROPT_LEVEL_2 to have a set of levels that can be turned up to get more performance.

jimmo added 9 commits January 25, 2024 14:01
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
It's unused outside of mpz.c.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This will be replaced with a function attribute approach, configured via
mpconfig.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
It's useful to provide a way for a port/board to customise an individual
function, but no point cluttering up mpconfig.h.

These wrap macros are now defined in terms of stastandardised levels
(O3+ram, O3+mayberam, O3, maybeO3) which are defined in mpconfig.h. This
is what most ports/boards should configure instead.

Currently only level 1 and 2 are used, and the various functions have
been assigned levels to match the way esp32 currently overrides them.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This replaces the previous CSUPEROPT used for gc.o, however because it's
applied just to the required functions it leads to a code size saving.

Defaults to level 3 (i.e. apply `-O3`, but don't place in RAM).

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This currently uses level 4 (not enabled by default), so should be a
no-op change, but now a board with spare flash can opt-into it.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Provides a default implementation of a macro that will enable `-O3`,
and enable this by default on level 1, 2, and 3.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Instead of applying to individual functions, configure the levels
instead.

This should be a no-op change -- the wrap functions map to level 1 and 2
in the same way as the current rules.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
bare-arm, minimal, and stm32-on-CM0 used to disable `CSUPEROPT`. This
re-instates this behavior by disabling
`MICROPY_APPLY_COMPILER_OPTIMISATIONS` instead.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo force-pushed the superopt-per-function branch from f008756 to 62fbbe0 Compare January 25, 2024 04:36
@jimmo jimmo changed the title py/mkrules.mk: Finer-grained super-opt setting. py/mpconfig.h: Finer-grained super-opt setting. Jan 25, 2024
@jimmo
Copy link
Member Author

jimmo commented Jan 25, 2024

I have re-worked this from scratch.

  • "hot" functions can be annotated with a per-function MICROPY_WRAP_FOO (as before, except now gc_alloc, gc_free, and gc_realloc are included, as well as the whole mpz public API, with the assumption that the mpz internal methods are inline/static and so will inherit this)
  • The default definition of these MICROPY_WRAP_FOO macros moves to place-of-use (i.e. immediately above the function), and they're defined in terms of four optimisation levels. This avoids cluttering mpconfig.h with lots of these.
  • Ideally ports/boards don't configure the per-function macros, instead they configure the four levels.
  • These four levels have default definitions in mpconfig.h.
  • CSUPEROPT is replaced with the gcc-attribute based approach, and applied to the first three levels by default.
  • Additionally, ESP32 adds the IRAM_ATTR to level 1 (and level 2 depending on which family).

cc @dhalbert @tannewt this may be RTYI.

@jimmo
Copy link
Member Author

jimmo commented Jan 25, 2024

However, the results are still counter-intuitive (as suggested in the first comment in this PR). As before, adding -O3 (or -O2) seems to be actually harmful.

The best results I get on pybv11 are when MICROPY_APPLY_COMPILER_OPTIMISATIONS is disabled:

$ ./run-perfbench.py -s ~/mpy/perf/superopt/baseline-pybv11 ~/mpy/perf/superopt/noopt-pybv11 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/superopt/baseline-pybv11 -> /home/jimmo/mpy/perf/superopt/noopt-pybv11         diff      diff% (error%)
bm_chaos.py                    352.34 ->     361.75 :      +9.41 =  +2.671% (+/-0.00%)
bm_fannkuch.py                  75.08 ->      76.59 :      +1.51 =  +2.011% (+/-0.01%)
bm_fft.py                     2346.26 ->    2374.90 :     +28.64 =  +1.221% (+/-0.00%)
bm_float.py                   5724.38 ->    5970.45 :    +246.07 =  +4.299% (+/-0.03%)
bm_hexiom.py                    46.65 ->      47.22 :      +0.57 =  +1.222% (+/-0.00%)
bm_nqueens.py                 4224.78 ->    4364.56 :    +139.78 =  +3.309% (+/-0.00%)
bm_pidigits.py                 648.53 ->     642.07 :      -6.46 =  -0.996% (+/-0.27%)
bm_wordcount.py                 47.61 ->      47.39 :      -0.22 =  -0.462% (+/-0.01%)
core_import_mpy_multi.py       604.12 ->     611.12 :      +7.00 =  +1.159% (+/-0.00%)
core_import_mpy_single.py       99.28 ->     100.58 :      +1.30 =  +1.309% (+/-0.01%)
core_locals.py                  40.87 ->      41.50 :      +0.63 =  +1.541% (+/-0.01%)
core_qstr.py                   205.10 ->     206.82 :      +1.72 =  +0.839% (+/-0.00%)
core_str.py                     27.20 ->      27.65 :      +0.45 =  +1.654% (+/-0.00%)
core_yield_from.py             354.68 ->     346.56 :      -8.12 =  -2.289% (+/-0.00%)
misc_aes.py                    406.64 ->     410.91 :      +4.27 =  +1.050% (+/-0.01%)
misc_mandel.py                3018.46 ->    3139.44 :    +120.98 =  +4.008% (+/-0.00%)
misc_pystone.py               2355.08 ->    2375.38 :     +20.30 =  +0.862% (+/-0.00%)
misc_raytrace.py               368.25 ->     379.26 :     +11.01 =  +2.990% (+/-0.01%)

More investigation required...

@jimmo
Copy link
Member Author

jimmo commented Jan 29, 2024

The next step here would be to apply a similar idea to functions that need to be attributed for correctness (i.e. code that must be in RAM). ESP8266 does this in a port-specific way with MP_FASTCODE (and applies it to the MICROPY_WRAP_MP_SCHED_* wrappers). (I think this is more along the lines of what @tannewt and @dhalbert were interested in).

@jimmo
Copy link
Member Author

jimmo commented Jan 29, 2024

Another open question is, if you apply the optimise attribute to a function, does that "propagate" to e.g. a called function (in particular, if that function is static inline).

@tannewt
Copy link

tannewt commented Jan 30, 2024

More investigation required...

I suspect that caches play a big role for any MCU that has them. Especially if they are in front of XIP flash. The miss cost is quite high in that case.

We have a few annotations in CP designed to fill up 32k ITCM on the iMX RT to make room in the cache for "maybe" used code. My long term plan is to make the compiler and linker smarter using performance information. Maintaining per-function annotations will be more time consuming than relying on performance guided optimization (PGO).

@projectgus
Copy link
Contributor

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants