-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/mpconfig.h: Finer-grained super-opt setting. #12644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Code size report:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #12644 +/- ##
==========================================
+ Coverage 98.36% 98.39% +0.02%
==========================================
Files 159 158 -1
Lines 21088 20945 -143
==========================================
- Hits 20743 20608 -135
+ Misses 345 337 -8 ☔ View full report in Codecov by Sentry. |
Good idea digging into these! I was curious what's going on here so I got the output from Extra optimizations from `MICROPY_ENABLE_SUPEROPT`map.c
gc.c
Optimizations only performed without
qstr.c
One optimization seems to be slightly different with
Without:
Click for even more output
The summary seems to be: more aggressively inlining, and in I was a little skeptical that loop unswitching would be much help on microcontrollers... However although
FWIW extra inlining in webassembly may be counter productive, so maybe this doesn't matter that much...? |
Enabling this on rp2, firmware for RPI_PICO is +1032 and performance change is:
That's really good! (Note that rp2 has never used the SUPEROPT option because it's cmake.) |
@jimmo I think it would be simpler to configure this at the C level only (ie not in the makefile). There's no real need to configure it at the makefile level. Doing it at the C level in |
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
It's unused outside of mpz.c. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This will be replaced with a function attribute approach, configured via mpconfig. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
It's useful to provide a way for a port/board to customise an individual function, but no point cluttering up mpconfig.h. These wrap macros are now defined in terms of stastandardised levels (O3+ram, O3+mayberam, O3, maybeO3) which are defined in mpconfig.h. This is what most ports/boards should configure instead. Currently only level 1 and 2 are used, and the various functions have been assigned levels to match the way esp32 currently overrides them. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This replaces the previous CSUPEROPT used for gc.o, however because it's applied just to the required functions it leads to a code size saving. Defaults to level 3 (i.e. apply `-O3`, but don't place in RAM). Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This currently uses level 4 (not enabled by default), so should be a no-op change, but now a board with spare flash can opt-into it. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Provides a default implementation of a macro that will enable `-O3`, and enable this by default on level 1, 2, and 3. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Instead of applying to individual functions, configure the levels instead. This should be a no-op change -- the wrap functions map to level 1 and 2 in the same way as the current rules. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
bare-arm, minimal, and stm32-on-CM0 used to disable `CSUPEROPT`. This re-instates this behavior by disabling `MICROPY_APPLY_COMPILER_OPTIMISATIONS` instead. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
f008756
to
62fbbe0
Compare
I have re-worked this from scratch.
|
However, the results are still counter-intuitive (as suggested in the first comment in this PR). As before, adding The best results I get on pybv11 are when
More investigation required... |
The next step here would be to apply a similar idea to functions that need to be attributed for correctness (i.e. code that must be in RAM). ESP8266 does this in a port-specific way with |
Another open question is, if you apply the optimise attribute to a function, does that "propagate" to e.g. a called function (in particular, if that function is static inline). |
I suspect that caches play a big role for any MCU that has them. Especially if they are in front of XIP flash. The miss cost is quite high in that case. We have a few annotations in CP designed to fill up 32k ITCM on the iMX RT to make room in the cache for "maybe" used code. My long term plan is to make the compiler and linker smarter using performance information. Maintaining per-function annotations will be more time consuming than relying on performance guided optimization (PGO). |
This is an automated heads-up that we've just merged a Pull Request See #13763 A search suggests this PR might apply the STATIC macro to some C code. If it Although this is an automated message, feel free to @-reply to me directly if |
Split out from #10758 (comment)
This allows
-O3
to be enabled on a per-function basis, rather than whole-file as it is currently implemented withCSUPEROPT
. This avoids having to waste code size on functions that don't need-O3
.Interestingly I found that adding
-O2
or-O3
tomp_execute_bytecode
both made it smaller and slower (i.e. the exact opposite of expected) on PYBV11, somp_execute_bytecode
does not have the attribute. On x86/x64 or other architectures though this is likely to not be the case.This PR on PYBV11 (+300 bytes).
Also clang doesn't support the optimise attribute. I don't have a good workaround to suggest for eg Unix, webassembly.
So there's a bit of tuning and other things to consider. Maybe a good compromise for now is just to add to map.c and qstr.c and leave the old CSUPEROPT in place. But either way I think this shows that there is some performance to be gained here for very little code size.
For reference, this is what adding
-O2
to the whole PYBV11 build does (at a cost of +60kiB):-O3
isn't much different (but much more code size). This shows an upper bound on how much can be gained by selectively optimising functions.This work was funded through GitHub Sponsors.