py/qstr: qstr performance improvements. #10758

jimmo · 2023-02-15T12:28:23Z

This (WIP) PR includes ~~four~~ three changes to qstr behavior:

Allow setting MICROPY_QSTR_BYTES_IN_HASH to zero, and make this the default. This has a significant code size saving (~3.5kiB on PYBv11) due to removing two bytes per qstr. This also translates to RAM saving at runtime for any strings interned at runtime. It also frees up a size_t field in mp_obj_str_t, which we could use in the future as the buffer for short string data (in particular for slices of bytes this might be a win).
Add a mechanism to speed up map lookups by using the qstr ID rather than computing a hash for the string. This is even faster than using the previous qstr hash because it saves looking up the qstr id to its hash. This isn't entirely straightforward and has an interesting degenerate case because maps could contain both str and qstr keys, see the commit and code comments for details.
Split the built-in qstrs (formerly base+frozen) into (mpy+base+frozen) and sort base and frozen. This allows a much faster lookup of qstrs via binary search, which is important for .mpy loading, .py parsing, and string operations that need to check whether a result can be de-duped. This is the exact same idea introduced in py: Faster qstr search #6896 by @amirgon but side-steps the issue of sorting-by-hash now that hashes are removed. The implementation is simplified to be a smaller overall change.
In the future we could also sort the mpy qstrs.
Enable CSUPEROPT for the sensitive parts of map.c and qstr.c (mp_map_lookup, find_qstr, qstr_add, qstr_strncmp, qstr_find_strn, qstr_from_str, qstr_from_strn, qstr_hash, qstr_len, qstr_str, qstr_data`). This is worth about 3% (and more on some tests) for a cost of 536 bytes. (Note: only applies to the Make-based ports). I'm not sure this is worth it. A performance diff for PYBv11 is included at the end.

Overall this PR currently comes in at -2248 bytes on PYBV11. The performance diff on PYBV11 is:

diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-pybv11-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-pybv11-enabled-optmini         diff      diff% (error%)
bm_chaos.py                    350.03 ->     368.10 :     +18.07 =  +5.162% (+/-0.01%)
bm_fannkuch.py                  75.41 ->      75.71 :      +0.30 =  +0.398% (+/-0.00%)
bm_fft.py                     2298.91 ->    2308.17 :      +9.26 =  +0.403% (+/-0.00%)
bm_float.py                   5669.53 ->    5884.83 :    +215.30 =  +3.797% (+/-0.02%)
bm_hexiom.py                    40.46 ->      49.80 :      +9.34 = +23.085% (+/-0.00%)
bm_nqueens.py                 4204.21 ->    4300.81 :     +96.60 =  +2.298% (+/-0.00%)
bm_pidigits.py                 627.46 ->     630.21 :      +2.75 =  +0.438% (+/-0.18%)
core_import_mpy_multi.py       509.75 ->     730.32 :    +220.57 = +43.270% (+/-0.00%)
core_import_mpy_single.py       64.49 ->     115.07 :     +50.58 = +78.431% (+/-0.01%)
core_qstr.py                    64.34 ->     284.60 :    +220.26 = +342.338% (+/-0.02%)
core_yield_from.py             351.54 ->     352.18 :      +0.64 =  +0.182% (+/-0.00%)
misc_aes.py                    404.98 ->     404.46 :      -0.52 =  -0.128% (+/-0.00%)
misc_mandel.py                3068.33 ->    3184.00 :    +115.67 =  +3.770% (+/-0.01%)
misc_pystone.py               2273.83 ->    2467.94 :    +194.11 =  +8.537% (+/-0.01%)
misc_raytrace.py               365.51 ->     384.82 :     +19.31 =  +5.283% (+/-0.01%)

Note that the hexiom score here is misleading -- it get a significant boost (~13%) from sorted qstrs because they happened to hit the pathological case with map lookup caching where two hot keys (len and already_done were exactly 128 apart, and therefore shared the same cache entry). By sorting the qstrs, len happened to moved to a different index. Similar (~7%) for pystone.

On Linux 64-bit:

$ ./run-perfbench.py -s ~/mpy/perf/qstr-no-hash-linux-baseline ~/qstr-no-hash-linux-enabled-optmini
diff of scores (higher is better)
N=2000 M=2000              /home/jimmo/mpy/perf/qstr-no-hash-linux-baseline -> /home/jimmo/qstr-no-hash-linux-enabled-optmini         diff      diff% (error%)
bm_chaos.py                  11692.38 ->   12032.34 :    +339.96 =  +2.908% (+/-3.98%)
bm_fannkuch.py                  62.77 ->      62.50 :      -0.27 =  -0.430% (+/-1.27%)
bm_fft.py                    99832.29 ->  100255.67 :    +423.38 =  +0.424% (+/-4.08%)
bm_float.py                 241110.28 ->  239482.59 :   -1627.69 =  -0.675% (+/-2.14%)
bm_hexiom.py                   605.51 ->     624.37 :     +18.86 =  +3.115% (+/-1.12%)
bm_nqueens.py               222137.51 ->  220873.17 :   -1264.34 =  -0.569% (+/-1.61%)
bm_pidigits.py                6127.69 ->    6138.97 :     +11.28 =  +0.184% (+/-0.18%)
core_import_mpy_multi.py     26046.34 ->   27012.52 :    +966.18 =  +3.709% (+/-2.29%)
core_import_mpy_single.py     2783.01 ->    3152.68 :    +369.67 = +13.283% (+/-10.22%)
core_qstr.py                  6011.98 ->   13206.13 :   +7194.15 = +119.664% (+/-18.54%)
core_yield_from.py           10740.92 ->   10531.59 :    -209.33 =  -1.949% (+/-1.88%)
misc_aes.py                  16423.26 ->   16575.42 :    +152.16 =  +0.926% (+/-2.18%)
misc_mandel.py              118315.24 ->  126364.00 :   +8048.76 =  +6.803% (+/-3.28%)
misc_pystone.py              86531.57 ->   87296.57 :    +765.00 =  +0.884% (+/-1.41%)
misc_raytrace.py             20013.78 ->   20214.44 :    +200.66 =  +1.003% (+/-1.52%)

And rp2040 (Pico)

diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-rp2-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-rp2-enabled-optmini         diff      diff% (error%)
bm_chaos.py                    212.85 ->     213.42 :      +0.57 =  +0.268% (+/-0.04%)
bm_fannkuch.py                  51.73 ->      54.58 :      +2.85 =  +5.509% (+/-0.08%)
bm_fft.py                     1400.41 ->    1472.96 :     +72.55 =  +5.181% (+/-0.02%)
bm_float.py                   3757.68 ->    3691.18 :     -66.50 =  -1.770% (+/-0.06%)
bm_hexiom.py                    27.58 ->      28.40 :      +0.82 =  +2.973% (+/-0.03%)
bm_nqueens.py                 2681.04 ->    2638.48 :     -42.56 =  -1.587% (+/-0.04%)
bm_pidigits.py                 398.56 ->     399.45 :      +0.89 =  +0.223% (+/-0.03%)
core_import_mpy_multi.py       280.13 ->     354.09 :     +73.96 = +26.402% (+/-0.08%)
core_import_mpy_single.py       40.83 ->      62.75 :     +21.92 = +53.686% (+/-0.39%)
core_qstr.py                    52.20 ->     158.95 :    +106.75 = +204.502% (+/-0.10%)
core_yield_from.py             217.72 ->     214.77 :      -2.95 =  -1.355% (+/-0.01%)
misc_aes.py                    283.68 ->     294.86 :     +11.18 =  +3.941% (+/-0.04%)
misc_mandel.py                1599.78 ->    1851.54 :    +251.76 = +15.737% (+/-0.06%)
misc_pystone.py               1443.18 ->    1493.99 :     +50.81 =  +3.521% (+/-0.08%)
misc_raytrace.py               227.18 ->     226.88 :      -0.30 =  -0.132% (+/-0.05%)

ESP32 does not work well. Will need to investigate further. I think this is much more than the usual code-moves-around perf diff we see on ESP32.

$ ./run-perfbench.py -s ~/mpy/perf/qstr-no-hash-esp32-baseline ~/mpy/perf/qstr-no-hash-esp32-enabled-optmini
diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-esp32-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-esp32-enabled-optmini         diff      diff% (error%)
bm_chaos.py                    282.11 ->     269.05 :     -13.06 =  -4.629% (+/-0.07%)
bm_fannkuch.py                  71.03 ->      70.96 :      -0.07 =  -0.099% (+/-0.01%)
bm_fft.py                     2170.10 ->    2143.92 :     -26.18 =  -1.206% (+/-0.00%)
bm_float.py                   4390.34 ->    3663.39 :    -726.95 = -16.558% (+/-0.03%)
bm_hexiom.py                    35.78 ->      32.65 :      -3.13 =  -8.748% (+/-0.02%)
bm_nqueens.py                 3125.77 ->    2071.28 :   -1054.49 = -33.735% (+/-0.88%)
bm_pidigits.py                 637.25 ->     626.40 :     -10.85 =  -1.703% (+/-0.20%)
core_import_mpy_multi.py       264.24 ->     271.65 :      +7.41 =  +2.804% (+/-0.05%)
core_import_mpy_single.py       43.99 ->      57.01 :     +13.02 = +29.598% (+/-0.06%)
core_qstr.py                    58.05 ->     139.16 :     +81.11 = +139.724% (+/-0.06%)
core_yield_from.py             106.47 ->     106.46 :      -0.01 =  -0.009% (+/-0.01%)
misc_aes.py                    307.53 ->     307.78 :      +0.25 =  +0.081% (+/-0.01%)
misc_mandel.py                2832.35 ->    2762.71 :     -69.64 =  -2.459% (+/-0.01%)
misc_pystone.py               1665.84 ->    1756.67 :     +90.83 =  +5.453% (+/-0.01%)
misc_raytrace.py               314.52 ->     271.82 :     -42.70 = -13.576% (+/-0.00%)

Performance diff for map_opt.c / qstr_opt.c on PYBv11:

diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-pybv11-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-pybv11-baseline-optmini         diff      diff% (error%)
bm_chaos.py                    350.03 ->     357.92 :      +7.89 =  +2.254% (+/-0.00%)
bm_fannkuch.py                  75.41 ->      75.48 :      +0.07 =  +0.093% (+/-0.01%)
bm_fft.py                     2298.91 ->    2279.87 :     -19.04 =  -0.828% (+/-0.00%)
bm_float.py                   5669.53 ->    5764.12 :     +94.59 =  +1.668% (+/-0.02%)
bm_hexiom.py                    40.46 ->      44.65 :      +4.19 = +10.356% (+/-0.00%)
bm_nqueens.py                 4204.21 ->    4304.28 :    +100.07 =  +2.380% (+/-0.00%)
bm_pidigits.py                 627.46 ->     625.31 :      -2.15 =  -0.343% (+/-0.16%)
core_import_mpy_multi.py       509.75 ->     599.04 :     +89.29 = +17.516% (+/-0.00%)
core_import_mpy_single.py       64.49 ->      80.66 :     +16.17 = +25.074% (+/-0.01%)
core_qstr.py                    64.34 ->      94.37 :     +30.03 = +46.674% (+/-0.00%)
core_yield_from.py             351.54 ->     352.29 :      +0.75 =  +0.213% (+/-0.00%)
misc_aes.py                    404.98 ->     404.75 :      -0.23 =  -0.057% (+/-0.01%)
misc_mandel.py                3068.33 ->    3079.79 :     +11.46 =  +0.373% (+/-0.00%)
misc_pystone.py               2273.83 ->    2433.54 :    +159.71 =  +7.024% (+/-0.01%)
misc_raytrace.py               365.51 ->     373.92 :      +8.41 =  +2.301% (+/-0.00%)

github-actions · 2023-02-15T12:31:44Z

Code size report:

   bare-arm:  +384 +0.678% 
minimal x86: +1389 +0.745% [incl +416(data)]
   unix x64: -1832 -0.229% standard[incl +128(data)]
      stm32: -3136 -0.800% PYBV10
     mimxrt: -2688 -0.744% TEENSY40
        rp2: -1088 -0.332% RPI_PICO
       samd:  -872 -0.334% ADAFRUIT_ITSYBITSY_M4_EXPRESS

jimmo · 2023-02-15T12:57:32Z

The code size diff for bare-arm is due to:

baseline 56624
hash=0 56188  (-436)
hash=0 + map 56240 (+52) 
hash=0 + map + sorted 57040 (+800)
hash=0 + map + sorted + opt 57040 (+0)

The reason for sorted costing so much is that:

It adds the overhead of the extra qstr pool instance
It adds a bunch of fixed qstrs (mostly dunder operators) that were not previously included in the bare-arm build but now need to be forced into the first pool so that their id < 256. There are other ways around this (e.g. change const byte mp_binary_op_method_name to const uint16_t mp_binary_op_method_name).

tannewt · 2023-04-25T21:07:18Z

Would this allow us to have discontinuous qstr numbering? Computing qstr numbering prior to preprocessing across all files would make qstr numbers the same across all builds. That'd make ccache work better and result in many object files being identical across boards.

jimmo · 2023-04-26T01:24:06Z

Would this allow us to have discontinuous qstr numbering? Computing qstr numbering prior to preprocessing across all files would make qstr numbers the same across all builds. That'd make ccache work better and result in many object files being identical across boards.

That's an interesting idea! Making ccache work would certainly be a big improvement.

I'm not sure anything prevents you doing that right now though -- i.e. you're allowed to have "unused" qstr IDs at the cost of the wasted slots (but no wasted space in the actual data chunks). But this PR would allow you to at least not pay the cost of the hash for the unused slots (but you still have length+pointer).

The other thing you could do is break up the QSTRs into contiguous ranges, and have each pool contain a contiguous range (but you'll need to make some small changes so the pools know their ranges rather than just continuing from the previous).

jimmo · 2023-10-09T02:40:45Z

I've removed the map.c & qstr.c split into optimised parts. Instead I'm going to investigate implementing a way where we do this via compiler attributes. (i.e. a macro that expands into applying whatever the CSUPEROPT setting is to a given function). Likely gcc-only for now.

This disables using qstr hashes altogether, which saves RAM and flash (two bytes per interned string on a typical build) as well as code size. On PYBV11 this is worth over 3k flash. qstr comparison will now be done just by length then data. This affects qstr_find_strn although this has a negligible performance impact as, for a given comparison, the length and first character will ~usually be different anyway. String hashing (e.g. builtin `hash()` and map.c) now need to compute the hash dynamically, and for the map case this does come at a performance cost. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

mp_map_lookup now uses the qstr ID directly as the hash. This has the benefit of being a very good hash (guaranteed unique) and free to compute. As mp_map_lookup is a very hot function (e.g. regular dictionaries as well as class locals and globals), this has a significant performance benefit. However, we also need to deal with the case where a map might contain a mixture of str (hashed the regular way) and qstr keys (hashed with ID), so it needs to detect and fall back to a linear search. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This test depends on the order in which qstrs are stored in ROM, which affects the order in which `dir()` will probe the object to see what it supports. Because of the lazy-loading in asyncio/__init__.py, if it tries to do e.g. `wait_for_ms` before `funcs` then it will import funcs, making `funcs` later succeed. But in the other way around, `funcs` will initially not be found. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

dpgeorge · 2023-10-10T02:35:16Z

py/makeqstrdata.py

@@ -220,6 +220,66 @@
    "values",
    "write",
    "zip",
+    # Additional QSTRs that must have index <255 (these are not part of
+    # the .mpy compatibility list though).
+    "__bool__",


Why are these added here, don't they increase the size of ports that don't use these qstrs? Eg bare-arm and minimal.

Because these qstrs must have index <255 (e.g. the operator code assumes that the qstr for the operator name is a byte). Therefore they cannot go into a sorted pool, so they must be put in the unsorted pool at the start.

makeqstrdata.py only append these here if they appear in the required qstr list from the firmware.

But in a comment you made above you said that:

It adds a bunch of fixed qstrs (mostly dunder operators) that were not previously included in the bare-arm build but now need to be forced into the first pool so that their id < 256.

So... is it possible to not include these in the build if they aren't used? (Mainly I'm interested in getting the bare-arm size diff down.)

jimmo · 2023-10-13T04:11:19Z

I added some tests for general string performance and the map lookup behavior is impacted badly.

Closing this in favour of #12678 which just implements the sorted pools.

jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from 5b85ee3 to bbf98c8 Compare February 15, 2023 12:43

dpgeorge added the py-core Relates to py/ directory in source label May 8, 2023

dpgeorge added this to the release-1.22.0 milestone Oct 6, 2023

jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from bbf98c8 to 9ed31ac Compare October 9, 2023 02:38

jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from 9ed31ac to 179ee10 Compare October 9, 2023 03:11

jimmo added 6 commits October 9, 2023 17:56

py/persistentcode.c: Explain the "special" fixed QSTRs.

e0a9f82

These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

top: Set MICROPY_QSTR_BYTES_IN_HASH=0 as default.

27f39cb

This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from 179ee10 to 8e28db8 Compare October 9, 2023 06:57

dpgeorge reviewed Oct 10, 2023

View reviewed changes

This was referenced Oct 10, 2023

py/mpconfig.h: Finer-grained super-opt setting. #12644

Open

py/qstr: Add support for sorted qstr pools. #12678

Merged

jimmo closed this Oct 13, 2023

This was referenced Oct 30, 2023

py: Faster qstr search #6896

Closed

py/qstr: Disable qstr hashing on low-flash/mem boards. #12835

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py/qstr: qstr performance improvements. #10758

py/qstr: qstr performance improvements. #10758

jimmo commented Feb 15, 2023 •

edited

Loading

github-actions bot commented Feb 15, 2023 •

edited

Loading

jimmo commented Feb 15, 2023

tannewt commented Apr 25, 2023

jimmo commented Apr 26, 2023

jimmo commented Oct 9, 2023

dpgeorge Oct 10, 2023

jimmo Oct 10, 2023

dpgeorge Oct 10, 2023

jimmo commented Oct 13, 2023

py/qstr: qstr performance improvements. #10758

py/qstr: qstr performance improvements. #10758

Conversation

jimmo commented Feb 15, 2023 • edited Loading

github-actions bot commented Feb 15, 2023 • edited Loading

jimmo commented Feb 15, 2023

tannewt commented Apr 25, 2023

jimmo commented Apr 26, 2023

jimmo commented Oct 9, 2023

dpgeorge Oct 10, 2023

Choose a reason for hiding this comment

jimmo Oct 10, 2023

Choose a reason for hiding this comment

dpgeorge Oct 10, 2023

Choose a reason for hiding this comment

jimmo commented Oct 13, 2023

jimmo commented Feb 15, 2023 •

edited

Loading

github-actions bot commented Feb 15, 2023 •

edited

Loading