Skip to content

py/qstr: qstr performance improvements. #10758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

jimmo
Copy link
Member

@jimmo jimmo commented Feb 15, 2023

This (WIP) PR includes four three changes to qstr behavior:

  • Allow setting MICROPY_QSTR_BYTES_IN_HASH to zero, and make this the default. This has a significant code size saving (~3.5kiB on PYBv11) due to removing two bytes per qstr. This also translates to RAM saving at runtime for any strings interned at runtime. It also frees up a size_t field in mp_obj_str_t, which we could use in the future as the buffer for short string data (in particular for slices of bytes this might be a win).

  • Add a mechanism to speed up map lookups by using the qstr ID rather than computing a hash for the string. This is even faster than using the previous qstr hash because it saves looking up the qstr id to its hash. This isn't entirely straightforward and has an interesting degenerate case because maps could contain both str and qstr keys, see the commit and code comments for details.

  • Split the built-in qstrs (formerly base+frozen) into (mpy+base+frozen) and sort base and frozen. This allows a much faster lookup of qstrs via binary search, which is important for .mpy loading, .py parsing, and string operations that need to check whether a result can be de-duped. This is the exact same idea introduced in py: Faster qstr search #6896 by @amirgon but side-steps the issue of sorting-by-hash now that hashes are removed. The implementation is simplified to be a smaller overall change.
    In the future we could also sort the mpy qstrs.

  • Enable CSUPEROPT for the sensitive parts of map.c and qstr.c (mp_map_lookup, find_qstr, qstr_add, qstr_strncmp, qstr_find_strn, qstr_from_str, qstr_from_strn, qstr_hash, qstr_len, qstr_str, qstr_data`). This is worth about 3% (and more on some tests) for a cost of 536 bytes. (Note: only applies to the Make-based ports). I'm not sure this is worth it. A performance diff for PYBv11 is included at the end.

Overall this PR currently comes in at -2248 bytes on PYBV11. The performance diff on PYBV11 is:

diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-pybv11-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-pybv11-enabled-optmini         diff      diff% (error%)
bm_chaos.py                    350.03 ->     368.10 :     +18.07 =  +5.162% (+/-0.01%)
bm_fannkuch.py                  75.41 ->      75.71 :      +0.30 =  +0.398% (+/-0.00%)
bm_fft.py                     2298.91 ->    2308.17 :      +9.26 =  +0.403% (+/-0.00%)
bm_float.py                   5669.53 ->    5884.83 :    +215.30 =  +3.797% (+/-0.02%)
bm_hexiom.py                    40.46 ->      49.80 :      +9.34 = +23.085% (+/-0.00%)
bm_nqueens.py                 4204.21 ->    4300.81 :     +96.60 =  +2.298% (+/-0.00%)
bm_pidigits.py                 627.46 ->     630.21 :      +2.75 =  +0.438% (+/-0.18%)
core_import_mpy_multi.py       509.75 ->     730.32 :    +220.57 = +43.270% (+/-0.00%)
core_import_mpy_single.py       64.49 ->     115.07 :     +50.58 = +78.431% (+/-0.01%)
core_qstr.py                    64.34 ->     284.60 :    +220.26 = +342.338% (+/-0.02%)
core_yield_from.py             351.54 ->     352.18 :      +0.64 =  +0.182% (+/-0.00%)
misc_aes.py                    404.98 ->     404.46 :      -0.52 =  -0.128% (+/-0.00%)
misc_mandel.py                3068.33 ->    3184.00 :    +115.67 =  +3.770% (+/-0.01%)
misc_pystone.py               2273.83 ->    2467.94 :    +194.11 =  +8.537% (+/-0.01%)
misc_raytrace.py               365.51 ->     384.82 :     +19.31 =  +5.283% (+/-0.01%)

Note that the hexiom score here is misleading -- it get a significant boost (~13%) from sorted qstrs because they happened to hit the pathological case with map lookup caching where two hot keys (len and already_done were exactly 128 apart, and therefore shared the same cache entry). By sorting the qstrs, len happened to moved to a different index. Similar (~7%) for pystone.

On Linux 64-bit:

$ ./run-perfbench.py -s ~/mpy/perf/qstr-no-hash-linux-baseline ~/qstr-no-hash-linux-enabled-optmini
diff of scores (higher is better)
N=2000 M=2000              /home/jimmo/mpy/perf/qstr-no-hash-linux-baseline -> /home/jimmo/qstr-no-hash-linux-enabled-optmini         diff      diff% (error%)
bm_chaos.py                  11692.38 ->   12032.34 :    +339.96 =  +2.908% (+/-3.98%)
bm_fannkuch.py                  62.77 ->      62.50 :      -0.27 =  -0.430% (+/-1.27%)
bm_fft.py                    99832.29 ->  100255.67 :    +423.38 =  +0.424% (+/-4.08%)
bm_float.py                 241110.28 ->  239482.59 :   -1627.69 =  -0.675% (+/-2.14%)
bm_hexiom.py                   605.51 ->     624.37 :     +18.86 =  +3.115% (+/-1.12%)
bm_nqueens.py               222137.51 ->  220873.17 :   -1264.34 =  -0.569% (+/-1.61%)
bm_pidigits.py                6127.69 ->    6138.97 :     +11.28 =  +0.184% (+/-0.18%)
core_import_mpy_multi.py     26046.34 ->   27012.52 :    +966.18 =  +3.709% (+/-2.29%)
core_import_mpy_single.py     2783.01 ->    3152.68 :    +369.67 = +13.283% (+/-10.22%)
core_qstr.py                  6011.98 ->   13206.13 :   +7194.15 = +119.664% (+/-18.54%)
core_yield_from.py           10740.92 ->   10531.59 :    -209.33 =  -1.949% (+/-1.88%)
misc_aes.py                  16423.26 ->   16575.42 :    +152.16 =  +0.926% (+/-2.18%)
misc_mandel.py              118315.24 ->  126364.00 :   +8048.76 =  +6.803% (+/-3.28%)
misc_pystone.py              86531.57 ->   87296.57 :    +765.00 =  +0.884% (+/-1.41%)
misc_raytrace.py             20013.78 ->   20214.44 :    +200.66 =  +1.003% (+/-1.52%)

And rp2040 (Pico)

diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-rp2-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-rp2-enabled-optmini         diff      diff% (error%)
bm_chaos.py                    212.85 ->     213.42 :      +0.57 =  +0.268% (+/-0.04%)
bm_fannkuch.py                  51.73 ->      54.58 :      +2.85 =  +5.509% (+/-0.08%)
bm_fft.py                     1400.41 ->    1472.96 :     +72.55 =  +5.181% (+/-0.02%)
bm_float.py                   3757.68 ->    3691.18 :     -66.50 =  -1.770% (+/-0.06%)
bm_hexiom.py                    27.58 ->      28.40 :      +0.82 =  +2.973% (+/-0.03%)
bm_nqueens.py                 2681.04 ->    2638.48 :     -42.56 =  -1.587% (+/-0.04%)
bm_pidigits.py                 398.56 ->     399.45 :      +0.89 =  +0.223% (+/-0.03%)
core_import_mpy_multi.py       280.13 ->     354.09 :     +73.96 = +26.402% (+/-0.08%)
core_import_mpy_single.py       40.83 ->      62.75 :     +21.92 = +53.686% (+/-0.39%)
core_qstr.py                    52.20 ->     158.95 :    +106.75 = +204.502% (+/-0.10%)
core_yield_from.py             217.72 ->     214.77 :      -2.95 =  -1.355% (+/-0.01%)
misc_aes.py                    283.68 ->     294.86 :     +11.18 =  +3.941% (+/-0.04%)
misc_mandel.py                1599.78 ->    1851.54 :    +251.76 = +15.737% (+/-0.06%)
misc_pystone.py               1443.18 ->    1493.99 :     +50.81 =  +3.521% (+/-0.08%)
misc_raytrace.py               227.18 ->     226.88 :      -0.30 =  -0.132% (+/-0.05%)

ESP32 does not work well. Will need to investigate further. I think this is much more than the usual code-moves-around perf diff we see on ESP32.

$ ./run-perfbench.py -s ~/mpy/perf/qstr-no-hash-esp32-baseline ~/mpy/perf/qstr-no-hash-esp32-enabled-optmini
diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-esp32-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-esp32-enabled-optmini         diff      diff% (error%)
bm_chaos.py                    282.11 ->     269.05 :     -13.06 =  -4.629% (+/-0.07%)
bm_fannkuch.py                  71.03 ->      70.96 :      -0.07 =  -0.099% (+/-0.01%)
bm_fft.py                     2170.10 ->    2143.92 :     -26.18 =  -1.206% (+/-0.00%)
bm_float.py                   4390.34 ->    3663.39 :    -726.95 = -16.558% (+/-0.03%)
bm_hexiom.py                    35.78 ->      32.65 :      -3.13 =  -8.748% (+/-0.02%)
bm_nqueens.py                 3125.77 ->    2071.28 :   -1054.49 = -33.735% (+/-0.88%)
bm_pidigits.py                 637.25 ->     626.40 :     -10.85 =  -1.703% (+/-0.20%)
core_import_mpy_multi.py       264.24 ->     271.65 :      +7.41 =  +2.804% (+/-0.05%)
core_import_mpy_single.py       43.99 ->      57.01 :     +13.02 = +29.598% (+/-0.06%)
core_qstr.py                    58.05 ->     139.16 :     +81.11 = +139.724% (+/-0.06%)
core_yield_from.py             106.47 ->     106.46 :      -0.01 =  -0.009% (+/-0.01%)
misc_aes.py                    307.53 ->     307.78 :      +0.25 =  +0.081% (+/-0.01%)
misc_mandel.py                2832.35 ->    2762.71 :     -69.64 =  -2.459% (+/-0.01%)
misc_pystone.py               1665.84 ->    1756.67 :     +90.83 =  +5.453% (+/-0.01%)
misc_raytrace.py               314.52 ->     271.82 :     -42.70 = -13.576% (+/-0.00%)

Performance diff for map_opt.c / qstr_opt.c on PYBv11:

diff of scores (higher is better)
N=100 M=100                /home/jimmo/mpy/perf/qstr-no-hash-pybv11-baseline -> /home/jimmo/mpy/perf/qstr-no-hash-pybv11-baseline-optmini         diff      diff% (error%)
bm_chaos.py                    350.03 ->     357.92 :      +7.89 =  +2.254% (+/-0.00%)
bm_fannkuch.py                  75.41 ->      75.48 :      +0.07 =  +0.093% (+/-0.01%)
bm_fft.py                     2298.91 ->    2279.87 :     -19.04 =  -0.828% (+/-0.00%)
bm_float.py                   5669.53 ->    5764.12 :     +94.59 =  +1.668% (+/-0.02%)
bm_hexiom.py                    40.46 ->      44.65 :      +4.19 = +10.356% (+/-0.00%)
bm_nqueens.py                 4204.21 ->    4304.28 :    +100.07 =  +2.380% (+/-0.00%)
bm_pidigits.py                 627.46 ->     625.31 :      -2.15 =  -0.343% (+/-0.16%)
core_import_mpy_multi.py       509.75 ->     599.04 :     +89.29 = +17.516% (+/-0.00%)
core_import_mpy_single.py       64.49 ->      80.66 :     +16.17 = +25.074% (+/-0.01%)
core_qstr.py                    64.34 ->      94.37 :     +30.03 = +46.674% (+/-0.00%)
core_yield_from.py             351.54 ->     352.29 :      +0.75 =  +0.213% (+/-0.00%)
misc_aes.py                    404.98 ->     404.75 :      -0.23 =  -0.057% (+/-0.01%)
misc_mandel.py                3068.33 ->    3079.79 :     +11.46 =  +0.373% (+/-0.00%)
misc_pystone.py               2273.83 ->    2433.54 :    +159.71 =  +7.024% (+/-0.01%)
misc_raytrace.py               365.51 ->     373.92 :      +8.41 =  +2.301% (+/-0.00%)

@github-actions
Copy link

github-actions bot commented Feb 15, 2023

Code size report:

   bare-arm:  +384 +0.678% 
minimal x86: +1389 +0.745% [incl +416(data)]
   unix x64: -1832 -0.229% standard[incl +128(data)]
      stm32: -3136 -0.800% PYBV10
     mimxrt: -2688 -0.744% TEENSY40
        rp2: -1088 -0.332% RPI_PICO
       samd:  -872 -0.334% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@jimmo jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from 5b85ee3 to bbf98c8 Compare February 15, 2023 12:43
@jimmo
Copy link
Member Author

jimmo commented Feb 15, 2023

The code size diff for bare-arm is due to:

baseline 56624
hash=0 56188  (-436)
hash=0 + map 56240 (+52) 
hash=0 + map + sorted 57040 (+800)
hash=0 + map + sorted + opt 57040 (+0)

The reason for sorted costing so much is that:

  • It adds the overhead of the extra qstr pool instance
  • It adds a bunch of fixed qstrs (mostly dunder operators) that were not previously included in the bare-arm build but now need to be forced into the first pool so that their id < 256. There are other ways around this (e.g. change const byte mp_binary_op_method_name to const uint16_t mp_binary_op_method_name).

@tannewt
Copy link

tannewt commented Apr 25, 2023

Would this allow us to have discontinuous qstr numbering? Computing qstr numbering prior to preprocessing across all files would make qstr numbers the same across all builds. That'd make ccache work better and result in many object files being identical across boards.

@jimmo
Copy link
Member Author

jimmo commented Apr 26, 2023

Would this allow us to have discontinuous qstr numbering? Computing qstr numbering prior to preprocessing across all files would make qstr numbers the same across all builds. That'd make ccache work better and result in many object files being identical across boards.

That's an interesting idea! Making ccache work would certainly be a big improvement.

I'm not sure anything prevents you doing that right now though -- i.e. you're allowed to have "unused" qstr IDs at the cost of the wasted slots (but no wasted space in the actual data chunks). But this PR would allow you to at least not pay the cost of the hash for the unused slots (but you still have length+pointer).

The other thing you could do is break up the QSTRs into contiguous ranges, and have each pool contain a contiguous range (but you'll need to make some small changes so the pools know their ranges rather than just continuing from the previous).

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label May 8, 2023
@dpgeorge dpgeorge added this to the release-1.22.0 milestone Oct 6, 2023
@jimmo jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from bbf98c8 to 9ed31ac Compare October 9, 2023 02:38
@jimmo
Copy link
Member Author

jimmo commented Oct 9, 2023

I've removed the map.c & qstr.c split into optimised parts. Instead I'm going to investigate implementing a way where we do this via compiler attributes. (i.e. a macro that expands into applying whatever the CSUPEROPT setting is to a given function). Likely gcc-only for now.

@jimmo jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from 9ed31ac to 179ee10 Compare October 9, 2023 03:11
jimmo added 6 commits October 9, 2023 17:56
This disables using qstr hashes altogether, which saves RAM and flash
(two bytes per interned string on a typical build) as well as code size.
On PYBV11 this is worth over 3k flash.

qstr comparison will now be done just by length then data. This affects
qstr_find_strn although this has a negligible performance impact as, for a
given comparison, the length and first character will ~usually be
different anyway.

String hashing (e.g. builtin `hash()` and map.c) now need to compute the
hash dynamically, and for the map case this does come at a performance
cost.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
mp_map_lookup now uses the qstr ID directly as the hash. This has the
benefit of being a very good hash (guaranteed unique) and free to compute.
As mp_map_lookup is a very hot function (e.g. regular dictionaries as well
as class locals and globals), this has a significant performance benefit.

However, we also need to deal with the case where a map might contain a
mixture of str (hashed the regular way) and qstr keys (hashed with ID),
so it needs to detect and fall back to a linear search.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as
interning of string objects (which happens in most string methods that
return new strings).

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
These are part of the .mpy ABI and avoid needing to duplicate string
data for QSTRs known to already be in the firmware.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This test depends on the order in which qstrs are stored in ROM, which
affects the order in which `dir()` will probe the object to see what it
supports. Because of the lazy-loading in asyncio/__init__.py, if it
tries to do e.g. `wait_for_ms` before `funcs` then it will import funcs,
making `funcs` later succeed. But in the other way around, `funcs` will
initially not be found.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo force-pushed the qstr-no-hash-fastmap-sorted branch from 179ee10 to 8e28db8 Compare October 9, 2023 06:57
@@ -220,6 +220,66 @@
"values",
"write",
"zip",
# Additional QSTRs that must have index <255 (these are not part of
# the .mpy compatibility list though).
"__bool__",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these added here, don't they increase the size of ports that don't use these qstrs? Eg bare-arm and minimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these qstrs must have index <255 (e.g. the operator code assumes that the qstr for the operator name is a byte). Therefore they cannot go into a sorted pool, so they must be put in the unsorted pool at the start.

makeqstrdata.py only append these here if they appear in the required qstr list from the firmware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in a comment you made above you said that:

It adds a bunch of fixed qstrs (mostly dunder operators) that were not previously included in the bare-arm build but now need to be forced into the first pool so that their id < 256.

So... is it possible to not include these in the build if they aren't used? (Mainly I'm interested in getting the bare-arm size diff down.)

@jimmo
Copy link
Member Author

jimmo commented Oct 13, 2023

I added some tests for general string performance and the map lookup behavior is impacted badly.

Closing this in favour of #12678 which just implements the sorted pools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants