-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/qstr: qstr performance improvements. #10758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Code size report:
|
5b85ee3
to
bbf98c8
Compare
The code size diff for bare-arm is due to:
The reason for sorted costing so much is that:
|
Would this allow us to have discontinuous qstr numbering? Computing qstr numbering prior to preprocessing across all files would make qstr numbers the same across all builds. That'd make ccache work better and result in many object files being identical across boards. |
That's an interesting idea! Making ccache work would certainly be a big improvement. I'm not sure anything prevents you doing that right now though -- i.e. you're allowed to have "unused" qstr IDs at the cost of the wasted slots (but no wasted space in the actual data chunks). But this PR would allow you to at least not pay the cost of the hash for the unused slots (but you still have length+pointer). The other thing you could do is break up the QSTRs into contiguous ranges, and have each pool contain a contiguous range (but you'll need to make some small changes so the pools know their ranges rather than just continuing from the previous). |
bbf98c8
to
9ed31ac
Compare
I've removed the map.c & qstr.c split into optimised parts. Instead I'm going to investigate implementing a way where we do this via compiler attributes. (i.e. a macro that expands into applying whatever the CSUPEROPT setting is to a given function). Likely gcc-only for now. |
9ed31ac
to
179ee10
Compare
This disables using qstr hashes altogether, which saves RAM and flash (two bytes per interned string on a typical build) as well as code size. On PYBV11 this is worth over 3k flash. qstr comparison will now be done just by length then data. This affects qstr_find_strn although this has a negligible performance impact as, for a given comparison, the length and first character will ~usually be different anyway. String hashing (e.g. builtin `hash()` and map.c) now need to compute the hash dynamically, and for the map case this does come at a performance cost. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
mp_map_lookup now uses the qstr ID directly as the hash. This has the benefit of being a very good hash (guaranteed unique) and free to compute. As mp_map_lookup is a very hot function (e.g. regular dictionaries as well as class locals and globals), this has a significant performance benefit. However, we also need to deal with the case where a map might contain a mixture of str (hashed the regular way) and qstr keys (hashed with ID), so it needs to detect and fall back to a linear search. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This test depends on the order in which qstrs are stored in ROM, which affects the order in which `dir()` will probe the object to see what it supports. Because of the lazy-loading in asyncio/__init__.py, if it tries to do e.g. `wait_for_ms` before `funcs` then it will import funcs, making `funcs` later succeed. But in the other way around, `funcs` will initially not be found. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
179ee10
to
8e28db8
Compare
@@ -220,6 +220,66 @@ | |||
"values", | |||
"write", | |||
"zip", | |||
# Additional QSTRs that must have index <255 (these are not part of | |||
# the .mpy compatibility list though). | |||
"__bool__", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these added here, don't they increase the size of ports that don't use these qstrs? Eg bare-arm and minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these qstrs must have index <255 (e.g. the operator code assumes that the qstr for the operator name is a byte). Therefore they cannot go into a sorted pool, so they must be put in the unsorted pool at the start.
makeqstrdata.py only append these here if they appear in the required qstr list from the firmware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in a comment you made above you said that:
It adds a bunch of fixed qstrs (mostly dunder operators) that were not previously included in the bare-arm build but now need to be forced into the first pool so that their id < 256.
So... is it possible to not include these in the build if they aren't used? (Mainly I'm interested in getting the bare-arm size diff down.)
I added some tests for general string performance and the map lookup behavior is impacted badly. Closing this in favour of #12678 which just implements the sorted pools. |
This (WIP) PR includes
fourthree changes to qstr behavior:Allow setting
MICROPY_QSTR_BYTES_IN_HASH
to zero, and make this the default. This has a significant code size saving (~3.5kiB on PYBv11) due to removing two bytes per qstr. This also translates to RAM saving at runtime for any strings interned at runtime. It also frees up a size_t field in mp_obj_str_t, which we could use in the future as the buffer for short string data (in particular for slices of bytes this might be a win).Add a mechanism to speed up map lookups by using the qstr ID rather than computing a hash for the string. This is even faster than using the previous qstr hash because it saves looking up the qstr id to its hash. This isn't entirely straightforward and has an interesting degenerate case because maps could contain both str and qstr keys, see the commit and code comments for details.
Split the built-in qstrs (formerly base+frozen) into (mpy+base+frozen) and sort base and frozen. This allows a much faster lookup of qstrs via binary search, which is important for .mpy loading, .py parsing, and string operations that need to check whether a result can be de-duped. This is the exact same idea introduced in py: Faster qstr search #6896 by @amirgon but side-steps the issue of sorting-by-hash now that hashes are removed. The implementation is simplified to be a smaller overall change.
In the future we could also sort the mpy qstrs.
Enable CSUPEROPT for the sensitive parts of map.c and qstr.c (mp_map_lookup
,find_qstr
,qstr_add
,qstr_strncmp
,qstr_find_strn
,qstr_from_str
, qstr_from_strn,
qstr_hash,
qstr_len,
qstr_str,
qstr_data`). This is worth about 3% (and more on some tests) for a cost of 536 bytes. (Note: only applies to the Make-based ports). I'm not sure this is worth it. A performance diff for PYBv11 is included at the end.Overall this PR currently comes in at -2248 bytes on PYBV11. The performance diff on PYBV11 is:
Note that the hexiom score here is misleading -- it get a significant boost (~13%) from sorted qstrs because they happened to hit the pathological case with map lookup caching where two hot keys (
len
andalready_done
were exactly 128 apart, and therefore shared the same cache entry). By sorting the qstrs,len
happened to moved to a different index. Similar (~7%) for pystone.On Linux 64-bit:
And rp2040 (Pico)
ESP32 does not work well. Will need to investigate further. I think this is much more than the usual code-moves-around perf diff we see on ESP32.
Performance diff for map_opt.c / qstr_opt.c on PYBv11: