-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
py/persistentcode: Add a qstr window to save mpy files more efficiently. #4549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Sorry, didn't have time to comment on that so far, but that idea sounds good, if this sliding window is "optional" and configurable. E.g., an .mpy could specify what sliding window size if used, and it could be from 0 (to allow quick and easy .mpy generation) to the number of unique qstr's used. If it's small enough, it's allocated on the stack, as you propose. |
Yes it is optional for the creator of an mpy file: they can simply create mpy files as before and store all qstsr, the only difference is in the encoding of the length of the qstr because now the LSB of that is used to indicate if the qstr comes from the window or not. So the feature is opt-in for the writer (but not the reader).
The idea was not to make the window size configurable, but raher have it fixed at N=32, which is a balance of stack usage vs compression. Having it configurable means adding another field to the .mpy, adds complexity for the loader to support a variable window size (and means it loses its property of "bounded RAM usage" for this window), and then needs a new option to mpy-cross to specify the window size. So I'd rather keep it fixed for now, and maybe later introduce a configuration option for it. |
Good, didn't yet have a chance to review the code. But does it mean that now we limit number of qstr's to 32K. Oh-oh, that's getting pretty small.
Yes, usually a one byte.
Yes, pretty trivial complexity, similar to how function data stack allocation is handled for a function call: if it's smaller than a threshold, it's allocation on stack, otherwise, on heap.
Not really, can use just hardcoded value of 32. That option is to allow to compress .mpy as much as possible, and all those optimization options don't scale to be implemented in C with mpy-cross, they're for separate tools scalably implemented in Python. So, effectively it's about making .mpy format flexible by including this "window size" param (again, usually a byte). The loader can just error out for now if its >32, but at least format is there for future extension. |
No, it's encoded as a variable uint (like most fields in the mpy file). Also it's the length of the qstr, not the number of qstrs that holds this flag. |
4286058
to
0e6012a
Compare
Ok, I encoded the window size in the mpy, using 3 spare bits from the existing "features byte" which is the 3rd byte in the file. Encoded as a log2 number this allows a window size between 8 and 1024 qstrs (inclusive). Force pushed this branch with some additional comments. |
Thanks much for listening, but all that was a discussion-level so far (if only there was some kind of schedule, e.g. idea to finish off PRs already much time in the queue, while not hasting with the code changes for too-fresh things like this). And "still have arbitrary limit of the table, but also steal bits which will be required to encode future features" is totally not what I meant. I proposed to have a restriction-free table size, which would just cost 1 byte of overhead per .mpy file. (For comparison, will little percentage of functions are closures, and yet 1 byte ("end of closed over var list") is wasted for all non-closure functions. Could have go a bit somewhere. Alas, that would make calling functions slower.)
Ok, as I mentioned, I didn't have a chance to review the code (swamped with work), now I did, and I see that it's pretty far away from picture I had in mind and shared (intermediate indirection table for qstrs). There're still 2 wasted bytes in the encoding of each qstr-taking instruction (btw, could be 0 on saving, to improve external LZ compressability (tarballs, etc.)). So, this is nice practical So, sorry for the confusion. I guess, if I want to continue, I'll need to roll out my own format after all. If you have previous version (without MPY_FEATURE_DECODE_QSTR_WINDOW_SIZE_LOG2(), please revert to it it, if not, it it seems to be pretty isolated and getting back to previous version would be just ~5 line cuts). |
STATIC qstr qstr_window_pull(qstr_window_t *qw, size_t idx) { | ||
qstr qst = qw->window[idx]; | ||
if (idx > qw->idx) { | ||
memmove(&qw->window[idx], &qw->window[idx + 1], (QSTR_WINDOW_SIZE - idx - 1) * sizeof(uint16_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, with all these memmove's, I wouldn't call this mechanism "simple and fast".
Well, it's still not too late to make it like: 0 value in qstr bytecode means "read from stream", otherwise its index in the LRU table - 1 (then still would make sense to make the table size configurable, just not by stealing bits from other (future) features). |
If these bits are needed to encode future features then the layout of the header can be changed at that time, because anyway it'll need a new mpy version. Right now it tries to be efficient with all the features that are supported.
Any fixed size entry (like 1 byte) will still be a restriction, right?
I did think of this but I couldn't see how to do it "inplace", so left it out. Because you need to write the qstrs after the bytecode you need to record the decisions about where the qstr goes (literal in the file or access from the window) as you write the bytecode. So it either requires 1) to duplicate the bytecode (one to record the decisions, the other to retain the qstrs); 2) make 2 passes over the bytecode and reset the qstr window before the second pass; 3) create the qstr table in RAM as the bytecode is written then write the qstr table out; or 4) actually encode the qstr literals within the bytecode itself (so expand the bytecode as it is written); and 5) maybe another scheme. Well, it might be possible to do (4) inplace. |
So, from my side, the talk was about designing flexible, extensible if needed format, without "premature optimizations" or outright practical hacks. (That desire is based on my experience of writing (well, refactoring mpy-tool.py) into generic .mpy parser/generator module in Python, and cases like #4378 .
I of course mean varint encoding, which would have minimum, and common, overhead of 1 byte for table of size <128. And I also mean arbitrary-size table, not power of two, no randomly low size. Again, the ultimate idea is to be able to specify table with the size of number of unique qstrs in the module, and neither encode the same qstr more than once, nor allocate larger table than needed during load.
I believe I also already said that I don't see any problem with doing multiple passes during saving (as long as there's a way to have non-optimized single-pass write-out for small tools). I also said that doing .mpy compilation in C doesn't scale, because even trivial optimizations take considerable effort, and this discussion of the complication of two-pass save is well proves that ;-). Note that a complication arises because of the "sliding window" scheme, with my original idea of using lossless, fully optimal table, there wouldn't be issue of where to store intermediate data. And of course, it would be just a few lines in Python ;-). But all in all, this looks like you embarked on addressing a long-waiting issue |
I found a way to do this inplace. See #4564 for this and other mpy-reduction PRs combined. |
This feature was merged in 5996eeb |
displayio.Bitmap: Allow modification though the buffer protocol
Following on from the analysis described in #3054 (comment), here is an implementation of a sliding qstr window used to reduce the number of qstrs stored in a .mpy file. The window size is configured to 32 entries which takes a fixed 64 bytes (16-bits each) on the C stack when loading/saving a .mpy file. It allows to remember the most recent 32 qstrs so they don't need to be stored again in the .mpy file. The qstr window uses a simple least-recently-used mechanism to discard the least recently used qstr when the window overflows.
Pros: only needs a single pass to save/load; uses fixed amount of C stack regardless of number of qstrs in the .mpy file; for small to moderate sized .mpy files it's optimal (no duplicate qstrs are stored).
Cons: extra code, extra complexity; doesn't do the best job for large .mpy files (eg saves maybe 20% when in the best case could save up to 30%).