Skip to content

py/persistentcode: Add a qstr window to save mpy files more efficiently. #4549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dpgeorge
Copy link
Member

Following on from the analysis described in #3054 (comment), here is an implementation of a sliding qstr window used to reduce the number of qstrs stored in a .mpy file. The window size is configured to 32 entries which takes a fixed 64 bytes (16-bits each) on the C stack when loading/saving a .mpy file. It allows to remember the most recent 32 qstrs so they don't need to be stored again in the .mpy file. The qstr window uses a simple least-recently-used mechanism to discard the least recently used qstr when the window overflows.

Pros: only needs a single pass to save/load; uses fixed amount of C stack regardless of number of qstrs in the .mpy file; for small to moderate sized .mpy files it's optimal (no duplicate qstrs are stored).

Cons: extra code, extra complexity; doesn't do the best job for large .mpy files (eg saves maybe 20% when in the best case could save up to 30%).

@pfalcon
Copy link
Contributor

pfalcon commented Feb 25, 2019

Sorry, didn't have time to comment on that so far, but that idea sounds good, if this sliding window is "optional" and configurable. E.g., an .mpy could specify what sliding window size if used, and it could be from 0 (to allow quick and easy .mpy generation) to the number of unique qstr's used. If it's small enough, it's allocated on the stack, as you propose.

@dpgeorge
Copy link
Member Author

if this sliding window is "optional"

Yes it is optional for the creator of an mpy file: they can simply create mpy files as before and store all qstsr, the only difference is in the encoding of the length of the qstr because now the LSB of that is used to indicate if the qstr comes from the window or not. So the feature is opt-in for the writer (but not the reader).

E.g., an .mpy could specify what sliding window size if used, and it could be from 0 (to allow quick and easy .mpy generation) to the number of unique qstr's used. If it's small enough, it's allocated on the stack, as you propose.

The idea was not to make the window size configurable, but raher have it fixed at N=32, which is a balance of stack usage vs compression. Having it configurable means adding another field to the .mpy, adds complexity for the loader to support a variable window size (and means it loses its property of "bounded RAM usage" for this window), and then needs a new option to mpy-cross to specify the window size. So I'd rather keep it fixed for now, and maybe later introduce a configuration option for it.

@pfalcon
Copy link
Contributor

pfalcon commented Feb 26, 2019

Yes it is optional for the creator of an mpy file

Good, didn't yet have a chance to review the code. But does it mean that now we limit number of qstr's to 32K. Oh-oh, that's getting pretty small.

Having it configurable means adding another field to the .mpy,

Yes, usually a one byte.

adds complexity for the loader to support a variable window size

Yes, pretty trivial complexity, similar to how function data stack allocation is handled for a function call: if it's smaller than a threshold, it's allocation on stack, otherwise, on heap.

and then needs a new option to mpy-cross to specify the window size

Not really, can use just hardcoded value of 32. That option is to allow to compress .mpy as much as possible, and all those optimization options don't scale to be implemented in C with mpy-cross, they're for separate tools scalably implemented in Python.

So, effectively it's about making .mpy format flexible by including this "window size" param (again, usually a byte). The loader can just error out for now if its >32, but at least format is there for future extension.

@dpgeorge
Copy link
Member Author

But does it mean that now we limit number of qstr's to 32K.

No, it's encoded as a variable uint (like most fields in the mpy file). Also it's the length of the qstr, not the number of qstrs that holds this flag.

@dpgeorge
Copy link
Member Author

So, effectively it's about making .mpy format flexible by including this "window size" param (again, usually a byte). The loader can just error out for now if its >32, but at least format is there for future extension.

Ok, I encoded the window size in the mpy, using 3 spare bits from the existing "features byte" which is the 3rd byte in the file. Encoded as a log2 number this allows a window size between 8 and 1024 qstrs (inclusive).

Force pushed this branch with some additional comments.

@pfalcon
Copy link
Contributor

pfalcon commented Feb 28, 2019

Ok, I encoded the window size in the mpy, using 3 spare bits from the existing "features byte" which is the 3rd byte in the file.

Thanks much for listening, but all that was a discussion-level so far (if only there was some kind of schedule, e.g. idea to finish off PRs already much time in the queue, while not hasting with the code changes for too-fresh things like this). And "still have arbitrary limit of the table, but also steal bits which will be required to encode future features" is totally not what I meant. I proposed to have a restriction-free table size, which would just cost 1 byte of overhead per .mpy file. (For comparison, will little percentage of functions are closures, and yet 1 byte ("end of closed over var list") is wasted for all non-closure functions. Could have go a bit somewhere. Alas, that would make calling functions slower.)

No, it's encoded as a variable uint (like most fields in the mpy file). Also it's the length of the qstr, not the number of qstrs that holds this flag.

Ok, as I mentioned, I didn't have a chance to review the code (swamped with work), now I did, and I see that it's pretty far away from picture I had in mind and shared (intermediate indirection table for qstrs). There're still 2 wasted bytes in the encoding of each qstr-taking instruction (btw, could be 0 on saving, to improve external LZ compressability (tarballs, etc.)). So, this is nice practical hack solution which goes with stats on assessing its efficiency, while I was talking of a way to achieve starting-from-theorhetical high efficiency. In my scheme, bytecode bytes of qstr instructions would be used, and there would be possibility to not duplicate any qstr at all (not even by specifying its index).

So, sorry for the confusion. I guess, if I want to continue, I'll need to roll out my own format after all. If you have previous version (without MPY_FEATURE_DECODE_QSTR_WINDOW_SIZE_LOG2(), please revert to it it, if not, it it seems to be pretty isolated and getting back to previous version would be just ~5 line cuts).

STATIC qstr qstr_window_pull(qstr_window_t *qw, size_t idx) {
qstr qst = qw->window[idx];
if (idx > qw->idx) {
memmove(&qw->window[idx], &qw->window[idx + 1], (QSTR_WINDOW_SIZE - idx - 1) * sizeof(uint16_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, with all these memmove's, I wouldn't call this mechanism "simple and fast".

@pfalcon
Copy link
Contributor

pfalcon commented Feb 28, 2019

There're still 2 wasted bytes in the encoding of each qstr-taking instruction (btw, could be 0 on saving, to improve external LZ compressability (tarballs, etc.))

Well, it's still not too late to make it like: 0 value in qstr bytecode means "read from stream", otherwise its index in the LRU table - 1 (then still would make sense to make the table size configurable, just not by stealing bits from other (future) features).

@dpgeorge
Copy link
Member Author

And "still have arbitrary limit of the table, but also steal bits which will be required to encode future features" is totally not what I meant.

If these bits are needed to encode future features then the layout of the header can be changed at that time, because anyway it'll need a new mpy version. Right now it tries to be efficient with all the features that are supported.

I proposed to have a restriction-free table size, which would just cost 1 byte of overhead per .mpy file.

Any fixed size entry (like 1 byte) will still be a restriction, right?

Well, it's still not too late to make it like: 0 value in qstr bytecode means "read from stream", otherwise its index in the LRU table - 1

I did think of this but I couldn't see how to do it "inplace", so left it out. Because you need to write the qstrs after the bytecode you need to record the decisions about where the qstr goes (literal in the file or access from the window) as you write the bytecode. So it either requires 1) to duplicate the bytecode (one to record the decisions, the other to retain the qstrs); 2) make 2 passes over the bytecode and reset the qstr window before the second pass; 3) create the qstr table in RAM as the bytecode is written then write the qstr table out; or 4) actually encode the qstr literals within the bytecode itself (so expand the bytecode as it is written); and 5) maybe another scheme. Well, it might be possible to do (4) inplace.

@pfalcon
Copy link
Contributor

pfalcon commented Feb 28, 2019

If these bits are needed to encode future features then the layout of the header can be changed at that time, because anyway it'll need a new mpy version. Right now it tries to be efficient with all the features that are supported.

So, from my side, the talk was about designing flexible, extensible if needed format, without "premature optimizations" or outright practical hacks. (That desire is based on my experience of writing (well, refactoring mpy-tool.py) into generic .mpy parser/generator module in Python, and cases like #4378 .

I proposed to have a restriction-free table size, which would just cost 1 byte of overhead per .mpy file.
Any fixed size entry (like 1 byte) will still be a restriction, right?

I of course mean varint encoding, which would have minimum, and common, overhead of 1 byte for table of size <128. And I also mean arbitrary-size table, not power of two, no randomly low size. Again, the ultimate idea is to be able to specify table with the size of number of unique qstrs in the module, and neither encode the same qstr more than once, nor allocate larger table than needed during load.

I did think of this but I couldn't see how to do it "inplace", so left it out.
2) make 2 passes over the bytecode and reset the qstr window before the second pass; 3) create the qstr table in RAM as the bytecode is written then write the qstr table out;

I believe I also already said that I don't see any problem with doing multiple passes during saving (as long as there's a way to have non-optimized single-pass write-out for small tools). I also said that doing .mpy compilation in C doesn't scale, because even trivial optimizations take considerable effort, and this discussion of the complication of two-pass save is well proves that ;-). Note that a complication arises because of the "sliding window" scheme, with my original idea of using lossless, fully optimal table, there wouldn't be issue of where to store intermediate data. And of course, it would be just a few lines in Python ;-).

But all in all, this looks like you embarked on addressing a long-waiting issue
#3054, but I'm bombarding you with "extensibility" ideas. I don't want to complicate this beyond your original intention. (Even though I'd say it's last-mile issue. Even if you need to add a list/dynarray of qstrs id's which need to be explicitly written out to mpy-cross - so what, it's fairly trivial even in C.)

@dpgeorge
Copy link
Member Author

dpgeorge commented Mar 1, 2019

  1. actually encode the qstr literals within the bytecode itself (so expand the bytecode as it is written);

I found a way to do this inplace. See #4564 for this and other mpy-reduction PRs combined.

@dpgeorge
Copy link
Member Author

dpgeorge commented Mar 5, 2019

This feature was merged in 5996eeb

@dpgeorge dpgeorge closed this Mar 5, 2019
@dpgeorge dpgeorge deleted the py-mpy-qstr-window branch March 5, 2019 05:49
tannewt added a commit to tannewt/circuitpython that referenced this pull request Apr 8, 2021
displayio.Bitmap: Allow modification though the buffer protocol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants