-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
all: Compress error strings in ROM (v2) #5861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: Compress error strings in ROM (v2) #5861
Conversation
@jimmo I noticed the new str typedef you introduced is called |
Yes, agreed. Originally my goal was to just find the places that my script had missed, and wasn't planning to keep it around, so hadn't given much thought to the name. But I think it's useful to keep and your suggested name makes much more sense. |
4b0692c
to
a4f66e9
Compare
Ok, now renamed (and force-pushed because I needed to rebase-edit to change the name across multiple commits). |
Instead of compiler-level if-logic. This is necessary to know what error strings are included in the build at the preprocessor stage, so that string compression can be implemented.
So this setting could be used by other source files if needed.
This commit makes all functions and function wrappers in modubinascii.c STATIC and conditional on the MICROPY_PY_UBINASCII setting, which will exclude the file from qstr/ compressed-string searching when ubinascii is not enabled. The now-unused modubinascii.h header file is also removed. The cc3200 port is updated accordingly to use this module in its entirety instead of providing its own top-level definition of ubinascii. This was originally like this because the cc3200 port has its own ubinascii module which referenced these methods. The plan appeared to be that the API might diverge (e.g. hardware crc), but this should be done similar to I2C/SPI via a port-specific handler, rather than the port having its own definition of the module. Having a centralised module definition also enforces consistency of the API among ports.
The idea here is that there's a moderate amount of ROM used up by exception text. Obviously we try to keep the messages short, and the code can enable terse errors, but it still adds up. Listed below is the total string data size for various ports: bare-arm 2860 minimal 2876 stm32 8926 (PYBV11) cc3200 3751 esp32 5721 This commit implements compression of these strings. It takes advantage of the fact that these strings are all 7-bit ascii and extracts the top 128 frequently used words from the messages and stores them packed (dropping their null-terminator), then uses (0x80 | index) inside strings to refer to these common words. Spaces are automatically added around words, saving more bytes. This happens transparently in the build process, mirroring the steps that are used to generate the QSTR data. The MP_COMPRESSED_ROM_TEXT macro wraps any literal string that should compressed, and it's automatically decompressed in mp_decompress_rom_string. There are many schemes that could be used for the compression, and some are included in py/makecompresseddata.py for reference (space, Huffman, ngram, common word). Results showed that the common-word compression gets better results. This is before counting the increased cost of the Huffman decoder. This might be slightly counter-intuitive, but this data is extremely repetitive at a word-level, and the byte-level entropy coder can't quite exploit that as efficiently. Ideally one would combine both approaches, but for now the common-word approach is the one that is used. For additional comparison, the size of the raw data compressed with gzip and zlib is calculated, as a sort of proxy for a lower entropy bound. With this scheme we come within 15% on stm32, and 30% on bare-arm (i.e. we use x% more bytes than the data compressed with gzip -- not counting the code overhead of a decoder, and how this would be hypothetically implemented). The feature is disabled by default and can be enabled by setting MICROPY_ROM_TEXT_COMPRESSION at the Makefile-level.
a4f66e9
to
1946270
Compare
I squashed the commits that enabled it on ports into one commit, and update commit messages on most commits. |
The decompression of error-strings is only done if the string is accessed via printing or via er.args. Tests are added for this feature to ensure the decompression works.
Enabled on: bare-arm, minimal, unix coverage/dev/minimal, stm32, esp32, esp8266, cc3200, teensy, qemu-arm, nrf. Not enabled on others to be able to test the code when the feature is disabled (the default case). Code size change for this commit: bare-arm: -600 -0.906% minimal x86: -308 -0.208% unix x64: +0 +0.000% unix nanbox: +0 +0.000% stm32: -3368 -0.869% PYBV10 cc3200: -1024 -0.558% esp8266: -2512 -0.368% GENERIC esp32: -2876 -0.205% GENERIC[incl -3168(data)] nrf: -1708 -1.173% pca10040 samd: +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
1946270
to
073b9a5
Compare
Ok, I'm happy with this, it's ready to go in. |
Merged! |
Is it possible to use the compression for non-error strings? We compress the messages printed to the serial console as well (after we translate them.) |
hey @tannewt, yes for sure. The help text being an obvious candidate. I did some preliminary tests and there was definitely a benefit! (The help text on STM32 is over 2kiB). I really wanted to avoid the need for any heap allocation, and minimise stack usage while decompressing, hence why the MP_MAX_UNCOMPRESSED_TEXT_LEN constant gets generated during the compression stage. But if we add longer text like the help text then that would skew that. So the missing thing is a way to stream compressed data straight to a printer. There was also a TODO I added in this PR to make string formatting work with a compressed format string (without first decompressing in full) which would work in the same way. |
This is #5168 rebased on latest master, with a minor addition to lazily compute the error string hash only if it's extracted (so raised errors that don't have their message extracted will never spend time unnecessarily computing the hash).
Code size change with this PR (0 for ports that don't enable the string compression):