all: Compress error strings in ROM (v2) #5861

dpgeorge · 2020-04-04T13:03:41Z

This is #5168 rebased on latest master, with a minor addition to lazily compute the error string hash only if it's extracted (so raised errors that don't have their message extracted will never spend time unnecessarily computing the hash).

Code size change with this PR (0 for ports that don't enable the string compression):

   bare-arm:  -600 -0.906% 
minimal x86:  -308 -0.208% 
   unix x64:    +0 +0.000% 
unix nanbox:    +0 +0.000% 
      stm32: -3368 -0.869% PYBV10
     cc3200: -1024 -0.558% 
    esp8266: -2308 -0.338% GENERIC
      esp32: -2688 -0.199% GENERIC[incl -2976(data)]
        nrf: -1708 -1.175% pca10040
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

dpgeorge · 2020-04-04T13:17:54Z

@jimmo I noticed the new str typedef you introduced is called _mp_rom_error_text. I guess you made it have an underscore prefix because it's considered internal, but such identifiers are technically reserved for the compiler to use, and we avoid them. And for consistency with other types I think it's good that it ends in _t. So, what about changing it to mp_rom_error_text_t? (I can do it in this PR.)

jimmo · 2020-04-04T22:30:33Z

So, what about changing it to mp_rom_error_text_t? (I can do it in this PR.)

Yes, agreed. Originally my goal was to just find the places that my script had missed, and wasn't planning to keep it around, so hadn't given much thought to the name. But I think it's useful to keep and your suggested name makes much more sense.
Thanks!

dpgeorge · 2020-04-05T02:25:18Z

But I think it's useful to keep and your suggested name makes much more sense.

Ok, now renamed (and force-pushed because I needed to rebase-edit to change the name across multiple commits).

Instead of compiler-level if-logic. This is necessary to know what error strings are included in the build at the preprocessor stage, so that string compression can be implemented.

So this setting could be used by other source files if needed.

This commit makes all functions and function wrappers in modubinascii.c STATIC and conditional on the MICROPY_PY_UBINASCII setting, which will exclude the file from qstr/ compressed-string searching when ubinascii is not enabled. The now-unused modubinascii.h header file is also removed. The cc3200 port is updated accordingly to use this module in its entirety instead of providing its own top-level definition of ubinascii. This was originally like this because the cc3200 port has its own ubinascii module which referenced these methods. The plan appeared to be that the API might diverge (e.g. hardware crc), but this should be done similar to I2C/SPI via a port-specific handler, rather than the port having its own definition of the module. Having a centralised module definition also enforces consistency of the API among ports.

The idea here is that there's a moderate amount of ROM used up by exception text. Obviously we try to keep the messages short, and the code can enable terse errors, but it still adds up. Listed below is the total string data size for various ports: bare-arm 2860 minimal 2876 stm32 8926 (PYBV11) cc3200 3751 esp32 5721 This commit implements compression of these strings. It takes advantage of the fact that these strings are all 7-bit ascii and extracts the top 128 frequently used words from the messages and stores them packed (dropping their null-terminator), then uses (0x80 | index) inside strings to refer to these common words. Spaces are automatically added around words, saving more bytes. This happens transparently in the build process, mirroring the steps that are used to generate the QSTR data. The MP_COMPRESSED_ROM_TEXT macro wraps any literal string that should compressed, and it's automatically decompressed in mp_decompress_rom_string. There are many schemes that could be used for the compression, and some are included in py/makecompresseddata.py for reference (space, Huffman, ngram, common word). Results showed that the common-word compression gets better results. This is before counting the increased cost of the Huffman decoder. This might be slightly counter-intuitive, but this data is extremely repetitive at a word-level, and the byte-level entropy coder can't quite exploit that as efficiently. Ideally one would combine both approaches, but for now the common-word approach is the one that is used. For additional comparison, the size of the raw data compressed with gzip and zlib is calculated, as a sort of proxy for a lower entropy bound. With this scheme we come within 15% on stm32, and 30% on bare-arm (i.e. we use x% more bytes than the data compressed with gzip -- not counting the code overhead of a decoder, and how this would be hypothetically implemented). The feature is disabled by default and can be enabled by setting MICROPY_ROM_TEXT_COMPRESSION at the Makefile-level.

dpgeorge · 2020-04-05T04:31:34Z

I squashed the commits that enabled it on ports into one commit, and update commit messages on most commits.

The decompression of error-strings is only done if the string is accessed via printing or via er.args. Tests are added for this feature to ensure the decompression works.

Enabled on: bare-arm, minimal, unix coverage/dev/minimal, stm32, esp32, esp8266, cc3200, teensy, qemu-arm, nrf. Not enabled on others to be able to test the code when the feature is disabled (the default case). Code size change for this commit: bare-arm: -600 -0.906% minimal x86: -308 -0.208% unix x64: +0 +0.000% unix nanbox: +0 +0.000% stm32: -3368 -0.869% PYBV10 cc3200: -1024 -0.558% esp8266: -2512 -0.368% GENERIC esp32: -2876 -0.205% GENERIC[incl -3168(data)] nrf: -1708 -1.173% pca10040 samd: +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

dpgeorge · 2020-04-05T05:28:06Z

Ok, I'm happy with this, it's ready to go in.

dpgeorge · 2020-04-07T03:36:10Z

Merged!

tannewt · 2020-04-07T17:51:31Z

Is it possible to use the compression for non-error strings? We compress the messages printed to the serial console as well (after we translate them.)

jimmo · 2020-04-08T01:15:44Z

hey @tannewt, yes for sure. The help text being an obvious candidate. I did some preliminary tests and there was definitely a benefit! (The help text on STM32 is over 2kiB).

I really wanted to avoid the need for any heap allocation, and minimise stack usage while decompressing, hence why the MP_MAX_UNCOMPRESSED_TEXT_LEN constant gets generated during the compression stage. But if we add longer text like the help text then that would skew that.

So the missing thing is a way to stream compressed data straight to a printer. There was also a TODO I added in this PR to make string formatting work with a compressed format string (without first decompressing in full) which would work in the same way.

micropython deleted a comment Apr 5, 2020

dpgeorge force-pushed the compress-error-strings-v2 branch from 4b0692c to a4f66e9 Compare April 5, 2020 02:23

dpgeorge added the py-core Relates to py/ directory in source label Apr 5, 2020

jimmo added 6 commits April 5, 2020 14:11

py: Use preprocessor to detect error reporting level (terse/detailed).

a9a745e

Instead of compiler-level if-logic. This is necessary to know what error strings are included in the build at the preprocessor stage, so that string compression can be implemented.

stm32/Makefile: Add missing ordering dependency on generated headers.

994c1dd

py/dynruntime.mk: Set MICROPY_ENABLE_DYNRUNTIME instead of per module.

c34e7b9

So this setting could be used by other source files if needed.

windows: Update genhdr.targets to match makeqstrdefs.py args.

92c83bd

dpgeorge force-pushed the compress-error-strings-v2 branch from a4f66e9 to 1946270 Compare April 5, 2020 04:31

jimmo added 3 commits April 5, 2020 15:02

py/objexcept: Allow compression of exception message text.

85858e7

The decompression of error-strings is only done if the string is accessed via printing or via er.args. Tests are added for this feature to ensure the decompression works.

all: Use MP_ERROR_TEXT for all error messages.

def76fe

dpgeorge force-pushed the compress-error-strings-v2 branch from 1946270 to 073b9a5 Compare April 5, 2020 05:03

dpgeorge merged commit 073b9a5 into micropython:master Apr 7, 2020

dpgeorge deleted the compress-error-strings-v2 branch April 7, 2020 03:36

dpgeorge mentioned this pull request Apr 7, 2020

top: Compress error strings in ROM #5168

Closed

tannewt mentioned this pull request Sep 14, 2020

compression: Implement @ciscorn's dictionary approach adafruit/circuitpython#3398

Merged

romanz mentioned this pull request Apr 25, 2025

perf(core): enable MICROPY_ROM_TEXT_COMPRESSION trezor/trezor-firmware#4961

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

all: Compress error strings in ROM (v2) #5861

all: Compress error strings in ROM (v2) #5861

Uh oh!

dpgeorge commented Apr 4, 2020

Uh oh!

dpgeorge commented Apr 4, 2020

Uh oh!

jimmo commented Apr 4, 2020

Uh oh!

dpgeorge commented Apr 5, 2020

Uh oh!

dpgeorge commented Apr 5, 2020

Uh oh!

dpgeorge commented Apr 5, 2020

Uh oh!

dpgeorge commented Apr 7, 2020

Uh oh!

tannewt commented Apr 7, 2020

Uh oh!

jimmo commented Apr 8, 2020

Uh oh!

Uh oh!

Uh oh!

all: Compress error strings in ROM (v2) #5861

all: Compress error strings in ROM (v2) #5861

Uh oh!

Conversation

dpgeorge commented Apr 4, 2020

Uh oh!

dpgeorge commented Apr 4, 2020

Uh oh!

jimmo commented Apr 4, 2020

Uh oh!

dpgeorge commented Apr 5, 2020

Uh oh!

dpgeorge commented Apr 5, 2020

Uh oh!

dpgeorge commented Apr 5, 2020

Uh oh!

dpgeorge commented Apr 7, 2020

Uh oh!

tannewt commented Apr 7, 2020

Uh oh!

jimmo commented Apr 8, 2020

Uh oh!

Uh oh!