Skip to content

all: Compress error strings in ROM (v2) #5861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 7, 2020

Conversation

dpgeorge
Copy link
Member

@dpgeorge dpgeorge commented Apr 4, 2020

This is #5168 rebased on latest master, with a minor addition to lazily compute the error string hash only if it's extracted (so raised errors that don't have their message extracted will never spend time unnecessarily computing the hash).

Code size change with this PR (0 for ports that don't enable the string compression):

   bare-arm:  -600 -0.906% 
minimal x86:  -308 -0.208% 
   unix x64:    +0 +0.000% 
unix nanbox:    +0 +0.000% 
      stm32: -3368 -0.869% PYBV10
     cc3200: -1024 -0.558% 
    esp8266: -2308 -0.338% GENERIC
      esp32: -2688 -0.199% GENERIC[incl -2976(data)]
        nrf: -1708 -1.175% pca10040
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@dpgeorge
Copy link
Member Author

dpgeorge commented Apr 4, 2020

@jimmo I noticed the new str typedef you introduced is called _mp_rom_error_text. I guess you made it have an underscore prefix because it's considered internal, but such identifiers are technically reserved for the compiler to use, and we avoid them. And for consistency with other types I think it's good that it ends in _t. So, what about changing it to mp_rom_error_text_t? (I can do it in this PR.)

@jimmo
Copy link
Member

jimmo commented Apr 4, 2020

So, what about changing it to mp_rom_error_text_t? (I can do it in this PR.)

Yes, agreed. Originally my goal was to just find the places that my script had missed, and wasn't planning to keep it around, so hadn't given much thought to the name. But I think it's useful to keep and your suggested name makes much more sense.
Thanks!

@micropython micropython deleted a comment Apr 5, 2020
@dpgeorge dpgeorge force-pushed the compress-error-strings-v2 branch from 4b0692c to a4f66e9 Compare April 5, 2020 02:23
@dpgeorge
Copy link
Member Author

dpgeorge commented Apr 5, 2020

But I think it's useful to keep and your suggested name makes much more sense.

Ok, now renamed (and force-pushed because I needed to rebase-edit to change the name across multiple commits).

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Apr 5, 2020
jimmo added 6 commits April 5, 2020 14:11
Instead of compiler-level if-logic.  This is necessary to know what error
strings are included in the build at the preprocessor stage, so that string
compression can be implemented.
So this setting could be used by other source files if needed.
This commit makes all functions and function wrappers in modubinascii.c
STATIC and conditional on the MICROPY_PY_UBINASCII setting, which will
exclude the file from qstr/ compressed-string searching when ubinascii is
not enabled.  The now-unused modubinascii.h header file is also removed.

The cc3200 port is updated accordingly to use this module in its entirety
instead of providing its own top-level definition of ubinascii.

This was originally like this because the cc3200 port has its own ubinascii
module which referenced these methods.  The plan appeared to be that the
API might diverge (e.g. hardware crc), but this should be done similar to
I2C/SPI via a port-specific handler, rather than the port having its own
definition of the module.  Having a centralised module definition also
enforces consistency of the API among ports.
The idea here is that there's a moderate amount of ROM used up by exception
text.  Obviously we try to keep the messages short, and the code can enable
terse errors, but it still adds up.  Listed below is the total string data
size for various ports:

    bare-arm 2860
    minimal 2876
    stm32 8926  (PYBV11)
    cc3200 3751
    esp32 5721

This commit implements compression of these strings.  It takes advantage of
the fact that these strings are all 7-bit ascii and extracts the top 128
frequently used words from the messages and stores them packed (dropping
their null-terminator), then uses (0x80 | index) inside strings to refer to
these common words.  Spaces are automatically added around words, saving
more bytes.  This happens transparently in the build process, mirroring the
steps that are used to generate the QSTR data.  The MP_COMPRESSED_ROM_TEXT
macro wraps any literal string that should compressed, and it's
automatically decompressed in mp_decompress_rom_string.

There are many schemes that could be used for the compression, and some are
included in py/makecompresseddata.py for reference (space, Huffman, ngram,
common word).  Results showed that the common-word compression gets better
results.  This is before counting the increased cost of the Huffman
decoder.  This might be slightly counter-intuitive, but this data is
extremely repetitive at a word-level, and the byte-level entropy coder
can't quite exploit that as efficiently.  Ideally one would combine both
approaches, but for now the common-word approach is the one that is used.

For additional comparison, the size of the raw data compressed with gzip
and zlib is calculated, as a sort of proxy for a lower entropy bound.  With
this scheme we come within 15% on stm32, and 30% on bare-arm (i.e. we use
x% more bytes than the data compressed with gzip -- not counting the code
overhead of a decoder, and how this would be hypothetically implemented).

The feature is disabled by default and can be enabled by setting
MICROPY_ROM_TEXT_COMPRESSION at the Makefile-level.
@dpgeorge dpgeorge force-pushed the compress-error-strings-v2 branch from a4f66e9 to 1946270 Compare April 5, 2020 04:31
@dpgeorge
Copy link
Member Author

dpgeorge commented Apr 5, 2020

I squashed the commits that enabled it on ports into one commit, and update commit messages on most commits.

jimmo added 3 commits April 5, 2020 15:02
The decompression of error-strings is only done if the string is accessed
via printing or via er.args.  Tests are added for this feature to ensure
the decompression works.
Enabled on: bare-arm, minimal, unix coverage/dev/minimal, stm32, esp32,
esp8266, cc3200, teensy, qemu-arm, nrf.  Not enabled on others to be able
to test the code when the feature is disabled (the default case).

Code size change for this commit:

   bare-arm:  -600 -0.906%
minimal x86:  -308 -0.208%
   unix x64:    +0 +0.000%
unix nanbox:    +0 +0.000%
      stm32: -3368 -0.869% PYBV10
     cc3200: -1024 -0.558%
    esp8266: -2512 -0.368% GENERIC
      esp32: -2876 -0.205% GENERIC[incl -3168(data)]
        nrf: -1708 -1.173% pca10040
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
@dpgeorge dpgeorge force-pushed the compress-error-strings-v2 branch from 1946270 to 073b9a5 Compare April 5, 2020 05:03
@dpgeorge
Copy link
Member Author

dpgeorge commented Apr 5, 2020

Ok, I'm happy with this, it's ready to go in.

@dpgeorge dpgeorge merged commit 073b9a5 into micropython:master Apr 7, 2020
@dpgeorge
Copy link
Member Author

dpgeorge commented Apr 7, 2020

Merged!

@dpgeorge dpgeorge deleted the compress-error-strings-v2 branch April 7, 2020 03:36
@tannewt
Copy link

tannewt commented Apr 7, 2020

Is it possible to use the compression for non-error strings? We compress the messages printed to the serial console as well (after we translate them.)

@jimmo
Copy link
Member

jimmo commented Apr 8, 2020

hey @tannewt, yes for sure. The help text being an obvious candidate. I did some preliminary tests and there was definitely a benefit! (The help text on STM32 is over 2kiB).

I really wanted to avoid the need for any heap allocation, and minimise stack usage while decompressing, hence why the MP_MAX_UNCOMPRESSED_TEXT_LEN constant gets generated during the compression stage. But if we add longer text like the help text then that would skew that.

So the missing thing is a way to stream compressed data straight to a printer. There was also a TODO I added in this PR to make string formatting work with a compressed format string (without first decompressing in full) which would work in the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants