Description
Epigraph:
> * str is now unicode => unicode is no longer a pain in the a****
True. Now byte strings are a pain in the arse.
- https://mail.python.org/pipermail/python-list/2010-June/580107.html
So, one of the changes in Python3 with respect to Python2 is that strings are by default Unicode. And that's actually the change which is not friendly to constrained environments. And well, it may sound progressive and innovative to have Unicode by default, but conservative approach says that Unicode is extra, advanced, special-purpose feature, and forcing it as "default" is hardly wise. You can get far along with just 8bit-transparent byte strings, for example, you can write fairly advanced web applications which just accept input from user, store it, and then render back - without trying to interpret string contents. That's exactly how Python2 string type worked. And not only Python3 forced Unicode on everyone, it also killed unobtrusive Python2 8-bit strings. Instead it introduced "byte strings" (bytes type). But they're not Python2 strings:
$ python3.3
>>> b"123"[0]
49
So, if you had good times with Python2, with Python3 you either need to thrash your heap (all 8Kb of it), or write code which looks more complicated than Python2 ("if s[0] == ord('0')"?) and not compatible with it.
So, how to deal with this madness in MicroPython? First of all, let's look what we now:
$ ./py
>>> "123"[0]
49
Ahem, so unlike "// XXX a massive hack!" comments says, it's not hack, it's just uPy so far implements byte strings. But:
$ ./py
>>> b"123"[0]
code 0x8b21fac, byte code 0x17 not implemented
py: ../py/vm.c:477: mp_execute_byte_code_2: Assertion `0' failed.
Aborted
So, what to do with "default Unicode" strings in uPy? It goes w/o saying that in-memory representation for them should be utf8 - we simply don't have wealth of memory to waste on 2- or 4-byte representations. Of course, using utf8 means expensive random access, so it would be nice to have special (but oh-so-common) case for ASCII-only strings to support fast random access. Here Python2 lover says that special-case 1-byte strings are well, just Python2 strings. Outlawed by Python3, they are still useful for optimizing MCU-specific apps. And well, 2- or 4-byte representations don't scale for MCUs, but not so mad for POSIX build.
Let's also backtrack to byte strings - they also have mutable counterpart, bytearray.
So, here're all potential string types which are nice to support:
- bytes
- bytearray
- utf8
- ASCII/8bit string
- 16bit string
- 32bit string
And don't forget about interned strings, which are apparently utf8, but of course with ascii optimization:
7. interned utf8
8. interned ascii
We can also remember array.array - it's very important for uPy, but can stay extension type.
So, there's clearly no free tag bits in object pointers to store string type. But #8 proposes to add string header with hash and variable-length size encoding. Well, as it's variable-length, we can steal few bits from initial size byte to hold type, so still keeping only 2-byte overhead for short strings.
Thoughts?