Skip to content

Dealing with Python3 string-like types zoo #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pfalcon opened this issue Jan 1, 2014 · 20 comments
Closed

Dealing with Python3 string-like types zoo #22

pfalcon opened this issue Jan 1, 2014 · 20 comments
Labels
rfc Request for Comment

Comments

@pfalcon
Copy link
Contributor

pfalcon commented Jan 1, 2014

Epigraph:

>  * str is now unicode => unicode is no longer a pain in the a****

True. Now byte strings are a pain in the arse.

- https://mail.python.org/pipermail/python-list/2010-June/580107.html

So, one of the changes in Python3 with respect to Python2 is that strings are by default Unicode. And that's actually the change which is not friendly to constrained environments. And well, it may sound progressive and innovative to have Unicode by default, but conservative approach says that Unicode is extra, advanced, special-purpose feature, and forcing it as "default" is hardly wise. You can get far along with just 8bit-transparent byte strings, for example, you can write fairly advanced web applications which just accept input from user, store it, and then render back - without trying to interpret string contents. That's exactly how Python2 string type worked. And not only Python3 forced Unicode on everyone, it also killed unobtrusive Python2 8-bit strings. Instead it introduced "byte strings" (bytes type). But they're not Python2 strings:

$ python3.3
>>> b"123"[0]
49

So, if you had good times with Python2, with Python3 you either need to thrash your heap (all 8Kb of it), or write code which looks more complicated than Python2 ("if s[0] == ord('0')"?) and not compatible with it.

So, how to deal with this madness in MicroPython? First of all, let's look what we now:

$ ./py
>>> "123"[0]
49

Ahem, so unlike "// XXX a massive hack!" comments says, it's not hack, it's just uPy so far implements byte strings. But:

$ ./py
>>> b"123"[0]
code 0x8b21fac, byte code 0x17 not implemented
py: ../py/vm.c:477: mp_execute_byte_code_2: Assertion `0' failed.
Aborted

So, what to do with "default Unicode" strings in uPy? It goes w/o saying that in-memory representation for them should be utf8 - we simply don't have wealth of memory to waste on 2- or 4-byte representations. Of course, using utf8 means expensive random access, so it would be nice to have special (but oh-so-common) case for ASCII-only strings to support fast random access. Here Python2 lover says that special-case 1-byte strings are well, just Python2 strings. Outlawed by Python3, they are still useful for optimizing MCU-specific apps. And well, 2- or 4-byte representations don't scale for MCUs, but not so mad for POSIX build.

Let's also backtrack to byte strings - they also have mutable counterpart, bytearray.

So, here're all potential string types which are nice to support:

  1. bytes
  2. bytearray
  3. utf8
  4. ASCII/8bit string
  5. 16bit string
  6. 32bit string

And don't forget about interned strings, which are apparently utf8, but of course with ascii optimization:
7. interned utf8
8. interned ascii

We can also remember array.array - it's very important for uPy, but can stay extension type.

So, there's clearly no free tag bits in object pointers to store string type. But #8 proposes to add string header with hash and variable-length size encoding. Well, as it's variable-length, we can steal few bits from initial size byte to hold type, so still keeping only 2-byte overhead for short strings.

Thoughts?

@chipaca
Copy link
Contributor

chipaca commented Jan 2, 2014

As I understand it, micropython is “a lean and fast implementation of the Python 3 programming language”. If that is the case (and if it weren't the case I wouldn't've backed the project), then making it behave differently from Python 3 WRT unicode strings goes directly against that. If unicode strings are really too much, then don't have strings at all -- just bytes would be enough. If I write Python 3 code that uses all the features in micropython, and it produces different output on micropython and on Python 3, then it's no good.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jan 2, 2014

and it produces different output on micropython and on Python 3, then it's no good.

Well, everything discussed here is implementation details of how to make it possible for micropython to have exactly same output as Python 3, while not using so much memory.

If unicode strings are really too much, then don't have strings at all -- just bytes would be enough.

That's indeed makes sense - you don't need unicode to start blinking LEDs. And I kinda hint that it makes sense to recast what's currently implemented as byte strings (just need to support "b" prefix).

On the other hand, it would nice to consider object layouts to support further string types down the road w/o obtrusive redesigns, what is exactly the subject of this ticket.

@chipaca
Copy link
Contributor

chipaca commented Jan 2, 2014

Well, everything discussed here is implementation details of how to make it possible for micropython to have exactly same output as Python 3, while not using so much memory.

Ah. It reads a bit like you're ranting against Python 3's strings and
proposing micropython's strings work like Python 2's. If that's not
the case, then I'm fine with it.

@dpgeorge
Copy link
Member

dpgeorge commented Jan 2, 2014

To be clear, my intentions were, are, and will be, to have Micro Python as compatible with Python 3 as possible. At the worst, uPy will be a subset of Python 3, such that if it runs on uPy, it runs on Python 3.

Bytearrays (the mutable one) are very, very useful for microcontrollers (you can use them as a buffer, for example). They have one straight forward implementation (a byte array).

Strings will be unicode, stored as UTF-8 I think is best. Perhaps you could have an option to store them as 32bit wide characters. Note that UTF-8 storage has zero overhead on RAM for ASCII-only strings, compared with just 8bit storage of an ASCII string. If you were really pressed for speed, then you could restrict what you accept as a unicode code-point to lie in the ASCII range (1-127) and implement your unicode_next() function as simply a pointer increment.

@piranna
Copy link

piranna commented Jan 2, 2014

If you were really pressed for speed, then you could restrict what you
accept as a unicode code-point to lie in the ASCII range (1-127) and
implement your unicode_next() function as simply a pointer increment.

Would make sense to have this as a compilation option as a particular
optimization? Seems easy to do and don't break compatibility (up to some
degree...) with CPython 3.3... For the 99% of string operations with
MicroPython they could be done on the ASCII range, so they can take
advantage of this optimization, and if they're required full-fledged
compliant Unicode strings, just disable the flag and you are go.

"Si quieres viajar alrededor del mundo y ser invitado a hablar en un monton
de sitios diferentes, simplemente escribe un sistema operativo Unix."
– Linus Tordvals, creador del sistema operativo Linux

@dpgeorge
Copy link
Member

dpgeorge commented Jan 2, 2014

Yes, it would be a compile time optimisation, enabling you to disable unicode without changing any of the string handling framework.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jan 2, 2014

@dpgeorge: Sure, that's all more or less clear. This ticket goes further and contemplates how to implement all that. So, do you agree that 8 string-like types listed above need to be represented? Or more? Or less? Do you agree that it makes no sense to try to fit distinction among them into tag bits of mp_obj_t? Then, do you agree that it makes sense to take few bits away from var-length string size encoding to store those bits is the same byte? (Well recursive question - do you agree that it makes sense to use var-length encoding for string size? - I don't remember your ack for that in #8, but as you said, you use varlen for qstr handles encoding already, so I wouldn't think you would object to it.)

And if you just skimmed thru the description text and don't have time for this so far, no worries - I just wanted to record my ideas before holiday time is over and I need to go back to work stuff and thus risk losing some details.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jan 2, 2014

For the 99% of string operations with MicroPython they could be done on the ASCII range

I don't know where that 99% comes from. Anyone whose language is not based on Latin script will tell you that's not true. Heck, even my HD7780 LCD has chargen with 256 chars, with Cyrillic or Chinese symbols.

On the other hand, it's possible to do 80% of all operations on any encoding (including Unicode) with just 8bit-clean strings. If you add strlen operation for a particular encoding - as a separate function - you can cover 90%. Add substr for a particular encoding and that covers 95% of all operations. And that all without forcing a particular encoding on everyone (even if it's Unicode). But Python3 forced Unicode on everyone, and now for MicroPython that choice needs to be worked around, and I personally don't find workaround of forcing ASCII as "optimization" to be good at all.

@piranna
Copy link

piranna commented Jan 2, 2014

For the 99% of string operations with MicroPython they could be done on
the ASCII range

I don't know where that 99% comes from. Anyone whose language is not based
on Latin script will tell you that's not true. Heck, even my HD7780 LCD has
chargen with 256 chars, with Cyrillic or Chinese symbols.

I admit 99% is an arbitrary number, but your 80% is totally acceptable.
English is the current lingua-franca, and it can be represented just by
ASCII characters. Except when dealing directly with localiced UIs, it's
perfectly acceptable to just use english (ASCII) for the strings used
inside MicroPython, and if needed to use full Unicode, you could check
about if the speed is acceptable or it makes more sense to generate the
localiced strings outside of the microcontroler (for example, dispatch JSON
objects and generate the location on a desktop, or dispatch the english
strings and translate later with gettext).

"Si quieres viajar alrededor del mundo y ser invitado a hablar en un monton
de sitios diferentes, simplemente escribe un sistema operativo Unix."
– Linus Tordvals, creador del sistema operativo Linux

@dpgeorge
Copy link
Member

dpgeorge commented Jan 5, 2014

@pfalcon yes, I agree to using varlen encoding in qstrs.

Can't qstrs be used to just store an array of bytes? And not care about the encoding? A qstr would be: (bytelen,hash,data), with bytelen encoded as a variable length integer. The data is just bytes.

Different string representations (ASCII, UTF-8, 16-bit, 32-bit) would then be represented by the Python object. You would probably just compile with 1 particular representation.

bytes are always represented as a string of bytes, and can use the same qstr API, since qstr just stores data.

bytearray will be a mutable array of bytes, and not use qstr.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jan 6, 2014

Can't qstrs be used to just store an array of bytes? Different string representations (ASCII, UTF-8, 16-bit, 32-bit) would then be represented by the Python object.

Bingo! You know, one of the first headers I peered into was parse.h and its tagged pointers and it seems I had a strange mix-up in my head on how what where represented. Of course, on Python level all objects start with mp_obj_base_t and that's how they're differentiated and interpreted, no need (likely ;-) ) to fit some tag bits into qstr.

Then, should hash be part of qstr (and not part of Python object)? Well, the reason I fitted it there is because qstr is used in level deeper than Python objects, where quick search and comparison is still required. The problem here is that hash should depend on semantics of object, for example, for Unicode, individual (32bit) chars should be hashed, so hash of UTF-8 and UTF-16 string with same chars was the same. Well, we can accommodate that by passing hash value to store from upper levels. With all that, there's still benefit of storing hash in qstr because we allocate 1 byte, and anything added to mp_obj_*_t would likely take 4 bytes due to struct alignment.

bytearray will be a mutable array of bytes, and not use qstr.

Well, I kind of thought about making generic storage infra for string-like types. bytearrays are exactly like bytes, except it's mutable (and I use qstr in a bit looser manner than it is now, in particular, qstr doesn't imply "interned", per your idea of supporting both interned and non-interned).

@pfalcon
Copy link
Contributor Author

pfalcon commented Jan 6, 2014

>>> b = bytearray(b"123")
>>> hash(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'bytearray'

Crazy. So, there's not much use of storing bytearray in qstr - I though we could cache hash in it (and support lazy hashing), but it's not supported on language core level. And given:

>>> b.append(1)
>>> b
bytearray(b'123\x01')

, bytearary would rather be type with both size and allocsize fields (and varlen for size would just complicate stuff).

@dpgeorge
Copy link
Member

dpgeorge commented Jan 6, 2014

Yeah, bytearray is really more like list than str.

@dpgeorge
Copy link
Member

dpgeorge commented Jan 8, 2014

How about this CPython behaviour for a null character in a string:

>>> t = type('a\x00b', (object,), {})
>>> t
<class '__main__.ab'>
>>> t()
<__main__.ab object at 0x7f7022e13810>
>>> t.member
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'a' has no attribute 'member'

Thus, while CPython allows you to create a type whose name has a null character in it, and it prints such a name correctly in some cases (the first 2 outputs with __main__.ab) it prints it as an ASCIIZ string in other cases (the error it throws, it just prints a).

If we allow qstrs to have null characters, then everywhere we convert qstr to its representation we will need to account for the fact it's a pointer and length.

A good option is to make a custom printf format specifier for qstrs that handles them correctly (or just use %.*s).

@Arachnid
Copy link

Arachnid commented Jan 8, 2014

On Thu, Jan 2, 2014 at 10:48 PM, Paul Sokolovsky
notifications@github.comwrote:

For the 99% of string operations with MicroPython they could be done on
the ASCII range

I don't know where that 99% comes from. Anyone whose language is not based
on Latin script will tell you that's not true. Heck, even my HD7780 LCD has
chargen with 256 chars, with Cyrillic or Chinese symbols.

On the other hand, it's possible to do 80% of all operations on any
encoding (including Unicode) with just 8bit-clean strings. If you add
strlen operation for a particular encoding - as a separate function - you
can cover 90%. Add substr for a particular encoding and that covers 95% of
all operations. And that all without forcing a particular encoding on
everyone (even if it's Unicode). But Python3 forced Unicode on everyone,
and now for MicroPython that choice needs to be worked around, and I
personally don't find workaround of forcing ASCII as "optimization" to be
good at all.

Unicode isn't an encoding; UTF-8 is an encoding of Unicode codepoints, and
(retrospectively) ASCII is too, of a small subset of unicode codepoints.

There's a lot of english centric discussion going on here. It's important
to recognise that not 100% (not even 50%) of the world supports English.
Unicode support should be a basic feature of a modern programming language,
and personally I'd want to see some pretty firm figures on overhead from
unicode string support before any "size matters" argument holds sway - so
far there's nothing but a lot of hand waving.

-Nick Johnson

Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-31491783
.

@pfalcon
Copy link
Contributor Author

pfalcon commented Jan 8, 2014

How about this CPython behaviour for a null character in a

Well, that's pretty edge case ;-). I guess it comes from dichotomy that qstr (and their CPython analog) is used to represent not just arbitrary strings, but also identifiers as used in language syntax. Of course, identifiers have other constituent character requirements.

So, well, we can cheat/overlook how we print qstrs representing identifiers, but of course not user data. I had idea about "%*s" syntax too, but it seems it allows to limit length, not extend it beyond \0:

printf("%.*s|\n", 5, "ab\0cd");

gives me:

ab|

So, custom printf formatter may be interesting idea. (Though I still not sure I understand how you handle repr() vs str() difference).

@dpgeorge
Copy link
Member

Saw this, regarding putting back % operator for bytes and bytearray:

https://mail.python.org/pipermail/python-dev/2014-March/133621.html

Thought immediately of @pfalcon :)

@pfalcon
Copy link
Contributor Author

pfalcon commented Mar 31, 2014

@dpgeorge : Greate news! ;-) Opened #403 to cover that.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 10, 2014

Btw, I wanted to mention that I kinda feel that we should keep trailing null byte for str/bytes around ~forever. Motivation: interoperability with native C APIs. Ref: https://mail.python.org/pipermail/python-dev/2014-April/134398.html

@pfalcon
Copy link
Contributor Author

pfalcon commented Oct 25, 2015

We now have proper support for bytes and unicode strings, closing this.

@pfalcon pfalcon closed this as completed Oct 25, 2015
drrk pushed a commit to drrk/micropython that referenced this issue Jan 22, 2017
Improving image support for lists and imageboxes
retryfail pushed a commit to retryfail/micropython that referenced this issue Aug 20, 2020
mphalport.c et al: fix us-timing support
tannewt pushed a commit to tannewt/circuitpython that referenced this issue Sep 11, 2020
xiaoxiang781216 pushed a commit to xiaoxiang781216/micropython that referenced this issue Jan 11, 2021
This action build the unix port upon push/pull

- update submodules
- Install SDL
- Fixed SDL installation according to advice from https://alexene.dev/2019/09/04/Github-actions-CI-rust-SDL2.html
- make mpy-cross and run advanced_demo
WeActStudio pushed a commit to WeActStudio/micropython that referenced this issue Feb 14, 2021
Almost done for now. Just have to finish with the omv.image.rst file
TinyCircuits added a commit to TinyCircuits/micropython that referenced this issue Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Request for Comment
Projects
None yet
Development

No branches or pull requests

5 participants