-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Dealing with Python3 string-like types zoo #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As I understand it, micropython is “a lean and fast implementation of the Python 3 programming language”. If that is the case (and if it weren't the case I wouldn't've backed the project), then making it behave differently from Python 3 WRT unicode strings goes directly against that. If unicode strings are really too much, then don't have strings at all -- just bytes would be enough. If I write Python 3 code that uses all the features in micropython, and it produces different output on micropython and on Python 3, then it's no good. |
Well, everything discussed here is implementation details of how to make it possible for micropython to have exactly same output as Python 3, while not using so much memory.
That's indeed makes sense - you don't need unicode to start blinking LEDs. And I kinda hint that it makes sense to recast what's currently implemented as byte strings (just need to support "b" prefix). On the other hand, it would nice to consider object layouts to support further string types down the road w/o obtrusive redesigns, what is exactly the subject of this ticket. |
Ah. It reads a bit like you're ranting against Python 3's strings and |
To be clear, my intentions were, are, and will be, to have Micro Python as compatible with Python 3 as possible. At the worst, uPy will be a subset of Python 3, such that if it runs on uPy, it runs on Python 3. Bytearrays (the mutable one) are very, very useful for microcontrollers (you can use them as a buffer, for example). They have one straight forward implementation (a byte array). Strings will be unicode, stored as UTF-8 I think is best. Perhaps you could have an option to store them as 32bit wide characters. Note that UTF-8 storage has zero overhead on RAM for ASCII-only strings, compared with just 8bit storage of an ASCII string. If you were really pressed for speed, then you could restrict what you accept as a unicode code-point to lie in the ASCII range (1-127) and implement your unicode_next() function as simply a pointer increment. |
"Si quieres viajar alrededor del mundo y ser invitado a hablar en un monton |
Yes, it would be a compile time optimisation, enabling you to disable unicode without changing any of the string handling framework. |
@dpgeorge: Sure, that's all more or less clear. This ticket goes further and contemplates how to implement all that. So, do you agree that 8 string-like types listed above need to be represented? Or more? Or less? Do you agree that it makes no sense to try to fit distinction among them into tag bits of mp_obj_t? Then, do you agree that it makes sense to take few bits away from var-length string size encoding to store those bits is the same byte? (Well recursive question - do you agree that it makes sense to use var-length encoding for string size? - I don't remember your ack for that in #8, but as you said, you use varlen for qstr handles encoding already, so I wouldn't think you would object to it.) And if you just skimmed thru the description text and don't have time for this so far, no worries - I just wanted to record my ideas before holiday time is over and I need to go back to work stuff and thus risk losing some details. |
I don't know where that 99% comes from. Anyone whose language is not based on Latin script will tell you that's not true. Heck, even my HD7780 LCD has chargen with 256 chars, with Cyrillic or Chinese symbols. On the other hand, it's possible to do 80% of all operations on any encoding (including Unicode) with just 8bit-clean strings. If you add strlen operation for a particular encoding - as a separate function - you can cover 90%. Add substr for a particular encoding and that covers 95% of all operations. And that all without forcing a particular encoding on everyone (even if it's Unicode). But Python3 forced Unicode on everyone, and now for MicroPython that choice needs to be worked around, and I personally don't find workaround of forcing ASCII as "optimization" to be good at all. |
"Si quieres viajar alrededor del mundo y ser invitado a hablar en un monton |
@pfalcon yes, I agree to using varlen encoding in qstrs. Can't qstrs be used to just store an array of bytes? And not care about the encoding? A qstr would be: (bytelen,hash,data), with bytelen encoded as a variable length integer. The data is just bytes. Different string representations (ASCII, UTF-8, 16-bit, 32-bit) would then be represented by the Python object. You would probably just compile with 1 particular representation. bytes are always represented as a string of bytes, and can use the same qstr API, since qstr just stores data. bytearray will be a mutable array of bytes, and not use qstr. |
Bingo! You know, one of the first headers I peered into was parse.h and its tagged pointers and it seems I had a strange mix-up in my head on how what where represented. Of course, on Python level all objects start with mp_obj_base_t and that's how they're differentiated and interpreted, no need (likely ;-) ) to fit some tag bits into qstr. Then, should hash be part of qstr (and not part of Python object)? Well, the reason I fitted it there is because qstr is used in level deeper than Python objects, where quick search and comparison is still required. The problem here is that hash should depend on semantics of object, for example, for Unicode, individual (32bit) chars should be hashed, so hash of UTF-8 and UTF-16 string with same chars was the same. Well, we can accommodate that by passing hash value to store from upper levels. With all that, there's still benefit of storing hash in qstr because we allocate 1 byte, and anything added to mp_obj_*_t would likely take 4 bytes due to struct alignment.
Well, I kind of thought about making generic storage infra for string-like types. bytearrays are exactly like bytes, except it's mutable (and I use qstr in a bit looser manner than it is now, in particular, qstr doesn't imply "interned", per your idea of supporting both interned and non-interned). |
Crazy. So, there's not much use of storing bytearray in qstr - I though we could cache hash in it (and support lazy hashing), but it's not supported on language core level. And given:
, bytearary would rather be type with both size and allocsize fields (and varlen for size would just complicate stuff). |
Yeah, bytearray is really more like list than str. |
How about this CPython behaviour for a null character in a string:
Thus, while CPython allows you to create a type whose name has a null character in it, and it prints such a name correctly in some cases (the first 2 outputs with If we allow qstrs to have null characters, then everywhere we convert qstr to its representation we will need to account for the fact it's a pointer and length. A good option is to make a custom printf format specifier for qstrs that handles them correctly (or just use |
On Thu, Jan 2, 2014 at 10:48 PM, Paul Sokolovsky
Unicode isn't an encoding; UTF-8 is an encoding of Unicode codepoints, and There's a lot of english centric discussion going on here. It's important -Nick Johnson —
|
Well, that's pretty edge case ;-). I guess it comes from dichotomy that qstr (and their CPython analog) is used to represent not just arbitrary strings, but also identifiers as used in language syntax. Of course, identifiers have other constituent character requirements. So, well, we can cheat/overlook how we print qstrs representing identifiers, but of course not user data. I had idea about "%*s" syntax too, but it seems it allows to limit length, not extend it beyond \0:
gives me:
So, custom printf formatter may be interesting idea. (Though I still not sure I understand how you handle repr() vs str() difference). |
Saw this, regarding putting back % operator for bytes and bytearray: https://mail.python.org/pipermail/python-dev/2014-March/133621.html Thought immediately of @pfalcon :) |
Btw, I wanted to mention that I kinda feel that we should keep trailing null byte for str/bytes around ~forever. Motivation: interoperability with native C APIs. Ref: https://mail.python.org/pipermail/python-dev/2014-April/134398.html |
We now have proper support for bytes and unicode strings, closing this. |
Improving image support for lists and imageboxes
mphalport.c et al: fix us-timing support
Merge from adafruit main
This action build the unix port upon push/pull - update submodules - Install SDL - Fixed SDL installation according to advice from https://alexene.dev/2019/09/04/Github-actions-CI-rust-SDL2.html - make mpy-cross and run advanced_demo
Almost done for now. Just have to finish with the omv.image.rst file
Delete marking (BREAKING)
Epigraph:
So, one of the changes in Python3 with respect to Python2 is that strings are by default Unicode. And that's actually the change which is not friendly to constrained environments. And well, it may sound progressive and innovative to have Unicode by default, but conservative approach says that Unicode is extra, advanced, special-purpose feature, and forcing it as "default" is hardly wise. You can get far along with just 8bit-transparent byte strings, for example, you can write fairly advanced web applications which just accept input from user, store it, and then render back - without trying to interpret string contents. That's exactly how Python2 string type worked. And not only Python3 forced Unicode on everyone, it also killed unobtrusive Python2 8-bit strings. Instead it introduced "byte strings" (bytes type). But they're not Python2 strings:
So, if you had good times with Python2, with Python3 you either need to thrash your heap (all 8Kb of it), or write code which looks more complicated than Python2 ("if s[0] == ord('0')"?) and not compatible with it.
So, how to deal with this madness in MicroPython? First of all, let's look what we now:
Ahem, so unlike "// XXX a massive hack!" comments says, it's not hack, it's just uPy so far implements byte strings. But:
So, what to do with "default Unicode" strings in uPy? It goes w/o saying that in-memory representation for them should be utf8 - we simply don't have wealth of memory to waste on 2- or 4-byte representations. Of course, using utf8 means expensive random access, so it would be nice to have special (but oh-so-common) case for ASCII-only strings to support fast random access. Here Python2 lover says that special-case 1-byte strings are well, just Python2 strings. Outlawed by Python3, they are still useful for optimizing MCU-specific apps. And well, 2- or 4-byte representations don't scale for MCUs, but not so mad for POSIX build.
Let's also backtrack to byte strings - they also have mutable counterpart, bytearray.
So, here're all potential string types which are nice to support:
And don't forget about interned strings, which are apparently utf8, but of course with ascii optimization:
7. interned utf8
8. interned ascii
We can also remember array.array - it's very important for uPy, but can stay extension type.
So, there's clearly no free tag bits in object pointers to store string type. But #8 proposes to add string header with hash and variable-length size encoding. Well, as it's variable-length, we can steal few bits from initial size byte to hold type, so still keeping only 2-byte overhead for short strings.
Thoughts?
The text was updated successfully, but these errors were encountered: