First shot at Unicode support #671

Rosuav · 2014-06-07T02:00:36Z

This may well not yet be ready for merging. Request for comments, mainly!

I've added a test, and it and all the rest of the tests pass. There are quite a few places in the code where things assume correct UTF-8 encoding, and things may crash and burn if given malformed byte streams.

In theory, a bytes() is allowed to have anything in it, and will always have its charlen and its len the same; a str() must be correctly-formed UTF-8, and its charlen is less than its len by the number of continuation bytes.

The flags byte is currently unused and always 1. Later on, this could get a marker specifying that the string is ASCII-only, which could simplify some code paths for the very common case; alternatively, that could be dropped.

Previously:
   text    data     bss     dec     hex filename
 275857    1352    1216  278425   43f99 micropython
Now:
   text    data     bss     dec     hex filename
 281009    1352    1216  283577   453b9 micropython

If I'm reading this correctly, Unicode support has cost you 5KB.

The test suite all passes, but nothing has actually been changed.

…encoded chars, and subscript bytes the same way

CAUTION: Buggy, may crash stuff - qstr needs equivalent functionality too

…but POC. All tests pass now, though string creation is still buggy.

…ring

pfalcon · 2014-06-07T03:54:18Z

Ok, as I mentioned, we didn't discuss many points before, so there's going to be a lot of comments and questions now ;-)

There are quite a few places in the code where things assume correct UTF-8 encoding, and things may crash and burn if given malformed byte streams.

That's one the points which is worth having in requirements. My stake would be: Handling definitely should not segfault, so bounds checking is needed. But otherwise, I don't this there should be any validity check during various operations, at least not for version 1.0 of unicode support. And even going forward, I'd say that stringent check should be applied on string input (as in I/O), not again and again during processing. Let me know if that makes sense.

Other comment likely go inline.

pfalcon · 2014-06-07T03:56:14Z

The flags byte is currently unused and always 1. Later on, this could get a marker specifying that the string is ASCII-only, which could simplify some code paths for the very common case; alternatively, that could be dropped.

I'd say utf-8 is moderately complicated by itself to not plan for unproven optimizations. And the flags is clearly superflous (and may have cost of whole machine word, as was mentioned earlier). Suggestion: remove.

pfalcon · 2014-06-07T03:57:34Z

py/builtin.c

+    } else if (c < 0x800) {
+        str[0] = (c >> 6) | 0xC0;
+        str[1] = (c & 0x3F) | 0x80;
+	len = 2;


It seems you have mixes of tabs and spaces - here and in in lot of other places. (It's visible even with github rendering - indent is different for consecutive lines).

pfalcon · 2014-06-07T04:21:04Z

Commented on specific code pieces, back to generic requirements.

I'm sure we'd want to keep current (Py2-like) strings as config option. And apparently the only manageable way to do that is to introduce new type for unicode strings. That would also outcome from following requirement, inspired by " // For byte strings, the 'character' length (really the "exposed length" or "Python length") equals the byte length." So, why unicode stuff should affect bytes which are good as they are? So, everything points to objstruni.c with new type, and then apparently uni-specific methods should be hosted there too (objstr.c remains for bytes & 8-bit str).

pfalcon · 2014-06-07T04:22:02Z

With all that, the work looks like a great start, thanks! Hope you'll find the comments sensible.

Rosuav · 2014-06-07T04:25:41Z

Alright, here come the responses to the responses....

Checking on input - absolutely. Currently, I try to do those sorts of checks at string creation, but I'm not prepared to guarantee that it's reliable. Obviously the intention is for it to be impossible to segfault the engine.

Removal of flags - yeah, that's looking likely. Having a fast-path for ASCII-only strings may be possible, but may well not be worth it.

Tabs and spaces - argh, grumble. I've been fighting SciTE over that. Will deal with that in a separate commit in a bit.

"Stupidly inefficient" was actually counting the length every time you index it. No, I didn't have firm data to back that, but I do know that counting the full length when you ask for the first character is pretty wasteful! I left most of the indexing code exactly as it was, which means passing a length along and having mp_get_index process it. That may change, and in fact it must change if we're to get any optimization based on negative indexing, but I want to get some benchmarks going before looking at optimization.

You want macros for UTF8_IS_LEAD / UTF8_IS_CONT? Sure, will get to that. I'm personally more happy seeing what the code's doing there, because I'm comfortable with bit manipulations; putting in lots of function-like macros suggests to my mind that they might change (as with the GET_STR_DATA_LEN macro family - change the structure and just change the macros), and the definition of UTF-8 won't change. But no problem.

TBH, I'm not sure you would want to keep the current strings as a config option, because they are broken compared to the Python 3 spec. Breaking out Unicode strings from the existing string types would be an option, but that involves a lot more digging into the deeper parts of the code than I'm prepared to do right now. I'll get to these easy fixes you've mentioned, but then I might leave the open-heart surgery to someone like you.

…ction, add braces. No functional changes.

Rosuav · 2014-06-07T05:46:43Z

Alright. Whitespace and UTF8_* macros added (not _LEAD as it's actually never something I care about, but _NONASCII is - it catches continuation bytes and multibyte lead bytes, but not single-byte chars).

dpgeorge · 2014-06-07T12:32:40Z

py/makeqstrdata.py

@@ -58,7 +59,8 @@ def do_work(infiles):
    for order, ident, qstr in sorted(qstrs.values(), key=lambda x: x[0]):
        qhash = compute_hash(qstr)
        qlen = len(qstr)
-        print('Q({}, (const byte*)"\\x{:02x}\\x{:02x}\\x{:02x}\\x{:02x}" "{}")'.format(ident, qhash & 0xff, (qhash >> 8) & 0xff, qlen & 0xff, (qlen >> 8) & 0xff, qstr))
+        qchlen = len(qstr.decode("utf-8"))


Are you using Python 3 to run this script? For me it fails, because qstr is a str and str's don't have the decode method. qstr should be bytes, or don't use decode.

I just let the makefile run it, and if I had to guess, I'd say it either used the system Python (Debian Wheezy, so that's 2.7.3) or it chose a Python that it wanted.

The point may become moot, though, if charlen is removed from qstrs.

dpgeorge · 2014-06-07T12:45:00Z

A note on binary size: stmhal/ grows by about 3.5k, and 2k of this is due to the increase in size of the qstrs: there are almost 700 qstrs, and each grew by 3 bytes.

bare-arm/ grew by 2k, with about 700 bytes going to increase in qstr size.

Surely there is room for space optimisation here.

This reverts commit c239f50.

Rosuav · 2014-06-07T16:12:55Z

There, flags byte removed from all strings, which should reduce sizes a bit. I'll dig into the charlen changes later; they'll require more than "git revert c239f5" followed by merge resolution :)

pfalcon · 2014-06-07T18:18:11Z

Breaking out Unicode strings from the existing string types would be an option, but that involves a lot more digging into the deeper parts of the code than I'm prepared to do right now. I'll get to these easy fixes you've mentioned, but then I might leave the open-heart surgery to someone like you.

Yes, sure, unicode support is big enough undertaking and you count on others to help. On the other hand, current patchset shows that you navigate code pretty well already, so looking into how types are made in uPy might be next step. I could suggest objfloat.c as not too big example. const mp_obj_type_t mp_type_float is def of type, and you surely find a lot of similarity with CPy.

I elaborated in #657 (comment) why I think keeping support for 8-bit str's is important, so deciding on separate unicode str type is on critical path to see how to move forward. And, I guess we should create "unicode" branch in upstream and land code there. So, please see if you'd be interested in splitting separate type for uni on your side, otherwise I can do that (but probably will require squashing and rebasing).

…s say

…om length-in-bytes

pfalcon · 2014-06-08T10:37:15Z

py/qstr.c

@@ -83,8 +84,8 @@ const static qstr_pool_t const_pool = {
    10,                 // set so that the first dynamically allocated pool is twice this size; must be <= the len (just below)
    MP_QSTR_number_of,  // corresponds to number of strings in array just below
    {
-        (const byte*) "\0\0\0\0", // invalid/no qstr has empty data


These now need to be reverted too, right?

Good catch, thanks. Fixed.

…for neg indexing

Rosuav · 2014-06-09T19:21:38Z

Changes made as per comments. The executable has shrunk somewhat.

Previously:
   text    data     bss     dec     hex filename
 275857    1352    1216  278425   43f99 micropython
As of the above:
   text    data     bss     dec     hex filename
 281009    1352    1216  283577   453b9 micropython
Now:
   text    data     bss     dec     hex filename
 277561    1352    1216  280129   44641 micropython

So Unicode now costs just under 2KB.

pfalcon · 2014-06-09T19:53:15Z

Now, that's what I love! We surely will need some more patching to string methods, but it's good to know that this basic ("len() and subscription") unicode support costs (fingers crossed) under 3Kb.

Which reminds me that we should plan for next level - isalpha() and friends (in #657).

dpgeorge · 2014-06-09T20:40:44Z

Very nice indeed! I'll review it properly as soon as I have time.

Rosuav · 2014-06-09T21:51:52Z

By the way, I added some extra test cases, including slicing and such. What I have not added, so far, is bounds checking that involves thrown exceptions - so, for instance, the tests verify that "asdf"[0] == "a", but not that "asdf"[7] throws.

Rosuav · 2014-06-09T22:44:35Z

Just added another test (which already passes). String encoding currently seems to assume UTF-8, but that's going to be the most important anyway.

dpgeorge · 2014-06-11T11:36:06Z

py/misc.h

@@ -100,6 +100,9 @@ bool unichar_isupper(unichar c);
 bool unichar_islower(unichar c);
 unichar unichar_tolower(unichar c);
 unichar unichar_toupper(unichar c);
+uint unichar_charlen(const char *str, uint len);


Should this be called utf8_charlen?

pfalcon · 2014-06-14T12:25:47Z

Squashed and merged to https://github.com/micropython/micropython/tree/unicode branch, thanks. Further discussion happens (so far) in #657.

Rosuav added 15 commits June 7, 2014 11:34

Add PEP 393-flags to strings and stub usage.

c239f50

The test suite all passes, but nothing has actually been changed.

Update comments - now aiming for UTF-8 rather than PEP 393 strings

89452be

Beginnings of UTF-8 support - construct strings from that many UTF-8-…

b0f41c7

…encoded chars, and subscript bytes the same way

objstr: Record character length separately from byte length

47c2345

CAUTION: Buggy, may crash stuff - qstr needs equivalent functionality too

HACK - When indexing a qstr, count its charlen. Stupidly inefficient …

cd2cf66

…but POC. All tests pass now, though string creation is still buggy.

Make len(s) return character length (even though creation's still buggy)

16429b8

Add test of UTF-8 encoded source file resulting in properly formed st…

6df1b94

…ring

Add character length to qstr

231031a

Add support for \u and \U escapes, but not \N (with explanatory comment)

e924659

Retain characters as UTF-8 encoded Unicode

03f0cbe

Make ord() Unicode-aware

bb13212

Record byte lengths for byte strings

7bc9190

Expand the Unicode tests

01bd686

Make a string's repr Unicode-aware

f51ad73

Make chr() Unicode-aware

f1911f5

pfalcon reviewed Jun 7, 2014
View reviewed changes

Rosuav added 2 commits June 7, 2014 15:28

Formatting/layout improvements - introduce macros for UTF-8 byte dete…

279de0c

…ction, add braces. No functional changes.

Whitespace fixes

f9bebb2

dpgeorge reviewed Jun 7, 2014
View reviewed changes

Revert "Add PEP 393-flags to strings and stub usage."

bc990da

This reverts commit c239f50.

Rosuav added 5 commits June 8, 2014 06:32

Make utf8_get_char() and utf8_next_char() actually do what their name…

30d1bad

…s say

Use utf8_get/next_char in building up a string's repr

44b0d5c

Add a unichar_charlen() function to calculate length-in-characters fr…

a019ba9

…om length-in-bytes

Get rid of mp_obj_str_get_data_len() which was used in only one place

5c1658e

Remove the charlen field from strings, calculating it when required

5473e1a

pfalcon reviewed Jun 8, 2014
View reviewed changes

Rosuav added 7 commits June 8, 2014 22:09

Clean up constant qstr declarations now that charlen isn't needed

0bcc7ab

Change string indexing to not precalculate the charlen, and add test …

a24d19f

…for neg indexing

Add tests of string slicing, which currently fail

616c24a

Break out index-to-pointer calculation into a function

24371c7

Support slicing in str_index_to_ptr, and fix a bounds error

0d339a1

Properly implement string slicing

086a2a0

More whitespace fixups

e2c9782

Merge branch 'master' into unicode, resolving conflict on py/obj.h

c962057

Test a default (= UTF-8) encode and decode

5bf0153

dpgeorge reviewed Jun 11, 2014
View reviewed changes

Optimize as per TODO (thanks Damien!)

99dc21b

Uh oh!

First shot at Unicode support #671

First shot at Unicode support #671

Uh oh!

Conversation

Rosuav commented Jun 7, 2014

Uh oh!

pfalcon commented Jun 7, 2014

Uh oh!

pfalcon commented Jun 7, 2014

Uh oh!

pfalcon Jun 7, 2014

Choose a reason for hiding this comment

Uh oh!

pfalcon commented Jun 7, 2014

Uh oh!

pfalcon commented Jun 7, 2014

Uh oh!

Rosuav commented Jun 7, 2014

Uh oh!

Rosuav commented Jun 7, 2014

Uh oh!

dpgeorge Jun 7, 2014

Choose a reason for hiding this comment

Uh oh!

Rosuav Jun 7, 2014

Choose a reason for hiding this comment

Uh oh!

dpgeorge commented Jun 7, 2014

Uh oh!

Rosuav commented Jun 7, 2014

Uh oh!

pfalcon commented Jun 7, 2014

Uh oh!

pfalcon Jun 8, 2014

Choose a reason for hiding this comment

Uh oh!

Rosuav Jun 8, 2014

Choose a reason for hiding this comment

Uh oh!

Rosuav commented Jun 9, 2014

Uh oh!

pfalcon commented Jun 9, 2014

Uh oh!

dpgeorge commented Jun 9, 2014

Uh oh!

Rosuav commented Jun 9, 2014

Uh oh!

Rosuav commented Jun 9, 2014

Uh oh!

dpgeorge Jun 11, 2014

Choose a reason for hiding this comment

Uh oh!

pfalcon commented Jun 14, 2014

Uh oh!

Uh oh!