Skip to content

First shot at Unicode support #671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from
Closed

Conversation

Rosuav
Copy link
Contributor

@Rosuav Rosuav commented Jun 7, 2014

This may well not yet be ready for merging. Request for comments, mainly!

I've added a test, and it and all the rest of the tests pass. There are quite a few places in the code where things assume correct UTF-8 encoding, and things may crash and burn if given malformed byte streams.

In theory, a bytes() is allowed to have anything in it, and will always have its charlen and its len the same; a str() must be correctly-formed UTF-8, and its charlen is less than its len by the number of continuation bytes.

The flags byte is currently unused and always 1. Later on, this could get a marker specifying that the string is ASCII-only, which could simplify some code paths for the very common case; alternatively, that could be dropped.

Previously:
   text    data     bss     dec     hex filename
 275857    1352    1216  278425   43f99 micropython
Now:
   text    data     bss     dec     hex filename
 281009    1352    1216  283577   453b9 micropython

If I'm reading this correctly, Unicode support has cost you 5KB.

@pfalcon
Copy link
Contributor

pfalcon commented Jun 7, 2014

Ok, as I mentioned, we didn't discuss many points before, so there's going to be a lot of comments and questions now ;-)

There are quite a few places in the code where things assume correct UTF-8 encoding, and things may crash and burn if given malformed byte streams.

That's one the points which is worth having in requirements. My stake would be: Handling definitely should not segfault, so bounds checking is needed. But otherwise, I don't this there should be any validity check during various operations, at least not for version 1.0 of unicode support. And even going forward, I'd say that stringent check should be applied on string input (as in I/O), not again and again during processing. Let me know if that makes sense.

Other comment likely go inline.

@pfalcon
Copy link
Contributor

pfalcon commented Jun 7, 2014

The flags byte is currently unused and always 1. Later on, this could get a marker specifying that the string is ASCII-only, which could simplify some code paths for the very common case; alternatively, that could be dropped.

I'd say utf-8 is moderately complicated by itself to not plan for unproven optimizations. And the flags is clearly superflous (and may have cost of whole machine word, as was mentioned earlier). Suggestion: remove.

} else if (c < 0x800) {
str[0] = (c >> 6) | 0xC0;
str[1] = (c & 0x3F) | 0x80;
len = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you have mixes of tabs and spaces - here and in in lot of other places. (It's visible even with github rendering - indent is different for consecutive lines).

@pfalcon
Copy link
Contributor

pfalcon commented Jun 7, 2014

Commented on specific code pieces, back to generic requirements.

I'm sure we'd want to keep current (Py2-like) strings as config option. And apparently the only manageable way to do that is to introduce new type for unicode strings. That would also outcome from following requirement, inspired by " // For byte strings, the 'character' length (really the "exposed length" or "Python length") equals the byte length." So, why unicode stuff should affect bytes which are good as they are? So, everything points to objstruni.c with new type, and then apparently uni-specific methods should be hosted there too (objstr.c remains for bytes & 8-bit str).

@pfalcon
Copy link
Contributor

pfalcon commented Jun 7, 2014

With all that, the work looks like a great start, thanks! Hope you'll find the comments sensible.

@Rosuav
Copy link
Contributor Author

Rosuav commented Jun 7, 2014

Alright, here come the responses to the responses....

Checking on input - absolutely. Currently, I try to do those sorts of checks at string creation, but I'm not prepared to guarantee that it's reliable. Obviously the intention is for it to be impossible to segfault the engine.

Removal of flags - yeah, that's looking likely. Having a fast-path for ASCII-only strings may be possible, but may well not be worth it.

Tabs and spaces - argh, grumble. I've been fighting SciTE over that. Will deal with that in a separate commit in a bit.

"Stupidly inefficient" was actually counting the length every time you index it. No, I didn't have firm data to back that, but I do know that counting the full length when you ask for the first character is pretty wasteful! I left most of the indexing code exactly as it was, which means passing a length along and having mp_get_index process it. That may change, and in fact it must change if we're to get any optimization based on negative indexing, but I want to get some benchmarks going before looking at optimization.

You want macros for UTF8_IS_LEAD / UTF8_IS_CONT? Sure, will get to that. I'm personally more happy seeing what the code's doing there, because I'm comfortable with bit manipulations; putting in lots of function-like macros suggests to my mind that they might change (as with the GET_STR_DATA_LEN macro family - change the structure and just change the macros), and the definition of UTF-8 won't change. But no problem.

TBH, I'm not sure you would want to keep the current strings as a config option, because they are broken compared to the Python 3 spec. Breaking out Unicode strings from the existing string types would be an option, but that involves a lot more digging into the deeper parts of the code than I'm prepared to do right now. I'll get to these easy fixes you've mentioned, but then I might leave the open-heart surgery to someone like you.

@Rosuav
Copy link
Contributor Author

Rosuav commented Jun 7, 2014

Alright. Whitespace and UTF8_* macros added (not _LEAD as it's actually never something I care about, but _NONASCII is - it catches continuation bytes and multibyte lead bytes, but not single-byte chars).

@@ -58,7 +59,8 @@ def do_work(infiles):
for order, ident, qstr in sorted(qstrs.values(), key=lambda x: x[0]):
qhash = compute_hash(qstr)
qlen = len(qstr)
print('Q({}, (const byte*)"\\x{:02x}\\x{:02x}\\x{:02x}\\x{:02x}" "{}")'.format(ident, qhash & 0xff, (qhash >> 8) & 0xff, qlen & 0xff, (qlen >> 8) & 0xff, qstr))
qchlen = len(qstr.decode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using Python 3 to run this script? For me it fails, because qstr is a str and str's don't have the decode method. qstr should be bytes, or don't use decode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just let the makefile run it, and if I had to guess, I'd say it either used the system Python (Debian Wheezy, so that's 2.7.3) or it chose a Python that it wanted.

The point may become moot, though, if charlen is removed from qstrs.

@dpgeorge
Copy link
Member

dpgeorge commented Jun 7, 2014

A note on binary size: stmhal/ grows by about 3.5k, and 2k of this is due to the increase in size of the qstrs: there are almost 700 qstrs, and each grew by 3 bytes.

bare-arm/ grew by 2k, with about 700 bytes going to increase in qstr size.

Surely there is room for space optimisation here.

@Rosuav
Copy link
Contributor Author

Rosuav commented Jun 7, 2014

There, flags byte removed from all strings, which should reduce sizes a bit. I'll dig into the charlen changes later; they'll require more than "git revert c239f5" followed by merge resolution :)

@pfalcon
Copy link
Contributor

pfalcon commented Jun 7, 2014

Breaking out Unicode strings from the existing string types would be an option, but that involves a lot more digging into the deeper parts of the code than I'm prepared to do right now. I'll get to these easy fixes you've mentioned, but then I might leave the open-heart surgery to someone like you.

Yes, sure, unicode support is big enough undertaking and you count on others to help. On the other hand, current patchset shows that you navigate code pretty well already, so looking into how types are made in uPy might be next step. I could suggest objfloat.c as not too big example. const mp_obj_type_t mp_type_float is def of type, and you surely find a lot of similarity with CPy.

I elaborated in #657 (comment) why I think keeping support for 8-bit str's is important, so deciding on separate unicode str type is on critical path to see how to move forward. And, I guess we should create "unicode" branch in upstream and land code there. So, please see if you'd be interested in splitting separate type for uni on your side, otherwise I can do that (but probably will require squashing and rebasing).

@@ -83,8 +84,8 @@ const static qstr_pool_t const_pool = {
10, // set so that the first dynamically allocated pool is twice this size; must be <= the len (just below)
MP_QSTR_number_of, // corresponds to number of strings in array just below
{
(const byte*) "\0\0\0\0", // invalid/no qstr has empty data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These now need to be reverted too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. Fixed.

@Rosuav
Copy link
Contributor Author

Rosuav commented Jun 9, 2014

Changes made as per comments. The executable has shrunk somewhat.

Previously:
   text    data     bss     dec     hex filename
 275857    1352    1216  278425   43f99 micropython
As of the above:
   text    data     bss     dec     hex filename
 281009    1352    1216  283577   453b9 micropython
Now:
   text    data     bss     dec     hex filename
 277561    1352    1216  280129   44641 micropython

So Unicode now costs just under 2KB.

@pfalcon
Copy link
Contributor

pfalcon commented Jun 9, 2014

Now, that's what I love! We surely will need some more patching to string methods, but it's good to know that this basic ("len() and subscription") unicode support costs (fingers crossed) under 3Kb.

Which reminds me that we should plan for next level - isalpha() and friends (in #657).

@dpgeorge
Copy link
Member

dpgeorge commented Jun 9, 2014

Very nice indeed! I'll review it properly as soon as I have time.

@Rosuav
Copy link
Contributor Author

Rosuav commented Jun 9, 2014

By the way, I added some extra test cases, including slicing and such. What I have not added, so far, is bounds checking that involves thrown exceptions - so, for instance, the tests verify that "asdf"[0] == "a", but not that "asdf"[7] throws.

@Rosuav
Copy link
Contributor Author

Rosuav commented Jun 9, 2014

Just added another test (which already passes). String encoding currently seems to assume UTF-8, but that's going to be the most important anyway.

@@ -100,6 +100,9 @@ bool unichar_isupper(unichar c);
bool unichar_islower(unichar c);
unichar unichar_tolower(unichar c);
unichar unichar_toupper(unichar c);
uint unichar_charlen(const char *str, uint len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called utf8_charlen?

@pfalcon
Copy link
Contributor

pfalcon commented Jun 14, 2014

Squashed and merged to https://github.com/micropython/micropython/tree/unicode branch, thanks. Further discussion happens (so far) in #657.

@pfalcon pfalcon closed this Jun 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants