-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Basic implementation of unicode strings handling #695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @pfalcon, really nice. I will take a good look through it. |
Squashed commit of the following: commit 99dc21b Author: Chris Angelico <rosuav@gmail.com> Date: Thu Jun 12 02:18:54 2014 +1000 Optimize as per TODO (thanks Damien!) commit 5bf0153 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 08:42:06 2014 +1000 Test a default (= UTF-8) encode and decode commit c962057 Merge: e2c9782 195de32 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 05:23:03 2014 +1000 Merge branch 'master' into unicode, resolving conflict on py/obj.h commit e2c9782 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 05:05:57 2014 +1000 More whitespace fixups commit 086a2a0 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 05:04:20 2014 +1000 Properly implement string slicing commit 0d339a1 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 02:24:11 2014 +1000 Support slicing in str_index_to_ptr, and fix a bounds error commit 24371c7 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 02:10:22 2014 +1000 Break out index-to-pointer calculation into a function commit 616c24a Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 02:03:11 2014 +1000 Add tests of string slicing, which currently fail commit a24d19f Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 01:56:53 2014 +1000 Change string indexing to not precalculate the charlen, and add test for neg indexing commit 0bcc7ab Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 22:09:17 2014 +1000 Clean up constant qstr declarations now that charlen isn't needed commit 5473e1a Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 07:18:42 2014 +1000 Remove the charlen field from strings, calculating it when required commit 5c1658e Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 07:11:27 2014 +1000 Get rid of mp_obj_str_get_data_len() which was used in only one place commit a019ba9 Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 06:58:26 2014 +1000 Add a unichar_charlen() function to calculate length-in-characters from length-in-bytes commit 44b0d5c Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 06:32:44 2014 +1000 Use utf8_get/next_char in building up a string's repr commit 30d1bad Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 06:10:45 2014 +1000 Make utf8_get_char() and utf8_next_char() actually do what their names say commit bc990da Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 02:10:59 2014 +1000 Revert "Add PEP 393-flags to strings and stub usage." This reverts commit c239f50. commit f9bebb2 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 15:41:48 2014 +1000 Whitespace fixes commit 279de0c Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 15:28:35 2014 +1000 Formatting/layout improvements - introduce macros for UTF-8 byte detection, add braces. No functional changes. commit f1911f5 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:56:02 2014 +1000 Make chr() Unicode-aware commit f51ad73 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:44:07 2014 +1000 Make a string's repr Unicode-aware commit 01bd686 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:33:43 2014 +1000 Expand the Unicode tests commit 7bc9190 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:27:30 2014 +1000 Record byte lengths for byte strings commit bb13212 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:25:06 2014 +1000 Make ord() Unicode-aware commit 03f0cbe Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 10:24:35 2014 +1000 Retain characters as UTF-8 encoded Unicode commit e924659 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 08:37:27 2014 +1000 Add support for \u and \U escapes, but not \N (with explanatory comment) commit 231031a Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 05:09:35 2014 +1000 Add character length to qstr commit 6df1b94 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:48:36 2014 +1000 Add test of UTF-8 encoded source file resulting in properly formed string commit 16429b8 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:44:15 2014 +1000 Make len(s) return character length (even though creation's still buggy) commit cd2cf66 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:15:36 2014 +1000 HACK - When indexing a qstr, count its charlen. Stupidly inefficient but POC. All tests pass now, though string creation is still buggy. commit 47c2345 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:15:32 2014 +1000 objstr: Record character length separately from byte length CAUTION: Buggy, may crash stuff - qstr needs equivalent functionality too commit b0f41c7 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 05:37:36 2014 +1000 Beginnings of UTF-8 support - construct strings from that many UTF-8-encoded chars, and subscript bytes the same way commit 89452be Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 05:28:47 2014 +1000 Update comments - now aiming for UTF-8 rather than PEP 393 strings commit c239f50 Author: Chris Angelico <rosuav@gmail.com> Date: Wed Jun 4 05:28:12 2014 +1000 Add PEP 393-flags to strings and stub usage. The test suite all passes, but nothing has actually been changed.
Useful when we have pointer to char inside string, but need to return char index. (E.g. str.find()).
Based on config define.
} | ||
|
||
STATIC const mp_map_elem_t str_locals_dict_table[] = { | ||
#if MICROPY_CPYTHON_COMPAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that we can safely exclude encode and decode methods (with CPYTHON_COMPAT option)? You can use bytes() and str() instead, but is that the more "Pythonic" way / more used way?
Would it be better to have a different (more descriptive) config option for these methods, eg MICROPY_PY_BUILTINS_STR_ENCODE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that we can safely exclude encode and decode methods
No ;-(
You can use bytes() and str() instead, but is that the more "Pythonic" way / more used way?
So, converting via encoding arg to constructors is Py3-only feature, and so nobody uses it (well, I never saw a case so far), whereas .encode()/.decode() are well known Py2 methods.
Would it be better to have a different (more descriptive) config option for these methods, eg MICROPY_PY_BUILTINS_STR_ENCODE?
Yes, now that we have good config scheme, makes sense to do that. But that's not exactly unicode-specific, as objstr.c has similar usage.
Conflicts: py/mpconfig.h
This enables testing unicode and non-unicode implementations.
Unicode is disabled by default for now, since FileIO.read(n) is currently not implemented for text-mode files, and this is an often function.
Okay, that's merged. I made a few small changes, and added some comments. It would be good to enable by default unicode support in both unix and stmhal ports. To do that, need to implement FileIO.read(n) for text-mode files. Shall we delete the unicode branch? |
Thanks! I submitted #726 for buffered streams implementation. More smaller tweaks needed here and there too. Yes, I'll delete the branch. |
This implements basic, but IMHO, mergeable, unicode strings support. Features added beyond initial patchset (#671) are discussed at
#657 (comment) .