Basic implementation of unicode strings handling #695

pfalcon · 2014-06-15T20:32:36Z

This implements basic, but IMHO, mergeable, unicode strings support. Features added beyond initial patchset (#671) are discussed at
#657 (comment) .

dpgeorge · 2014-06-19T15:48:21Z

Thanks @pfalcon, really nice. I will take a good look through it.

…port.

Squashed commit of the following: commit 99dc21b Author: Chris Angelico <rosuav@gmail.com> Date: Thu Jun 12 02:18:54 2014 +1000 Optimize as per TODO (thanks Damien!) commit 5bf0153 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 08:42:06 2014 +1000 Test a default (= UTF-8) encode and decode commit c962057 Merge: e2c9782 195de32 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 05:23:03 2014 +1000 Merge branch 'master' into unicode, resolving conflict on py/obj.h commit e2c9782 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 05:05:57 2014 +1000 More whitespace fixups commit 086a2a0 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 05:04:20 2014 +1000 Properly implement string slicing commit 0d339a1 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 02:24:11 2014 +1000 Support slicing in str_index_to_ptr, and fix a bounds error commit 24371c7 Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 02:10:22 2014 +1000 Break out index-to-pointer calculation into a function commit 616c24a Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 02:03:11 2014 +1000 Add tests of string slicing, which currently fail commit a24d19f Author: Chris Angelico <rosuav@gmail.com> Date: Tue Jun 10 01:56:53 2014 +1000 Change string indexing to not precalculate the charlen, and add test for neg indexing commit 0bcc7ab Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 22:09:17 2014 +1000 Clean up constant qstr declarations now that charlen isn't needed commit 5473e1a Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 07:18:42 2014 +1000 Remove the charlen field from strings, calculating it when required commit 5c1658e Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 07:11:27 2014 +1000 Get rid of mp_obj_str_get_data_len() which was used in only one place commit a019ba9 Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 06:58:26 2014 +1000 Add a unichar_charlen() function to calculate length-in-characters from length-in-bytes commit 44b0d5c Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 06:32:44 2014 +1000 Use utf8_get/next_char in building up a string's repr commit 30d1bad Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 06:10:45 2014 +1000 Make utf8_get_char() and utf8_next_char() actually do what their names say commit bc990da Author: Chris Angelico <rosuav@gmail.com> Date: Sun Jun 8 02:10:59 2014 +1000 Revert "Add PEP 393-flags to strings and stub usage." This reverts commit c239f50. commit f9bebb2 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 15:41:48 2014 +1000 Whitespace fixes commit 279de0c Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 15:28:35 2014 +1000 Formatting/layout improvements - introduce macros for UTF-8 byte detection, add braces. No functional changes. commit f1911f5 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:56:02 2014 +1000 Make chr() Unicode-aware commit f51ad73 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:44:07 2014 +1000 Make a string's repr Unicode-aware commit 01bd686 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:33:43 2014 +1000 Expand the Unicode tests commit 7bc9190 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:27:30 2014 +1000 Record byte lengths for byte strings commit bb13212 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 11:25:06 2014 +1000 Make ord() Unicode-aware commit 03f0cbe Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 10:24:35 2014 +1000 Retain characters as UTF-8 encoded Unicode commit e924659 Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 08:37:27 2014 +1000 Add support for \u and \U escapes, but not \N (with explanatory comment) commit 231031a Author: Chris Angelico <rosuav@gmail.com> Date: Sat Jun 7 05:09:35 2014 +1000 Add character length to qstr commit 6df1b94 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:48:36 2014 +1000 Add test of UTF-8 encoded source file resulting in properly formed string commit 16429b8 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:44:15 2014 +1000 Make len(s) return character length (even though creation's still buggy) commit cd2cf66 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:15:36 2014 +1000 HACK - When indexing a qstr, count its charlen. Stupidly inefficient but POC. All tests pass now, though string creation is still buggy. commit 47c2345 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 13:15:32 2014 +1000 objstr: Record character length separately from byte length CAUTION: Buggy, may crash stuff - qstr needs equivalent functionality too commit b0f41c7 Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 05:37:36 2014 +1000 Beginnings of UTF-8 support - construct strings from that many UTF-8-encoded chars, and subscript bytes the same way commit 89452be Author: Chris Angelico <rosuav@gmail.com> Date: Fri Jun 6 05:28:47 2014 +1000 Update comments - now aiming for UTF-8 rather than PEP 393 strings commit c239f50 Author: Chris Angelico <rosuav@gmail.com> Date: Wed Jun 4 05:28:12 2014 +1000 Add PEP 393-flags to strings and stub usage. The test suite all passes, but nothing has actually been changed.

…tringIO.

Useful when we have pointer to char inside string, but need to return char index. (E.g. str.find()).

Based on config define.

…mented.

dpgeorge · 2014-06-27T06:25:37Z

py/objstrunicode.c

+}
+
+STATIC const mp_map_elem_t str_locals_dict_table[] = {
+#if MICROPY_CPYTHON_COMPAT


Do you think that we can safely exclude encode and decode methods (with CPYTHON_COMPAT option)? You can use bytes() and str() instead, but is that the more "Pythonic" way / more used way?

Would it be better to have a different (more descriptive) config option for these methods, eg MICROPY_PY_BUILTINS_STR_ENCODE?

Do you think that we can safely exclude encode and decode methods

No ;-(

You can use bytes() and str() instead, but is that the more "Pythonic" way / more used way?

So, converting via encoding arg to constructors is Py3-only feature, and so nobody uses it (well, I never saw a case so far), whereas .encode()/.decode() are well known Py2 methods.

Would it be better to have a different (more descriptive) config option for these methods, eg MICROPY_PY_BUILTINS_STR_ENCODE?

Yes, now that we have good config scheme, makes sense to do that. But that's not exactly unicode-specific, as objstr.c has similar usage.

Conflicts: py/mpconfig.h

This enables testing unicode and non-unicode implementations.

Unicode is disabled by default for now, since FileIO.read(n) is currently not implemented for text-mode files, and this is an often function.

dpgeorge · 2014-06-28T09:36:20Z

Okay, that's merged. I made a few small changes, and added some comments.

It would be good to enable by default unicode support in both unix and stmhal ports. To do that, need to implement FileIO.read(n) for text-mode files.

Shall we delete the unicode branch?

pfalcon · 2014-06-28T10:33:59Z

Thanks! I submitted #726 for buffered streams implementation. More smaller tweaks needed here and there too. Yes, I'll delete the branch.

Paul Sokolovsky and others added 28 commits June 27, 2014 00:04

mpconfig.h: Add MICROPY_PY_BUILTINS_STR_UNICODE.

12bc13e

py: Implement basic unicode functions.

c88987c

objstrunicode: Complete copy of objstr, to be patched for unicode sup…

8386534

…port.

builtin: ord, chr: Unicode support.

9a1a4be

tests: Add unicode test.

1e3781b

lexer, vstr: Add unicode support.

2ba2299

builtin: Restore bytestr compatibility.

42a5251

vstr: Restore bytestr compatibility.

165eb69

py: Prune unneeded code from objstrunicode, reuse code in objstr.

9731912

py: Make MICROPY_PY_BUILTINS_STR_UNICODE=1 buildable.

d215ee1

objstrunicode: Get rid of bytes checking, it's separate type.

86d3898

objstrunicode: Revamp len() handling for unicode, and optimize bool().

e7f2b4c

objstrunicode: Re-add buffer protocol back for now, required for io.S…

cdc020d

…tringIO.

objstrunicode: Implement iterator.

79b7fe2

tests: Add test for unicode string iteration.

17994d1

py: Add dedicated unicode header.

ded0fc7

unicode: Add utf8_ptr_to_index().

46d31e9

Useful when we have pointer to char inside string, but need to return char index. (E.g. str.find()).

objstr: find(), rfind(), index(): Make return value be unicode-aware.

5048df0

tests: Add tests for unicode find()/rfind()/index().

b1949e4

unicode: Make get_char()/next_char()/charlen() be 8-bit compatible.

1044c3d

Based on config define.

objstrunicode: Signedness issues.

00c904b

objstr: 64-bit issues.

26fda6d

objstrunicode: Refactor str_index_to_ptr() following objstr.

ea2c936

tests: Test for explicit start/end args to str methods for unicode.

63143c9

misc: Add count_lead_ones() function, useful for UTF-8 handling.

ce81312

streams: Reading by char count from unicode text streams is not imple…

f5f6c3b

…mented.

tests: Add basic test for unicode file i/o.

ed07d03

dpgeorge reviewed Jun 27, 2014
View reviewed changes

dpgeorge added 5 commits June 28, 2014 10:27

Merge branch 'master' into unicode

b3a50f0

Conflicts: py/mpconfig.h

py: Small comments, name changes, use of machine_int_t.

e04a44e

tests: Write output in byte mode, not text mode.

41736f8

This enables testing unicode and non-unicode implementations.

py: Add missing #endif.

8546ce1

unix, stmhal: Add option for STR_UNICODE to mpconfigport.h.

635b60e

Unicode is disabled by default for now, since FileIO.read(n) is currently not implemented for text-mode files, and this is an often function.

dpgeorge merged commit 635b60e into master Jun 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Basic implementation of unicode strings handling #695

Basic implementation of unicode strings handling #695

Uh oh!

pfalcon commented Jun 15, 2014

Uh oh!

dpgeorge commented Jun 19, 2014

Uh oh!

dpgeorge Jun 27, 2014

Uh oh!

pfalcon Jun 27, 2014

Uh oh!

dpgeorge commented Jun 28, 2014

Uh oh!

pfalcon commented Jun 28, 2014

Uh oh!

Uh oh!

Uh oh!

Basic implementation of unicode strings handling #695

Basic implementation of unicode strings handling #695

Uh oh!

Conversation

pfalcon commented Jun 15, 2014

Uh oh!

dpgeorge commented Jun 19, 2014

Uh oh!

dpgeorge Jun 27, 2014

Choose a reason for hiding this comment

Uh oh!

pfalcon Jun 27, 2014

Choose a reason for hiding this comment

Uh oh!

dpgeorge commented Jun 28, 2014

Uh oh!

pfalcon commented Jun 28, 2014

Uh oh!

Uh oh!