.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

eric-wieser · 2017-04-06T15:19:35Z

Was slightly surprised today to find that these are not equal

>>> sys.version
'2.7.11 (v2.7.11:6d1b6a68f775, Dec  5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)]'
>>> sys.maxunicode
65535  # pretty sure this only happens in this situation

>>> x = np.unicode_("test")
>>> x.itemsize
8
>>> x.dtype.itemsize
16
>>> np.array(x).itemsize
16
>>> np.array(x).dtype.itemsize
16

One of these is not like the others. Why?

eric-wieser · 2017-04-06T15:34:24Z

This is problematic with something like this:

>>> x.view((np.byte, x.nbytes))  # this should usually work
ValueError: new type not compatible with array.
>>> x.view((np.byte, x.nbytes * 2))  # ???
array([116,   0,   0,   0, 101,   0,   0,   0, 115,   0,   0,   0, 116,
         0,   0,   0], dtype=int8)

Although it seems that it is sometimes correct:

>>> b = buffer(x)
>>> len(b) == x.nbytes
True
>>> list(b)
>>> list(buffer(x))
['t', '\x00', 'e', '\x00', 's', '\x00', 't', '\x00']

Why does unicode seem to have different binary representations depending on how you look at it? Aren't both buffer and .view((np.byte,...)) supposed to simply expose the underlying memory?

charris · 2017-04-06T16:31:09Z

Python 2,7 can be compiled with either 16 or 32 bit unicode whereas Numpy always uses 32 bit unicode when storing unicode strings in order to avoid problems.

eric-wieser · 2017-04-06T16:35:08Z

Numpy always uses 32 bit unicode when storing unicode strings

Except in np.unicode_, when it somehow both does and doesn't?

charris · 2017-04-06T16:47:25Z

I'll guess that np.unicode_ aliases the Python type. My Python is compiled to use UCS4

/usr/include/python2.7/pyconfig-64.h:#define Py_UNICODE_SIZE 4

and

In [1]: np.unicode_("test").itemsize
Out[1]: 16

I believe Microsoft itself uses wide characters (16 bits) for unicode.

eric-wieser · 2017-04-06T16:51:36Z

I'll guess that np.unicode_ aliases the Python type

If you mean np.unicode_ is unicode, then that's not the case. Perhaps unicode_ internally stores a unicode.

I believe Microsoft itself uses wide characters (16 bits) for unicode.

Indeed - I'm only seeing this for that reason, but I think the difference in storage between arrays and numpy scalars is screwy

charris · 2017-04-06T17:13:50Z

The numpy scalar type looks to subclass unicode in order to add some attributes.

In [16]: isinstance(np.unicode_("test"), unicode)
Out[16]: True

charris · 2017-04-06T17:22:40Z

See numpy/core/numerictypes.py, lines 262-275

    if sys.version_info[0] >= 3:
        if name == 'bytes_':
            char = 'S'
            base = 'bytes'
        elif name == 'str_':
            char = 'U'
            base = 'str'
    else:
        if name == 'string_':
            char = 'S'
            base = 'string'
        elif name == 'unicode_':
            char = 'U'
            base = 'unicode'

eric-wieser · 2017-04-06T17:25:45Z

Yep, you're right, that is the reason

eric-wieser · 2020-06-17T17:29:56Z

Pretty sure this was fixed by #15385, now all unicode objects in numpy are UCS4.

eric-wieser mentioned this issue Apr 6, 2017

.fill called with unicode scalar Python3 on Windows yields UnicodeDecodeError when accessing the array values #7227

Closed

eric-wieser closed this as completed Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

eric-wieser commented Apr 6, 2017 •

edited

Loading

eric-wieser commented Apr 6, 2017 •

edited

Loading

Uh oh!

charris commented Apr 6, 2017

Uh oh!

eric-wieser commented Apr 6, 2017 •

edited

Loading

Uh oh!

charris commented Apr 6, 2017

Uh oh!

eric-wieser commented Apr 6, 2017 •

edited

Loading

Uh oh!

charris commented Apr 6, 2017 •

edited

Loading

Uh oh!

charris commented Apr 6, 2017

Uh oh!

eric-wieser commented Apr 6, 2017

Uh oh!

eric-wieser commented Jun 17, 2020

Uh oh!

Uh oh!

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

Comments

eric-wieser commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

eric-wieser commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Apr 6, 2017

Uh oh!

eric-wieser commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Apr 6, 2017

Uh oh!

eric-wieser commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Apr 6, 2017

Uh oh!

eric-wieser commented Apr 6, 2017

Uh oh!

eric-wieser commented Jun 17, 2020

Uh oh!

eric-wieser commented Apr 6, 2017 •

edited

Loading

eric-wieser commented Apr 6, 2017 •

edited

Loading

eric-wieser commented Apr 6, 2017 •

edited

Loading

eric-wieser commented Apr 6, 2017 •

edited

Loading

charris commented Apr 6, 2017 •

edited

Loading