Skip to content

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eric-wieser opened this issue Apr 6, 2017 · 9 comments
Closed

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

eric-wieser opened this issue Apr 6, 2017 · 9 comments

Comments

@eric-wieser
Copy link
Member

eric-wieser commented Apr 6, 2017

Was slightly surprised today to find that these are not equal

>>> sys.version
'2.7.11 (v2.7.11:6d1b6a68f775, Dec  5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)]'
>>> sys.maxunicode
65535  # pretty sure this only happens in this situation

>>> x = np.unicode_("test")
>>> x.itemsize
8
>>> x.dtype.itemsize
16
>>> np.array(x).itemsize
16
>>> np.array(x).dtype.itemsize
16

One of these is not like the others. Why?

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 6, 2017

This is problematic with something like this:

>>> x.view((np.byte, x.nbytes))  # this should usually work
ValueError: new type not compatible with array.
>>> x.view((np.byte, x.nbytes * 2))  # ???
array([116,   0,   0,   0, 101,   0,   0,   0, 115,   0,   0,   0, 116,
         0,   0,   0], dtype=int8)

Although it seems that it is sometimes correct:

>>> b = buffer(x)
>>> len(b) == x.nbytes
True
>>> list(b)
>>> list(buffer(x))
['t', '\x00', 'e', '\x00', 's', '\x00', 't', '\x00']

Why does unicode seem to have different binary representations depending on how you look at it? Aren't both buffer and .view((np.byte,...)) supposed to simply expose the underlying memory?

@charris
Copy link
Member

charris commented Apr 6, 2017

Python 2,7 can be compiled with either 16 or 32 bit unicode whereas Numpy always uses 32 bit unicode when storing unicode strings in order to avoid problems.

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 6, 2017

Numpy always uses 32 bit unicode when storing unicode strings

Except in np.unicode_, when it somehow both does and doesn't?

@charris
Copy link
Member

charris commented Apr 6, 2017

I'll guess that np.unicode_ aliases the Python type. My Python is compiled to use UCS4

/usr/include/python2.7/pyconfig-64.h:#define Py_UNICODE_SIZE 4

and

In [1]: np.unicode_("test").itemsize
Out[1]: 16

I believe Microsoft itself uses wide characters (16 bits) for unicode.

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 6, 2017

I'll guess that np.unicode_ aliases the Python type

If you mean np.unicode_ is unicode, then that's not the case. Perhaps unicode_ internally stores a unicode.

I believe Microsoft itself uses wide characters (16 bits) for unicode.

Indeed - I'm only seeing this for that reason, but I think the difference in storage between arrays and numpy scalars is screwy

@charris
Copy link
Member

charris commented Apr 6, 2017

The numpy scalar type looks to subclass unicode in order to add some attributes.

In [16]: isinstance(np.unicode_("test"), unicode)
Out[16]: True

@charris
Copy link
Member

charris commented Apr 6, 2017

See numpy/core/numerictypes.py, lines 262-275

    if sys.version_info[0] >= 3:
        if name == 'bytes_':
            char = 'S'
            base = 'bytes'
        elif name == 'str_':
            char = 'U'
            base = 'str'
    else:
        if name == 'string_':
            char = 'S'
            base = 'string'
        elif name == 'unicode_':
            char = 'U'
            base = 'unicode'

@eric-wieser
Copy link
Member Author

Yep, you're right, that is the reason

@eric-wieser
Copy link
Member Author

Pretty sure this was fixed by #15385, now all unicode objects in numpy are UCS4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants