Skip to content

BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis #15385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 14, 2020

Conversation

eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Jan 22, 2020

These APIs work with either UCS2 or UCS4, depending on the value of Py_UNICODE_WIDE.
After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this.

Fixes gh-3258
Fixes gh-15363

@eric-wieser eric-wieser force-pushed the fix-unicode-ucs2 branch 3 times, most recently from 4e40618 to 719c892 Compare January 30, 2020 16:22
@eric-wieser eric-wieser changed the title WIP,BUG: Remove some internal UCS2 uses BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis Jan 30, 2020
Comment on lines -2647 to -2657
#ifdef Py_UNICODE_WIDE
#define PyArray_UCS4_ISSPACE Py_UNICODE_ISSPACE
#else
#define PyArray_UCS4_ISSPACE(ch) Py_STRING_ISSPACE((char)ch)
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this was a bug, but constructing a failing string is non-trivial

Comment on lines +394 to +351
* FIXME:
* is this really a good idea?
* stop using Py_UNICODE here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #15477

@eric-wieser eric-wieser marked this pull request as ready for review January 30, 2020 16:30
@Zac-HD
Copy link
Contributor

Zac-HD commented Feb 2, 2020

Is this the patch you mentioned for #15363? If so the example would be a nice regression test 😄

@eric-wieser

This comment has been minimized.

@eric-wieser
Copy link
Member Author

It's not the cause of the segfault, but it turns out that using PyUnicode_New(size, 0x10ffff) is not legal unless I actually write a character greater than 0x10000 to the underlying buffer. Otherwise str.__eq__ behaves incorrectly.

@charris
Copy link
Member

charris commented Feb 2, 2020

Welcome to the swamp :)

@eric-wieser
Copy link
Member Author

@Zac-HD: Regression test added

@mattip
Copy link
Member

mattip commented Feb 5, 2020

LGTM

@eric-wieser
Copy link
Member Author

I have two possible worries with this patch:

  • np.str_ scalars may get a little more memory intensive.
  • Making np.str_ support the buffer interface might allow silent conversion to bytes where previously none was allowed. I haven't found a case of this yet.

If reviewers can constructs possible corner cases for this second point, it would be helpful.

@eric-wieser
Copy link
Member Author

I haven't found a case of this yet.

Here's one such case, which used to error, but now assigns the UCS4 codepoints directly:

>>> a = np.zeros(1, 'V8')
>>> a[:] = 'te'
>>> a
array([b'\x74\x00\x00\x00\x65\x00\x00\x00'], dtype='|V8')

I don't think this is any weirder than the behavior of a[:] = 1.0, which assigns the float bits to the void element.

@eric-wieser
Copy link
Member Author

eric-wieser commented Feb 5, 2020

And another one:

>>> a = np.str_("test")
>>> b"%s" % a
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'

which again, isn't much weirder than b'%s' % np.float_(1.0)

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, although I did not look super closely. I think the Length discovery in the dtype discovery code was correct before. Was worried for a second about the delayed initialization, but it seems fine.
As for supporting the buffer interface... Yeah, it should mostly be strange with respect to void (and maybe including buffered) datatypes. But overall, it should be strange enough that it does not matter. I suppose we could mention it in the release notes.

These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`.
After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this.

Fixes numpygh-3258
Fixes numpygh-15363
@eric-wieser
Copy link
Member Author

Updated with a test of memoryview(np.str_)

Comment on lines -2682 to 2683
if unicode:
if sys.maxunicode == 0xffff:
# On a narrow Python build, the buffer for Unicode
# strings is UCS2, which doesn't match the buffer for
# NumPy Unicode types, which is ALWAYS UCS4.
# Therefore, we need to convert the buffer. On Python
# 2.6 and later, we can use the utf_32 codec. Earlier
# versions don't have that codec, so we convert to a
# numerical array that matches the input buffer, and
# then use NumPy to convert it to UCS4. All of this
# should happen in native endianness.
obj = obj.encode('utf_32')
else:
obj = str(obj)
else:
# Let the default Unicode -> string encoding (if any) take
# precedence.
obj = bytes(obj)

return chararray(shape, itemsize=itemsize, unicode=unicode,
buffer=obj, order=order)
Copy link
Member Author

@eric-wieser eric-wieser Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code never made sense in the first place, as chararray.__new__ has an identity crisis over whether it's trying to be np.ndarray.__new__ or np.array, and accepts str objects in place of the buffer.

Edit: Perhaps it was a workaround for the original bug.

@mattip
Copy link
Member

mattip commented Feb 12, 2020

The test failure is the windows heisenbug. I will put this in soon if there are no objections.

@mattip mattip added the triage review Issue/PR to be discussed at the next triage meeting label Feb 12, 2020
@seberg
Copy link
Member

seberg commented Feb 14, 2020

Thanks for the threat Matti, and thanks Eric. Had another glance over and it looks good to me, so putting it in. The test takes around half a second on my computer. A bit slow, but probably fine (and easily changed later).

@seberg seberg merged commit 1f9ab28 into numpy:master Feb 14, 2020
Zac-HD added a commit to Zac-HD/numpy that referenced this pull request Jun 30, 2020
The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite.  Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
6 participants