-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis #15385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4e40618
to
719c892
Compare
719c892
to
41fc5e8
Compare
#ifdef Py_UNICODE_WIDE | ||
#define PyArray_UCS4_ISSPACE Py_UNICODE_ISSPACE | ||
#else | ||
#define PyArray_UCS4_ISSPACE(ch) Py_STRING_ISSPACE((char)ch) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this was a bug, but constructing a failing string is non-trivial
* FIXME: | ||
* is this really a good idea? | ||
* stop using Py_UNICODE here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #15477
Is this the patch you mentioned for #15363? If so the example would be a nice regression test 😄 |
This comment has been minimized.
This comment has been minimized.
It's not the cause of the segfault, but it turns out that using |
Welcome to the swamp :) |
a6a7957
to
a5d8554
Compare
7f3fd86
to
c1e7eae
Compare
@Zac-HD: Regression test added |
c1e7eae
to
634263a
Compare
LGTM |
I have two possible worries with this patch:
If reviewers can constructs possible corner cases for this second point, it would be helpful. |
Here's one such case, which used to error, but now assigns the UCS4 codepoints directly: >>> a = np.zeros(1, 'V8')
>>> a[:] = 'te'
>>> a
array([b'\x74\x00\x00\x00\x65\x00\x00\x00'], dtype='|V8') I don't think this is any weirder than the behavior of |
And another one: >>> a = np.str_("test")
>>> b"%s" % a
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00' which again, isn't much weirder than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me, although I did not look super closely. I think the Length discovery in the dtype discovery code was correct before. Was worried for a second about the delayed initialization, but it seems fine.
As for supporting the buffer interface... Yeah, it should mostly be strange with respect to void (and maybe including buffered) datatypes. But overall, it should be strange enough that it does not matter. I suppose we could mention it in the release notes.
These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`. After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this. Fixes numpygh-3258 Fixes numpygh-15363
634263a
to
9a49dfb
Compare
This eliminates the need for special casing in `np.generic.__reduce__`
9a49dfb
to
d0b7b66
Compare
Updated with a test of |
if unicode: | ||
if sys.maxunicode == 0xffff: | ||
# On a narrow Python build, the buffer for Unicode | ||
# strings is UCS2, which doesn't match the buffer for | ||
# NumPy Unicode types, which is ALWAYS UCS4. | ||
# Therefore, we need to convert the buffer. On Python | ||
# 2.6 and later, we can use the utf_32 codec. Earlier | ||
# versions don't have that codec, so we convert to a | ||
# numerical array that matches the input buffer, and | ||
# then use NumPy to convert it to UCS4. All of this | ||
# should happen in native endianness. | ||
obj = obj.encode('utf_32') | ||
else: | ||
obj = str(obj) | ||
else: | ||
# Let the default Unicode -> string encoding (if any) take | ||
# precedence. | ||
obj = bytes(obj) | ||
|
||
return chararray(shape, itemsize=itemsize, unicode=unicode, | ||
buffer=obj, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code never made sense in the first place, as chararray.__new__
has an identity crisis over whether it's trying to be np.ndarray.__new__
or np.array
, and accepts str
objects in place of the buffer
.
Edit: Perhaps it was a workaround for the original bug.
The test failure is the windows heisenbug. I will put this in soon if there are no objections. |
Thanks for the threat Matti, and thanks Eric. Had another glance over and it looks good to me, so putting it in. The test takes around half a second on my computer. A bit slow, but probably fine (and easily changed later). |
The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite. Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.
These APIs work with either UCS2 or UCS4, depending on the value of
Py_UNICODE_WIDE
.After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this.
Fixes gh-3258
Fixes gh-15363