-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Fix issues with zero-width string fields #6430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do you have a sense of how difficult it would be to simply enable
|
It wouldn't be difficult technically, but it would be difficult for preserving backwards compat. For example currently if one does: >>> np.array(['abc', 'def'], dtype=str) this is equivalent (on Python 3) to >>> np.array(['abc', 'def'], dtype='U0') but this will automatically set the dtype to fit the length of the largest string in the list. There's no obvious way to distinguish this "auto-string" feature from explicitly requesting a dtype of zero width. Perhaps one exception would be if all the strings in the input array are zero width (currently this results in 'U1' where one would expect 'U0'). |
In either case, I think the behavior should be at least documented somewhere, if it isn't already. |
OK, so right now "U0" means "whatever width fits the data, or else width 1".
|
No, it's like, a 1 line change or so. However, it would impact anything that uses |
Thanks!
|
(In the long run I think the ideal would be something like: dtype=str and
|
So basically allowing zero-width strings in general is just a matter of deleting these lines: numpy/numpy/core/src/multiarray/ctors.c Lines 946 to 955 in 6cbd724
it doesn't break any of the existing tests that I can tell. It would change the existing behavior of several functions--but it's behavior that it would seem wrong for anyone to be relying on. In principle I think we could even delete the entire outer if statement that belongs to, and not prevent empty void types either. It doesn't seem to be a problem. |
I agree that might be ideal. But currently it would be too big a change for almost no benefit. I don't think anyone really cares about making arrays of zero-width strings in general. If they really wanted to, with the change I suggested above they can with |
Can you send a note to the list describing the issue and giving an example
|
Sure. |
I ran py2 unit tests for scipy and pandas after deleting the lines you pointed to. FWIW I don't see any relevant test failures, though probably we are more worried about random user scripts. Also, it looks like numpy.io might not be set up to handle 0-sized arrays:
though I haven't investigated that much. (Also, in any case probably older numpy will not be able to load zero-sized arrays saved in newer numpy with these changes). I like the idea of always allowing zero-width strings. I haven't thought of any clear cases we are breaking someone's code, but maybe that's just for lack of imagination! |
Interesting--thanks for pointing out the above case. Should probably be fixed. I can add the fix for that to this PR. I agree I have a hard time thinking of anything it would break. Who knows... |
Another one to think about (I'm not sure it's a problem):
Old result:
New result:
Note that |
Oof, that's terrible :) Especially considering >>> np.zeros(3, dtype='S')
array(['', '', ''], dtype='|S1') and not array(['0', '0', '0'], dtype='|S1) |
Well, didn't seem to be a lot of strong feelings about this on the mailing list, so I'll give it a few more days. In the meantime, how about this: For the sake of getting this in soon as a bug fix maybe merge this PR as is (once I add tests). I'll then open a separate issue for making the change to support zero-width strings in general. (Yes, that would mean going from a more complicated change to a simpler change, but it would still probably have the least impact overall, and only changes internal functions). |
☔ The latest upstream changes (presumably #6208) made this pull request unmergeable. Please resolve the merge conflicts. |
Will rebase this shortly and add the requisite tests. It turns out this affects me more than I previously thought: I'm now finding myself in a position where I need to be able to create dtypes with arbitrary offsets for the fields, and supporting zero-width fields. As far as I've been able to find this is currently impossible but I may be missing some subtle workaround...? If I can't find a workaround I'm going to have to write my own custom dtype constructor in C to use in my code, for now :( |
I would guess the problem is this line. If you fix the dtype conversion behavior in this PR, I am pretty sure that line can be removed, presumably solving your problem. |
This PR does fix that line. Doesn't help me for older Numpy versions though. |
Nevermind, it was just due to #6208 after all. At first it wasn't obvious to me what changed there that would impact this but now it's clear. |
Okay, I was able to mostly fix this against #6208, but the test case from #1901 is still failing. This is because, for some reason, assigning to a zero-width string array from a non-zero string array is still resulting in data being copied into the destination array. I'm going to double check, but I'm pretty sure that this did not happen before. Not clear why that would have changed though. Any ideas, @ahaldane ? Nevermind--I double checked and that case was not fixed previously. |
This is probably about ready to go, I think. It would be nice to have this as a bug fix, so my suggestion would be to use this version for now, but maybe open an issue to later outright remove the behavior of converting empty-string dtypes to 1 character string dtypes for a later version. Let me know if there's anything I should do in terms of rebasing / squashing commits. |
The code looks good to me. I would just like to try to think briefly of any special cases we might have missed. For instance, saving the zero-sized arrays with Here are a few more cases I thought about:
That seems fine to leave alone.
same.
That's more dubious.
This also seems a bit dubious. Any thoughts on those? Edit: The last one definitely needs to be fixed (segfault risk). I think you might fix the middle two as well, as they just need a change in logic in the region of |
Probably nobody would want to do it explicitly, though it could crash code that isn't expecting it. I'll see if I can fix that one. The The reshape example is interesting. That should be fixed too. |
Oh right, I forgot about that. This PR ended up being a bug-fix/enhancement rather than changed behavior, but you're right it should be noted in case of problems. I think in summary we allowed 'S0' types in Ping @embray for first dibs on making the PR for that. The note goes in |
Sorry; didn't know what Numpy's procedure is for adding to release notes. I can do so--as @ahaldane wrote the most visible change is in |
So what sections should I put thins in the release notes? The |
I don't think we're too picky about exactly which section it goes. Since these are all related changes I might to keep them together in a single note though. I kind of like putting it in "Improvements", since I don't think we're breaking any code that previously worked (so it's not a compatibility issue), but we are fixing a bunch of cases. |
DOC: Mention the changes of #6430 in the release notes.
what is the use case of an 'S0' array? |
Can you give a code example that dumps uninitialized memory? |
here it happens due to the ravel in repr, but tostring() as used in the testsuite also triggers it |
It seems to me that we should always use that constructor, and that the one that does not should not even exist |
I am wondering why we allow it at all, I cannot think of a practical use case for this, it is an array of zeros that cannot be modified. |
I wonder if this is because of astropys fits module. However that's just an educated guess. |
That doesn't sound right - it's an empty array with no contents that occupies 0 bytes (all of which are modifiable, but that's kinda meaningless). I think there's a consistency argument that says we should allow it, but I do struggle to construct an actual use case. Perhaps so that a function can use |
For one there is a consistency argument to be made, but irrespective of any use case known to you, you could just point out bugs--perhaps in a separate issue where it would be relevant--without superfluous challenges to the legitimacy of some use case.
In fact this is specifically the use case where for me it was most important. As @MSeifert04 suggests, the FITS format allows columns in binary tables that zero width. Why? I'm not sure. I've explored some possible reasons for this in the past but I don't remember. Nevertheless they have and do occur in the wild. It's very difficult to represent such a table as a structured dtype without supporting zero-width fields in the structure because it means lots of extra book-keeping to map fields in a structured array with logical fields in the FITS format. It's much easier and more consistent to allow fields that are just empty arrays. In fact, as pointed out, this was already supported in the form of
In fact I (tried) to do exactly that. And it's not exactly a difficult case to support--in fact in a way it takes less effort and doing so in fact makes for more correct code in any many cases, rather than brushing it under the rug (as the conversion from |
I've had cases like this as well come up too. You may want to have the same data structure throughout even if |
hm it should be fixable by not attempting to change the string description when creating a view. |
Surgery complete at #8970 |
There is a behavior that probably dates far back, that Numpy tries really hard not to let you make arrays with zero-width strings as the dtype. It normally will promote them to a width of at least one (or more, depending on the context). That can be a little surprising, but we probably don't want to change it.
However, (perhaps as an oversight) it is possible to make a structured array that has zero-width strings for one of its fields. But when viewing such a field the dtype of the view is converted to 'S1' (or 'U1' for unicode), leading to data corruption, overflows, etc. So this PR ensures that arrays of zero-width strings can be created at least in the special case of viewing a field of a structured array.
This also lifts the restrictions in the
dtype
constructor that prevented creation of structured dtypes with zero-width fields.Fixes #2196, #473, #4955, #2585, and #1901 .
(I'll add some regression tests in a bit)