Skip to content

Bug in numpy.copy for views of structured arrays in NumPy 1.16.x #13299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dmbelov opened this issue Apr 10, 2019 · 20 comments
Closed

Bug in numpy.copy for views of structured arrays in NumPy 1.16.x #13299

dmbelov opened this issue Apr 10, 2019 · 20 comments

Comments

@dmbelov
Copy link
Contributor

dmbelov commented Apr 10, 2019

The following change multi-field views return a view instead of a copy in NumPy 1.16 breaks reasonable behavior of the function copy: it does not return a packed array. This bug in copy causes a lot of problems, because it copes much more data than expected.
Below is an example that shows the problem.

Reproducing code example:

import numpy as np
x_i = np.array([(1, 2, 3)] * 10, dtype=[(n, 'f8') for n in ['a', 'b', 'c']])
x_i[['a', 'c']] # this returns a view as expected
print x_i[['a', 'c']].dtype
{'names':['a','c'], 'formats':['<f8','<f8'], 'offsets':[0,16], 'itemsize':24}

# I expect the code below to return a *packed* structured array
# of dtype=[('a', 'f8'), ('c', 'f8')],
# but instead I see the same output as above.
x_i[['a', 'c']].copy() 
print x_i[['a', 'c']].copy().dtype
{'names':['a','c'], 'formats':['<f8','<f8'], 'offsets':[0,16], 'itemsize':24}

Numpy/Python version information:

('1.16.2', '2.7.15 |Anaconda custom (64-bit)| (default, Oct 10 2018, 21:32:13) \n[GCC 7.3.0]')

@mhvk
Copy link
Contributor

mhvk commented Apr 10, 2019

I agree this is unexpected! Certainly does not happen for regular arrays where a view leads to gaps. cc @ahaldane - it may mean that in .copy() offsets in the dtype are removed.

@ahaldane
Copy link
Member

Hi @dmbelov ,

The issue here actually isn't in the copy function, that works exactly the same as always. copy has always exactly preserved the dtype, including padding.

What is tripping you up is that x_i[['a', 'c']] now has padding bytes. copy merely preserves those new padding bytes.

In the new docs (visible here) we discuss this change. In your case, I think the best solution is to use the function numpy.lib.recfunctions.repack_fields:

>>> import numpy.lib.recfunctions as rf
>>> x_i = np.array([(1, 2, 3)] * 10, dtype=[(n, 'f8') for n in ['a', 'b', 'c']])
>>> print(rf.repack_fields(x_i[['a', 'c']]).dtype)                             
[('a', '<f8'), ('c', '<f8')]

note that repack_fields avoids making a copy if the dtype is already packed - which may actually be an optimization for your code.

@mhvk
Copy link
Contributor

mhvk commented Apr 10, 2019

@ahaldane - I now see that in a way this is not a regression, but I wonder still if the current behaviour is desired: it does feel like the copy should be maximally compact, just like for a strided view and not dissimilar to the fact that copy also changes the order (well, at least ndarray.copy() does...). I basically see no advantage whatsoever to keeping the padding! (Though for strings in some sense the same holds, yet I'd not argue to shorten those to the longest in the array. But at least for strings you can set an entry to a longer string, here the memory is essentially inaccessible, so why waste it?)

@ahaldane
Copy link
Member

The main reason to keep the padding, in my mind, is for consistency.

copy currently satisfies a "contract" with the user that it never changes the dtype, or really almost any property of the array being copied.

If copy changed the dtype by removing padding, that might be unexpected. For instance, someone might find that a function that worked fine on the original array, failed when passed a copy of the original array, which seems nonsensical.

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 10, 2019

I do not understand the reason for keeping padding: one expects that the command x_i[['a', 'c']].copy() copies only columns 'a' and 'c'; but to get padding it should also copy column 'b' (or reserve some empty space), which I do not what to copy. This results in waste of memory (think of the situation when you have 100 columns, and you want to copy only 2). Therefore, I think this behavior of copy should be changed.

@mhvk
Copy link
Contributor

mhvk commented Apr 10, 2019

I think your point about phrasing it about a copy behaving the same as the original is helpful, but I'm not sure I agree with the conclusion.

Certainly, two dtype holding the same fields but with different offsets are not considered equal. On the other hand, two arrays with those dtypes and the same numbers are considered equal, and to me it seems more logical that that is what counts: it would seem to me the contract is that the content of the arrays that is actually visible should be the thing that is preserved, i.e., any way you access an array element you get the identical answer. I don't quite see why the dtype itself has to be identical, just like I cannot expect strides to be identical. (Then again, following this logic, endianness should be adjusted too...)

Also, if it does matter to preserve holes, then shouldn't be what is in them be preserved? Currently, that is not the case:

np.arange(9.).reshape(3, 3).view('f8,f8,f8')[['f0', 'f2']].copy().view('3f8').squeeze()
array([[0.000000e+000, 0.000000e+000, 2.000000e+000],
       [3.000000e+000, 1.221188e-316, 5.000000e+000],
       [6.000000e+000,           nan, 8.000000e+000]])

I'm not sure how that is going to help any code!

Anyway, I'm not trying to conclude I'm right here, just that there is I think a valid argument for packing copies. Also I'm not sure what is the bigger risk for regressions: that copies of arrays created with dtypes with holes no longer have holes, or that, because of the change to a view, copies of parts now do have holes. Certainly, @dmbelov ran into the second problem...

p.s. If we do not change the behaviour, it would be good to update the documentation of copy to point to repack_fields... And one might also have to add warnings for things like np.save - most people will not be pleased to be writing more data than needed!

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 10, 2019

comment on ahaldane's updates:

  1. I do not see how fixing behavior of copy for this case can break somebody's code because this strange behavior just appeared in NumPy 1.16.
  2. Following this logic why x_i[2:3].copy() does not copy the whole array x_i?

@ahaldane
Copy link
Member

Again, I think the copy is a red herring here. It is the new indexing behavior that creates padding, not the copy operation. The copy operation never has, and still does not, remove any padding.

For instance, np.save('fname', x[['a', 'c']]) does not involve a copy operation, and in 1.16 it writes padding bytes while previously it did not. To repeat, in numpy 1.13 this operation already removed padding bytes, without a copy. It is not the copy that removes padding. Likewise, the code np.save('fname', np.array([1], dtype={'names': ['f0'], 'formats': ['i4'], 'offsets':[8]}).copy()) writes padding bytes to file, both in numpy 1.13 and 1.16, and it does involve a copy. The copy does not remove any padding in any version of numpy. If we did make copy remove padding, that would break backcompat for cases like the latter.

Furthermore, I really think we should stick to the principle that a copy of an array behaves exactly like the array itself. Consider the following:

def mysave(filename, arr):
     if arr.itemsize <= 16:
        raise Exception("not supported by my file format")
    my_save_impl(filename, arr)

a = np.array([1], dtype={'names': ['f0'], 'formats': ['i4'], 'offsets':[16]}).

mysave(a)  # works
mysave(a.copy())  # wouldn't work if copy removes padding

x_i = np.array([(1, 2, 3)] * 10, dtype=[(n, 'f8') for n in ['a', 'b', 'c']])
b = x_i[['a', 'c']]

mysave(b)  # works in 1.16
mysave(b.copy())  # wouldn't work if copy removes padding

I think it would be very strange if a function failed on a copy of an array when it worked on the array itself. If we remove padding, this example shows that would happen.

@ahaldane
Copy link
Member

Following this logic why x_i[2:3].copy() does not copy the whole array x_i?

Let's unpack this line. It is doing two things in one line: Indexing, and then a copy operation. It is equivalent to doing:

b = x_i[2:3]
b_copy = b.copy()

So you are not making a copy of x_i, you are making a copy of b which is a view of x_i. That is why it does not copy all of x_i.

@mhvk
Copy link
Contributor

mhvk commented Apr 11, 2019

Point taken about np.save... I think it is not very logical either, but let's keep that as a separate discussion.

Still, your example about a routine that might break by changing the behaviour of copy does seem a bit contrived... (I see itemsize as something like strides, that there is little guarantee for it to be a given number, at least for a dtype that has variable size to begin with, like V or S).

Also, I think the comparison that @dmbelov should have been looking for is something like x_i[::10].copy() in which case the skipped 9 elements are not copied, nor is any space kept for them. So, a routine similar to yours (though admittedly even more contrived) would break similarly...

I do think the behaviour @dmbelov expected is something logical to support - a common use case would be opening a very large file, maybe in memory mapped mode, and just getting out the relevant information, copying to be sure it works faster and/or to make sure one can write to it without destroying the original (for this example, the fact that repack_fields would not copy if things were aligned already is actually not good!). In contrast, I can see very little cases where it is useful that the packing bytes do not get removed.

Anyway, perhaps this is something to bring up on the mailing list? I think the question of what the contract is that one gives the user is one that could use some broader input.

@ahaldane
Copy link
Member

ahaldane commented Apr 11, 2019

That's a good point, the strides aren't preserved by copy, so that's a demonstration that copy does sometimes change the array. On the other hand, the itemsize is really a property of the dtype and not the array, unlike strides, and copy does preserve the dtype currently. A mailing list discussion sounds fine to me.

I also agree your memmap'd file example is something that would be nice to support. One option is to add a copy kwarg to repack_fields, but that's too late for 1.16. I think with current 1.16 code I would do:

mm = np.memmap(filename, dtype='i4,i4,i4', shape=(3,4))[['f0', 'f1']]
val = np.require(np.repack_fields(mm), requirements='WRITEABLE')

Also, there is an other issue/PR where we discuss np.save not saving the padding bytes. I'd be all for that!

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 11, 2019

comment on @ahaldane:
A lot of companies use NumPy + there is a lot of free code that use NumPy. The amount of these code is much much bigger than NumPy itself. Every time you make unreasonable and unexpected change in NumPy that is not backward compatible --- you cause a lot of trouble for many many people. Please always remember about this and behave responsibly.

@ahaldane
Copy link
Member

ahaldane commented Apr 11, 2019

@dmbelov You can read the discussion of the change here: link. We spent a lot of time energy discussing the pros and cons of this change, and backcompat, there and in other places.

In fact, I see we discussed the very case here involving copy!

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 11, 2019

I understand the reasons why we do not want to change behavior of copy. I have an alternative proposal:

  1. We keep the meaning of list slice as it was before --- namely, it returns a copy (the way how it was originally introduced). (We do not want to break others people code => changes need to be backward compatible).
  2. To get the view we add support for slice by list to dtype object, so to get view one need to write
    x_i.view(x_i.dtype[['a', 'c']]).
  3. We do not change the behavior of copy.

What do you think?

@mhvk
Copy link
Contributor

mhvk commented Apr 11, 2019

@dmbelov - I think to normally have a view is really preferable in most situations, and much more consistent with how we do slicing - it is unlikely we go back to that! But as I hope is clear, you have brought up a good point about how copy should behave.

There is a variant on your proposal that I wondered about: x[['a', 'c']].astype(<something without gaps>) - but I worry it is something people would simply not find easily enough (an advantage, though, is that astype already has a copy kwarg).

Anyway, I think the mailing list is a good idea here; I'll try to post tomorrow.

@ahaldane
Copy link
Member

Your proposal is connected to #10417, and seems quite elegant to avoid the copy issue.

A real issue with that change though, besides that many people prefer views by default, is not the technical side but the ambiguous backcompat situation here. As noted in the linked email, the view change was planned since numpy 1.7 with FutureWarnings since then, so arguably the copy behavior you depend on was explicitly unsupported. (Interestingly, I think by doing an extra copy your code may have avoided the FutureWarning, though I'm not sure if your copy was in preexisting code or if you introduced it now to fix a field packing issue). In addition to all the issues brought up previously, we now have a new backcompat issue, that this change has already been publicly released in numpy 1.16.

In any case its worth discussing strategies like modifying copy. Also, I'm not at all opposed to changing the behavior again if it turns out to be the right thing to do to help backcompat.

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 22, 2019

This buggy behavior of list slicing is inconsistent with assumptions in the function recfunctions.rename_fields.

Example:

import numpy as np
x_i = np.array([(1, 2, 3)] * 10, dtype=[(n, 'f8') for n in ['a', 'b', 'c']])

np.lib.recfunctions.rename_fields(x_i[['a', 'c']], {'a': 'A', 'c': 'C'})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-682d1739d414> in <module>()
----> 1 np.lib.recfunctions.rename_fields(x_i[['a', 'c']], {'a': 'A', 'c': 'C'})

/usr/bin/anaconda/envs/numpy16/lib/python2.7/site-packages/numpy/lib/recfunctions.pyc in rename_fields(base, namemapper)
    666         return newdtype
    667     newdtype = _recursive_rename_fields(base.dtype, namemapper)
--> 668     return base.view(newdtype)
    669 
    670 

ValueError: When changing to a smaller dtype, its size must be a divisor of the size of original dtype

This is very bad that code in numpy release is inconsistent. Do you think it is a good idea to revert the change with list view back to its original until all inconsistencies and errors are fixed?

@ahaldane
Copy link
Member

That is indeed a bug, thank you for finding it. A few observations: rename_fields fails for any dtype with padding bytes, in all versions of numpy. The fact that padding bytes now appear from multi-field indices is triggering this old bug. Evidently our test suite did not test this.

I'm still not yet convinced we should revert the copy->view changes, but that is always on the table. The example you found seems a little bit unusual/contrived, is an old bug that now is somewhat more likely to turn up, and I think we we can still fix it for numpy 1.16.4.

@seberg
Copy link
Member

seberg commented Jan 30, 2024

Didn't bother with the whole history here: But I changed copies, structured copies do return packed structs now.

@seberg seberg closed this as completed Jan 30, 2024
@seberg
Copy link
Member

seberg commented Jan 30, 2024

Sorry nvm, type promotion does, copy doesn't use that though to normalize...

However, the above bug looks like a duplicate anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants