Skip to content

BUG: Fix np.load for aligned dtypes. #10931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ihormelnyk
Copy link

@ihormelnyk ihormelnyk commented Apr 19, 2018

import numpy as np

t = np.dtype('i1, i4, i1', align=True)
d = np.zeros(1, t)

np.save("test.npy", d)
data = np.load("test.npy")

Traceback (most recent call last):
  File "D:\Projects\numpy_bug.py", line 8, in <module>
    data = np.load("test.npy")
  File "D:\Projects\vendors\numpy\lib\npyio.py", line 314, in load
    return format.read_array(fid)
  File "D:\Projects\vendors\numpy\lib\format.py", line 440, in read_array
    shape, fortran_order, dtype = read_array_header_1_0(fp)
  File "D:\Projects\vendors\numpy\lib\format.py", line 358, in read_array_header_1_0
    dtype = numpy.dtype(d['descr'])
ValueError: two fields with the same name

Fixes gh-2215

old_header = header
# replace paddings like ('', '|V4')
header = re.sub(r"\s*,*\s*\(\s*''\s*,\s*'\|*V\d'\s*\)", '', header)
aligned = header != old_header
Copy link
Member

@ahaldane ahaldane Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this test assumes that the pressence of padding-fields implies an aligned dtype. but actually there are many ways to create a dtype which has padding fields yet which is not "aligned".

For instance, try saving and loading the array

>>> a = np.zeros(3, dtype={'names': ['a', 'b'], 'formats': ['i4', 'i4'], 'offsets': [1, 7]})
>>> np.save('test', a)
>>> np.load('test.npy')

Also, I think we should avoid regular expression processing if possible, I am worried it is too fragile. It would be better to turn the string into a list of tuples first using save_eval, and then do some computations on the list of tuples to get the right offsets, names, dtypes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to save npy where fields in dtype have empty names?
It will be more complex code to rebuild list of tuples

Copy link
Member

@ahaldane ahaldane Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code already there rebuilds the list of tuples right now, using safe_eval: You can access it in d['descr'], as is done a few lines below in the file.

What I am suggesting is adding some code like:

def descr_to_dtype(descr):
    if isisinstance(descr, str):
        return np.dtype(descr)  # Question: Is this guaranteed to work?
    
    fields = []
    offset = 0
    for name, descr_i in d['descr']:
        dt = process_descr(descr_i)
        if name == '':  # skip padding
            assert(dt.type is np.void and dt.names is None)
        else:
            fields.append((name, dt, offset))
        offset += dt.itemsize
    
    names, formats, offsets = zip(*fields)
    return np.dtype('names': names, 'formats': formats, 
                    'offsets': offsets, 'itemsize': offset)

dtype = descr_to_dtype(d['descr'])

(completely untested, just to give an idea)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it technically is possible to create a dtype with empty field, but only using the dict specification:

np.dtype({'names': [''], 'formats': 'i4'})

(see this comment)

I think that if we want to fix up the save/load code here, we may want to disallow that possibility, to avoid confusion when such an array can't be saved (after this PR).

@ahaldane
Copy link
Member

Also, you may not have seen it, but there is an ongoing discussion starting here (link) about exactly how to fix this bug.

In this PR you are going with what I called option 2 in a comment there. I think that is probably the easiest, and possibly best, thing to do. That is, we should continue storing the dtype using the .descr representation of the dtype, which is a list of tuples, but we need to modify how this list is read back in to remove the padding names and compute the right offsets, like I suggested in my inline comment to your code.

@mhvk also suggested that we "pack" the fields before saving to file, but that is an extra detail that we can do in a separate PR, if desired.

@charris charris changed the title Fixes #2215 Loading numpy with aligned dtype BUG: Fix Loading numpy with aligned dtype May 5, 2018
@charris charris changed the title BUG: Fix Loading numpy with aligned dtype BUG: Fix np.load for aligned dtypes. May 5, 2018
@mattip
Copy link
Member

mattip commented Jul 7, 2018

@ihormelnyk do you think you will continue with this PR? It seems we need to fix this in order to move on with some other cleanups

@ihormelnyk
Copy link
Author

Hi @mattip ,
What do you mean by "continue"?

@mattip
Copy link
Member

mattip commented Jul 7, 2018

there were suggestions for fixes to this PR that you have not yet responded to

@ihormelnyk
Copy link
Author

I think, I'll be able to look into it on Monday

@ihormelnyk
Copy link
Author

Hi @mattip , @ahaldane
Sorry, but I have no enoght time to be deeply involved in numpy development and discussion.
Here is a patch which we have been using since 2010 till now to load numpy with aligned dtype:

 numpy/lib/format.py | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/numpy/lib/format.py b/numpy/lib/format.py
index ef5ec57e3..1e4915d11 100644
--- a/numpy/lib/format.py
+++ b/numpy/lib/format.py
@@ -343,6 +343,27 @@ def _write_array_header(fp, d, version=None):
     fp.write(header)
     return version
 
+
+def get_descr_without_paddings(descr):
+    aligned = False
+    if not isinstance(descr, list):
+        return descr, aligned
+    newdescr = []
+    for i in xrange(len(descr)):
+        if descr[i][0]=='' and (descr[i][1].startswith('|V') or descr[i][1].startswith('V')) and isinstance(descr[i][1], str):
+            aligned = True #padding
+        else:
+            if isinstance(descr[i][1], list):
+                d, a = get_descr_without_paddings(descr[i][1])
+                aligned =  aligned or a
+                l = list(descr[i])
+                l[1] = d
+                newdescr.append(tuple(l))
+            else:
+                newdescr.append(descr[i])
+    return newdescr, aligned
+
+
 def write_array_header_1_0(fp, d):
     """ Write the header for an array using the 1.0 format.
 
@@ -524,7 +545,8 @@ def _read_array_header(fp, version):
         msg = "fortran_order is not a valid bool: %r"
         raise ValueError(msg % (d['fortran_order'],))
     try:
-        dtype = numpy.dtype(d['descr'])
+        descr, aligned = get_descr_without_paddings(d['descr'])
+        dtype = numpy.dtype(descr, align=aligned)
     except TypeError as e:
         msg = "descr is not a valid dtype descriptor: %r"
         raise ValueError(msg % (d['descr'],))


Aligned dtype is very important for our computation framework, due to we use it to pass large data structs to python C extension (to avoid time consuming copying) and further calculation on the GPU.
Also there is code generation phase, which perepare python and c++ interfaces for computation models, so numpy structs have to be C compatible.
Looks like this is last fix related to aligned dtype after which we will be able to use numpy for our clients without patching.

P.S. But again we can't upgrade version of numpy for clients higher than 1.13.3 due to "Multiple-field indexing/assignment of structured arrays" changes in 1.14.
#6053 (comment)

@mattip
Copy link
Member

mattip commented Nov 9, 2018

@ihormelnyk I opened a new PR #12358 to replace this one. It is not quite this fix since it does not assume the dtype should be aligned, that information is simply lost in the storage format. I would like to close this one in favor of continuing the conversation there.

@mattip
Copy link
Member

mattip commented Nov 13, 2018

Closing in favor of #12358

@mattip mattip closed this Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants