-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: test, fix loading structured dtypes with padding #12358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e2cb037
to
deb2396
Compare
also closes #8100 |
now will round trip fields with |
numpy/lib/format.py
Outdated
tup = ('%s%d' % (unique_name, itr),) + tup[1:] | ||
if isinstance(tup[1], list): | ||
# record of records, recurse | ||
if len(tup) > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever be 4, or is (name, type, shape)
the longest possible tuple? What about titles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test, fixed for titles
I looked at doing this a while ago, but I got sidetracked by the issue that this strategy will "lose" the >>> arr = np.array([(1, 2), (3, 4)], dtype=np.dtype([('a','u2'), ('b','u4')], align=True))
>>> np.save('test', arr)
>>> np.load('test.npy').dtype
dtype({'names':['a','b'], 'formats':['<u2','<u4'], 'offsets':[0,4], 'itemsize':8})
>>> arr.dtype
dtype({'names':['a','b'], 'formats':['<u2','<u4'], 'offsets':[0,4], 'itemsize':8}, align=True) Then I started working on trying to avoid that, by making the aligned flag be recomputed on the fly when accessed (there is no fundamental reason it needs to be stored as a flag - we can always re-compute if something is aligned). That led to snowballing problems I never finished. FWIW my implementation of def descr_to_dtype(descr):
if isinstance(descr, str):
# descr was produced by dtype.str, so this always works
return numpy.dtype(descr)
fields = []
offset = 0
for field in descr:
if len(field) == 2:
name, descr_str = field
dt = descr_to_dtype(descr_str)
else:
name, descr_str, shape = field
dt = numpy.dtype((descr_to_dtype(descr_str), shape))
# ignore padding bytes, which will be void bytes with '' as name
# (once blank fieldnames are deprecated, only "if name == ''" needed)
is_pad = (name == '' and dt.type is numpy.void and dt.names is None)
if not is_pad:
fields.append((name, dt, offset))
offset += dt.itemsize
names, formats, offsets = zip(*fields)
return numpy.dtype({'names': names, 'formats': formats,
'offsets': offsets, 'itemsize': offset}) I would also recommend that we simultaneously deprecate blank fieldnames in this PR, because the code here will be unable to deal with them. I did it by adding this to
|
@ahaldane your function does seem to be simpler. It does handle empty names as long as they are not a void type, but I have added the deprecation since it is confusing to have blank fields on purpose. As for |
2f7fcf1
to
893dd2b
Compare
rebased and squashed to two commits since there were merge conflicts, so I guess that also implies restarting review |
I'd rather leave the deprecation to another PR, just so that it can actually be discussed in isolation. |
893dd2b
to
bbe1522
Compare
bbe1522
to
1956ada
Compare
Besides style comment above, LGTM. This was less painful that I thought. Thanks @mattip! |
Thanks Matti. |
I'm wondering why I.e., looks like #7797 is still broken? (it would need a very similar fix) |
|
Ah indeed, my bad - apologies. Fixing #7797 would require fixing/rewriting |
Replaces PR #10931 which it draws heavily from. Fixes #2215.
The original PR has a discussion of why this is not the best fix, it assumes a field
('', |Vn)
- wheren
is the void size - is a padding field when someone could theoretically be using such a field for data. It also will force a possibly aligned dtype to now be not aligned, but unfortunately that information is not preserved in the storage format.I think this is the least evil of the possibilities, although we could consider saving the padding fields with a unique name. It would have to be better than the existing choice:
f1
,f2
, ... which in the original issue clashed with names commonly used by users