Skip to content

BUG: test, fix loading structured dtypes with padding #12358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 14, 2018

Conversation

mattip
Copy link
Member

@mattip mattip commented Nov 9, 2018

Replaces PR #10931 which it draws heavily from. Fixes #2215.

The original PR has a discussion of why this is not the best fix, it assumes a field ('', |Vn) - where n is the void size - is a padding field when someone could theoretically be using such a field for data. It also will force a possibly aligned dtype to now be not aligned, but unfortunately that information is not preserved in the storage format.

I think this is the least evil of the possibilities, although we could consider saving the padding fields with a unique name. It would have to be better than the existing choice: f1, f2, ... which in the original issue clashed with names commonly used by users

@mattip
Copy link
Member Author

mattip commented Nov 10, 2018

also closes #8100

@mattip
Copy link
Member Author

mattip commented Nov 10, 2018

now will round trip fields with name == '' if the field is not a void type

tup = ('%s%d' % (unique_name, itr),) + tup[1:]
if isinstance(tup[1], list):
# record of records, recurse
if len(tup) > 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever be 4, or is (name, type, shape) the longest possible tuple? What about titles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test, fixed for titles

@ahaldane
Copy link
Member

I looked at doing this a while ago, but I got sidetracked by the issue that this strategy will "lose" the aligned flag:

>>> arr = np.array([(1, 2), (3, 4)], dtype=np.dtype([('a','u2'), ('b','u4')], align=True))
>>> np.save('test', arr)
>>> np.load('test.npy').dtype
dtype({'names':['a','b'], 'formats':['<u2','<u4'], 'offsets':[0,4], 'itemsize':8})
>>> arr.dtype
dtype({'names':['a','b'], 'formats':['<u2','<u4'], 'offsets':[0,4], 'itemsize':8}, align=True)

Then I started working on trying to avoid that, by making the aligned flag be recomputed on the fly when accessed (there is no fundamental reason it needs to be stored as a flag - we can always re-compute if something is aligned). That led to snowballing problems I never finished.

FWIW my implementation of descr_without_padding (which I pasted in another PR) was very similar to yours. Here it is:

def descr_to_dtype(descr):
    if isinstance(descr, str):
        # descr was produced by dtype.str, so this always works
        return numpy.dtype(descr)

    fields = []
    offset = 0
    for field in descr:
        if len(field) == 2:
            name, descr_str = field
            dt = descr_to_dtype(descr_str)
        else:
            name, descr_str, shape = field
            dt = numpy.dtype((descr_to_dtype(descr_str), shape))

        # ignore padding bytes, which will be void bytes with '' as name
        # (once blank fieldnames are deprecated, only "if name == ''" needed)
        is_pad = (name == '' and dt.type is numpy.void and dt.names is None)
        if not is_pad:
            fields.append((name, dt, offset))

        offset += dt.itemsize

    names, formats, offsets = zip(*fields)
    return numpy.dtype({'names': names, 'formats': formats,
                        'offsets': offsets, 'itemsize': offset})

I would also recommend that we simultaneously deprecate blank fieldnames in this PR, because the code here will be unable to deal with them. I did it by adding this to _convert_from_dict around line 1169:

        if (PyObject_IsTrue(name) == 0) {
            if (DEPRECATE("Blank field names are deprecated and will result in "
                          "an error in the future.") < 0) {
                Py_DECREF(tup);
                goto fail;
            }
        }

@mattip
Copy link
Member Author

mattip commented Nov 12, 2018

@ahaldane your function does seem to be simpler. It does handle empty names as long as they are not a void type, but I have added the deprecation since it is confusing to have blank fields on purpose.

As for alignement=True I think we should punt on that for now, as we have never been able to round-trip the attribute so we are not breaking backward compatibility here. Calculating it on the fly would force non-aligned dtypes that happen to specify an aligned dtype using offsets to be aligned after round-tripping. I think it must be part of the storage protocol. One solution would be to use str(dtype) rather than dtype.descr but that should be a separate PR.

@mattip mattip force-pushed the roundtrip-record-arrays branch from 2f7fcf1 to 893dd2b Compare November 13, 2018 00:01
@mattip
Copy link
Member Author

mattip commented Nov 13, 2018

rebased and squashed to two commits since there were merge conflicts, so I guess that also implies restarting review

@eric-wieser
Copy link
Member

I'd rather leave the deprecation to another PR, just so that it can actually be discussed in isolation.

@mattip mattip force-pushed the roundtrip-record-arrays branch from bbe1522 to 1956ada Compare November 13, 2018 14:38
@ahaldane
Copy link
Member

Besides style comment above, LGTM. This was less painful that I thought. Thanks @mattip!

@charris charris merged commit 13a69b5 into numpy:master Nov 14, 2018
@charris
Copy link
Member

charris commented Nov 14, 2018

Thanks Matti.

@aldanor
Copy link

aldanor commented Jan 2, 2019

I'm wondering why np.load() was special-cased here? Isn't it a more general problem?

I.e., looks like #7797 is still broken? (it would need a very similar fix)

@mattip mattip deleted the roundtrip-record-arrays branch January 2, 2019 13:55
@mattip
Copy link
Member Author

mattip commented Jan 2, 2019

np.save does not use PEP3118 nomenclature for the dtype format.

@aldanor
Copy link

aldanor commented Jan 2, 2019

Ah indeed, my bad - apologies.

Fixing #7797 would require fixing/rewriting _dtype_from_pep3118() which I'm not sure anyone wants to touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading aligned dtype error (Trac #1619)
5 participants