Skip to content

ENH: properly account for trailing padding in PEP3118 #7798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ahaldane
Copy link
Member

@ahaldane ahaldane commented Jul 2, 2016

This changes how numpy treats trailing padding in the PEP3118 buffer interface, to fix #7797.

Previously, dtypes with trailing padding would have their trailing padding truncated when converted to memoryview using pep3118. Now trailing padding is maintained.

Then, I also needed to modify the pep3118-to-array conversion to stop explicitly adding void fields for the trailing padding. This way code of the form np.array(memoryview(arr)) will give back the original array even if there is trailing padding.

One thing I still need to think about is what to do for "aligned" types. The pep3118 doc is not clear what "alignment" means. It appears that ix and ixxx are supposed to mean the same thing: a structure of 8 bytes containing an integer in the first 4 bytes and 4 bytes of padding, but I am not 100% sure that is right.

self._check('ixxxx', [('f0', 'i'), ('', VV(4))])
self._check('i7x', [('f0', 'i'), ('', VV(7))])
# XXX not sure if this aligment is desired behavior
self._check('ix', dtype_itemsize(8))
Copy link
Member

@pv pv Jul 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment in the native mode is supposed to follow that of the C compiler, so the test should check the results are in agreement with dtype.alignment (which is determined by the compiler on the platform)

@ahaldane ahaldane force-pushed the pep3118_trailing_pad branch 3 times, most recently from 842dd34 to e540543 Compare July 2, 2016 22:10
@ahaldane
Copy link
Member Author

ahaldane commented Jul 3, 2016

Thanks @pv, I made some small corrections.

I need to read and think about the proper behavior for the unstructured types and aligned structured types.

It seems to me the PEP3118 format supports arrangements that can't really be done in numpy. For example, while numpy does support trailing padding in structured types like T{i:f0:xx}, there is no natural way in numpy to include trailing padding without a structure, like in ixx. Therefore it may be reasonable to raise an error in the latter case, when trying to convert memoryviews to arrays.

Also, I was looking at how the struct module treats padding bytes. While it adds internal padding bytes to maintain alignment, it doesn't add trailing padding bytes the way numpy does (in this PR or before):

# b gets 4 bytes here, like in dtype('i,u1,i', align=True)
>>> struct.pack('ibi', 1, 2, 3)  
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

# trailing x only adds 1 byte, not 4 as numpy would
>>> struct.pack('ibix', 1, 2, 3)  
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x00'

# in numpy (and usually C structs), 4 more trailing pad bytes would be added here
>>> struct.pack('qi', 1, 2) 
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00'

The pep3118 docs say the struct module is supposed to support the new format strings, but it doesn't appear to yet. So maybe the discrepancy between struct and numpy is struct's problem, not numpy's.

@ahaldane
Copy link
Member Author

ahaldane commented Jul 3, 2016

Ah, see on the python bug tracker

struct.pack(): trailing padding bytes on x64 last changed 2016-05-10
implement PEP 3118 struct changes last changed 2016-04-13

@ahaldane
Copy link
Member Author

ahaldane commented Jul 4, 2016

After reading, I think numpy's current interpretation of trailing padding in pep3118 format strings is wrong. The struct module is pretty clear about how format strings treat trailing padding:

Padding is only automatically added between successive structure members. No padding is added at the beginning or the end of the encoded struct.

No padding is added when using non-native size and alignment, e.g. with ‘<’, ‘>’, ‘=’, and ‘!’.

To align the end of a structure to the alignment requirement of a particular type, end the format with the code for that type with a repeat count of zero.

So in a format like 'ix', even though the type is implicitly @ (aligned) we should only add 1 trailing padding byte. If we wanted numpy-style padding bytes the format string should instead be something like 'ix0i'. Unfortunately this means that even though ix is in "aligned" mode, the resulting numpy dtype does not count as aligned.

I've updated the code accordingly. There is still more to do.

@charris
Copy link
Member

charris commented Jul 4, 2016

Might take a look at #7467 while you are working on this.

@ahaldane ahaldane changed the title WIP: ENH: properly account for tailing padding in PEP3118 WIP: ENH: properly account for trailing padding in PEP3118 Jul 6, 2016
@ahaldane
Copy link
Member Author

ahaldane commented Jul 6, 2016

All right, the story gets more complicated.

While the struct module does not add C-style trailing bytes to unstructured aligned data, the memoryview function does within T{}:

>>> arr = np.zeros(1, dtype=np.dtype([('x', '<f8'), ('y', '<b')], align=True))
>>> m1 = memoryview(arr)
>>> m1.format, m1.itemsize
('T{d:x:b:y:}', 16)

#  ctypes has the same behavior
>>> class POINT(ctypes.Structure):
...     _fields_ = [("x", ctypes.c_double), ("y", ctypes.c_byte)]
...
>>> m2 = memoryview((POINT)())
>>> m2.format, m2.itemsize
('T{<d:x:<b:y:}', 16)

>>> len(struct.pack('db', 0.0, 0))
9

I don't know how to construct a memoryview of format ix so I can't test that case. But it appears that trailing padding is added within T{} (which does not currently exist in the struct module) while struct does not add it outside of T{}.

@aldanor
Copy link

aldanor commented Oct 31, 2016

@ahaldane Just wondering, are there any plans in regards to moving this forward?

@ahaldane
Copy link
Member Author

Yeah I sort of forgot about this because it was so messy.

As I recall I think the current state of this PR is a pretty good approximation of how things "should work", given the current somewhat incomplete state of pep3118 and memoryviews in python.

I'll revisit it sometime soon, comments on it currently are welcome.

@eric-wieser
Copy link
Member

Some overlap here with #9054

@homu
Copy link
Contributor

homu commented May 9, 2017

☔ The latest upstream changes (presumably #9054) made this pull request unmergeable. Please resolve the merge conflicts.

@ahaldane ahaldane force-pushed the pep3118_trailing_pad branch 2 times, most recently from 9ce1070 to 8f72316 Compare May 9, 2017 19:11
@ahaldane
Copy link
Member Author

ahaldane commented May 9, 2017

Rebased.

As a reminder, this PR fixes the PEP3118 trailing padding, which was not accounted for at all before when converting an array->memoryview. Importantly, I now interpret ix format as 5 bytes, and ix0i as 8 (on x64), as the struct modele documents should be the case. This means I had to tweak the memoryview->array code since it did something else before. This should have no effect on any real code since such formats are never created in practice.

@ahaldane
Copy link
Member Author

ahaldane commented May 9, 2017

Here are two more things we might want to think more about:

  1. Currently, although the format string numpy produces from an array works properly, is perhaps a bit ugly. Consider:
>>> memoryview(np.zeros(3, dtype=np.dtype('u1,i4,u1', align=True))).format
'T{B:f0:3xi:f1:B:f2:3x}'

Maybe it should be T{B:f0:0ii:f1:B:f2:0i} or T{B:f0:i:f1:B:f2:0i}.

  1. While fixing this up I realized there are dtypes which we will never be able to represent with the PEP3118 sytnax: Those with overlapping fields:
>>> dt = dtype({'names': ['a', 'b'], 'formats': ['i4','i4'], 'offsets': [0,1]})
>>> memoryview(np.zeros(3, dtype=d))
RuntimeError: This should never happen: Invalid offset in buffer format string g
eneration. Please report a bug to the Numpy developers.

@ahaldane ahaldane force-pushed the pep3118_trailing_pad branch from 8f72316 to 20f5478 Compare May 9, 2017 19:16
"buffer format string generation. Please "
"report a bug to the Numpy developers.");
"The buffer interface does not support "
"overlapping fields");
Copy link
Member

@eric-wieser eric-wieser May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this path also trigger for fields that are not in sequence? (like the type returned by #6053)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I should make a better message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we going to do about #6053 here? Seems like it's going to make casting to memoryview a lot harder...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you're right. The PEP3118 format cannot account for either overlapping fields or out-of-order fields (when order matters). That's something to think about...

Copy link
Member

@eric-wieser eric-wieser May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by PEP3118, as it:

  • Seems to be designed to standardize some numpy behaviour for interoperability
  • Causes discussions in the python bugtracker that refer to numpy as the canonical implementation
  • (yet) Is not actually sufficient to represent either all C types or all numpy types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought (weeks later),

A more general way of describing a structured type is as a list of (name, type, offset) tuples. This is how, for example, MPI serializes structs (see here). It is also how numpy already stores structured types internally.

Since a purpose of PEP3118 is serialization, maybe it would have made sense to specify structured types this way. Eg we could have had something like 'T24{i,0,f0:f,8,f1} to represent dtype({'formats': ['i', 'f'], 'offsets': [0, 8], 'names': ['f0', 'f1'], 'itemsize': 24}).

Since numpy is the main user of the PEP3118 interface, and because the PEP seems to be only partly implemented in cpython currently, there is probably still opportunity for us to work out the details or add to the specification, with some effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, one more datapoint: The HDF5 datatype specification also stores "compound datatypes" as an ordered list of (name, format, offset) tuples. So that is now 3 cases where structured datatypes are stored that way: numpy dtypes, MPI serialization, and HDF5 files.

I want to record these examples here in case in the future we want to argue to the CPython devs that we should add a new PEP3118 specification style for structured types, that is equivalent to a list of tuples.

@@ -5918,6 +5921,10 @@ def aligned(n):
itemsize=aligned(size + 1)
), (3,)))

expected_dtype = {'names': ['f0'], 'formats': ['i'],
'itemsize': np.dtype('i,V1', align=True).itemsize}
self._check('(3)T{ix}', (expected_dtype, (3,)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this an exact duplicate of the above test? What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, I fudged the rebase here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although perhaps using np.dtype(..., align=True) is an improvement over aligned(size + 1)

@eric-wieser
Copy link
Member

What is the itemsize of T{ix} with this patch?

self._check('ixx', dict(itemsize=aligned(size + 2), **base))
self._check('ixxx', dict(itemsize=aligned(size + 3), **base))
self._check('ixxxx', dict(itemsize=aligned(size + 4), **base))
self._check('i7x', dict(itemsize=aligned(size + 7), **base))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests might remain valid with T{...}? In particular, I would expect trailing padding to be kept in a struct context, to match the behaviour of sizeof(T) in C, and what happens when structs are repeated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PEP3118 spec is unclear about this.

One could argue the struct module has set no precedent in judging how alignment and padding work here since it doesn't implement the T{} format. We might then feel free to set the precedent here in this PR, deciding that aligned formats add trailing padding only inside T{}.

Copy link
Member

@eric-wieser eric-wieser Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whereas the ctypes module has set the precedent of ignoring all the remarks that the struct docs make about alignment...

@charris
Copy link
Member

charris commented Dec 16, 2020

@seberg @eric-wieser Is this still of interest?

@charris
Copy link
Member

charris commented Jan 25, 2021

Needs rebase. I'm inclined to close this unless it is still relevant. @seberg @eric-wieser thoughts?

@seberg
Copy link
Member

seberg commented Jan 25, 2021

It is still relevant, there was no change in behaviour with respect to the issue.

I am not sure whether the trailing bytes themselves should or should not be relevant. The main interest would be with respect to getting alignment right in the buffer protocol (this PR does that as far as I can tell). But the mismatch between buffer protocol itemsize and the length that is set, does currently prevent roundtripping array -> buffer -> array in certain cases, so I guess there is something wrong.

(I am not sure how easy it is to rebase this PR, it might be annoying, but I expect all the pieces here are still relevant just as much)

Comment on lines 279 to 294
PyExc_ValueError,
"dtypes with overlapping or out-of-order fields are not "
"representable as buffers. Consider reordering the fields."
);
PyExc_ValueError, "The buffer interface does not support "
"overlapping fields or out-of-order "
"fields");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change of message deliberate, or just a consequence of the merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deliberate, based on your review above.

I just did a little bit more than a simple rebase. I also added support for 0-sized unnamed padding, see the release note.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see what you mean, the previous message seems good enough.

@ahaldane ahaldane force-pushed the pep3118_trailing_pad branch from 11bfeca to d7a697a Compare February 21, 2021 16:25
@ahaldane
Copy link
Member Author

ahaldane commented Feb 21, 2021

  • rebased
  • tidied up based on review comments from last time
  • added release note
  • NEW: no automatic assumption of trailing align-padding if the format is not inside T{}, as the struct doc says to do.
  • NEW: added support for 0-sized unnamed fields (eg to add trailing padding), as struct supports.

See: struct doc in particular "Note" 3 and the last example.

@ahaldane ahaldane force-pushed the pep3118_trailing_pad branch 4 times, most recently from ec97206 to 101cec8 Compare February 21, 2021 18:10
@ahaldane ahaldane force-pushed the pep3118_trailing_pad branch from 101cec8 to dfc8ec7 Compare February 21, 2021 18:19
@ahaldane ahaldane changed the title WIP: ENH: properly account for trailing padding in PEP3118 ENH: properly account for trailing padding in PEP3118 Feb 22, 2021
Base automatically changed from master to main March 4, 2021 02:03
@aldanor
Copy link

aldanor commented Jan 12, 2022

@ahaldane It's been quite a while, just wondering what's the status on this?

@ahaldane
Copy link
Member Author

As far as I remember, this is good-to-go, after review.

I can rebase and re-review it myself in the next few days. As I recall the challenge was in figuring out a consistent way to interpret the various python/numpy docs.

@InessaPawson
Copy link
Member

@ahaldane Please let us know if you need help with moving this PR forward.

@mattip
Copy link
Member

mattip commented Apr 17, 2024

We discussed this at a recent triage meeting and decided to close it due to a lack of interest in moving it forward. If someone wants to pick it up, we would review the work. Thanks @ahaldane for looking at this.

@mattip mattip closed this Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array from memoryview fails if there's trailing padding
9 participants