-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: properly account for trailing padding in PEP3118 #7798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/core/tests/test_multiarray.py
Outdated
self._check('ixxxx', [('f0', 'i'), ('', VV(4))]) | ||
self._check('i7x', [('f0', 'i'), ('', VV(7))]) | ||
# XXX not sure if this aligment is desired behavior | ||
self._check('ix', dtype_itemsize(8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alignment in the native mode is supposed to follow that of the C compiler, so the test should check the results are in agreement with dtype.alignment
(which is determined by the compiler on the platform)
842dd34
to
e540543
Compare
Thanks @pv, I made some small corrections. I need to read and think about the proper behavior for the unstructured types and aligned structured types. It seems to me the PEP3118 format supports arrangements that can't really be done in numpy. For example, while numpy does support trailing padding in structured types like Also, I was looking at how the # b gets 4 bytes here, like in dtype('i,u1,i', align=True)
>>> struct.pack('ibi', 1, 2, 3)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'
# trailing x only adds 1 byte, not 4 as numpy would
>>> struct.pack('ibix', 1, 2, 3)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x00'
# in numpy (and usually C structs), 4 more trailing pad bytes would be added here
>>> struct.pack('qi', 1, 2)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00' The pep3118 docs say the |
Ah, see on the python bug tracker struct.pack(): trailing padding bytes on x64 last changed 2016-05-10 |
After reading, I think numpy's current interpretation of trailing padding in pep3118 format strings is wrong. The struct module is pretty clear about how format strings treat trailing padding:
So in a format like 'ix', even though the type is implicitly I've updated the code accordingly. There is still more to do. |
Might take a look at #7467 while you are working on this. |
All right, the story gets more complicated. While the struct module does not add C-style trailing bytes to unstructured aligned data, the memoryview function does within
I don't know how to construct a memoryview of format |
@ahaldane Just wondering, are there any plans in regards to moving this forward? |
Yeah I sort of forgot about this because it was so messy. As I recall I think the current state of this PR is a pretty good approximation of how things "should work", given the current somewhat incomplete state of pep3118 and memoryviews in python. I'll revisit it sometime soon, comments on it currently are welcome. |
Some overlap here with #9054 |
☔ The latest upstream changes (presumably #9054) made this pull request unmergeable. Please resolve the merge conflicts. |
9ce1070
to
8f72316
Compare
Rebased. As a reminder, this PR fixes the PEP3118 trailing padding, which was not accounted for at all before when converting an |
Here are two more things we might want to think more about:
>>> memoryview(np.zeros(3, dtype=np.dtype('u1,i4,u1', align=True))).format
'T{B:f0:3xi:f1:B:f2:3x}' Maybe it should be
>>> dt = dtype({'names': ['a', 'b'], 'formats': ['i4','i4'], 'offsets': [0,1]})
>>> memoryview(np.zeros(3, dtype=d))
RuntimeError: This should never happen: Invalid offset in buffer format string g
eneration. Please report a bug to the Numpy developers. |
8f72316
to
20f5478
Compare
numpy/core/src/multiarray/buffer.c
Outdated
"buffer format string generation. Please " | ||
"report a bug to the Numpy developers."); | ||
"The buffer interface does not support " | ||
"overlapping fields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this path also trigger for fields that are not in sequence? (like the type returned by #6053)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I should make a better message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we going to do about #6053 here? Seems like it's going to make casting to memoryview a lot harder...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you're right. The PEP3118 format cannot account for either overlapping fields or out-of-order fields (when order matters). That's something to think about...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by PEP3118, as it:
- Seems to be designed to standardize some numpy behaviour for interoperability
- Causes discussions in the python bugtracker that refer to numpy as the canonical implementation
- (yet) Is not actually sufficient to represent either all C types or all numpy types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought (weeks later),
A more general way of describing a structured type is as a list of (name, type, offset)
tuples. This is how, for example, MPI serializes structs (see here). It is also how numpy already stores structured types internally.
Since a purpose of PEP3118 is serialization, maybe it would have made sense to specify structured types this way. Eg we could have had something like 'T24{i,0,f0:f,8,f1}
to represent dtype({'formats': ['i', 'f'], 'offsets': [0, 8], 'names': ['f0', 'f1'], 'itemsize': 24})
.
Since numpy is the main user of the PEP3118 interface, and because the PEP seems to be only partly implemented in cpython currently, there is probably still opportunity for us to work out the details or add to the specification, with some effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, one more datapoint: The HDF5 datatype specification also stores "compound datatypes" as an ordered list of (name, format, offset)
tuples. So that is now 3 cases where structured datatypes are stored that way: numpy dtypes, MPI serialization, and HDF5 files.
I want to record these examples here in case in the future we want to argue to the CPython devs that we should add a new PEP3118 specification style for structured types, that is equivalent to a list of tuples.
@@ -5918,6 +5921,10 @@ def aligned(n): | |||
itemsize=aligned(size + 1) | |||
), (3,))) | |||
|
|||
expected_dtype = {'names': ['f0'], 'formats': ['i'], | |||
'itemsize': np.dtype('i,V1', align=True).itemsize} | |||
self._check('(3)T{ix}', (expected_dtype, (3,))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this an exact duplicate of the above test? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, I fudged the rebase here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although perhaps using np.dtype(..., align=True)
is an improvement over aligned(size + 1)
What is the itemsize of |
self._check('ixx', dict(itemsize=aligned(size + 2), **base)) | ||
self._check('ixxx', dict(itemsize=aligned(size + 3), **base)) | ||
self._check('ixxxx', dict(itemsize=aligned(size + 4), **base)) | ||
self._check('i7x', dict(itemsize=aligned(size + 7), **base)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests might remain valid with T{...}
? In particular, I would expect trailing padding to be kept in a struct context, to match the behaviour of sizeof(T)
in C, and what happens when structs are repeated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PEP3118 spec is unclear about this.
One could argue the struct
module has set no precedent in judging how alignment and padding work here since it doesn't implement the T{}
format. We might then feel free to set the precedent here in this PR, deciding that aligned formats add trailing padding only inside T{}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whereas the ctypes
module has set the precedent of ignoring all the remarks that the struct docs make about alignment...
@seberg @eric-wieser Is this still of interest? |
Needs rebase. I'm inclined to close this unless it is still relevant. @seberg @eric-wieser thoughts? |
It is still relevant, there was no change in behaviour with respect to the issue. I am not sure whether the trailing bytes themselves should or should not be relevant. The main interest would be with respect to getting alignment right in the buffer protocol (this PR does that as far as I can tell). But the mismatch between buffer protocol itemsize and the (I am not sure how easy it is to rebase this PR, it might be annoying, but I expect all the pieces here are still relevant just as much) |
20f5478
to
7d21238
Compare
7d21238
to
11bfeca
Compare
numpy/core/src/multiarray/buffer.c
Outdated
PyExc_ValueError, | ||
"dtypes with overlapping or out-of-order fields are not " | ||
"representable as buffers. Consider reordering the fields." | ||
); | ||
PyExc_ValueError, "The buffer interface does not support " | ||
"overlapping fields or out-of-order " | ||
"fields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change of message deliberate, or just a consequence of the merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deliberate, based on your review above.
I just did a little bit more than a simple rebase. I also added support for 0-sized unnamed padding, see the release note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see what you mean, the previous message seems good enough.
11bfeca
to
d7a697a
Compare
See: struct doc in particular "Note" 3 and the last example. |
ec97206
to
101cec8
Compare
101cec8
to
dfc8ec7
Compare
@ahaldane It's been quite a while, just wondering what's the status on this? |
As far as I remember, this is good-to-go, after review. I can rebase and re-review it myself in the next few days. As I recall the challenge was in figuring out a consistent way to interpret the various python/numpy docs. |
@ahaldane Please let us know if you need help with moving this PR forward. |
We discussed this at a recent triage meeting and decided to close it due to a lack of interest in moving it forward. If someone wants to pick it up, we would review the work. Thanks @ahaldane for looking at this. |
This changes how numpy treats trailing padding in the PEP3118 buffer interface, to fix #7797.
Previously, dtypes with trailing padding would have their trailing padding truncated when converted to memoryview using pep3118. Now trailing padding is maintained.
Then, I also needed to modify the pep3118-to-array conversion to stop explicitly adding void fields for the trailing padding. This way code of the form
np.array(memoryview(arr))
will give back the original array even if there is trailing padding.One thing I still need to think about is what to do for "aligned" types. The pep3118 doc is not clear what "alignment" means. It appears that
ix
andixxx
are supposed to mean the same thing: a structure of 8 bytes containing an integer in the first 4 bytes and 4 bytes of padding, but I am not 100% sure that is right.