ENH: properly account for trailing padding in PEP3118 #7798

ahaldane · 2016-07-02T19:27:30Z

This changes how numpy treats trailing padding in the PEP3118 buffer interface, to fix #7797.

Previously, dtypes with trailing padding would have their trailing padding truncated when converted to memoryview using pep3118. Now trailing padding is maintained.

Then, I also needed to modify the pep3118-to-array conversion to stop explicitly adding void fields for the trailing padding. This way code of the form np.array(memoryview(arr)) will give back the original array even if there is trailing padding.

One thing I still need to think about is what to do for "aligned" types. The pep3118 doc is not clear what "alignment" means. It appears that ix and ixxx are supposed to mean the same thing: a structure of 8 bytes containing an integer in the first 4 bytes and 4 bytes of padding, but I am not 100% sure that is right.

pv · 2016-07-02T20:06:02Z

numpy/core/tests/test_multiarray.py

-        self._check('ixxxx', [('f0', 'i'), ('', VV(4))])
-        self._check('i7x', [('f0', 'i'), ('', VV(7))])
+        # XXX not sure if this aligment is desired behavior
+        self._check('ix',    dtype_itemsize(8))


The alignment in the native mode is supposed to follow that of the C compiler, so the test should check the results are in agreement with dtype.alignment (which is determined by the compiler on the platform)

ahaldane · 2016-07-03T00:32:59Z

Thanks @pv, I made some small corrections.

I need to read and think about the proper behavior for the unstructured types and aligned structured types.

It seems to me the PEP3118 format supports arrangements that can't really be done in numpy. For example, while numpy does support trailing padding in structured types like T{i:f0:xx}, there is no natural way in numpy to include trailing padding without a structure, like in ixx. Therefore it may be reasonable to raise an error in the latter case, when trying to convert memoryviews to arrays.

Also, I was looking at how the struct module treats padding bytes. While it adds internal padding bytes to maintain alignment, it doesn't add trailing padding bytes the way numpy does (in this PR or before):

# b gets 4 bytes here, like in dtype('i,u1,i', align=True)
>>> struct.pack('ibi', 1, 2, 3)  
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

# trailing x only adds 1 byte, not 4 as numpy would
>>> struct.pack('ibix', 1, 2, 3)  
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x00'

# in numpy (and usually C structs), 4 more trailing pad bytes would be added here
>>> struct.pack('qi', 1, 2) 
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00'

The pep3118 docs say the struct module is supposed to support the new format strings, but it doesn't appear to yet. So maybe the discrepancy between struct and numpy is struct's problem, not numpy's.

ahaldane · 2016-07-03T00:58:28Z

Ah, see on the python bug tracker

struct.pack(): trailing padding bytes on x64 last changed 2016-05-10
implement PEP 3118 struct changes last changed 2016-04-13

ahaldane · 2016-07-04T01:39:19Z

After reading, I think numpy's current interpretation of trailing padding in pep3118 format strings is wrong. The struct module is pretty clear about how format strings treat trailing padding:

Padding is only automatically added between successive structure members. No padding is added at the beginning or the end of the encoded struct.

No padding is added when using non-native size and alignment, e.g. with ‘<’, ‘>’, ‘=’, and ‘!’.

To align the end of a structure to the alignment requirement of a particular type, end the format with the code for that type with a repeat count of zero.

So in a format like 'ix', even though the type is implicitly @ (aligned) we should only add 1 trailing padding byte. If we wanted numpy-style padding bytes the format string should instead be something like 'ix0i'. Unfortunately this means that even though ix is in "aligned" mode, the resulting numpy dtype does not count as aligned.

I've updated the code accordingly. There is still more to do.

charris · 2016-07-04T15:33:09Z

Might take a look at #7467 while you are working on this.

ahaldane · 2016-07-06T19:02:13Z

All right, the story gets more complicated.

While the struct module does not add C-style trailing bytes to unstructured aligned data, the memoryview function does within T{}:

>>> arr = np.zeros(1, dtype=np.dtype([('x', '<f8'), ('y', '<b')], align=True))
>>> m1 = memoryview(arr)
>>> m1.format, m1.itemsize
('T{d:x:b:y:}', 16)

#  ctypes has the same behavior
>>> class POINT(ctypes.Structure):
...     _fields_ = [("x", ctypes.c_double), ("y", ctypes.c_byte)]
...
>>> m2 = memoryview((POINT)())
>>> m2.format, m2.itemsize
('T{<d:x:<b:y:}', 16)

>>> len(struct.pack('db', 0.0, 0))
9

I don't know how to construct a memoryview of format ix so I can't test that case. But it appears that trailing padding is added within T{} (which does not currently exist in the struct module) while struct does not add it outside of T{}.

aldanor · 2016-10-31T14:21:23Z

@ahaldane Just wondering, are there any plans in regards to moving this forward?

ahaldane · 2016-10-31T15:43:59Z

Yeah I sort of forgot about this because it was so messy.

As I recall I think the current state of this PR is a pretty good approximation of how things "should work", given the current somewhat incomplete state of pep3118 and memoryviews in python.

I'll revisit it sometime soon, comments on it currently are welcome.

eric-wieser · 2017-05-08T00:00:58Z

Some overlap here with #9054

homu · 2017-05-09T16:59:14Z

☔ The latest upstream changes (presumably #9054) made this pull request unmergeable. Please resolve the merge conflicts.

ahaldane · 2017-05-09T19:12:39Z

Rebased.

As a reminder, this PR fixes the PEP3118 trailing padding, which was not accounted for at all before when converting an array->memoryview. Importantly, I now interpret ix format as 5 bytes, and ix0i as 8 (on x64), as the struct modele documents should be the case. This means I had to tweak the memoryview->array code since it did something else before. This should have no effect on any real code since such formats are never created in practice.

ahaldane · 2017-05-09T19:13:11Z

Here are two more things we might want to think more about:

Currently, although the format string numpy produces from an array works properly, is perhaps a bit ugly. Consider:

>>> memoryview(np.zeros(3, dtype=np.dtype('u1,i4,u1', align=True))).format
'T{B:f0:3xi:f1:B:f2:3x}'

Maybe it should be T{B:f0:0ii:f1:B:f2:0i} or T{B:f0:i:f1:B:f2:0i}.

While fixing this up I realized there are dtypes which we will never be able to represent with the PEP3118 sytnax: Those with overlapping fields:

>>> dt = dtype({'names': ['a', 'b'], 'formats': ['i4','i4'], 'offsets': [0,1]})
>>> memoryview(np.zeros(3, dtype=d))
RuntimeError: This should never happen: Invalid offset in buffer format string g
eneration. Please report a bug to the Numpy developers.

eric-wieser · 2017-05-09T22:12:22Z

numpy/core/src/multiarray/buffer.c

-                                "buffer format string generation. Please "
-                                "report a bug to the Numpy developers.");
+                                "The buffer interface does not support "
+                                "overlapping fields");


Doesn't this path also trigger for fields that are not in sequence? (like the type returned by #6053)

yes, I should make a better message

What are we going to do about #6053 here? Seems like it's going to make casting to memoryview a lot harder...

Oh I see, you're right. The PEP3118 format cannot account for either overlapping fields or out-of-order fields (when order matters). That's something to think about...

I'm a little confused by PEP3118, as it:

Seems to be designed to standardize some numpy behaviour for interoperability

Causes discussions in the python bugtracker that refer to numpy as the canonical implementation

(yet) Is not actually sufficient to represent either all C types or all numpy types

Just a thought (weeks later),

A more general way of describing a structured type is as a list of (name, type, offset) tuples. This is how, for example, MPI serializes structs (see here). It is also how numpy already stores structured types internally.

Since a purpose of PEP3118 is serialization, maybe it would have made sense to specify structured types this way. Eg we could have had something like 'T24{i,0,f0:f,8,f1} to represent dtype({'formats': ['i', 'f'], 'offsets': [0, 8], 'names': ['f0', 'f1'], 'itemsize': 24}).

Since numpy is the main user of the PEP3118 interface, and because the PEP seems to be only partly implemented in cpython currently, there is probably still opportunity for us to work out the details or add to the specification, with some effort.

For future reference, one more datapoint: The HDF5 datatype specification also stores "compound datatypes" as an ordered list of (name, format, offset) tuples. So that is now 3 cases where structured datatypes are stored that way: numpy dtypes, MPI serialization, and HDF5 files.

I want to record these examples here in case in the future we want to argue to the CPython devs that we should add a new PEP3118 specification style for structured types, that is equivalent to a list of tuples.

eric-wieser · 2017-05-09T22:25:59Z

numpy/core/tests/test_multiarray.py

@@ -5918,6 +5921,10 @@ def aligned(n):
            itemsize=aligned(size + 1)
        ), (3,)))

+        expected_dtype = {'names': ['f0'], 'formats': ['i'],
+                          'itemsize': np.dtype('i,V1', align=True).itemsize}
+        self._check('(3)T{ix}', (expected_dtype, (3,)))


Isn't this an exact duplicate of the above test? What am I missing?

oh sorry, I fudged the rebase here

Although perhaps using np.dtype(..., align=True) is an improvement over aligned(size + 1)

eric-wieser · 2017-05-09T22:26:55Z

What is the itemsize of T{ix} with this patch?

eric-wieser · 2017-05-09T23:39:18Z

numpy/core/tests/test_multiarray.py

-        self._check('ixx',   dict(itemsize=aligned(size + 2), **base))
-        self._check('ixxx',  dict(itemsize=aligned(size + 3), **base))
-        self._check('ixxxx', dict(itemsize=aligned(size + 4), **base))
-        self._check('i7x',   dict(itemsize=aligned(size + 7), **base))


I think these tests might remain valid with T{...}? In particular, I would expect trailing padding to be kept in a struct context, to match the behaviour of sizeof(T) in C, and what happens when structs are repeated

The PEP3118 spec is unclear about this.

One could argue the struct module has set no precedent in judging how alignment and padding work here since it doesn't implement the T{} format. We might then feel free to set the precedent here in this PR, deciding that aligned formats add trailing padding only inside T{}.

Whereas the ctypes module has set the precedent of ignoring all the remarks that the struct docs make about alignment...

charris · 2020-12-16T16:58:25Z

@seberg @eric-wieser Is this still of interest?

charris · 2021-01-25T18:42:27Z

Needs rebase. I'm inclined to close this unless it is still relevant. @seberg @eric-wieser thoughts?

seberg · 2021-01-25T19:13:21Z

It is still relevant, there was no change in behaviour with respect to the issue.

I am not sure whether the trailing bytes themselves should or should not be relevant. The main interest would be with respect to getting alignment right in the buffer protocol (this PR does that as far as I can tell). But the mismatch between buffer protocol itemsize and the length that is set, does currently prevent roundtripping array -> buffer -> array in certain cases, so I guess there is something wrong.

(I am not sure how easy it is to rebase this PR, it might be annoying, but I expect all the pieces here are still relevant just as much)

Fixes numpy#7797

eric-wieser · 2021-02-21T16:16:37Z

numpy/core/src/multiarray/buffer.c

-                    PyExc_ValueError,
-                    "dtypes with overlapping or out-of-order fields are not "
-                    "representable as buffers. Consider reordering the fields."
-                );
+                    PyExc_ValueError, "The buffer interface does not support "
+                                      "overlapping fields or out-of-order "
+                                      "fields");


Is this change of message deliberate, or just a consequence of the merge?

deliberate, based on your review above.

I just did a little bit more than a simple rebase. I also added support for 0-sized unnamed padding, see the release note.

oh I see what you mean, the previous message seems good enough.

ahaldane · 2021-02-21T16:30:09Z

rebased
tidied up based on review comments from last time
added release note
NEW: no automatic assumption of trailing align-padding if the format is not inside T{}, as the struct doc says to do.
NEW: added support for 0-sized unnamed fields (eg to add trailing padding), as struct supports.

See: struct doc in particular "Note" 3 and the last example.

aldanor · 2022-01-12T14:35:07Z

@ahaldane It's been quite a while, just wondering what's the status on this?

ahaldane · 2022-01-13T18:07:00Z

As far as I remember, this is good-to-go, after review.

I can rebase and re-review it myself in the next few days. As I recall the challenge was in figuring out a consistent way to interpret the various python/numpy docs.

InessaPawson · 2022-08-07T21:55:52Z

@ahaldane Please let us know if you need help with moving this PR forward.

mattip · 2024-04-17T19:29:58Z

We discussed this at a recent triage meeting and decided to close it due to a lack of interest in moving it forward. If someone wants to pick it up, we would review the work. Thanks @ahaldane for looking at this.

pv reviewed Jul 2, 2016
View reviewed changes

ahaldane force-pushed the pep3118_trailing_pad branch 3 times, most recently from 842dd34 to e540543 Compare July 2, 2016 22:10

ahaldane mentioned this pull request Jul 4, 2016

Array from memoryview fails if there's trailing padding #7797

Open

charris added 00 - Bug component: numpy._core labels Jul 4, 2016

ahaldane mentioned this pull request Jul 6, 2016

BUG: interpret 'c' PEP3118/struct type as 'S1'. #7803

Merged

ahaldane changed the title ~~WIP: ENH: properly account for tailing padding in PEP3118~~ WIP: ENH: properly account for trailing padding in PEP3118 Jul 6, 2016

patstew mentioned this pull request Nov 21, 2016

Add the buffer interface for wrapped STL vectors pybind/pybind11#488

Merged

ahaldane mentioned this pull request May 8, 2017

BUG: Various fixes to _dtype_from_pep3118 #9054

Merged

ahaldane force-pushed the pep3118_trailing_pad branch 2 times, most recently from 9ce1070 to 8f72316 Compare May 9, 2017 19:11

ahaldane force-pushed the pep3118_trailing_pad branch from 8f72316 to 20f5478 Compare May 9, 2017 19:16

eric-wieser reviewed May 9, 2017

View reviewed changes

ahaldane mentioned this pull request Nov 27, 2017

Numpy does not recognize ctypes arrays with c_wchar field #10100

Closed

ahaldane mentioned this pull request Apr 16, 2018

BUG: Revert multifield-indexing adds padding bytes for NumPy 1.15. #10411

Merged

ahaldane mentioned this pull request Sep 26, 2018

BUG: define "uint-alignment", fixes complex64 alignment #6377

Merged

ahaldane mentioned this pull request Nov 18, 2018

DEP: deprecate empty field names #12375

Closed

ENH: output trailing padding in PEP3118 format strings

08734b1

Fixes numpy#7797

ahaldane force-pushed the pep3118_trailing_pad branch from 20f5478 to 7d21238 Compare February 21, 2021 15:51

github-actions bot added the 25 - WIP label Feb 21, 2021

ahaldane force-pushed the pep3118_trailing_pad branch from 7d21238 to 11bfeca Compare February 21, 2021 16:05

eric-wieser reviewed Feb 21, 2021

View reviewed changes

ahaldane force-pushed the pep3118_trailing_pad branch from 11bfeca to d7a697a Compare February 21, 2021 16:25

ahaldane force-pushed the pep3118_trailing_pad branch 4 times, most recently from ec97206 to 101cec8 Compare February 21, 2021 18:10

ENH: allow 0-sized elements in PEP3118 format strings to align

dfc8ec7

ahaldane force-pushed the pep3118_trailing_pad branch from 101cec8 to dfc8ec7 Compare February 21, 2021 18:19

ahaldane changed the title ~~WIP: ENH: properly account for trailing padding in PEP3118~~ ENH: properly account for trailing padding in PEP3118 Feb 22, 2021

github-actions bot added the 01 - Enhancement label Feb 22, 2021

Base automatically changed from master to main March 4, 2021 02:03

charris added 58 - Ready for review and removed 25 - WIP labels Jun 9, 2022

mattip closed this Apr 17, 2024

ENH: properly account for trailing padding in PEP3118 #7798

ENH: properly account for trailing padding in PEP3118 #7798

Conversation

ahaldane commented Jul 2, 2016

pv Jul 2, 2016 • edited Loading

Choose a reason for hiding this comment

ahaldane commented Jul 3, 2016

ahaldane commented Jul 3, 2016 • edited Loading

ahaldane commented Jul 4, 2016

charris commented Jul 4, 2016

ahaldane commented Jul 6, 2016 • edited Loading

aldanor commented Oct 31, 2016

ahaldane commented Oct 31, 2016

eric-wieser commented May 8, 2017

homu commented May 9, 2017

ahaldane commented May 9, 2017

ahaldane commented May 9, 2017 • edited Loading

eric-wieser May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser May 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser commented May 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

charris commented Dec 16, 2020

charris commented Jan 25, 2021

seberg commented Jan 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaldane commented Feb 21, 2021 • edited Loading

aldanor commented Jan 12, 2022

ahaldane commented Jan 13, 2022

InessaPawson commented Aug 7, 2022

mattip commented Apr 17, 2024

pv Jul 2, 2016 •

edited

Loading

ahaldane commented Jul 3, 2016 •

edited

Loading

ahaldane commented Jul 6, 2016 •

edited

Loading

ahaldane commented May 9, 2017 •

edited

Loading

eric-wieser May 9, 2017 •

edited

Loading

eric-wieser May 10, 2017 •

edited

Loading

eric-wieser Feb 6, 2018 •

edited

Loading

ahaldane commented Feb 21, 2021 •

edited

Loading