WIP, BUG: preserve endianness of concat #21319

tylerjereddy · 2022-04-10T21:49:07Z

add a regression test for the described issue,
and C code changes to make it pass
one curious problem, likely at the C level:
this causes a few test failures in the full test suite,
but none of them are reproducible if you tell pytest
to run them in isolation; also, which tests fail is
somewhat stochastic--reference counting issue maybe?
other things it may be useful to consider:
- most sensible behavior when combining arrays with different
  endianness when out and/or dtype are not specified (I've assumed
  this is covered by the tests already in place, though quite possible it is not)
- if we want to do this, need for versionchanged directive? maybe
  just in concatenate but not the functions that leverage concatenate?
- run the tests on a big endian machine on the gcc compile farm to be
  safe

tylerjereddy · 2022-04-10T21:50:26Z

numpy/core/src/multiarray/multiarraymodule.c

+
+        if (single_byteorder == 1 && dtype == NULL) {
+            descr->byteorder = initial_byteorder;
+        }


These 3 lines are the ones that cause somewhat-stochastic test failures in the full suite, that in my hands are not reproducible when testing those individuals tests (or even their entire modules) in isolation on this branch. If they are commented out, then the new test I added fails, but everything else passes.

Reference counting or something else?

The problem is that descr is a singleton (or can be). You would have to create a new dtype and swap bytes. (There should be a NewByteOrder function for that I think).

tylerjereddy · 2022-04-10T21:50:57Z

numpy/core/src/multiarray/multiarraymodule.c

@@ -382,6 +382,9 @@ PyArray_ConcatenateArrays(int narrays, PyArrayObject **arrays, int axis,
                          NPY_CASTING casting)
 {
    int iarrays, idim, ndim;
+    char initial_byteorder;
+    char iterative_byteorder;
+    int single_byteorder = 1;


smaller integer/bool type could be used here maybe?

tylerjereddy · 2022-04-10T21:52:10Z

numpy/core/src/multiarray/multiarraymodule.c

+            if (iterative_byteorder != initial_byteorder) {
+                single_byteorder = 0;
+            }
+        }


we already have loops that do this iteration, and I did have success fusing this into those loops if that is preferred; obviously it doesn't change the asymptotic complexity

(we could also break early I suppose, if that condition is ever satisfied)

* add a regression test for the described issue, and C code changes to make it pass * one curious problem, likely at the C level: this causes a few test failures in the full test suite, but none of them are reproducible if you tell `pytest` to run them in isolation; also, which tests fail is somewhat stochastic--reference counting issue maybe? * other things it may be useful to consider: * most sensible behavior when combining arrays with different endianness when `out` and/or `dtype` are not specified (I've assumed this is covered by the tests already in place, though quite possible it is not) * if we want to do this, need for `versionchanged` directive? maybe just in `concatenate` but not the functions that leverage `concatenate`? * run the tests on a big endian machine on the gcc compile farm to be safe

seberg · 2022-04-11T18:12:36Z

Are we sure we want to preserve endianess here? My feeling would be that the only thing that I would seriously considering is to preserve the dtype exactly when all dtypes are equivalent (we probably can't do identical, it would be asking for too much in practice).

In general, I am not sure I feel it makes sense. Users are normally better of with native-byte order. Concatenate always copies the data, so swapping bytes in the concat operation probably makes sense for most users. That is, unless they really rely on byte-order.
The question is, whether we cannot expect from those users to write dtype=...?

tylerjereddy · 2022-04-11T21:34:06Z

Ok, there isn't much discussion in the issue, and it isn't necessarily labeled a bug I don't think, so I guess I'm not sure what tests I should be aiming to make pass just yet.

seberg · 2022-04-12T14:14:42Z

There is a bit more discussion around this topic in gh-15088. I do think for np.result_type "canonicalizing" the dtype makes sense. For concatenate, I agree there could be an argument against it. But I still think that unless you happen to write anything to do with serialization you are better off with the canonicalizing behaviour anyway.
So right now, it seems to me that preserving byte-order would really only benefit very few users who probably need to double check a lot of things anyway.

tylerjereddy · 2022-04-12T16:17:47Z

Is it best to close this and mention a summary of your points in the matching issue for now?

seberg · 2022-04-12T20:12:58Z

Yeah, let me close it then. Thanks Tyler.

tylerjereddy added the component: numpy._core label Apr 10, 2022

github-actions bot added the 25 - WIP label Apr 10, 2022

tylerjereddy commented Apr 10, 2022

View reviewed changes

tylerjereddy force-pushed the treddy_issue_7829 branch from c296ead to 15be7d2 Compare April 10, 2022 21:59

seberg mentioned this pull request Apr 12, 2022

np.concatenate loses endianness / byte order #7829

Open

seberg closed this Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP, BUG: preserve endianness of concat #21319

WIP, BUG: preserve endianness of concat #21319

Uh oh!

tylerjereddy commented Apr 10, 2022

Uh oh!

tylerjereddy Apr 10, 2022

Uh oh!

seberg Apr 11, 2022

Uh oh!

tylerjereddy Apr 10, 2022

Uh oh!

tylerjereddy Apr 10, 2022

Uh oh!

tylerjereddy Apr 10, 2022

Uh oh!

seberg commented Apr 11, 2022

Uh oh!

tylerjereddy commented Apr 11, 2022

Uh oh!

seberg commented Apr 12, 2022

Uh oh!

tylerjereddy commented Apr 12, 2022

Uh oh!

seberg commented Apr 12, 2022

Uh oh!

Uh oh!

Uh oh!

WIP, BUG: preserve endianness of concat #21319

WIP, BUG: preserve endianness of concat #21319

Uh oh!

Conversation

tylerjereddy commented Apr 10, 2022

Uh oh!

tylerjereddy Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

seberg Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

seberg commented Apr 11, 2022

Uh oh!

tylerjereddy commented Apr 11, 2022

Uh oh!

seberg commented Apr 12, 2022

Uh oh!

tylerjereddy commented Apr 12, 2022

Uh oh!

seberg commented Apr 12, 2022

Uh oh!

Uh oh!