Skip to content

WIP, BUG: preserve endianness of concat #21319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tylerjereddy
Copy link
Contributor

Fixes #7829

  • add a regression test for the described issue,
    and C code changes to make it pass

  • one curious problem, likely at the C level:
    this causes a few test failures in the full test suite,
    but none of them are reproducible if you tell pytest
    to run them in isolation; also, which tests fail is
    somewhat stochastic--reference counting issue maybe?

  • other things it may be useful to consider:

    • most sensible behavior when combining arrays with different
      endianness when out and/or dtype are not specified (I've assumed
      this is covered by the tests already in place, though quite possible it is not)
    • if we want to do this, need for versionchanged directive? maybe
      just in concatenate but not the functions that leverage concatenate?
    • run the tests on a big endian machine on the gcc compile farm to be
      safe


if (single_byteorder == 1 && dtype == NULL) {
descr->byteorder = initial_byteorder;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 lines are the ones that cause somewhat-stochastic test failures in the full suite, that in my hands are not reproducible when testing those individuals tests (or even their entire modules) in isolation on this branch. If they are commented out, then the new test I added fails, but everything else passes.

Reference counting or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that descr is a singleton (or can be). You would have to create a new dtype and swap bytes. (There should be a NewByteOrder function for that I think).

@@ -382,6 +382,9 @@ PyArray_ConcatenateArrays(int narrays, PyArrayObject **arrays, int axis,
NPY_CASTING casting)
{
int iarrays, idim, ndim;
char initial_byteorder;
char iterative_byteorder;
int single_byteorder = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller integer/bool type could be used here maybe?

if (iterative_byteorder != initial_byteorder) {
single_byteorder = 0;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have loops that do this iteration, and I did have success fusing this into those loops if that is preferred; obviously it doesn't change the asymptotic complexity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we could also break early I suppose, if that condition is ever satisfied)

* add a regression test for the described issue,
and C code changes to make it pass

* one curious problem, likely at the C level:
this causes a few test failures in the full test suite,
but none of them are reproducible if you tell `pytest`
to run them in isolation; also, which tests fail is
somewhat stochastic--reference counting issue maybe?

* other things it may be useful to consider:
  * most sensible behavior when combining arrays with different
endianness when `out` and/or `dtype` are not specified (I've assumed
this is covered by the tests already in place, though quite possible it is not)
  * if we want to do this, need for `versionchanged` directive? maybe
just in `concatenate` but not the functions that leverage `concatenate`?
  * run the tests on a big endian machine on the gcc compile farm to be
safe
@seberg
Copy link
Member

seberg commented Apr 11, 2022

Are we sure we want to preserve endianess here? My feeling would be that the only thing that I would seriously considering is to preserve the dtype exactly when all dtypes are equivalent (we probably can't do identical, it would be asking for too much in practice).

In general, I am not sure I feel it makes sense. Users are normally better of with native-byte order. Concatenate always copies the data, so swapping bytes in the concat operation probably makes sense for most users. That is, unless they really rely on byte-order.
The question is, whether we cannot expect from those users to write dtype=...?

@tylerjereddy
Copy link
Contributor Author

Ok, there isn't much discussion in the issue, and it isn't necessarily labeled a bug I don't think, so I guess I'm not sure what tests I should be aiming to make pass just yet.

@seberg
Copy link
Member

seberg commented Apr 12, 2022

There is a bit more discussion around this topic in gh-15088. I do think for np.result_type "canonicalizing" the dtype makes sense. For concatenate, I agree there could be an argument against it. But I still think that unless you happen to write anything to do with serialization you are better off with the canonicalizing behaviour anyway.
So right now, it seems to me that preserving byte-order would really only benefit very few users who probably need to double check a lot of things anyway.

@tylerjereddy
Copy link
Contributor Author

Is it best to close this and mention a summary of your points in the matching issue for now?

@seberg
Copy link
Member

seberg commented Apr 12, 2022

Yeah, let me close it then. Thanks Tyler.

@seberg seberg closed this Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

np.concatenate loses endianness / byte order
2 participants