-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Fixed a cross-platform endian issue #17644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @qzhang90 ! It's valid fix but I have mixed feelings about it as pickle is explicitly not portable across python versions (and likely architectures). If we try to correct for it I'm afraid it would require a lot of fixes accross the code and still be unsatisfactory. The easiest and more reliable way is still to train on the same architecture as deployment. Unless you are deploying in an low resource environment, in which case something like ONNX would likely be more appropriate for deployment.
In particular, here I don't understand why we need to swap byte order for node_ndarray
but not value_ndarray
.
Let's see what other reviewers think.
I agree, pickle is not meant to be used cross-platform. |
thanks for you pull request, but I think the consensus is that pickle is not supposed to be used cross-platform, and fixing the issue would be better done on the pickle level (if ever done). For serving purposes, please refer to ONNX which should solve the multi-arch issue. |
I have changed my mind on this, sorry. Pickle won't work cross versions but it should work cross platform, if doesn't require much work on our side. The answer of using ONNX has it's own constraints (I imagine I can't encode very generic python pre-processing with it). Also I was recently surprised that pickle generally works between x86_64 and pyodide (32bit WebAssembly) and trying to use ONNX there would add more complexity. Actually the issue that this PR aims to solve was earlier reported in pyodide/pyodide#521. We just need a more generic and backward compatible solution for big endian/little endian serialization in trees... |
Thinking about it with @rth, pickle from the standard lib and joblib pickles are cross-platform compatible as long as you use the same version of Python and libraries, so maybe we should accept reviewing PRs to fix this issue. I am not sure how many models would have similar problems in scikit-learn. Maybe NearestNeigbhor models based on KD/Balll-trees? It can be very helpful to support training a on big intel / AMD server and then deploying on low power ARM machines. Mandating ONNX for this quite common usecase might be a bit too restrictive as ONNX does not necessarily cover the full scikit-learn system. The only problem is that I am not sure how to properly test this. |
This is still an open question AFAIK. |
ARM64 is now supported in Azure Pipelines, that might help: |
If we are going into cross platform support, can we do it incrementally? As suggested by @ogrisel, we can test for the most common use case: training on x86 linux and predicting on arm linux.
Time to see if this can help with building and testing on arm! |
Right but even with an Arm64 CI testing this would be difficult, I think? Maybe having an Arm64 CI (unrelated to this issue) and manually checking that the proposed fix works as expected would already be a good start.. |
If someone knows the command-line instructions to manually test this on an guest ARM VM using qemu on a Linux x86_64 host machine, I would be very interested :) |
So I think the way to go is via https://github.com/multiarch/qemu-user-static, docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
docker run -v`pwd`:/io --rm -it arm64v8/ubuntu /bin/bash
$uname -a
Linux bacac753e50f 5.4.0-39-generic #43-Ubuntu SMP Fri Jun 19 10:28:31 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux I'm not fully sure how it works, but it feels like magic :) That's what conda-forge is using. A number of other architectures (e.g. ppc64) are also supported. We should probably add this to the maintainers doc, trying to build scikit-learn now. |
nested docker / qemu / docker for the win! |
I updated your commandline to add |
The issue with mounting the local folder, is that files written from Docker will end up being owned by root. So we can do this in CI, but for local development it might be better to just re-clone a fresh copy of scikit-learn inside docker. |
My point was to build a pickle file of the model on the host (amd64) and pass the pickle file via the shared folder to the arm64v8 guest inside the container to manually check that it can be loaded and works as expected. |
@ogrisel @rth https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L660 this line compares |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @qzhang90! Generally LGTM, but someone should try to reproduce before merging.
Please add an entry to the change log at doc/whats_new/v0.24.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself with :user:
.
Added comments Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
…earn into endian-support
@rth can you advise what does it mean by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to reproduce with qemu, but actually I realized that architectures in https://github.com/multiarch/qemu-user-static are all little endian (at least I tried arm64v8, ppc64le) same as x86_64, so that's not going to help.
Anyway aside from the below two comments this LGTM.
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
@rth s390x should be big endian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s390x should be big endian
Thanks @qzhang90, indeed. TBH I don't want to re-build scikit-learn there again, it takes a while with qemu. So if you can confirm that this resolves the original issue for you I'm OK with the proposed fix.
Added a space after "if", to be PEP8 compliant. Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Just tested the latest version of this PR on my local s390x machine, which worked well(i.e., able to load models built on both little enaian and big endian machines), thus I can confirm. Thanks @rth ! |
Hi @rth, just wondering, is there any planned or estimated release date for v0.24? |
We try to have a 6 month release schedule. v0.23 was released in May so v0.24 will likely be in October. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no easy way to setup a big endian build on our current CI infrastructure. However I think @qzhang90's change adds very low additional code complexity and is unlikely to impact any use of scikit-learn that used to work prior this change (pickling and loading on LE architectures+OS).
So +1 for merging based on @qzhang90's manual check on his mainframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very tricky one, hopefully things don't break :D
* fixed a cross-platform endian issue * removed a duplicated dtype checking * Trigger [arm64] CI * added an entry to doc/whats_new/v0.24.rst * moved my fix entry to sklearn.tree section * Update sklearn/tree/_tree.pyx Added comments Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * code simplified * fixed a typo in sklearn/tree/_tree.pyx * Update doc/whats_new/v0.24.rst Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * updated doc/whats_new/v0.24.rst * Update sklearn/tree/_tree.pyx Added a space after "if", to be PEP8 compliant. Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * Update sklearn/tree/_tree.pyx Co-authored-by: Qi Zhang <q.zhang@ibm.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
This PR is to fix a problem that I encountered when loading a GradientBoostingClassifier/GradientBoostingRegressor model, which was trained on a little-endian machine, on a big-endian machine. The error occurred at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L671
The reason was that the loaded
node_ndarray
was in little-endian byte order while the machine was expecting a big-endian one.The fix in the PR is, before throwing out an error, try to swap the byte order of the
node_ndarray
and see whether it can satisfy thedtype
checking.