Skip to content

Fixed a cross-platform endian issue #17644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Aug 12, 2020

Conversation

qzhang90
Copy link
Contributor

@qzhang90 qzhang90 commented Jun 19, 2020

This PR is to fix a problem that I encountered when loading a GradientBoostingClassifier/GradientBoostingRegressor model, which was trained on a little-endian machine, on a big-endian machine. The error occurred at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L671

The reason was that the loaded node_ndarray was in little-endian byte order while the machine was expecting a big-endian one.

The fix in the PR is, before throwing out an error, try to swap the byte order of the node_ndarray and see whether it can satisfy the dtype checking.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @qzhang90 ! It's valid fix but I have mixed feelings about it as pickle is explicitly not portable across python versions (and likely architectures). If we try to correct for it I'm afraid it would require a lot of fixes accross the code and still be unsatisfactory. The easiest and more reliable way is still to train on the same architecture as deployment. Unless you are deploying in an low resource environment, in which case something like ONNX would likely be more appropriate for deployment.

In particular, here I don't understand why we need to swap byte order for node_ndarray but not value_ndarray.

Let's see what other reviewers think.

@amueller
Copy link
Member

I agree, pickle is not meant to be used cross-platform.

@adrinjalali
Copy link
Member

thanks for you pull request, but I think the consensus is that pickle is not supposed to be used cross-platform, and fixing the issue would be better done on the pickle level (if ever done). For serving purposes, please refer to ONNX which should solve the multi-arch issue.

@rth
Copy link
Member

rth commented Jul 21, 2020

I have changed my mind on this, sorry. Pickle won't work cross versions but it should work cross platform, if doesn't require much work on our side.

The answer of using ONNX has it's own constraints (I imagine I can't encode very generic python pre-processing with it).

Also I was recently surprised that pickle generally works between x86_64 and pyodide (32bit WebAssembly) and trying to use ONNX there would add more complexity. Actually the issue that this PR aims to solve was earlier reported in pyodide/pyodide#521.

We just need a more generic and backward compatible solution for big endian/little endian serialization in trees...

@rth rth reopened this Jul 21, 2020
@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2020

Thinking about it with @rth, pickle from the standard lib and joblib pickles are cross-platform compatible as long as you use the same version of Python and libraries, so maybe we should accept reviewing PRs to fix this issue.

I am not sure how many models would have similar problems in scikit-learn. Maybe NearestNeigbhor models based on KD/Balll-trees?

It can be very helpful to support training a on big intel / AMD server and then deploying on low power ARM machines. Mandating ONNX for this quite common usecase might be a bit too restrictive as ONNX does not necessarily cover the full scikit-learn system.

The only problem is that I am not sure how to properly test this.

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2020

In particular, here I don't understand why we need to swap byte order for node_ndarray but not value_ndarray.

This is still an open question AFAIK.

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2020

ARM64 is now supported in Azure Pipelines, that might help:

https://docs.microsoft.com/en-us/azure/devops/release-notes/2020/sprint-171-update#additional-agent-platform-arm64

@thomasjpfan
Copy link
Member

If we are going into cross platform support, can we do it incrementally? As suggested by @ogrisel, we can test for the most common use case: training on x86 linux and predicting on arm linux.

ARM64 is now supported in Azure Pipelines, that might help:

Time to see if this can help with building and testing on arm!

@rth
Copy link
Member

rth commented Jul 21, 2020

The only problem is that I am not sure how to properly test this.

we can test for the most common use case: training on x86 linux and predicting on arm linux.

Right but even with an Arm64 CI testing this would be difficult, I think? Maybe having an Arm64 CI (unrelated to this issue) and manually checking that the proposed fix works as expected would already be a good start..

@ogrisel
Copy link
Member

ogrisel commented Jul 25, 2020

manually checking that the proposed fix works as expected would already be a good start.

If someone knows the command-line instructions to manually test this on an guest ARM VM using qemu on a Linux x86_64 host machine, I would be very interested :)

@rth
Copy link
Member

rth commented Jul 25, 2020

If someone knows the command-line instructions to manually test this on an guest ARM VM using qemu on a Linux x86_64 host machine,

So I think the way to go is via https://github.com/multiarch/qemu-user-static,

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes

docker run -v`pwd`:/io --rm -it arm64v8/ubuntu /bin/bash
$uname -a
Linux bacac753e50f 5.4.0-39-generic #43-Ubuntu SMP Fri Jun 19 10:28:31 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

I'm not fully sure how it works, but it feels like magic :) That's what conda-forge is using. A number of other architectures (e.g. ppc64) are also supported. We should probably add this to the maintainers doc, trying to build scikit-learn now.

@ogrisel
Copy link
Member

ogrisel commented Jul 25, 2020

nested docker / qemu / docker for the win!

@ogrisel
Copy link
Member

ogrisel commented Jul 25, 2020

I updated your commandline to add -v`pwd`:/io to expose the local folder of the host under the /io folder on the guest to make it easy to pass pickle files around.

@rth
Copy link
Member

rth commented Jul 25, 2020

The issue with mounting the local folder, is that files written from Docker will end up being owned by root. So we can do this in CI, but for local development it might be better to just re-clone a fresh copy of scikit-learn inside docker.

@ogrisel
Copy link
Member

ogrisel commented Jul 27, 2020

So we can do this in CI, but for local development it might be better to just re-clone a fresh copy of scikit-learn inside docker.

My point was to build a pickle file of the model on the host (amd64) and pass the pickle file via the shared folder to the arm64v8 guest inside the container to manually check that it can be loaded and works as expected.

@qzhang90
Copy link
Contributor Author

In particular, here I don't understand why we need to swap byte order for node_ndarray but not value_ndarray.

This is still an open question AFAIK.

@ogrisel @rth
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L656 this line compares node_ndarray.dtype with NODE_TYPE. When a model is built on a big endian machine, its node_ndarray.dtype is:
[('left_child', '>i8'), ('right_child', '>i8'), ('feature', '>i8'), ('threshold', '>f8'), ('impurity', '>f8'), ('n_node_samples', '>i8'), ('weighted_n_node_samples', '>f8')],
then when it runs on a little endian machine, NODE_TYPE is:
[('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')],
thus they don't match, and this PR fixed it.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L660 this line compares value_ndarray.dtype with np.float64, and no matter the model is built on a big endian or little endian machine, value_ndarray.dtype is always np.float64, thus there is no endian issue for value_ndarray.

@qzhang90
Copy link
Contributor Author

qzhang90 commented Aug 5, 2020

Hi @ogrisel @rth Thank you for making progress on this PR. Since all checks have passed, I am wondering what are the next steps or anything I can help with? Thanks!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @qzhang90! Generally LGTM, but someone should try to reproduce before merging.

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

@qzhang90
Copy link
Contributor Author

qzhang90 commented Aug 6, 2020

Hi @rth @ogrisel , added an entry to doc/whats_new/v0.24.rst. Thanks!

@qzhang90
Copy link
Contributor Author

qzhang90 commented Aug 6, 2020

Thanks @qzhang90! Generally LGTM, but someone should try to reproduce before merging.

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

@rth can you advise what does it mean by someone should try to reproduce? Thanks!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reproduce with qemu, but actually I realized that architectures in https://github.com/multiarch/qemu-user-static are all little endian (at least I tried arm64v8, ppc64le) same as x86_64, so that's not going to help.

Anyway aside from the below two comments this LGTM.

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
@qzhang90
Copy link
Contributor Author

qzhang90 commented Aug 6, 2020

I tried to reproduce with qemu, but actually I realized that architectures in https://github.com/multiarch/qemu-user-static are all little endian (at least I tried arm64v8, ppc64le) same as x86_64, so that's not going to help.

Anyway aside from the below two comments this LGTM.

@rth s390x should be big endian

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s390x should be big endian

Thanks @qzhang90, indeed. TBH I don't want to re-build scikit-learn there again, it takes a while with qemu. So if you can confirm that this resolves the original issue for you I'm OK with the proposed fix.

Added a space after "if", to be PEP8 compliant.

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
@qzhang90
Copy link
Contributor Author

qzhang90 commented Aug 6, 2020

s390x should be big endian

Thanks @qzhang90, indeed. TBH I don't want to re-build scikit-learn there again, it takes a while with qemu. So if you can confirm that this resolves the original issue for you I'm OK with the proposed fix.

Just tested the latest version of this PR on my local s390x machine, which worked well(i.e., able to load models built on both little enaian and big endian machines), thus I can confirm. Thanks @rth !

@qzhang90
Copy link
Contributor Author

Hi @rth, just wondering, is there any planned or estimated release date for v0.24?

@rth
Copy link
Member

rth commented Aug 11, 2020

We try to have a 6 month release schedule. v0.23 was released in May so v0.24 will likely be in October.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no easy way to setup a big endian build on our current CI infrastructure. However I think @qzhang90's change adds very low additional code complexity and is unlikely to impact any use of scikit-learn that used to work prior this change (pickling and loading on LE architectures+OS).

So +1 for merging based on @qzhang90's manual check on his mainframe.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very tricky one, hopefully things don't break :D

@adrinjalali adrinjalali merged commit e217b68 into scikit-learn:master Aug 12, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
* fixed a cross-platform endian issue

* removed a duplicated dtype checking

* Trigger [arm64] CI

* added an entry to doc/whats_new/v0.24.rst

* moved my fix entry to sklearn.tree section

* Update sklearn/tree/_tree.pyx

Added comments

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

* code simplified

* fixed a typo in sklearn/tree/_tree.pyx

* Update doc/whats_new/v0.24.rst

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

* updated doc/whats_new/v0.24.rst

* Update sklearn/tree/_tree.pyx

Added a space after "if", to be PEP8 compliant.

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

* Update sklearn/tree/_tree.pyx

Co-authored-by: Qi Zhang <q.zhang@ibm.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants