Fixed a cross-platform endian issue #17644

qzhang90 · 2020-06-19T21:30:01Z

This PR is to fix a problem that I encountered when loading a GradientBoostingClassifier/GradientBoostingRegressor model, which was trained on a little-endian machine, on a big-endian machine. The error occurred at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L671

The reason was that the loaded node_ndarray was in little-endian byte order while the machine was expecting a big-endian one.

The fix in the PR is, before throwing out an error, try to swap the byte order of the node_ndarray and see whether it can satisfy the dtype checking.

rth

Thanks for this PR @qzhang90 ! It's valid fix but I have mixed feelings about it as pickle is explicitly not portable across python versions (and likely architectures). If we try to correct for it I'm afraid it would require a lot of fixes accross the code and still be unsatisfactory. The easiest and more reliable way is still to train on the same architecture as deployment. Unless you are deploying in an low resource environment, in which case something like ONNX would likely be more appropriate for deployment.

In particular, here I don't understand why we need to swap byte order for node_ndarray but not value_ndarray.

Let's see what other reviewers think.

amueller · 2020-06-20T15:22:37Z

I agree, pickle is not meant to be used cross-platform.

adrinjalali · 2020-06-22T12:14:32Z

thanks for you pull request, but I think the consensus is that pickle is not supposed to be used cross-platform, and fixing the issue would be better done on the pickle level (if ever done). For serving purposes, please refer to ONNX which should solve the multi-arch issue.

rth · 2020-07-21T14:29:27Z

I have changed my mind on this, sorry. Pickle won't work cross versions but it should work cross platform, if doesn't require much work on our side.

The answer of using ONNX has it's own constraints (I imagine I can't encode very generic python pre-processing with it).

Also I was recently surprised that pickle generally works between x86_64 and pyodide (32bit WebAssembly) and trying to use ONNX there would add more complexity. Actually the issue that this PR aims to solve was earlier reported in pyodide/pyodide#521.

We just need a more generic and backward compatible solution for big endian/little endian serialization in trees...

ogrisel · 2020-07-21T14:29:29Z

Thinking about it with @rth, pickle from the standard lib and joblib pickles are cross-platform compatible as long as you use the same version of Python and libraries, so maybe we should accept reviewing PRs to fix this issue.

I am not sure how many models would have similar problems in scikit-learn. Maybe NearestNeigbhor models based on KD/Balll-trees?

It can be very helpful to support training a on big intel / AMD server and then deploying on low power ARM machines. Mandating ONNX for this quite common usecase might be a bit too restrictive as ONNX does not necessarily cover the full scikit-learn system.

The only problem is that I am not sure how to properly test this.

ogrisel · 2020-07-21T14:31:51Z

In particular, here I don't understand why we need to swap byte order for node_ndarray but not value_ndarray.

This is still an open question AFAIK.

ogrisel · 2020-07-21T14:35:34Z

ARM64 is now supported in Azure Pipelines, that might help:

https://docs.microsoft.com/en-us/azure/devops/release-notes/2020/sprint-171-update#additional-agent-platform-arm64

thomasjpfan · 2020-07-21T20:18:14Z

If we are going into cross platform support, can we do it incrementally? As suggested by @ogrisel, we can test for the most common use case: training on x86 linux and predicting on arm linux.

ARM64 is now supported in Azure Pipelines, that might help:

Time to see if this can help with building and testing on arm!

rth · 2020-07-21T20:53:49Z

The only problem is that I am not sure how to properly test this.

we can test for the most common use case: training on x86 linux and predicting on arm linux.

Right but even with an Arm64 CI testing this would be difficult, I think? Maybe having an Arm64 CI (unrelated to this issue) and manually checking that the proposed fix works as expected would already be a good start..

ogrisel · 2020-07-25T13:14:28Z

manually checking that the proposed fix works as expected would already be a good start.

If someone knows the command-line instructions to manually test this on an guest ARM VM using qemu on a Linux x86_64 host machine, I would be very interested :)

rth · 2020-07-25T13:46:31Z

If someone knows the command-line instructions to manually test this on an guest ARM VM using qemu on a Linux x86_64 host machine,

So I think the way to go is via https://github.com/multiarch/qemu-user-static,

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes

docker run -v`pwd`:/io --rm -it arm64v8/ubuntu /bin/bash
$uname -a
Linux bacac753e50f 5.4.0-39-generic #43-Ubuntu SMP Fri Jun 19 10:28:31 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

I'm not fully sure how it works, but it feels like magic :) That's what conda-forge is using. A number of other architectures (e.g. ppc64) are also supported. We should probably add this to the maintainers doc, trying to build scikit-learn now.

ogrisel · 2020-07-25T13:55:54Z

nested docker / qemu / docker for the win!

ogrisel · 2020-07-25T14:04:39Z

I updated your commandline to add -v`pwd`:/io to expose the local folder of the host under the /io folder on the guest to make it easy to pass pickle files around.

rth · 2020-07-25T14:07:47Z

The issue with mounting the local folder, is that files written from Docker will end up being owned by root. So we can do this in CI, but for local development it might be better to just re-clone a fresh copy of scikit-learn inside docker.

ogrisel · 2020-07-27T09:34:21Z

So we can do this in CI, but for local development it might be better to just re-clone a fresh copy of scikit-learn inside docker.

My point was to build a pickle file of the model on the host (amd64) and pass the pickle file via the shared folder to the arm64v8 guest inside the container to manually check that it can be loaded and works as expected.

qzhang90 · 2020-07-28T01:32:34Z

In particular, here I don't understand why we need to swap byte order for node_ndarray but not value_ndarray.

This is still an open question AFAIK.

@ogrisel @rth
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L656 this line compares node_ndarray.dtype with NODE_TYPE. When a model is built on a big endian machine, its node_ndarray.dtype is:
[('left_child', '>i8'), ('right_child', '>i8'), ('feature', '>i8'), ('threshold', '>f8'), ('impurity', '>f8'), ('n_node_samples', '>i8'), ('weighted_n_node_samples', '>f8')],
then when it runs on a little endian machine, NODE_TYPE is:
[('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')],
thus they don't match, and this PR fixed it.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L660 this line compares value_ndarray.dtype with np.float64, and no matter the model is built on a big endian or little endian machine, value_ndarray.dtype is always np.float64, thus there is no endian issue for value_ndarray.

…dian-support

qzhang90 · 2020-08-05T13:36:45Z

Hi @ogrisel @rth Thank you for making progress on this PR. Since all checks have passed, I am wondering what are the next steps or anything I can help with? Thanks!

rth

Thanks @qzhang90! Generally LGTM, but someone should try to reproduce before merging.

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

sklearn/tree/_tree.pyx

…earn into endian-support

qzhang90 · 2020-08-06T14:23:23Z

Hi @rth @ogrisel , added an entry to doc/whats_new/v0.24.rst. Thanks!

Added comments Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

…earn into endian-support

qzhang90 · 2020-08-06T16:46:29Z

Thanks @qzhang90! Generally LGTM, but someone should try to reproduce before merging.

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

@rth can you advise what does it mean by someone should try to reproduce? Thanks!

rth

I tried to reproduce with qemu, but actually I realized that architectures in https://github.com/multiarch/qemu-user-static are ~~all~~ little endian (at least I tried arm64v8, ppc64le) same as x86_64, so that's not going to help.

Anyway aside from the below two comments this LGTM.

doc/whats_new/v0.24.rst

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

qzhang90 · 2020-08-06T18:10:04Z

I tried to reproduce with qemu, but actually I realized that architectures in https://github.com/multiarch/qemu-user-static are all little endian (at least I tried arm64v8, ppc64le) same as x86_64, so that's not going to help.

Anyway aside from the below two comments this LGTM.

@rth s390x should be big endian

rth

s390x should be big endian

Thanks @qzhang90, indeed. TBH I don't want to re-build scikit-learn there again, it takes a while with qemu. So if you can confirm that this resolves the original issue for you I'm OK with the proposed fix.

sklearn/tree/_tree.pyx

Added a space after "if", to be PEP8 compliant. Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

qzhang90 · 2020-08-06T20:07:52Z

s390x should be big endian

Thanks @qzhang90, indeed. TBH I don't want to re-build scikit-learn there again, it takes a while with qemu. So if you can confirm that this resolves the original issue for you I'm OK with the proposed fix.

Just tested the latest version of this PR on my local s390x machine, which worked well(i.e., able to load models built on both little enaian and big endian machines), thus I can confirm. Thanks @rth !

qzhang90 · 2020-08-11T01:38:40Z

Hi @rth, just wondering, is there any planned or estimated release date for v0.24?

rth · 2020-08-11T07:15:40Z

We try to have a 6 month release schedule. v0.23 was released in May so v0.24 will likely be in October.

sklearn/tree/_tree.pyx

ogrisel

I see no easy way to setup a big endian build on our current CI infrastructure. However I think @qzhang90's change adds very low additional code complexity and is unlikely to impact any use of scikit-learn that used to work prior this change (pickling and loading on LE architectures+OS).

So +1 for merging based on @qzhang90's manual check on his mainframe.

adrinjalali

A very tricky one, hopefully things don't break :D

* fixed a cross-platform endian issue * removed a duplicated dtype checking * Trigger [arm64] CI * added an entry to doc/whats_new/v0.24.rst * moved my fix entry to sklearn.tree section * Update sklearn/tree/_tree.pyx Added comments Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * code simplified * fixed a typo in sklearn/tree/_tree.pyx * Update doc/whats_new/v0.24.rst Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * updated doc/whats_new/v0.24.rst * Update sklearn/tree/_tree.pyx Added a space after "if", to be PEP8 compliant. Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * Update sklearn/tree/_tree.pyx Co-authored-by: Qi Zhang <q.zhang@ibm.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

Qi Zhang added 2 commits June 19, 2020 17:19

fixed a cross-platform endian issue

f7da932

removed a duplicated dtype checking

7b35b71

github-actions bot added the module:tree label Jun 19, 2020

rth reviewed Jun 19, 2020

View reviewed changes

adrinjalali closed this Jun 22, 2020

rth reopened this Jul 21, 2020

This was referenced Jul 28, 2020

[0.23.1] doctest GradientBoostingClassifier failes on arm(rhel) processors #17797

Closed

Arm64 CI setup with TravisCI #17996

Merged

ogrisel added 2 commits July 31, 2020 16:22

Merge branch 'master' of github.com:scikit-learn/scikit-learn into en…

2f9f740

…dian-support

Trigger [arm64] CI

32d4093

rth mentioned this pull request Aug 6, 2020

DOC revamp model persistence documentation #18046

Merged

rth reviewed Aug 6, 2020

View reviewed changes

sklearn/tree/_tree.pyx Outdated Show resolved Hide resolved

Qi Zhang added 2 commits August 6, 2020 10:09

added an entry to doc/whats_new/v0.24.rst

5c9939e

Merge branch 'endian-support' of https://github.com/qzhang90/scikit-l…

cc50c1a

…earn into endian-support

Qi Zhang and others added 6 commits August 6, 2020 10:31

moved my fix entry to sklearn.tree section

4fe99d4

Update sklearn/tree/_tree.pyx

ebd53a0

Added comments Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

code simplified

40d083c

Merge branch 'master' into endian-support

cc4e91a

fixed a typo in sklearn/tree/_tree.pyx

15a0ed0

Merge branch 'endian-support' of https://github.com/qzhang90/scikit-l…

4e068f8

…earn into endian-support

rth reviewed Aug 6, 2020

View reviewed changes

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v0.24.rst

30d8715

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

updated doc/whats_new/v0.24.rst

412f27d

rth approved these changes Aug 6, 2020

View reviewed changes

sklearn/tree/_tree.pyx Outdated Show resolved Hide resolved

Update sklearn/tree/_tree.pyx

3551409

Added a space after "if", to be PEP8 compliant. Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

ogrisel reviewed Aug 11, 2020

View reviewed changes

sklearn/tree/_tree.pyx Outdated Show resolved Hide resolved

Update sklearn/tree/_tree.pyx

8ed95a5

ogrisel approved these changes Aug 11, 2020

View reviewed changes

adrinjalali approved these changes Aug 12, 2020

View reviewed changes

adrinjalali merged commit e217b68 into scikit-learn:master Aug 12, 2020

rth mentioned this pull request Oct 19, 2020

Serialization / Deserialization on tree models #18640

Closed

rth mentioned this pull request Feb 25, 2021

DOC Fix documentation on pickle portability #19561

Merged

sgundura mentioned this pull request Oct 7, 2021

Pickle portability little 🡒 big endian #21237

Closed

rth mentioned this pull request Oct 13, 2021

Tree pickle portability between 64bit and 32bit arch #19602

Closed

lesteve mentioned this pull request Oct 18, 2021

Remove code introduced by #17644 #21359

Closed

lesteve mentioned this pull request Nov 3, 2021

Test decision tree pickle for different endianness #21539

Merged

Uh oh!

Fixed a cross-platform endian issue #17644

Fixed a cross-platform endian issue #17644

Uh oh!

Conversation

qzhang90 commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Jun 20, 2020

Uh oh!

adrinjalali commented Jun 22, 2020

Uh oh!

rth commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 21, 2020

Uh oh!

ogrisel commented Jul 21, 2020

Uh oh!

thomasjpfan commented Jul 21, 2020

Uh oh!

rth commented Jul 21, 2020

Uh oh!

ogrisel commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jul 25, 2020 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 25, 2020

Uh oh!

ogrisel commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jul 25, 2020

Uh oh!

ogrisel commented Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzhang90 commented Jul 28, 2020

Uh oh!

qzhang90 commented Aug 5, 2020

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qzhang90 commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzhang90 commented Aug 6, 2020

Uh oh!

rth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qzhang90 commented Aug 6, 2020

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qzhang90 commented Aug 6, 2020

Uh oh!

qzhang90 commented Aug 11, 2020

Uh oh!

rth commented Aug 11, 2020

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

qzhang90 commented Jun 19, 2020 •

edited

Loading

rth commented Jul 21, 2020 •

edited

Loading

ogrisel commented Jul 21, 2020 •

edited

Loading

ogrisel commented Jul 25, 2020 •

edited

Loading

rth commented Jul 25, 2020 •

edited by ogrisel

Loading

ogrisel commented Jul 25, 2020 •

edited

Loading

ogrisel commented Jul 27, 2020 •

edited

Loading

qzhang90 commented Aug 6, 2020 •

edited

Loading

rth left a comment •

edited

Loading

ogrisel left a comment •

edited

Loading