Description
Describe the bug
HistGradinetBoosting models use np.intp
to represent the feature_idx
in TreePredictor nodes
scikit-learn/sklearn/ensemble/_hist_gradient_boosting/common.pyx
Lines 19 to 36 in 0f8a777
This seems to cause issues with using pickled HistGradientBoosting models which are trained on a 64 bit environment, in 32 bit environments ( like Pyodide which is where I encountered this issue).
I know that for a while the other Tree models in sklearn had a similar problem but I am not 100% what the solution was.
Would changing the type to be np.uint32
be an acceptable solution here?
Steps/Code to Reproduce
Steps to reproduce
- Train a model in python on a 64 bit system
- Pickle the output
- Load that pickle on a 32 bit python environment like Pyodide
- Attempt to run the prediction on the loaded model
see this repo for a full example: https://github.com/stuartlynn/hist_gradient_boost_bug
Expected Results
The pyodide code to run and give the expected output
Actual Results
Error message
Running the above gives the following error message when trying to execute the Pyodide code
PythonError: Traceback (most recent call last):
File "/lib/python311.zip/_pyodide/_base.py", line 571, in eval_code_async
await CodeRunner(
File "/lib/python311.zip/_pyodide/_base.py", line 394, in run_async
coroutine = eval(self.code, globals, locals)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<exec>", line 61, in <module>
File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
return self._loss.link.inverse(self._raw_predict(X).ravel())
^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
self._predict_iterations(
File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
raw_predictions[:, k] += predict(X)
^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/predictor.py", line 71,
_predict_from_raw_data(
File "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx", line 18, in sklearn.ensemble._hist_gr
ValueError: Buffer dtype mismatch, expected 'intp_t' but got 'long long' in 'const node_struct.feature_
at new_error (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_mod
at wasm://wasm/02250ad6:wasm-function[295]:0x158827
at wasm://wasm/02250ad6:wasm-function[452]:0x15fcd5
at _PyCFunctionWithKeywords_TrampolineCall (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules
at wasm://wasm/02250ad6:wasm-function[1057]:0x1a3091
at wasm://wasm/02250ad6:wasm-function[3387]:0x289e4d
at wasm://wasm/02250ad6:wasm-function[2037]:0x1e3f77
at wasm://wasm/02250ad6:wasm-function[1064]:0x1a3579
at wasm://wasm/02250ad6:wasm-function[1067]:0x1a383a
at wasm://wasm/02250ad6:wasm-function[1068]:0x1a38dc
at wasm://wasm/02250ad6:wasm-function[3200]:0x2685c5
at wasm://wasm/02250ad6:wasm-function[3201]:0x26e3d0
at wasm://wasm/02250ad6:wasm-function[1070]:0x1a3a04
at wasm://wasm/02250ad6:wasm-function[1065]:0x1a3694
at wasm://wasm/02250ad6:wasm-function[440]:0x15f45e
at Module.callPyObjectKwargs (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:81732)
at Module.callPyObject (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:82066)
at Timeout.wrapper [as _onTimeout] (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:58562)
at listOnTimeout (node:internal/timers:569:17)
at process.processTimers (node:internal/timers:512:7) {
type: 'ValueError',
__error_address: 116329376
}
Things I have already checked
- All versions of the libraries used are the same in both environments
- Tried with both pickles and joblib
Hacky fix
So what I found to work is the following. In pyodide, after loading the model if we manually change the types of the nodes for the predictors, then the model runs fine. There is an example of this in the example repo
Y_DTYPE = np.float64
X_DTYPE = np.float64
X_BINNED_DTYPE = np.uint8 # hence max_bins == 256
# dtype for gradients and hessians arrays
G_H_DTYPE = np.float32
X_BITSET_INNER_DTYPE = np.uint32
PREDICTOR_RECORD_DTYPE_2 = np.dtype([
('value', Y_DTYPE),
('count', np.uint32),
('feature_idx', np.int32),
('num_threshold', X_DTYPE),
('missing_go_to_left', np.uint8),
('left', np.uint32),
('right', np.uint32),
('gain', Y_DTYPE),
('depth', np.uint32),
('is_leaf', np.uint8),
('bin_threshold', X_BINNED_DTYPE),
('is_categorical', np.uint8),
# The index of the corresponding bitsets in the Predictor's bitset arrays.
# Only used if is_categorical is True
('bitset_idx', np.uint32)
])
model = joblib.load("/model.joblib")
for i,_ in enumerate(model._predictors):
model._predictors[i][0].nodes = model._predictors[i][0].nodes.astype(PREDICTOR_RECORD_DTYPE_2)
model.predict(data)
Versions
python version 3.11.3 (main, May 15 2023, 10:43:03) [Clang 14.0.6 ]
sklearn version 1.3.1
System:
python: 3.11.3 (main, May 15 2023, 10:43:03) [Clang 14.0.6 ]
executable: /Users/slynn/miniconda3/envs/demoland/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.3.1
pip: 23.3
setuptools: 68.0.0
numpy: 1.25.2
scipy: 1.11.3
Cython: None
pandas: 1.5.3
matplotlib: None
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 10
prefix: libomp
filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: Nehalem
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.21.dev
threading_layer: pthreads
architecture: Nehalem