Description
Describe the bug
If you use the ordinal encoder when there is NULLS you need to put them to a Negative Value otherwise you get. The following error
The used value for unknown_value is one of the values already used for encoding the seen categories.
For example this could be done as
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1)
This later causes an issue when predict_proba() is run on the trained HistGradientBoostingClassifier.
OverflowError: can't convert negative value to npy_uint8
It seems odd to me that it can train but not predict
The super hackish way I have been getting around this is to just add one to the value. However, this means I need to do it out of the pipeline and make my code less clean.
train[categorical_features] = (train[categorical_features]+1).astype("category")
Steps/Code to Reproduce
#Modified from https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html
#This also happens in the classifier
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from time import time
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np
bikes = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas")
# Make an explicit copy to avoid "SettingWithCopyWarning" from pandas
X, y = bikes.data.copy(), bikes.target
X["weather"].replace(to_replace="heavy_rain", value=np.nan, inplace=True)
mask_training = X["year"] == 0.0
X = X.drop(columns=["year"])
X_train, y_train = X[mask_training], y[mask_training]
X_test, y_test = X[~mask_training], y[~mask_training]
numerical_features = ["temp","feel_temp","humidity","windspeed",
]
categorical_features = X_train.columns.drop(numerical_features)
hgbdt_preprocessor = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), categorical_features),
("num", "passthrough", numerical_features),
],
sparse_threshold=1,
verbose_feature_names_out=False
).set_output(transform="pandas")
hgbdt_model = make_pipeline(
hgbdt_preprocessor,
HistGradientBoostingRegressor(
categorical_features=categorical_features, random_state=0
),
)
hgbdt_model.fit(X_train, y_train)
hgbdt_model.predict(X_test)
Expected Results
I would expect this to run and make the prediction
Actual Results
Traceback (most recent call last):
File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
exec(code, globals, locals)
File " ", line 62, in <module>
hgbdt_model.predict(X_test)
File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/pipeline.py", line 482, in predict
return self.steps[-1][1].predict(Xt, **predict_params)
File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1487, in predict
return self._loss.link.inverse(self._raw_predict(X).ravel())
File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1043, in _raw_predict
self._predict_iterations(
File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1054, in _predict_iterations
) = self._bin_mapper.make_known_categories_bitsets()
File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/binning.py", line 314, in make_known_categories_bitsets
set_bitset_memoryview(known_cat_bitsets[mapped_f_idx], raw_cat_val)
File "sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx", line 48, in sklearn.ensemble._hist_gradient_boosting._bitset.set_bitset_memoryview
OverflowError: can't convert negative value to npy_uint8
Versions
System:
python: 3.10.8 (main, Nov 4 2022, 08:45:18) [Clang 12.0.0 ]
executable: /opt/anaconda3/envs/sklearn12/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.0
pip: 22.3.1
setuptools: 65.6.3
numpy: 1.23.5
scipy: 1.9.3
Cython: None
pandas: 1.5.3
matplotlib: 3.6.2
joblib: 1.1.1
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
filepath: /opt/anaconda3/envs/sklearn12/lib/libmkl_rt.1.dylib
prefix: libmkl_rt
user_api: blas
internal_api: mkl
version: 2021.4-Product
num_threads: 10
threading_layer: intel
filepath: /opt/anaconda3/envs/sklearn12/lib/libomp.dylib
prefix: libomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 10