Skip to content

OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

Closed
@KeithEdmondsMcK

Description

@KeithEdmondsMcK

Describe the bug

If you use the ordinal encoder when there is NULLS you need to put them to a Negative Value otherwise you get. The following error

The used value for unknown_value is one of the values already used for encoding the seen categories.

For example this could be done as
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1)
This later causes an issue when predict_proba() is run on the trained HistGradientBoostingClassifier.

OverflowError: can't convert negative value to npy_uint8

It seems odd to me that it can train but not predict

The super hackish way I have been getting around this is to just add one to the value. However, this means I need to do it out of the pipeline and make my code less clean.
train[categorical_features] = (train[categorical_features]+1).astype("category")

Steps/Code to Reproduce

#Modified from https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html
#This also happens in the classifier

from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from time import time
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np

bikes = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas")
# Make an explicit copy to avoid "SettingWithCopyWarning" from pandas
X, y = bikes.data.copy(), bikes.target

X["weather"].replace(to_replace="heavy_rain", value=np.nan, inplace=True)

mask_training = X["year"] == 0.0
X = X.drop(columns=["year"])
X_train, y_train = X[mask_training], y[mask_training]
X_test, y_test = X[~mask_training], y[~mask_training]

numerical_features = ["temp","feel_temp","humidity","windspeed",
                      ]
categorical_features = X_train.columns.drop(numerical_features)

hgbdt_preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), categorical_features),
        ("num", "passthrough", numerical_features),
    ],
    sparse_threshold=1,
    verbose_feature_names_out=False
).set_output(transform="pandas")

hgbdt_model = make_pipeline(
    hgbdt_preprocessor,
    HistGradientBoostingRegressor(
        categorical_features=categorical_features, random_state=0
    ),
)
hgbdt_model.fit(X_train, y_train)

hgbdt_model.predict(X_test)

Expected Results

I would expect this to run and make the prediction

Actual Results

Traceback (most recent call last):

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File " ", line 62, in <module>
    hgbdt_model.predict(X_test)

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/pipeline.py", line 482, in predict
    return self.steps[-1][1].predict(Xt, **predict_params)

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1487, in predict
    return self._loss.link.inverse(self._raw_predict(X).ravel())

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1043, in _raw_predict
    self._predict_iterations(

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1054, in _predict_iterations
    ) = self._bin_mapper.make_known_categories_bitsets()

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/binning.py", line 314, in make_known_categories_bitsets
    set_bitset_memoryview(known_cat_bitsets[mapped_f_idx], raw_cat_val)

  File "sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx", line 48, in sklearn.ensemble._hist_gradient_boosting._bitset.set_bitset_memoryview

OverflowError: can't convert negative value to npy_uint8

Versions

System:
    python: 3.10.8 (main, Nov  4 2022, 08:45:18) [Clang 12.0.0 ]
executable: /opt/anaconda3/envs/sklearn12/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.6.2
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /opt/anaconda3/envs/sklearn12/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 10
threading_layer: intel

       filepath: /opt/anaconda3/envs/sklearn12/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 10

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions