OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

KeithEdmondsMcK · 2023-02-16T21:16:34Z

Describe the bug

If you use the ordinal encoder when there is NULLS you need to put them to a Negative Value otherwise you get. The following error

The used value for unknown_value is one of the values already used for encoding the seen categories.

For example this could be done as
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1)
This later causes an issue when predict_proba() is run on the trained HistGradientBoostingClassifier.

OverflowError: can't convert negative value to npy_uint8

It seems odd to me that it can train but not predict

The super hackish way I have been getting around this is to just add one to the value. However, this means I need to do it out of the pipeline and make my code less clean.
train[categorical_features] = (train[categorical_features]+1).astype("category")

Steps/Code to Reproduce

#Modified from https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html
#This also happens in the classifier

from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from time import time
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np

bikes = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas")
# Make an explicit copy to avoid "SettingWithCopyWarning" from pandas
X, y = bikes.data.copy(), bikes.target

X["weather"].replace(to_replace="heavy_rain", value=np.nan, inplace=True)

mask_training = X["year"] == 0.0
X = X.drop(columns=["year"])
X_train, y_train = X[mask_training], y[mask_training]
X_test, y_test = X[~mask_training], y[~mask_training]

numerical_features = ["temp","feel_temp","humidity","windspeed",
                      ]
categorical_features = X_train.columns.drop(numerical_features)

hgbdt_preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), categorical_features),
        ("num", "passthrough", numerical_features),
    ],
    sparse_threshold=1,
    verbose_feature_names_out=False
).set_output(transform="pandas")

hgbdt_model = make_pipeline(
    hgbdt_preprocessor,
    HistGradientBoostingRegressor(
        categorical_features=categorical_features, random_state=0
    ),
)
hgbdt_model.fit(X_train, y_train)

hgbdt_model.predict(X_test)

Expected Results

I would expect this to run and make the prediction

Actual Results

Traceback (most recent call last):

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File " ", line 62, in <module>
    hgbdt_model.predict(X_test)

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/pipeline.py", line 482, in predict
    return self.steps[-1][1].predict(Xt, **predict_params)

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1487, in predict
    return self._loss.link.inverse(self._raw_predict(X).ravel())

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1043, in _raw_predict
    self._predict_iterations(

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1054, in _predict_iterations
    ) = self._bin_mapper.make_known_categories_bitsets()

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/binning.py", line 314, in make_known_categories_bitsets
    set_bitset_memoryview(known_cat_bitsets[mapped_f_idx], raw_cat_val)

  File "sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx", line 48, in sklearn.ensemble._hist_gradient_boosting._bitset.set_bitset_memoryview

OverflowError: can't convert negative value to npy_uint8

Versions

System:
    python: 3.10.8 (main, Nov  4 2022, 08:45:18) [Clang 12.0.0 ]
executable: /opt/anaconda3/envs/sklearn12/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.6.2
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /opt/anaconda3/envs/sklearn12/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 10
threading_layer: intel

       filepath: /opt/anaconda3/envs/sklearn12/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 10

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2023-02-16T22:01:17Z

Since HistGradientBoosting can handle nans directly, you can use nan as the encoded value:

OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan, encoded_missing_value=np.nan)

As to why fit works and not predict, we are not validating correctly in fit and should have failed when there is -1 category. It is documented that negative does not work:

scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Lines 1270 to 1271 in 4ff92e0

    
                   For each categorical feature, there must be at most `max_bins` unique 
        
                   categories, and each categorical value must be in [0, max_bins -1].

thomasjpfan · 2023-02-16T22:18:38Z

After looking at the code, I think it's worth treating negative values as missing values and have it "just work". I opened #25629 to propose this patch.

KeithEdmondsMcK · 2023-02-16T23:25:01Z

Can confirm that the NULL method works, thanks. In fact it is the default so it should not have even come up.

The categorical handling in HistGradientBoostingRegressor allows only 8 bits of storage space so it can hold 256 values (2^8) at most for max_bins. I have more than 256 levels and have pushed all the less important ones to "Other" such that I have 255 and this leaves a space for the NULLs. Can you confirm that if I leave it as NULL it will be just treated as one more level or does something more nuanced happen with NULLs in this case?

Additionally, would a it be a reasonable "feature request" to have this become part of OrdinalEncoder? A new integer parameter would be added such that the final number of encoded level was no greater than the given integer. An additional benefit of this would be that the handle_unknown could be to put it to this "other" value since it is clearly rare. Or is there another pipeline object I could put as an additional step to do this in a clean way?

thomasjpfan · 2023-02-17T18:08:23Z

I have more than 256 levels and have pushed all the less important ones to "Other" such that I have 255 and this leaves a space for the NULLs. Can you confirm that if I leave it as NULL it will be just treated as one more level or does something more nuanced happen with NULLs in this case?

HistGradientBoosting always reserves one of it's bins for nan. This is why max_bins can be at most 255, the last bin is always for nan. If you are okay with treating actual nans and "Other" as the same category, then it is okay to map the "Other" category to "nan".

Additionally, would a it be a reasonable "feature request" to have this become part of OrdinalEncoder?

Yea I think it's a reasonable feature request. While developing the infrequent categories feature for OneHotEncoder, I considered also having it in OrdinalEncoder, such that rare categories are be grouped together.

An additional benefit of this would be that the handle_unknown could be to put it to this "other" value since it is clearly rare.

The feature will likely be handle_unknown="infrequent_if_exist", with the following behavior:

The training data has many categories and the rare categories are grouped into an "Other" category. In this case, the unknown category can be mapped to "Other" category.
The training data has few categories and there is no need to group categories, then the "Other" category does not exist. In this case, we still need unknown_value to map unknown values to.

amueller · 2023-02-21T18:08:16Z

I was just about to open this as a feature request lol. This is also potentially an issue for other estimators that want to use OrdinalEncoding, say if you want to use a random forest with lots of categoricals or something like that.
I couldn't come up with the work-around that you posted @thomasjpfan, and that only works if there's no NaN in the training set, right? If there is frequent NaNs in the training set, I think the natural way would be to treat it as a normal category.

KeithEdmondsMcK added Bug Needs Triage Issue requires triage labels Feb 16, 2023

thomasjpfan added module:ensemble and removed Needs Triage Issue requires triage labels Feb 16, 2023

thomasjpfan mentioned this issue Feb 16, 2023

FIX Adds support for negative values in categorical features in gradient boosting #25629

Merged

KeithEdmondsMcK mentioned this issue Feb 20, 2023

Add the max_categories functionality from OneHotEncoder in OrdinalEncoder #25648

Closed

ogrisel closed this as completed in #25629 Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

KeithEdmondsMcK commented Feb 16, 2023 •

edited by ogrisel

Loading

thomasjpfan commented Feb 16, 2023 •

edited

Loading

thomasjpfan commented Feb 16, 2023

KeithEdmondsMcK commented Feb 16, 2023

thomasjpfan commented Feb 17, 2023

amueller commented Feb 21, 2023 •

edited

Loading

OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

Comments

KeithEdmondsMcK commented Feb 16, 2023 • edited by ogrisel Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Feb 16, 2023 • edited Loading

thomasjpfan commented Feb 16, 2023

KeithEdmondsMcK commented Feb 16, 2023

thomasjpfan commented Feb 17, 2023

amueller commented Feb 21, 2023 • edited Loading

KeithEdmondsMcK commented Feb 16, 2023 •

edited by ogrisel

Loading

thomasjpfan commented Feb 16, 2023 •

edited

Loading

amueller commented Feb 21, 2023 •

edited

Loading