Skip to content

OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KeithEdmondsMcK opened this issue Feb 16, 2023 · 5 comments · Fixed by #25629
Closed

Comments

@KeithEdmondsMcK
Copy link

KeithEdmondsMcK commented Feb 16, 2023

Describe the bug

If you use the ordinal encoder when there is NULLS you need to put them to a Negative Value otherwise you get. The following error

The used value for unknown_value is one of the values already used for encoding the seen categories.

For example this could be done as
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1)
This later causes an issue when predict_proba() is run on the trained HistGradientBoostingClassifier.

OverflowError: can't convert negative value to npy_uint8

It seems odd to me that it can train but not predict

The super hackish way I have been getting around this is to just add one to the value. However, this means I need to do it out of the pipeline and make my code less clean.
train[categorical_features] = (train[categorical_features]+1).astype("category")

Steps/Code to Reproduce

#Modified from https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html
#This also happens in the classifier

from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from time import time
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np

bikes = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas")
# Make an explicit copy to avoid "SettingWithCopyWarning" from pandas
X, y = bikes.data.copy(), bikes.target

X["weather"].replace(to_replace="heavy_rain", value=np.nan, inplace=True)

mask_training = X["year"] == 0.0
X = X.drop(columns=["year"])
X_train, y_train = X[mask_training], y[mask_training]
X_test, y_test = X[~mask_training], y[~mask_training]

numerical_features = ["temp","feel_temp","humidity","windspeed",
                      ]
categorical_features = X_train.columns.drop(numerical_features)

hgbdt_preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), categorical_features),
        ("num", "passthrough", numerical_features),
    ],
    sparse_threshold=1,
    verbose_feature_names_out=False
).set_output(transform="pandas")

hgbdt_model = make_pipeline(
    hgbdt_preprocessor,
    HistGradientBoostingRegressor(
        categorical_features=categorical_features, random_state=0
    ),
)
hgbdt_model.fit(X_train, y_train)

hgbdt_model.predict(X_test)

Expected Results

I would expect this to run and make the prediction

Actual Results

Traceback (most recent call last):

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File " ", line 62, in <module>
    hgbdt_model.predict(X_test)

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/pipeline.py", line 482, in predict
    return self.steps[-1][1].predict(Xt, **predict_params)

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1487, in predict
    return self._loss.link.inverse(self._raw_predict(X).ravel())

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1043, in _raw_predict
    self._predict_iterations(

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1054, in _predict_iterations
    ) = self._bin_mapper.make_known_categories_bitsets()

  File "/opt/anaconda3/envs/sklearn12/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/binning.py", line 314, in make_known_categories_bitsets
    set_bitset_memoryview(known_cat_bitsets[mapped_f_idx], raw_cat_val)

  File "sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx", line 48, in sklearn.ensemble._hist_gradient_boosting._bitset.set_bitset_memoryview

OverflowError: can't convert negative value to npy_uint8

Versions

System:
    python: 3.10.8 (main, Nov  4 2022, 08:45:18) [Clang 12.0.0 ]
executable: /opt/anaconda3/envs/sklearn12/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.6.2
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /opt/anaconda3/envs/sklearn12/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 10
threading_layer: intel

       filepath: /opt/anaconda3/envs/sklearn12/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 10
@KeithEdmondsMcK KeithEdmondsMcK added Bug Needs Triage Issue requires triage labels Feb 16, 2023
@thomasjpfan
Copy link
Member

thomasjpfan commented Feb 16, 2023

Since HistGradientBoosting can handle nans directly, you can use nan as the encoded value:

OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan, encoded_missing_value=np.nan) 

As to why fit works and not predict, we are not validating correctly in fit and should have failed when there is -1 category. It is documented that negative does not work:

For each categorical feature, there must be at most `max_bins` unique
categories, and each categorical value must be in [0, max_bins -1].

@thomasjpfan
Copy link
Member

After looking at the code, I think it's worth treating negative values as missing values and have it "just work". I opened #25629 to propose this patch.

@KeithEdmondsMcK
Copy link
Author

Can confirm that the NULL method works, thanks. In fact it is the default so it should not have even come up.

The categorical handling in HistGradientBoostingRegressor allows only 8 bits of storage space so it can hold 256 values (2^8) at most for max_bins. I have more than 256 levels and have pushed all the less important ones to "Other" such that I have 255 and this leaves a space for the NULLs. Can you confirm that if I leave it as NULL it will be just treated as one more level or does something more nuanced happen with NULLs in this case?

Additionally, would a it be a reasonable "feature request" to have this become part of OrdinalEncoder? A new integer parameter would be added such that the final number of encoded level was no greater than the given integer. An additional benefit of this would be that the handle_unknown could be to put it to this "other" value since it is clearly rare. Or is there another pipeline object I could put as an additional step to do this in a clean way?

@thomasjpfan
Copy link
Member

I have more than 256 levels and have pushed all the less important ones to "Other" such that I have 255 and this leaves a space for the NULLs. Can you confirm that if I leave it as NULL it will be just treated as one more level or does something more nuanced happen with NULLs in this case?

HistGradientBoosting always reserves one of it's bins for nan. This is why max_bins can be at most 255, the last bin is always for nan. If you are okay with treating actual nans and "Other" as the same category, then it is okay to map the "Other" category to "nan".

Additionally, would a it be a reasonable "feature request" to have this become part of OrdinalEncoder?

Yea I think it's a reasonable feature request. While developing the infrequent categories feature for OneHotEncoder, I considered also having it in OrdinalEncoder, such that rare categories are be grouped together.

An additional benefit of this would be that the handle_unknown could be to put it to this "other" value since it is clearly rare.

The feature will likely be handle_unknown="infrequent_if_exist", with the following behavior:

  1. The training data has many categories and the rare categories are grouped into an "Other" category. In this case, the unknown category can be mapped to "Other" category.
  2. The training data has few categories and there is no need to group categories, then the "Other" category does not exist. In this case, we still need unknown_value to map unknown values to.

@amueller
Copy link
Member

amueller commented Feb 21, 2023

I was just about to open this as a feature request lol. This is also potentially an issue for other estimators that want to use OrdinalEncoding, say if you want to use a random forest with lots of categoricals or something like that.
I couldn't come up with the work-around that you posted @thomasjpfan, and that only works if there's no NaN in the training set, right? If there is frequent NaNs in the training set, I think the natural way would be to treat it as a normal category.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants