-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
OrdinalEncoder does not work with HistGradientBoostingClassifier when there are NULLs #25627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since HistGradientBoosting can handle OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan, encoded_missing_value=np.nan) As to why scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Lines 1270 to 1271 in 4ff92e0
|
After looking at the code, I think it's worth treating negative values as missing values and have it "just work". I opened #25629 to propose this patch. |
Can confirm that the NULL method works, thanks. In fact it is the default so it should not have even come up. The categorical handling in HistGradientBoostingRegressor allows only 8 bits of storage space so it can hold 256 values (2^8) at most for max_bins. I have more than 256 levels and have pushed all the less important ones to "Other" such that I have 255 and this leaves a space for the NULLs. Can you confirm that if I leave it as NULL it will be just treated as one more level or does something more nuanced happen with NULLs in this case? Additionally, would a it be a reasonable "feature request" to have this become part of OrdinalEncoder? A new integer parameter would be added such that the final number of encoded level was no greater than the given integer. An additional benefit of this would be that the handle_unknown could be to put it to this "other" value since it is clearly rare. Or is there another pipeline object I could put as an additional step to do this in a clean way? |
HistGradientBoosting always reserves one of it's bins for
Yea I think it's a reasonable feature request. While developing the infrequent categories feature for OneHotEncoder, I considered also having it in
The feature will likely be
|
I was just about to open this as a feature request lol. This is also potentially an issue for other estimators that want to use OrdinalEncoding, say if you want to use a random forest with lots of categoricals or something like that. |
Describe the bug
If you use the ordinal encoder when there is NULLS you need to put them to a Negative Value otherwise you get. The following error
The used value for unknown_value is one of the values already used for encoding the seen categories.
For example this could be done as
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1)
This later causes an issue when predict_proba() is run on the trained HistGradientBoostingClassifier.
OverflowError: can't convert negative value to npy_uint8
It seems odd to me that it can train but not predict
The super hackish way I have been getting around this is to just add one to the value. However, this means I need to do it out of the pipeline and make my code less clean.
train[categorical_features] = (train[categorical_features]+1).astype("category")
Steps/Code to Reproduce
#Modified from https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html
#This also happens in the classifier
Expected Results
I would expect this to run and make the prediction
Actual Results
Versions
The text was updated successfully, but these errors were encountered: