-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Support unknown_value=np.nan in OrdinalEncoder #18406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NicolasHug !
Excellent work. Just to clarify: Will the new options allow to both
If yes, this will be super good news for fitting boosted trees! |
this PR supports 2 but 1 is still not supported. An error is raised when nans are present in the training data: it's unclear where to map them, as the output of OrdinalEncoder is supposed to be interpreted as ordered quantities. |
HistGradientBoostingClassifier and the correspoding regressor natively support both missing values (as nans) and categorical data now :) https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting |
yes, with 1 and 2 being inverted |
@NicolasHug : Thx for clarifying. From a practical perspective, it is not desirable that remaining nans would raise an error. If my subsequent model algorithm cannot natively deal with nans, we can simply add an imputer after the encoder and voila. |
@mayer79 Couldn’t you just run the encoder for not nans only to get the desired behavior? |
This PR adds support for
unknown_value=np.nan
inOrdinalEncoder
.(Parameter was introduced in #17406 by @FelixWick)
CC @thomasjpfan @ogrisel