Skip to content

[MRG] Support unknown_value=np.nan in OrdinalEncoder #18406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 23, 2020

Conversation

NicolasHug
Copy link
Member

This PR adds support for unknown_value=np.nan in OrdinalEncoder.

(Parameter was introduced in #17406 by @FelixWick)

CC @thomasjpfan @ogrisel

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NicolasHug !

@rth rth merged commit 4aada4e into scikit-learn:master Sep 23, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
@mayer79
Copy link
Contributor

mayer79 commented Dec 4, 2020

Excellent work. Just to clarify: Will the new options allow to both

  1. keep np.nan as they are and
  2. map new unknown categories to np.nan?

If yes, this will be super good news for fitting boosted trees!

@NicolasHug
Copy link
Member Author

this PR supports 2 but 1 is still not supported. An error is raised when nans are present in the training data: it's unclear where to map them, as the output of OrdinalEncoder is supposed to be interpreted as ordered quantities.

@NicolasHug
Copy link
Member Author

If yes, this will be super good news for fitting boosted trees

HistGradientBoostingClassifier and the correspoding regressor natively support both missing values (as nans) and categorical data now :)

https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting

@mfeurer
Copy link
Contributor

mfeurer commented Dec 4, 2020

It's great to see all the progress in handling categorical data with native scikit-learn components!

Is what @mayer79 asked in 1. what's proposed in #17123 ?

@NicolasHug
Copy link
Member Author

yes, with 1 and 2 being inverted

@mayer79
Copy link
Contributor

mayer79 commented Dec 4, 2020

@NicolasHug : Thx for clarifying. From a practical perspective, it is not desirable that remaining nans would raise an error. If my subsequent model algorithm cannot natively deal with nans, we can simply add an imputer after the encoder and voila.

@FelixWick
Copy link
Contributor

@mayer79 Couldn’t you just run the encoder for not nans only to get the desired behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants