-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
f4893a2
to
6346358
Compare
In the meantime, I added deprecation warnings for the old behaviour, keywords and attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifying the difference between encoded_input=False/True, and what will happen to it in two versions' time, still needs elucidation in the docs.
sklearn/base.py
Outdated
@@ -225,12 +225,27 @@ def get_params(self, deep=True): | |||
Parameter names mapped to their values. | |||
""" | |||
out = dict() | |||
for key in self._get_param_names(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting a bit adventurous of you! Propose this separately. It's not exclusive to this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It somehow is, in the sense that it is quite essential for this PR, as otherwise just showing the repr of the new OneHotEncoder shows deprecation warnings.
Did it happen before that keyword arguments were deprecated? How was it dealt with then?
I can certainly do it in a separate PR, but then that PR would be a blocker for this one IMO (which is not necessarily a problem, so fine for me to do that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this. Can you elaborate? If deprecated keyword arguments are used, they raise a DeprecationWarning during fit
, right? Why would the repr do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to ensure get_params()
does not raise any deprecation warnings. See the bigger non-inline comment with an overview of the deprecation handling
sklearn/preprocessing/_encoders.py
Outdated
The categories of each feature determined during fitting | ||
(in order corresponding with output of ``transform``). | ||
|
||
Deprecated Attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure Numpydoc doesn't handle this.
sklearn/preprocessing/_encoders.py
Outdated
will be all zeros. In the inverse transform, an unknown category | ||
will be denoted as None. | ||
|
||
Deprecated Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure Numpydoc doesn't handle this. Just use .. deprecated:: 0.20
, yeah?
sklearn/preprocessing/_encoders.py
Outdated
is set to 'ignore' and an unknown category is encountered during | ||
transform, the resulting one-hot encoded columns for this feature | ||
will be all zeros. In the inverse transform, an unknown category | ||
will be denoted as None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Perhaps add a note that this can be used to handle missing values)
sklearn/preprocessing/_encoders.py
Outdated
|
||
encoded_input=False : categorical features that still need to be | ||
encoded. | ||
encoded_input=True : already integer encoded data, and the categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"In the range 0...(n values - 1)
"
sklearn/preprocessing/_encoders.py
Outdated
|
||
The used categories can be found in the ``categories_`` attribute. | ||
|
||
encoded_input : boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ordinal_input
is better than encoded_input
, especially seeing as we introduce an OrdinalEncoder
...
sklearn/preprocessing/_encoders.py
Outdated
encoded_input : boolean | ||
How to interpret the input data: | ||
|
||
encoded_input=False : categorical features that still need to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use : descr
. The correct syntax for definition lists is:
encoded_input=False
categorical features ...
I think the default should be 'auto', not None, with the default changing to False in two versions.
Do you think this option will be useful (for efficiency, for error handling, or for quality assurance) after the deprecation is finished? If not, should it just be called legacy_mode
?
Does this setting affect the handling of n_values
? Can one use encoded_input=True
and categories
together?
This seems to mostly affect the handling of errors, in that even if a column isn't active in input, that value is not considered unknown....?
sklearn/preprocessing/_encoders.py
Outdated
OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>, | ||
encoded_input=True, handle_unknown='error', sparse=True) | ||
>>> enc.transform([[0, 1, 1]]).toarray() | ||
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.
sklearn/preprocessing/_encoders.py
Outdated
OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>, | ||
encoded_input=True, handle_unknown='error', sparse=True) | ||
>>> enc.transform([[0, 1, 1]]).toarray() | ||
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.
Some quick answers
See my (long) comment on the issue asking about this: #10521
Good question, this is one of the things I didn't really think through yet. In principle, you can pass
Yes I know, but wanted it to be very explicit now to see the consequences (for initial reviewing). I didn't bother yet fully updating the documentation as well. |
I think we may have previously handled such warnings in get_params using a
check_warnings context, IIRC, but it resulted in dangerous race conditions
in parallel processing.
We usually don't bother raising warnings for constructor parameter access:
if the user doesn't have that parameter set at fit, they are unlikely to
try to access it otherwise in saved code. I know, it's a bit dodgy.
I personally think legacy_mode='auto' is the way to go, switching on the
basis of the data whether to use only the new params/attributes, or the
old. Most people don't care about the differences between the old and new
model. It can either default to False in v0.22 then disappear in v0.23 or
0.24, or False can become the *only* behaviour in v0.22 before it
disappears in v0.23 or 0.24.
|
Last commit added the logic to deal with all different case (to raise a warning or not, to use legacy mode or not). Still have to clean-up a lot, so don't review in detail, but could already take a look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked at fit/transform
sklearn/preprocessing/_encoders.py
Outdated
|
||
def _handle_deprecations(self, X): | ||
|
||
if self.categories != 'auto': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make self.categories='legacy'
by default, so that the user can select auto
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday I made a commit to make the default categories=None
for the same purpose, but only pushed it now (the last commit).
I personally find that cleaner, as 'legacy'
would not give the correct meaning if you are using string data.
sklearn/preprocessing/_encoders.py
Outdated
self.dtype = dtype | ||
self.handle_unknown = handle_unknown | ||
|
||
if n_values is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just use self.n_values = n_values
? The warning is done there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just use self.n_values = n_values? The warning is done there...
Because I also want to deprecated the access / writing of the attribute n_values
on the class object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? But doing self.n_values = n_values
here will call the setter and raise the warning, just as you have done.
sklearn/preprocessing/_encoders.py
Outdated
|
||
if self._legacy_mode: | ||
# TODO not with _transform_selected ?? | ||
self._fit_transform_old(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call it _legacy_fit_transform
sklearn/preprocessing/_encoders.py
Outdated
``X[:, i]``. Each feature value should be | ||
in ``range(n_values[i])`` | ||
|
||
categorical_features : "all" or array of indices or mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure users will be happy that we are deprecating this! But I suppose that in the deprecation notice, we'll be able to point to ColumnTransformer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding categorical_features
: in principle I can keep this as this is not related to the inherent behaviour of how the encoding works in OneHotEncoder (legacy or new behaviour), but: it makes the implementation more complex, is not done in any other transformer, and indeed can be replaced with ColumnTransformer.
sklearn/preprocessing/_encoders.py
Outdated
self._legacy_mode = True | ||
|
||
if self._deprecated_categorical_features != 'all': | ||
self._legacy_mode = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't safe to do if categories
has been set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but I don't want to implement this ability for the new behaviour.
And, either you were using OneHotEncoder already, and then categories
was not set and this is OK, or either you did already update your usage (eg setting categories
instead of n_values) but then you can directly update for this deprecated keyword as well, or either you are new to this class and then you shouldn't use it.
What I can do is detect that categories
is set by the user (and not internally set), and in that case just raise a plain error here instead of a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Error if categorical_features is set and not legacy_mode (including if string input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked at fit/transform
Test failing? Is this waiting for reviews? |
This is waiting for your opinion about the general idea in #10521 (I would say, let's keep the general discussion there for now, would need to look back at this PR to know its actual status) |
OK, made some small updates:
|
Great work!
|
I'm happy with it as it is iirc |
I am also ok with deprecating I would also introduce a stub Even if |
This PR doesn't yet remove CategoricalEncoder. I'm okay with Olivier's suggestion: class CategoricalEncoder:
"Removed"
def __init__(*args, **kwargs):
raise RuntimeError('CategoricalEncoder briefly existed in 0.19dev. '
'Its functionality has been rolled into '
'OneHotEncoder and OrdinalEncoder. This stub '
'will be removed in version 0.21.') The only problem I see with it is that an ImportError would warn sooner. |
It does, but will add the stub as proposed above.
Yeah, I was also thinking about that. But I don't see a way to both support |
8966ad3
to
79f2e92
Compare
oh yeah... i always forget to load diff when looking for things
|
@amueller are you OK with going merging this as is? (thus, deprecating |
I'm okay with that, yes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we otherwise ready to give this a whirl?
sklearn/preprocessing/data.py
Outdated
|
||
class CategoricalEncoder: | ||
""" | ||
CategoricalEncoder briefly existed in 0.19dev. Its functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, should be 0.20dev
Great work !
|
Great work, @jorisvandenbossche! |
Yesssss !
Congratulations team
Sent from my phone. Please forgive typos and briefness.
…On Jun 21, 2018, 12:14, at 12:14, Joel Nothman ***@***.***> wrote:
Merged #10523.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#10523 (comment)
|
yes, I'm ok with this for now ;) Thank you for all the work, this is great! |
Possibly fixes #10521
This does split the CategoricalEncoder into two separate classes for onehot and ordinal encoding, and then integrates the onehot encoding into the existing OneHotEncoder.
I also moved them to a separate file, so for reviewing actual changes might be better to not view the first commit.
I also did not yet introduce deprecation warnings for the old kwargs / attributes (or computed the new attributes in the old setting), but I infer based on the passed data to fit whether it was accepted before (and in this case we should raise a deprecation warning) and if not directly use the new behaviour. This 'inferring' of the behaviour can be overwritten with a newly added a
encoded_input=True/False
keyword.The main drawback for new users of the OneHotEncoder is the 'pollution' with the old keywords and attributes in the docstring, repr of the object and tab completion.