Skip to content

[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Jun 21, 2018

Conversation

jorisvandenbossche
Copy link
Member

Possibly fixes #10521

This does split the CategoricalEncoder into two separate classes for onehot and ordinal encoding, and then integrates the onehot encoding into the existing OneHotEncoder.

I also moved them to a separate file, so for reviewing actual changes might be better to not view the first commit.

I also did not yet introduce deprecation warnings for the old kwargs / attributes (or computed the new attributes in the old setting), but I infer based on the passed data to fit whether it was accepted before (and in this case we should raise a deprecation warning) and if not directly use the new behaviour. This 'inferring' of the behaviour can be overwritten with a newly added a encoded_input=True/False keyword.

The main drawback for new users of the OneHotEncoder is the 'pollution' with the old keywords and attributes in the docstring, repr of the object and tab completion.

@sklearn-lgtm

This comment has been minimized.

@sklearn-lgtm

This comment has been minimized.

@jorisvandenbossche
Copy link
Member Author

In the meantime, I added deprecation warnings for the old behaviour, keywords and attributes.
(still need to update deprecation messages and the docs)

@jorisvandenbossche jorisvandenbossche changed the title WIP: proof of concent of CategoricalEncoder refactor WIP: proof of concept of CategoricalEncoder refactor Jan 23, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying the difference between encoded_input=False/True, and what will happen to it in two versions' time, still needs elucidation in the docs.

sklearn/base.py Outdated
@@ -225,12 +225,27 @@ def get_params(self, deep=True):
Parameter names mapped to their values.
"""
out = dict()
for key in self._get_param_names():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting a bit adventurous of you! Propose this separately. It's not exclusive to this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It somehow is, in the sense that it is quite essential for this PR, as otherwise just showing the repr of the new OneHotEncoder shows deprecation warnings.

Did it happen before that keyword arguments were deprecated? How was it dealt with then?

I can certainly do it in a separate PR, but then that PR would be a blocker for this one IMO (which is not necessarily a problem, so fine for me to do that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this. Can you elaborate? If deprecated keyword arguments are used, they raise a DeprecationWarning during fit, right? Why would the repr do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure get_params() does not raise any deprecation warnings. See the bigger non-inline comment with an overview of the deprecation handling

The categories of each feature determined during fitting
(in order corresponding with output of ``transform``).

Deprecated Attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure Numpydoc doesn't handle this.

will be all zeros. In the inverse transform, an unknown category
will be denoted as None.

Deprecated Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure Numpydoc doesn't handle this. Just use .. deprecated:: 0.20, yeah?

is set to 'ignore' and an unknown category is encountered during
transform, the resulting one-hot encoded columns for this feature
will be all zeros. In the inverse transform, an unknown category
will be denoted as None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps add a note that this can be used to handle missing values)


encoded_input=False : categorical features that still need to be
encoded.
encoded_input=True : already integer encoded data, and the categories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In the range 0...(n values - 1)"


The used categories can be found in the ``categories_`` attribute.

encoded_input : boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ordinal_input is better than encoded_input, especially seeing as we introduce an OrdinalEncoder...

encoded_input : boolean
How to interpret the input data:

encoded_input=False : categorical features that still need to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use : descr. The correct syntax for definition lists is:

encoded_input=False
    categorical features ...

I think the default should be 'auto', not None, with the default changing to False in two versions.

Do you think this option will be useful (for efficiency, for error handling, or for quality assurance) after the deprecation is finished? If not, should it just be called legacy_mode?

Does this setting affect the handling of n_values? Can one use encoded_input=True and categories together?

This seems to mostly affect the handling of errors, in that even if a column isn't active in input, that value is not considered unknown....?

OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>,
encoded_input=True, handle_unknown='error', sparse=True)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.

OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>,
encoded_input=True, handle_unknown='error', sparse=True)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.

@jorisvandenbossche
Copy link
Member Author

Some quick answers

Do you think this option will be useful (for efficiency, for error handling, or for quality assurance) after the deprecation is finished? If not, should it just be called legacy_mode?

See my (long) comment on the issue asking about this: #10521
Short answer: I personally don't really know if it (after deprecation) is worth its own option. Would appreciate input there (as indeed the design would change a bit then)

Does this setting affect the handling of n_values? Can one use encoded_input=True and categories together?

Good question, this is one of the things I didn't really think through yet. In principle, you can pass categories, but using the same format as for categorical input data does not make much sense (you would need to pass categories=[[0, 1, 2, 3, 4], [0, 1, 2]] for 5 and 3 categories, and we would need to check that it is consecutive range.
But changing the way to specify categories depending on encoded_input=True (eg by giving the number of categories like categories=[5, 3] like the current n_values) is also not the nicest API design (but maybe the better solution ..)

I'm sure Numpydoc doesn't handle this.

Yes I know, but wanted it to be very explicit now to see the consequences (for initial reviewing). I didn't bother yet fully updating the documentation as well.

@jnothman
Copy link
Member

jnothman commented Jan 23, 2018 via email

@jorisvandenbossche
Copy link
Member Author

Last commit added the logic to deal with all different case (to raise a warning or not, to use legacy mode or not). Still have to clean-up a lot, so don't review in detail, but could already take a look at _handle_deprecations method.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at fit/transform


def _handle_deprecations(self, X):

if self.categories != 'auto':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make self.categories='legacy' by default, so that the user can select auto explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I made a commit to make the default categories=None for the same purpose, but only pushed it now (the last commit).
I personally find that cleaner, as 'legacy' would not give the correct meaning if you are using string data.

self.dtype = dtype
self.handle_unknown = handle_unknown

if n_values is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just use self.n_values = n_values? The warning is done there...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just use self.n_values = n_values? The warning is done there...

Because I also want to deprecated the access / writing of the attribute n_values on the class object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? But doing self.n_values = n_values here will call the setter and raise the warning, just as you have done.


if self._legacy_mode:
# TODO not with _transform_selected ??
self._fit_transform_old(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call it _legacy_fit_transform

``X[:, i]``. Each feature value should be
in ``range(n_values[i])``

categorical_features : "all" or array of indices or mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure users will be happy that we are deprecating this! But I suppose that in the deprecation notice, we'll be able to point to ColumnTransformer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding categorical_features: in principle I can keep this as this is not related to the inherent behaviour of how the encoding works in OneHotEncoder (legacy or new behaviour), but: it makes the implementation more complex, is not done in any other transformer, and indeed can be replaced with ColumnTransformer.

self._legacy_mode = True

if self._deprecated_categorical_features != 'all':
self._legacy_mode = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't safe to do if categories has been set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but I don't want to implement this ability for the new behaviour.
And, either you were using OneHotEncoder already, and then categories was not set and this is OK, or either you did already update your usage (eg setting categories instead of n_values) but then you can directly update for this deprecated keyword as well, or either you are new to this class and then you shouldn't use it.

What I can do is detect that categories is set by the user (and not internally set), and in that case just raise a plain error here instead of a warning.

Copy link
Member

@jnothman jnothman Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Error if categorical_features is set and not legacy_mode (including if string input)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at fit/transform

@jorisvandenbossche jorisvandenbossche changed the title WIP: proof of concept of CategoricalEncoder refactor WIP: Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder Feb 7, 2018
@amueller
Copy link
Member

Test failing? Is this waiting for reviews?

@jorisvandenbossche
Copy link
Member Author

This is waiting for your opinion about the general idea in #10521 (I would say, let's keep the general discussion there for now, would need to look back at this PR to know its actual status)

@jorisvandenbossche
Copy link
Member Author

OK, made some small updates:

  • some minor clean-up for pep8, doctests, correct cross-references (hopefully travis is happy now)
  • suppressed warnings in RandomTreesEmbedding which was using a OneHotEncoder under the hood by adding a categories='auto' to that call (it is using integers, but the new behaviour should be identical in this case as the tree leave indices should be 0, 1, .. n)
  • rebased for [MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 #11042 (that dtype is honoured) and added similar tests for the new OneHotEncoder (which already passed)

@jnothman
Copy link
Member

jnothman commented Jun 6, 2018 via email

@glemaitre glemaitre added this to the 0.20 milestone Jun 8, 2018
@jorisvandenbossche
Copy link
Member Author

@jnothman @amueller can we try to move this forward to include in the release?

@jnothman
Copy link
Member

I'm happy with it as it is iirc

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2018

I am also ok with deprecating categorical_features and delegate feature dispatching to ColumnTransformer. We can always revise that decision later.

I would also introduce a stub CategoricalEncoder class that raises a TypeError in its constructor with explicit error message telling to use OneHotEncoder.

Even if CategoricalEncoder was never part of a public scikit-learn release, I am affraid that is was already mentionned in several blog posts and even in a revised version of @ageron's book who we had the pleasure to chat with today. In @ageron's book there was a warning that it was an unreleased experimental feature but better be nice with the users and introduce that stub to help them upgrade their code quickly.

@jnothman
Copy link
Member

This PR doesn't yet remove CategoricalEncoder.

I'm okay with Olivier's suggestion:

class CategoricalEncoder:
    "Removed"
    def __init__(*args, **kwargs):
        raise RuntimeError('CategoricalEncoder briefly existed in 0.19dev. '
                           'Its functionality has been rolled into '
                           'OneHotEncoder and OrdinalEncoder. This stub '
                           'will be removed in version 0.21.')

The only problem I see with it is that an ImportError would warn sooner.

@jorisvandenbossche
Copy link
Member Author

This PR doesn't yet remove CategoricalEncoder.

It does, but will add the stub as proposed above.

The only problem I see with it is that an ImportError would warn sooner.

Yeah, I was also thinking about that. But I don't see a way to both support from sklearn.preprocessing import CategoricalEncoder as from sklearn import preprocessing; preprocessing.CategoricalEncoder().
For the first I could add a CategoricalEncoder.py that raises an import error, but that doesn't work in the second case.

@jnothman
Copy link
Member

jnothman commented Jun 18, 2018 via email

@jorisvandenbossche
Copy link
Member Author

@amueller are you OK with going merging this as is? (thus, deprecating categorical_features keyword)

@jnothman
Copy link
Member

jnothman commented Jun 19, 2018 via email

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we otherwise ready to give this a whirl?


class CategoricalEncoder:
"""
CategoricalEncoder briefly existed in 0.19dev. Its functionality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, should be 0.20dev

@jnothman jnothman changed the title [MRG+1] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder [MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder Jun 21, 2018
@TomDLT
Copy link
Member

TomDLT commented Jun 21, 2018

Great work !

nitpick: You may also want to update examples/compose/column_transformer_mixed_types.py
(done in 73b7d07)

@jnothman jnothman merged commit 007aa71 into scikit-learn:master Jun 21, 2018
@jnothman
Copy link
Member

Great work, @jorisvandenbossche!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 21, 2018 via email

@jorisvandenbossche jorisvandenbossche deleted the categorical-refactor branch June 21, 2018 14:21
@ageron
Copy link
Contributor

ageron commented Jun 22, 2018

Awesome! Thanks everyone for your work on this important change, and special thanks to @jnothman
and @ogrisel for adding the CategoricalEncoder stub, I really appreciate it. :)

@amueller
Copy link
Member

yes, I'm ok with this for now ;) Thank you for all the work, this is great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rethinking the CategoricalEncoder API ?
9 participants