[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523

jorisvandenbossche · 2018-01-23T14:57:08Z

Possibly fixes #10521

This does split the CategoricalEncoder into two separate classes for onehot and ordinal encoding, and then integrates the onehot encoding into the existing OneHotEncoder.

I also moved them to a separate file, so for reviewing actual changes might be better to not view the first commit.

I also did not yet introduce deprecation warnings for the old kwargs / attributes (or computed the new attributes in the old setting), but I infer based on the passed data to fit whether it was accepted before (and in this case we should raise a deprecation warning) and if not directly use the new behaviour. This 'inferring' of the behaviour can be overwritten with a newly added a encoded_input=True/False keyword.

The main drawback for new users of the OneHotEncoder is the 'pollution' with the old keywords and attributes in the docstring, repr of the object and tab completion.

jorisvandenbossche · 2018-01-23T17:54:15Z

In the meantime, I added deprecation warnings for the old behaviour, keywords and attributes.
(still need to update deprecation messages and the docs)

jnothman

Clarifying the difference between encoded_input=False/True, and what will happen to it in two versions' time, still needs elucidation in the docs.

jnothman · 2018-01-23T21:40:49Z

sklearn/base.py

@@ -225,12 +225,27 @@ def get_params(self, deep=True):
            Parameter names mapped to their values.
        """
        out = dict()
-        for key in self._get_param_names():


This is getting a bit adventurous of you! Propose this separately. It's not exclusive to this change.

It somehow is, in the sense that it is quite essential for this PR, as otherwise just showing the repr of the new OneHotEncoder shows deprecation warnings.

Did it happen before that keyword arguments were deprecated? How was it dealt with then?

I can certainly do it in a separate PR, but then that PR would be a blocker for this one IMO (which is not necessarily a problem, so fine for me to do that).

I don't get this. Can you elaborate? If deprecated keyword arguments are used, they raise a DeprecationWarning during fit, right? Why would the repr do that?

This is to ensure get_params() does not raise any deprecation warnings. See the bigger non-inline comment with an overview of the deprecation handling

jnothman · 2018-01-23T21:41:38Z

sklearn/preprocessing/_encoders.py

+        The categories of each feature determined during fitting
+        (in order corresponding with output of ``transform``).
+
+    Deprecated Attributes


I'm sure Numpydoc doesn't handle this.

jnothman · 2018-01-23T21:41:41Z

sklearn/preprocessing/_encoders.py

+        will be all zeros. In the inverse transform, an unknown category
+        will be denoted as None.
+
+    Deprecated Parameters


I'm sure Numpydoc doesn't handle this. Just use .. deprecated:: 0.20, yeah?

jnothman · 2018-01-23T21:42:51Z

sklearn/preprocessing/_encoders.py

+        is set to 'ignore' and an unknown category is encountered during
+        transform, the resulting one-hot encoded columns for this feature
+        will be all zeros. In the inverse transform, an unknown category
+        will be denoted as None.


(Perhaps add a note that this can be used to handle missing values)

jnothman · 2018-01-23T21:44:49Z

sklearn/preprocessing/_encoders.py

+
+        encoded_input=False : categorical features that still need to be
+            encoded.
+        encoded_input=True : already integer encoded data, and the categories


"In the range 0...(n values - 1)"

jnothman · 2018-01-23T21:49:26Z

sklearn/preprocessing/_encoders.py

+
+        The used categories can be found in the ``categories_`` attribute.
+
+    encoded_input : boolean


I think ordinal_input is better than encoded_input, especially seeing as we introduce an OrdinalEncoder...

jnothman · 2018-01-23T21:49:53Z

sklearn/preprocessing/_encoders.py

+    encoded_input : boolean
+        How to interpret the input data:
+
+        encoded_input=False : categorical features that still need to be


Don't use : descr. The correct syntax for definition lists is:

encoded_input=False categorical features ...

I think the default should be 'auto', not None, with the default changing to False in two versions.

Do you think this option will be useful (for efficiency, for error handling, or for quality assurance) after the deprecation is finished? If not, should it just be called legacy_mode?

Does this setting affect the handling of n_values? Can one use encoded_input=True and categories together?

This seems to mostly affect the handling of errors, in that even if a column isn't active in input, that value is not considered unknown....?

jnothman · 2018-01-23T22:00:16Z

sklearn/preprocessing/_encoders.py

+    OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>,
+           encoded_input=True, handle_unknown='error', sparse=True)
+    >>> enc.transform([[0, 1, 1]]).toarray()
+    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])


Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.

jnothman · 2018-01-23T22:00:17Z

sklearn/preprocessing/_encoders.py

+    OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>,
+           encoded_input=True, handle_unknown='error', sparse=True)
+    >>> enc.transform([[0, 1, 1]]).toarray()
+    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])


Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.

jorisvandenbossche · 2018-01-23T22:38:05Z

Some quick answers

Do you think this option will be useful (for efficiency, for error handling, or for quality assurance) after the deprecation is finished? If not, should it just be called legacy_mode?

See my (long) comment on the issue asking about this: #10521
Short answer: I personally don't really know if it (after deprecation) is worth its own option. Would appreciate input there (as indeed the design would change a bit then)

Does this setting affect the handling of n_values? Can one use encoded_input=True and categories together?

Good question, this is one of the things I didn't really think through yet. In principle, you can pass categories, but using the same format as for categorical input data does not make much sense (you would need to pass categories=[[0, 1, 2, 3, 4], [0, 1, 2]] for 5 and 3 categories, and we would need to check that it is consecutive range.
But changing the way to specify categories depending on encoded_input=True (eg by giving the number of categories like categories=[5, 3] like the current n_values) is also not the nicest API design (but maybe the better solution ..)

I'm sure Numpydoc doesn't handle this.

Yes I know, but wanted it to be very explicit now to see the consequences (for initial reviewing). I didn't bother yet fully updating the documentation as well.

jnothman · 2018-01-23T23:06:31Z

I think we may have previously handled such warnings in get_params using a check_warnings context, IIRC, but it resulted in dangerous race conditions in parallel processing. We usually don't bother raising warnings for constructor parameter access: if the user doesn't have that parameter set at fit, they are unlikely to try to access it otherwise in saved code. I know, it's a bit dodgy. I personally think legacy_mode='auto' is the way to go, switching on the basis of the data whether to use only the new params/attributes, or the old. Most people don't care about the differences between the old and new model. It can either default to False in v0.22 then disappear in v0.23 or 0.24, or False can become the *only* behaviour in v0.22 before it disappears in v0.23 or 0.24.

jorisvandenbossche · 2018-02-01T16:30:15Z

Last commit added the logic to deal with all different case (to raise a warning or not, to use legacy mode or not). Still have to clean-up a lot, so don't review in detail, but could already take a look at _handle_deprecations method.

jnothman

I've not looked at fit/transform

jnothman · 2018-02-03T11:34:46Z

sklearn/preprocessing/_encoders.py

+
+    def _handle_deprecations(self, X):
+
+        if self.categories != 'auto':


Let's make self.categories='legacy' by default, so that the user can select auto explicitly.

Yesterday I made a commit to make the default categories=None for the same purpose, but only pushed it now (the last commit).
I personally find that cleaner, as 'legacy' would not give the correct meaning if you are using string data.

jnothman · 2018-02-03T11:35:43Z

sklearn/preprocessing/_encoders.py

+        self.dtype = dtype
+        self.handle_unknown = handle_unknown
+
+        if n_values is not None:


why don't you just use self.n_values = n_values? The warning is done there...

why don't you just use self.n_values = n_values? The warning is done there...

Because I also want to deprecated the access / writing of the attribute n_values on the class object.

Huh? But doing self.n_values = n_values here will call the setter and raise the warning, just as you have done.

jnothman · 2018-02-03T11:37:27Z

sklearn/preprocessing/_encoders.py

+
+        if self._legacy_mode:
+            # TODO not with _transform_selected ??
+            self._fit_transform_old(X)


call it _legacy_fit_transform

jnothman · 2018-02-03T11:40:02Z

sklearn/preprocessing/_encoders.py

+                  ``X[:, i]``. Each feature value should be
+                  in ``range(n_values[i])``
+
+    categorical_features : "all" or array of indices or mask


I'm not sure users will be happy that we are deprecating this! But I suppose that in the deprecation notice, we'll be able to point to ColumnTransformer?

regarding categorical_features: in principle I can keep this as this is not related to the inherent behaviour of how the encoding works in OneHotEncoder (legacy or new behaviour), but: it makes the implementation more complex, is not done in any other transformer, and indeed can be replaced with ColumnTransformer.

jnothman · 2018-02-03T11:42:17Z

sklearn/preprocessing/_encoders.py

+                    self._legacy_mode = True
+
+        if self._deprecated_categorical_features != 'all':
+            self._legacy_mode = True


This isn't safe to do if categories has been set

Yes, I know, but I don't want to implement this ability for the new behaviour.
And, either you were using OneHotEncoder already, and then categories was not set and this is OK, or either you did already update your usage (eg setting categories instead of n_values) but then you can directly update for this deprecated keyword as well, or either you are new to this class and then you shouldn't use it.

What I can do is detect that categories is set by the user (and not internally set), and in that case just raise a plain error here instead of a warning.

Absolutely. Error if categorical_features is set and not legacy_mode (including if string input)

jnothman

I've not looked at fit/transform

amueller · 2018-03-23T18:37:52Z

Test failing? Is this waiting for reviews?

jorisvandenbossche · 2018-03-23T18:41:22Z

This is waiting for your opinion about the general idea in #10521 (I would say, let's keep the general discussion there for now, would need to look back at this PR to know its actual status)

jorisvandenbossche · 2018-06-06T09:41:05Z

OK, made some small updates:

some minor clean-up for pep8, doctests, correct cross-references (hopefully travis is happy now)
suppressed warnings in RandomTreesEmbedding which was using a OneHotEncoder under the hood by adding a categories='auto' to that call (it is using integers, but the new behaviour should be identical in this case as the tree leave indices should be 0, 1, .. n)
rebased for [MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 #11042 (that dtype is honoured) and added similar tests for the new OneHotEncoder (which already passed)

jnothman · 2018-06-06T10:48:13Z

Great work!

jorisvandenbossche · 2018-06-13T08:21:25Z

@jnothman @amueller can we try to move this forward to include in the release?

jnothman · 2018-06-13T08:50:54Z

I'm happy with it as it is iirc

ogrisel · 2018-06-13T16:21:14Z

I am also ok with deprecating categorical_features and delegate feature dispatching to ColumnTransformer. We can always revise that decision later.

I would also introduce a stub CategoricalEncoder class that raises a TypeError in its constructor with explicit error message telling to use OneHotEncoder.

Even if CategoricalEncoder was never part of a public scikit-learn release, I am affraid that is was already mentionned in several blog posts and even in a revised version of @ageron's book who we had the pleasure to chat with today. In @ageron's book there was a warning that it was an unreleased experimental feature but better be nice with the users and introduce that stub to help them upgrade their code quickly.

jnothman · 2018-06-18T06:20:05Z

This PR doesn't yet remove CategoricalEncoder.

I'm okay with Olivier's suggestion:

class CategoricalEncoder:
    "Removed"
    def __init__(*args, **kwargs):
        raise RuntimeError('CategoricalEncoder briefly existed in 0.19dev. '
                           'Its functionality has been rolled into '
                           'OneHotEncoder and OrdinalEncoder. This stub '
                           'will be removed in version 0.21.')

The only problem I see with it is that an ImportError would warn sooner.

jorisvandenbossche · 2018-06-18T07:13:47Z

This PR doesn't yet remove CategoricalEncoder.

It does, but will add the stub as proposed above.

The only problem I see with it is that an ImportError would warn sooner.

Yeah, I was also thinking about that. But I don't see a way to both support from sklearn.preprocessing import CategoricalEncoder as from sklearn import preprocessing; preprocessing.CategoricalEncoder().
For the first I could add a CategoricalEncoder.py that raises an import error, but that doesn't work in the second case.

jnothman · 2018-06-18T07:29:02Z

oh yeah... i always forget to load diff when looking for things

jorisvandenbossche · 2018-06-19T12:53:27Z

@amueller are you OK with going merging this as is? (thus, deprecating categorical_features keyword)

jnothman · 2018-06-19T13:18:29Z

I'm okay with that, yes.

ogrisel

LGTM.

jnothman

Are we otherwise ready to give this a whirl?

jnothman · 2018-06-20T08:52:55Z

sklearn/preprocessing/data.py

+
+class CategoricalEncoder:
+    """
+    CategoricalEncoder briefly existed in 0.19dev. Its functionality


sorry, should be 0.20dev

TomDLT · 2018-06-21T09:20:38Z

Great work !

~~nitpick: You may also want to update examples/compose/column_transformer_mixed_types.py~~
(done in 73b7d07)

jnothman · 2018-06-21T09:28:06Z

Great work, @jorisvandenbossche!

GaelVaroquaux · 2018-06-21T10:30:56Z

Yesssss ! Congratulations team ⁣Sent from my phone. Please forgive typos and briefness.

…

On Jun 21, 2018, 12:14, at 12:14, Joel Nothman ***@***.***> wrote: Merged #10523. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #10523 (comment)

ageron · 2018-06-22T11:50:00Z

Awesome! Thanks everyone for your work on this important change, and special thanks to @jnothman
and @ogrisel for adding the CategoricalEncoder stub, I really appreciate it. :)

amueller · 2018-06-28T13:34:41Z

yes, I'm ok with this for now ;) Thank you for all the work, this is great!

jorisvandenbossche added 3 commits January 23, 2018 13:58

REF: move encoders to separate file

eb79ec9

Split CategoricalEncoder in OneHotEncoder and OrdinalEncoder

26ae02e

Combine old and new OneHotEncoder in one class

97b7672

jorisvandenbossche mentioned this pull request Jan 23, 2018

Rethinking the CategoricalEncoder API ? #10521

Closed

This comment has been minimized.

Sign in to view

jorisvandenbossche added 2 commits January 23, 2018 17:02

Add main deprecation warning and deprecated attributes

2524bf8

deprecate old keyword arguments

f2aceb2

This comment has been minimized.

Sign in to view

jorisvandenbossche added 3 commits January 23, 2018 18:24

make sure repr and get_params don't raise deprecation warnings

b438374

remove remaining occurences of CategoricalEncoder in codebase

b31cfa3

fix doctest

6346358

jorisvandenbossche force-pushed the categorical-refactor branch from f4893a2 to 6346358 Compare January 23, 2018 17:45

Don't show deprecated kwargs in class repr

0509226

jorisvandenbossche changed the title ~~WIP: proof of concent of CategoricalEncoder refactor~~ WIP: proof of concept of CategoricalEncoder refactor Jan 23, 2018

jnothman requested changes Jan 23, 2018

View reviewed changes

very messy first try to handle deprecations (infer legacy mode)

5c0ce11

Add categories='auto' as way to force new behaviour

37819ef

jnothman reviewed Feb 3, 2018

View reviewed changes

jorisvandenbossche changed the title ~~WIP: proof of concept of CategoricalEncoder refactor~~ WIP: Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder Feb 7, 2018

jorisvandenbossche added 2 commits February 7, 2018 17:53

Cleaning up deprecation warnings / tests

782fc27

Merge remote-tracking branch 'upstream/master' into categorical-refactor

243376f

jorisvandenbossche added 2 commits March 29, 2018 11:36

temp

b08d6cd

Merge remote-tracking branch 'upstream/master' into categorical-refactor

a204c96

further fix doctest

8a00ada

fix doctest

1df7a92

glemaitre added this to the 0.20 milestone Jun 8, 2018

suppress warnings again in the tests

496816c

add CategoricalEncoder stub with informative pointer

79f2e92

jorisvandenbossche force-pushed the categorical-refactor branch from 8966ad3 to 79f2e92 Compare June 18, 2018 07:28

ogrisel approved these changes Jun 20, 2018

View reviewed changes

jnothman reviewed Jun 20, 2018

View reviewed changes

jnothman changed the title ~~[MRG+1] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder~~ [MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder Jun 21, 2018

0.19 -> 0.20

5456aff

jnothman merged commit 007aa71 into scikit-learn:master Jun 21, 2018

jorisvandenbossche deleted the categorical-refactor branch June 21, 2018 14:21

amueller mentioned this pull request Jun 29, 2018

[Feature Request] OneHotEncoder for string at character level for LSTM's #11389

Closed

jorisvandenbossche mentioned this pull request Oct 4, 2018

don't change self.n_values in OneHotEncoder.fit #12286

Merged

rragundez mentioned this pull request Mar 11, 2019

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged


		The used categories can be found in the ``categories_`` attribute.

		encoded_input : boolean


		def _handle_deprecations(self, X):

		if self.categories != 'auto':

[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523

[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder #10523

Conversation

jorisvandenbossche commented Jan 23, 2018

This comment has been minimized.

This comment has been minimized.

jorisvandenbossche commented Jan 23, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 23, 2018

jnothman commented Jan 23, 2018 via email

jorisvandenbossche commented Feb 1, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman Feb 3, 2018 • edited Loading

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

amueller commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

jorisvandenbossche commented Jun 6, 2018

jnothman commented Jun 6, 2018 via email

jorisvandenbossche commented Jun 13, 2018

jnothman commented Jun 13, 2018

ogrisel commented Jun 13, 2018

jnothman commented Jun 18, 2018

jorisvandenbossche commented Jun 18, 2018

jnothman commented Jun 18, 2018 via email

jorisvandenbossche commented Jun 19, 2018

jnothman commented Jun 19, 2018 via email

ogrisel left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Jun 21, 2018 • edited Loading

jnothman commented Jun 21, 2018

GaelVaroquaux commented Jun 21, 2018 via email

ageron commented Jun 22, 2018

amueller commented Jun 28, 2018

jnothman Feb 3, 2018 •

edited

Loading

TomDLT commented Jun 21, 2018 •

edited

Loading