[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

thomasjpfan · 2019-09-21T16:23:38Z

Reference Issues/PRs

Resolves #14953

What does this implement/fix? Explain your changes.

Gets the dtypes in _fit and checks only when it has "categories".

Any other comments?

The fun part begins when we try to get the encoders to respect pandas categories (when they are numerical) :)

thomasjpfan · 2019-10-23T17:59:35Z

@jnothman @amueller This would be good in the next release if we want to support pd.Categorical in the encoders in 0.24.

jnothman

Otherwise lgtm

jnothman · 2019-10-23T22:33:53Z

sklearn/preprocessing/_encoders.py

+        """Checks cats is lexicographic consistent with categories in fitted X.
+        """
+        msg = ("'auto' categories is used, but the Categorical dtype provided "
+               "is not consistent with the automatic lexicographic ordering")


What should the user do?

Is it worth listing or diffing the category list?

What should the user do?

Right now...nothing good. The user would need order their categorical dtype to be lexicographic ordering to avoid this warning, which I think is bad. This should most likely be a FutureWarning suggesting that in the future we will respect the Categorical dtype. We can add support for a categories='dtype' which will respect the Categorical dtype.

Is it worth listing or diffing the category list?

Sometimes the only difference is the order. What kind of diff would be good to have?

Just print the two orders then?

Done. Also added another sentence regarding passing a list to the categories parameter.

jnothman · 2019-10-23T22:36:58Z

Please add what's new

jnothman · 2019-10-27T22:55:44Z

Maybe it's worth solving for LabelEncoder too (see #12086, #13351)

thomasjpfan · 2019-10-28T02:04:43Z

I agree with #12086 (comment)

For OneHotEncoder and OrdinalEncoder, we can have a category=dtype'` option to allow users to use the categorical dtype.

The LabelEncoder would be slightly more harder to do, since it is used internally in some estimators. If we were to warn in in those situations, there is nothing a user can do.

If we want to resolve both issues, we may need a sklearn.set_config(use_pandas_categories_order) to configure it globally.

rth · 2019-11-06T20:51:32Z

So is the goal of this overall to decrease number of versions needed to change the behavior for category="auto" from the current one to that of categories="dtype" for categorical data by 1 version?

If I understand correctly, the user is still getting the OHE they wanted with some categories in the OneHotEncoder estimator. Ideally these should match those of the categorical dtype. Currently they sometimes don't, but at the same time I'm not sure there was too much expectation that they would. Getting a warning one can do nothing about is not necessarily very helpful. Personally I currently have a pipeline with categorical dtypes + OHE and I would probably just ignore this warning, if it was raised. Or am I missing something?

thomasjpfan · 2019-11-06T21:11:26Z

Currently they sometimes don't, but at the same time I'm not sure there was too much expectation that they would. Getting a warning one can do nothing about is not necessarily very helpful.

Yes this warning is not helpful without the other PR on categories="dtypes" (#15396)

jnothman · 2019-11-06T23:00:15Z

It's helpful in the sense that the user can manually set the categories in the encoder to match the dtype, and in the sense that they are now aware that the dtype is not bring respected. What's wrong with that?

thomasjpfan · 2019-11-06T23:10:04Z

It's helpful in the sense that the user can manually set the categories in
the encoder to match the dtype, and in the sense that they are now aware
that the dtype is not bring respected. What's wrong with that?

Should that be the error message in this case? "Please set categories as a list to set the order of your categories explicitly?"

jnothman · 2019-11-07T01:45:24Z

I suppose

…

On Thu., 7 Nov. 2019, 10:10 am Thomas J Fan, ***@***.***> wrote: It's helpful in the sense that the user can manually set the categories in the encoder to match the dtype, and in the sense that they are now aware that the dtype is not bring respected. What's wrong with that? Should that be the error message in this case? "Please set categories as a list to set the order of your categories explicitly?" — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15050?email_source=notifications&email_token=AAATH25IWSXOVWHFO4HWBTTQSNFE7A5CNFSM4IY6YYMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDIJ6CQ#issuecomment-550543114>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAATH22RCS63AEQTBPDK3ATQSNFE7ANCNFSM4IY6YYMA> .

jnothman · 2019-11-07T07:19:52Z

sklearn/preprocessing/_encoders.py

+            msg = ("'auto' categories is used, but the Categorical dtype "
+                   "provided is not consistent with the automatic "
+                   "lexicographic ordering, lexicon order: {}, dtype order: "
+                   "{}. Please pass a custom list of categories to the "


Maybe a bit weaker "Consider passing" rather than "please pass"

rth · 2019-11-07T08:54:35Z

Consider the case, when you have a dataset split with train_test_split, with a significant number of categories. Some categories might then differ between train and test, and one would use OneHotEncoder(handle_unknown="ignore") to handle unkown categories in train. To silence this warning the user would then need not only to sort categories, but also to remove unknown categories from the categorical dtype in train set. I'm just saying, I would not spend time doing that, I prefer having train and test use the same categories in dtype in my present project, and seeing repeated warning for this is annoying.

Users only need to be told once that the categorical dtype is not respected (not each time they fit a pipeline) and maybe the documentation could be a better place to document this?

jnothman · 2019-11-07T09:03:34Z

Why do they need to remove unknown categories from the dtype?

rth · 2019-11-07T09:36:48Z

Why do they need to remove unknown categories from the dtype?

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> from sklearn.model_selection import train_test_split
>>> X = pd.DataFrame({'z': pd.Categorical(['b', 'c', 'c', 'a', 'b'])})
>>> X_train, X_test = train_test_split(X, shuffle=False)
>>> X_train['z']
0    b
1    c
2    c
Name: z, dtype: category
Categories (3, object): [a, b, c]
>>> ohe = OneHotEncoder(categories="auto", handle_unknown="ignore").fit(X_train)
>>> ohe.categories_
[array(['b', 'c'], dtype=object)]
>>> X_train['z'].cat.categories
Index(['a', 'b', 'c'], dtype='object')

This warning is going to be raised, currently with this PR I believe? Sorting categories is not going to help...

jnothman · 2019-11-07T22:15:38Z

Well that's an interesting argument to not warn in fit if the categorical dtype is a supersequence of the sorted cats.

NicolasHug

I don't think we should warn when the categories are unordered, which is the default https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.api.types.CategoricalDtype.html

NicolasHug · 2019-11-08T15:49:55Z

sklearn/preprocessing/_encoders.py

@@ -70,6 +71,17 @@ def _get_feature(self, X, feature_idx):
        # numpy arrays, sparse arrays
        return X[:, feature_idx]

+    def _check_pandas_categories(self, cats, pd_category):


Could be worth passing i to indicate which feature is causing this

NicolasHug · 2019-11-08T15:50:11Z

sklearn/preprocessing/_encoders.py

@@ -79,11 +91,16 @@ def _fit(self, X, handle_unknown='error'):
                                 " it has to be of shape (n_features,).")

        self.categories_ = []
+        X_dtypes = getattr(X, 'dtypes', None)


Suggested change

X_dtypes = getattr(X, 'dtypes', None)

X_dtypes = getattr(X, 'dtypes', None) # only exists for dataframes

NicolasHug · 2019-11-08T15:58:32Z

sklearn/preprocessing/tests/test_encoders.py

+
+
+@pytest.mark.parametrize('Encoder', [OneHotEncoder, OrdinalEncoder])
+def test_pandas_category_not_ordered(Encoder):


Please comment your tests @thomasjpfan ;)

it takes 5 minutes to read a test but 1 sec to read a description. It also makes the reviewers job easier. Sometimes the test name isn't enough.

NicolasHug · 2019-11-08T15:59:29Z

sklearn/preprocessing/tests/test_encoders.py

+    num_case = ('int_col',
+                pd.api.types.CategoricalDtype(categories=[3, 1, 2]))
+    str_case = ('str_col',
+                pd.api.types.CategoricalDtype(categories=['d', 'z', 'u']))
+    float_case = ('float_col',
+                  pd.api.types.CategoricalDtype(categories=[1.0, 3.1, 2.3]))


These are all unordered categories

NicolasHug · 2019-11-08T16:00:11Z

sklearn/preprocessing/tests/test_encoders.py

+    pd = pytest.importorskip('pandas')
+    df = pd.DataFrame({'int_col': [1, 2, 3, 1, 1]})
+    # correct order does not warn
+    ordered_dtype = pd.api.types.CategoricalDtype(categories=[1, 2, 3])


This is an unordered category

…lexico

jnothman · 2019-11-10T11:47:12Z

I'm uncomfortable with not warning when the categories are unordered precisely because that is the default behaviour in pandas.

thomasjpfan · 2019-11-15T15:20:15Z

I am happy with moving this to 0.23.

qinhanmin2014 · 2019-11-17T14:02:25Z

I vote +1 because I agree with Thomas's comment #14953 (comment) (i.e., users won't get unexpected warning), perhaps I'm wrong.
Let's move this to 0.23.

NicolasHug · 2019-11-17T14:14:46Z

I don't think we should rely on some arbitrary pandas internal behavior that may change tomorrow

qinhanmin2014 · 2019-11-17T14:26:57Z

I don't think we should rely on some arbitrary pandas internal behavior that may change tomorrow

arbitrary pandas internal behavior? If so, I agree that we should reconsider this PR. I admit that I lack experience outside scikit-learn.

Actually I come up with another question: should we raise warning in OneHotEncoder where the order doesn't matter?

NicolasHug · 2019-11-17T14:31:42Z

arbitrary pandas internal behavior?

I'm referring to the fact that pandas happens to use the lexicographic ordering by default. If we start relying on this, we're making pandas a de-facto dependency.

should we raise warning in OneHotEncoder where the order doesn't matter?

I have the same concern, again the discussion is tracked in #14953 (comment)

qinhanmin2014 · 2019-11-17T14:38:39Z

I'm referring to the fact that pandas happens to use the lexicographic ordering by default

I'll trust you but note that this behavior is actually documented.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

categories : Index-like (unique), optional
The unique categories for this categorical. If not given,
the categories are assumed to be the unique values of values
(sorted, if possible, otherwise in the order in which they appear).

So will pandas change it without a deprecation cycle? I don't know.

qinhanmin2014 · 2019-11-17T14:45:42Z

So let my clarify my point: since this feature is designed for pandas, I guess it's reasonable to rely on some documented behaviors in pandas.

adrinjalali · 2020-04-22T10:44:21Z

removing from the milestone. We can put it back when we find a way forward.

jnothman

I still would rather have this than not, at least in OrdinalEncoder, and maybe in OneHotEncoder when the dtype is ordered.

thomasjpfan · 2020-04-28T15:53:33Z

Updated to only warn in OrdinalEncoder

adrinjalali · 2021-08-20T13:45:43Z

Any reason this is flagged to be in v1.0? Hasn't been active since December.

thomasjpfan · 2021-08-21T13:14:13Z

This PR feels like a "needs decision" as we need to decide what the desired behavior of OrdinalEncoder + pandas Categroicals is. REF: #14953 (comment)

I'll push this milestone till 1.1, since I do not think we can decide before the 1.0 release.

Looking at this again, I think I would like an IntegerEncoder and the order is an implementation detail:

Pass in a numpy array with something that can be ordered -> use the lexicon ordering
Pass in a pandas categorical -> use the values already encoded by the CategoricalDtype

adrinjalali · 2022-04-07T13:30:58Z

Removing the milestone since there's been no activity since last release.

thomasjpfan added 6 commits September 21, 2019 12:10

ENH Adds warning with pandas category does not match lexicon ordering

44ea488

CLN Slighly less state is good

73e74da

CLN Removes asarray

6b55f08

CLN Less complicated

ddd92a4

CLN Remove unused

521c658

TST More coverage

d797a90

jnothman reviewed Oct 23, 2019

View reviewed changes

adrinjalali added this to the 0.22 milestone Oct 28, 2019

thomasjpfan mentioned this pull request Oct 29, 2019

[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

Closed

jnothman added the Waiting for Reviewer label Nov 6, 2019

CLN Adds clearer message

d7edc2b

jnothman approved these changes Nov 7, 2019

View reviewed changes

NicolasHug reviewed Nov 8, 2019

View reviewed changes

thomasjpfan added 2 commits November 8, 2019 14:57

Merge remote-tracking branch 'upstream/master' into category_warning_…

4bfe3c1

…lexico

WIP

3f7e6b9

CLN Removes unneeded check

acefd5e

qinhanmin2014 modified the milestones: 0.22, 0.23 Nov 17, 2019

rth mentioned this pull request Dec 4, 2019

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Open

github-actions bot added the module:preprocessing label Mar 2, 2020

adrinjalali removed this from the 0.23 milestone Apr 22, 2020

jnothman reviewed Apr 28, 2020

View reviewed changes

thomasjpfan added 3 commits April 27, 2020 22:08

Merge remote-tracking branch 'upstream/master' into pr/15050

4bfae05

CLN Only warn with ordinal encoder

5b131f4

Merge remote-tracking branch 'upstream/master' into pr/15050

ba374d2

cmarmo modified the milestone: 0.24 Oct 26, 2020

cmarmo added this to the 1.0 milestone Dec 18, 2020

cmarmo removed the Waiting for Reviewer label Dec 18, 2020

Base automatically changed from master to main January 22, 2021 10:51

thomasjpfan modified the milestones: 1.0, V 1.1 Aug 21, 2021

adrinjalali removed this from the 1.1 milestone Apr 7, 2022

jeremiedbb added the Needs Decision Requires decision label Apr 7, 2022

	X_dtypes = getattr(X, 'dtypes', None)
	X_dtypes = getattr(X, 'dtypes', None) # only exists for dataframes



		@pytest.mark.parametrize('Encoder', [OneHotEncoder, OrdinalEncoder])
		def test_pandas_category_not_ordered(Encoder):

[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

Are you sure you want to change the base?

[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

Conversation

thomasjpfan commented Sep 21, 2019 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

thomasjpfan commented Oct 23, 2019

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Oct 23, 2019

jnothman commented Oct 27, 2019

thomasjpfan commented Oct 28, 2019

rth commented Nov 6, 2019

thomasjpfan commented Nov 6, 2019

jnothman commented Nov 6, 2019 via email

thomasjpfan commented Nov 6, 2019

jnothman commented Nov 7, 2019 via email

Choose a reason for hiding this comment

rth commented Nov 7, 2019

jnothman commented Nov 7, 2019 via email

rth commented Nov 7, 2019 • edited Loading

jnothman commented Nov 7, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Nov 10, 2019 via email

thomasjpfan commented Nov 15, 2019

qinhanmin2014 commented Nov 17, 2019

NicolasHug commented Nov 17, 2019

qinhanmin2014 commented Nov 17, 2019

NicolasHug commented Nov 17, 2019 • edited Loading

qinhanmin2014 commented Nov 17, 2019

qinhanmin2014 commented Nov 17, 2019

adrinjalali commented Apr 22, 2020

jnothman left a comment

Choose a reason for hiding this comment

thomasjpfan commented Apr 28, 2020

adrinjalali commented Aug 20, 2021

thomasjpfan commented Aug 21, 2021

adrinjalali commented Apr 7, 2022

thomasjpfan commented Sep 21, 2019 •

edited

Loading

rth commented Nov 7, 2019 •

edited

Loading

NicolasHug commented Nov 17, 2019 •

edited

Loading