-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm
sklearn/preprocessing/_encoders.py
Outdated
"""Checks cats is lexicographic consistent with categories in fitted X. | ||
""" | ||
msg = ("'auto' categories is used, but the Categorical dtype provided " | ||
"is not consistent with the automatic lexicographic ordering") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the user do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth listing or diffing the category list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the user do?
Right now...nothing good. The user would need order their categorical dtype to be lexicographic ordering to avoid this warning, which I think is bad. This should most likely be a FutureWarning
suggesting that in the future we will respect the Categorical dtype. We can add support for a categories='dtype'
which will respect the Categorical dtype.
Is it worth listing or diffing the category list?
Sometimes the only difference is the order. What kind of diff would be good to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just print the two orders then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also added another sentence regarding passing a list to the categories parameter.
Please add what's new |
I agree with #12086 (comment) For The If we want to resolve both issues, we may need a |
So is the goal of this overall to decrease number of versions needed to change the behavior for If I understand correctly, the user is still getting the OHE they wanted with some categories in the |
Yes this warning is not helpful without the other PR on categories="dtypes" (#15396) |
It's helpful in the sense that the user can manually set the categories in
the encoder to match the dtype, and in the sense that they are now aware
that the dtype is not bring respected. What's wrong with that?
|
Should that be the error message in this case? "Please set categories as a list to set the order of your categories explicitly?" |
I suppose
…On Thu., 7 Nov. 2019, 10:10 am Thomas J Fan, ***@***.***> wrote:
It's helpful in the sense that the user can manually set the categories in
the encoder to match the dtype, and in the sense that they are now aware
that the dtype is not bring respected. What's wrong with that?
Should that be the error message in this case? "Please set categories as a
list to set the order of your categories explicitly?"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15050?email_source=notifications&email_token=AAATH25IWSXOVWHFO4HWBTTQSNFE7A5CNFSM4IY6YYMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDIJ6CQ#issuecomment-550543114>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAATH22RCS63AEQTBPDK3ATQSNFE7ANCNFSM4IY6YYMA>
.
|
sklearn/preprocessing/_encoders.py
Outdated
msg = ("'auto' categories is used, but the Categorical dtype " | ||
"provided is not consistent with the automatic " | ||
"lexicographic ordering, lexicon order: {}, dtype order: " | ||
"{}. Please pass a custom list of categories to the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a bit weaker "Consider passing" rather than "please pass"
Consider the case, when you have a dataset split with Users only need to be told once that the categorical dtype is not respected (not each time they fit a pipeline) and maybe the documentation could be a better place to document this? |
Why do they need to remove unknown categories from the dtype?
|
>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> from sklearn.model_selection import train_test_split
>>> X = pd.DataFrame({'z': pd.Categorical(['b', 'c', 'c', 'a', 'b'])})
>>> X_train, X_test = train_test_split(X, shuffle=False)
>>> X_train['z']
0 b
1 c
2 c
Name: z, dtype: category
Categories (3, object): [a, b, c]
>>> ohe = OneHotEncoder(categories="auto", handle_unknown="ignore").fit(X_train)
>>> ohe.categories_
[array(['b', 'c'], dtype=object)]
>>> X_train['z'].cat.categories
Index(['a', 'b', 'c'], dtype='object') This warning is going to be raised, currently with this PR I believe? Sorting categories is not going to help... |
Well that's an interesting argument to not warn in fit if the categorical dtype is a supersequence of the sorted cats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should warn when the categories are unordered, which is the default https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.api.types.CategoricalDtype.html
sklearn/preprocessing/_encoders.py
Outdated
@@ -70,6 +71,17 @@ def _get_feature(self, X, feature_idx): | |||
# numpy arrays, sparse arrays | |||
return X[:, feature_idx] | |||
|
|||
def _check_pandas_categories(self, cats, pd_category): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be worth passing i
to indicate which feature is causing this
sklearn/preprocessing/_encoders.py
Outdated
@@ -79,11 +91,16 @@ def _fit(self, X, handle_unknown='error'): | |||
" it has to be of shape (n_features,).") | |||
|
|||
self.categories_ = [] | |||
X_dtypes = getattr(X, 'dtypes', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_dtypes = getattr(X, 'dtypes', None) | |
X_dtypes = getattr(X, 'dtypes', None) # only exists for dataframes |
|
||
|
||
@pytest.mark.parametrize('Encoder', [OneHotEncoder, OrdinalEncoder]) | ||
def test_pandas_category_not_ordered(Encoder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment your tests @thomasjpfan ;)
it takes 5 minutes to read a test but 1 sec to read a description. It also makes the reviewers job easier. Sometimes the test name isn't enough.
num_case = ('int_col', | ||
pd.api.types.CategoricalDtype(categories=[3, 1, 2])) | ||
str_case = ('str_col', | ||
pd.api.types.CategoricalDtype(categories=['d', 'z', 'u'])) | ||
float_case = ('float_col', | ||
pd.api.types.CategoricalDtype(categories=[1.0, 3.1, 2.3])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all unordered categories
pd = pytest.importorskip('pandas') | ||
df = pd.DataFrame({'int_col': [1, 2, 3, 1, 1]}) | ||
# correct order does not warn | ||
ordered_dtype = pd.api.types.CategoricalDtype(categories=[1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unordered category
I'm uncomfortable with not warning when the categories are unordered
precisely because that is the default behaviour in pandas.
|
I am happy with moving this to 0.23. |
I vote +1 because I agree with Thomas's comment #14953 (comment) (i.e., users won't get unexpected warning), perhaps I'm wrong. |
I don't think we should rely on some arbitrary pandas internal behavior that may change tomorrow |
arbitrary pandas internal behavior? If so, I agree that we should reconsider this PR. I admit that I lack experience outside scikit-learn. Actually I come up with another question: should we raise warning in OneHotEncoder where the order doesn't matter? |
I'm referring to the fact that pandas happens to use the lexicographic ordering by default. If we start relying on this, we're making pandas a de-facto dependency.
I have the same concern, again the discussion is tracked in #14953 (comment) |
I'll trust you but note that this behavior is actually documented.
So will pandas change it without a deprecation cycle? I don't know. |
So let my clarify my point: since this feature is designed for pandas, I guess it's reasonable to rely on some documented behaviors in pandas. |
removing from the milestone. We can put it back when we find a way forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still would rather have this than not, at least in OrdinalEncoder, and maybe in OneHotEncoder when the dtype is ordered.
Updated to only warn in |
Any reason this is flagged to be in v1.0? Hasn't been active since December. |
This PR feels like a "needs decision" as we need to decide what the desired behavior of I'll push this milestone till 1.1, since I do not think we can decide before the 1.0 release. Looking at this again, I think I would like an
|
Removing the milestone since there's been no activity since last release. |
Reference Issues/PRs
Resolves #14953
What does this implement/fix? Explain your changes.
Gets the
dtypes
in_fit
and checks only when it has "categories".Any other comments?
The fun part begins when we try to get the encoders to respect pandas categories (when they are numerical) :)