-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396
Conversation
build_tools/azure/install.sh
Outdated
@@ -71,6 +67,8 @@ if [[ "$DISTRIB" == "conda" ]]; then | |||
pip install pytest-xdist | |||
fi | |||
|
|||
python -m pip install pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debugging something, this change will be reverted. (Forgot dictionaries are not ordered in python 3.5)
sklearn/preprocessing/_encoders.py
Outdated
@@ -36,8 +37,23 @@ def _check_X(self, X): | |||
constructed feature by feature to preserve the data types | |||
of pandas DataFrame columns, as otherwise information is lost | |||
and cannot be used, eg for the `categories_` attribute. | |||
|
|||
If categories == 'dtype' and the pandas column is a category, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If categories == 'dtype' and the pandas column is a category, | |
If categories == 'dtypes' and the pandas column is a category, |
sklearn/preprocessing/_encoders.py
Outdated
@@ -36,8 +37,23 @@ def _check_X(self, X): | |||
constructed feature by feature to preserve the data types | |||
of pandas DataFrame columns, as otherwise information is lost | |||
and cannot be used, eg for the `categories_` attribute. | |||
|
|||
If categories == 'dtype' and the pandas column is a category, | |||
the pandas series will be return in this list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pandas series will be return in this list. | |
the pandas series will be returned in this list. |
sklearn/preprocessing/_encoders.py
Outdated
""" | ||
if self.categories == 'dtypes': | ||
if not hasattr(X, 'dtypes'): | ||
raise TypeError("X must be a dataframe when " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise TypeError("X must be a dataframe when " | |
raise TypeError("X must be a DataFrame when " |
sklearn/preprocessing/_encoders.py
Outdated
if not hasattr(X, 'dtypes'): | ||
raise TypeError("X must be a dataframe when " | ||
"categories='dtypes'") | ||
X_dtypes = getattr(X, 'dtypes') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_dtypes = getattr(X, 'dtypes') | |
X_dtypes = X.dtypes |
sklearn/preprocessing/_encoders.py
Outdated
if hasattr(Xi, "cat"): | ||
cats = Xi.cat.categories.values.copy() | ||
else: | ||
cats = _encode(Xi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs imply to me that this setting only handles Categorical dtypes,, not falling back to auto
sklearn/preprocessing/_encoders.py
Outdated
@@ -176,6 +206,7 @@ class OneHotEncoder(_BaseEncoder): | |||
Categories (unique values) per feature: | |||
|
|||
- 'auto' : Determine categories automatically from the training data. | |||
- 'dtypes' : Uses pandas categorical dtype to encode categories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document handling of other dtypes, and perhaps non-Dataframes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Added some comments!
sklearn/preprocessing/_encoders.py
Outdated
if hasattr(self, "_X_fit_dtypes"): # fitted | ||
if not self._check_dtypes_equal(self._X_fit_dtypes, | ||
X_dtypes): | ||
raise ValueError("X.dtypes must match the dtypes used " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check needed in general? Don't we already check categories on transform?
I think we should be careful here, dtypes in pandas can change for other reasons like presence of missing values or not, and int/float mixture seems to work right now (eg fitting with int, transforming with float)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe only doing it for categorical dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added this, but I am undecided on this matter. With categories='dtypes'
I would like to ensure the dtypes during fit and transform are the same.
If we fit on int and then transform on float, I would prefer for this to fail. (I know it works now with categories='auto'
)
For what use case do you see the dtypes changing between fit
and transform
? Is this for missing value support in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not in the future, but now: if you have a dataset with integers, but for some reason you have a missing values in part of your dataset, they become float. Even after dropping NaNs (because OneHotEncoder does not yet support that), they stay float. So it is not unreasonable to read your train and test data from separate files and end up with a different dtype for that reason.
Note that we still check that the categories are found, or no new ones are there, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we still check that the categories are found, or no new ones are there, etc.
This is what the PR does now.
Even after dropping NaNs (because OneHotEncoder does not yet support that), they stay float.
Okay lets keep this float/int support.
sklearn/preprocessing/_encoders.py
Outdated
@@ -57,8 +87,12 @@ def _check_X(self, X): | |||
|
|||
for i in range(n_features): | |||
Xi = self._get_feature(X, feature_idx=i) | |||
Xi = check_array(Xi, ensure_2d=False, dtype=None, | |||
force_all_finite=needs_validation) | |||
if self.categories == 'dtypes' and hasattr(Xi, 'cat'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hasattr check could also be Xi.dtype.name == 'category'
, not sure which one is better
sklearn/preprocessing/_encoders.py
Outdated
if hasattr(Xi, "cat"): | ||
cats = Xi.cat.categories.values.copy() | ||
else: | ||
cats = _encode(Xi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to the if/else check here (the same fore below), is to handle this inside _encode
(that's the approach that was taken in #13351). Advantage is to keep some additional complexity out of this already complex code (moving it elsewhere of course ..)
'col_int': [3, 2, 1, 2]}, columns=['col_str', 'col_int']) | ||
|
||
str_category = pd.api.types.CategoricalDtype( | ||
categories=['b', 'a'], ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also add a category not present in the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think doing it like this can be cleaner:
X_df = pd.DataFrame({
'col_str': pd.Categorical(['a', b', 'b', 'a'], categories=['a', 'b'], ordered=True),
....
)
instead of first creating the dataframe, then the dtypes and then astyping.
|
||
@pytest.mark.parametrize("is_sparse", [True, False]) | ||
@pytest.mark.parametrize("drop", ["first", None]) | ||
def test_one_hot_encoder_pd_categories_mixed(is_sparse, drop): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also add one non-category column to the first test, to avoid some redundancy.
The dataframe is handled column by column anyway (it's good to have a test for this of course, but just think that an additional test is not really necessary, a dataframe with only categorical columns is not a special case)
sklearn/preprocessing/_encoders.py
Outdated
Xi = Xi.astype(self.categories_[i].dtype) | ||
is_category = (self.categories == 'dtypes' and | ||
Xi.dtype.name == 'category') | ||
# categories without missing values do not have unknown values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to my comment about _encode
, this could also be handled in _encode_check_unknown
in principle
@@ -179,6 +227,9 @@ class OneHotEncoder(_BaseEncoder): | |||
Categories (unique values) per feature: | |||
|
|||
- 'auto' : Determine categories automatically from the training data. | |||
- 'dtypes' : Uses pandas categorical dtype to encode categories. For | |||
non pandas categorical data, the categories are automatically | |||
determined from the training data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could be just part of 'auto' as mentioned in the comments.
But if we keep the 'dtypes' option, we should probably raise an error if a non-df is passed in.
I see this ultimately ending up to be the default behavior "auto". I am concerned with how we get there. Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state. The other option would be to add another parameter to the encoders , such as "use_dtype_categories" to enable this new behavior with "auto". With this path, we would ultimately want to deprecate this new parameter. |
can you explain why? Another option is to introduce the parameter and deprecate 'auto'. |
On master, 'auto' will use the lexicon ordering to encode anything including the pandas series with a categorical dtype. With this update, the integer encoding given by the pandas categorical will be used, which may be different from the lexicon ordering.
Not the most user friendly, since
Consider the following: X_train = pd.Categorical(['b', 'b', 'b'], categories=['a', 'b'])
X_test = pd.Categorical(['a', 'b'], categories=['a', 'b']) Both of these series have the same dtype, but the training set is all 'b's. If we only got by the dtype, this will create a column of all zeros. I believe @rth is suggesting we remove this column. With this, if we see 'a', we would treat it as unknown. |
Yes, the 'dtypes' setting is more about order than the set of categories
|
Another option is to make a list of such minor but annoying things to change via a deprecation cycle and make those changes in 1.0. |
@thomasjpfan could you summarize the main selling points of So far I see:
(BTW, these should be in the UG) |
Yes this should be in the UG. |
Is this still the current proposal @thomasjpfan? Thanks |
moving to 0.24, I have a feeling these ones need a little bit of discussion still. Happy for us to wait for it if y'all think it can get in. |
Yea this needs more discussion, this is more of a "quality of life" update when using pandas categories. The "quick" thing to is to only do this for |
Can we just warn for now: "From version 0.26 'auto' will adopt the ordering from the pandas dtype." and do so only if lexicographic and dtype order are inconsitent? |
But then I think my latest proposal was identical to #15050 Maybe we do indeed need to workshop this set of encoding PRs again. |
is it time to revive this? |
@thomasjpfan It would help me a lot if you could provide a summary of the status of this PR. Is it still the right approach and what were the objections and open questions to decide? |
After rereading all the comments in this PR, I am going to open a new PR that uses I'm still thinking through the API design, which depends on how complex it is to handle the edge cases for using |
Now that we have ways to collapse infrequent categories (as measured on the training set), I think the points of @rth in #15396 (comment) are less of a concern. I would personally be in favor of having It will avoid having arrays with varying |
If the rare categories are collapsed, then the ordering from the dtype will have to be rearranged. I am okay with this behavior.
This would work if there are no rare categories. With rare categories, different folds may end up collapsing different categories, which could give different coefficients across cross-validation folds. In any case, I am closing this PR. The |
True but at least we give the users control over the collapsing or non-collapsing behavior. |
Reference Issues/PRs
Fixes #14953
Related to #15050
What does this implement/fix? Explain your changes.
Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder. With this option enabled, the dtypes are remember during fit and checked during transform.
Any other comments?
CC @NicolasHug @jnothman @jorisvandenbossche @adrinjalali