Skip to content

[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Oct 29, 2019

Reference Issues/PRs

Fixes #14953
Related to #15050

What does this implement/fix? Explain your changes.

Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder. With this option enabled, the dtypes are remember during fit and checked during transform.

Any other comments?

  1. Uses pip to install pandas on the python3.5 conda env to get a newer version of pandas.
  2. A follow up to this PR would be to add missing value support for pandas dataframes.

CC @NicolasHug @jnothman @jorisvandenbossche @adrinjalali

@thomasjpfan thomasjpfan changed the title [MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder [WIP] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder Oct 29, 2019
@thomasjpfan thomasjpfan changed the title [WIP] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder [MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder Oct 29, 2019
@thomasjpfan thomasjpfan changed the title [MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder [WIP] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder Oct 29, 2019
@@ -71,6 +67,8 @@ if [[ "$DISTRIB" == "conda" ]]; then
pip install pytest-xdist
fi

python -m pip install pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debugging something, this change will be reverted. (Forgot dictionaries are not ordered in python 3.5)

@@ -36,8 +37,23 @@ def _check_X(self, X):
constructed feature by feature to preserve the data types
of pandas DataFrame columns, as otherwise information is lost
and cannot be used, eg for the `categories_` attribute.

If categories == 'dtype' and the pandas column is a category,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If categories == 'dtype' and the pandas column is a category,
If categories == 'dtypes' and the pandas column is a category,

@@ -36,8 +37,23 @@ def _check_X(self, X):
constructed feature by feature to preserve the data types
of pandas DataFrame columns, as otherwise information is lost
and cannot be used, eg for the `categories_` attribute.

If categories == 'dtype' and the pandas column is a category,
the pandas series will be return in this list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the pandas series will be return in this list.
the pandas series will be returned in this list.

"""
if self.categories == 'dtypes':
if not hasattr(X, 'dtypes'):
raise TypeError("X must be a dataframe when "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError("X must be a dataframe when "
raise TypeError("X must be a DataFrame when "

if not hasattr(X, 'dtypes'):
raise TypeError("X must be a dataframe when "
"categories='dtypes'")
X_dtypes = getattr(X, 'dtypes')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X_dtypes = getattr(X, 'dtypes')
X_dtypes = X.dtypes

if hasattr(Xi, "cat"):
cats = Xi.cat.categories.values.copy()
else:
cats = _encode(Xi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs imply to me that this setting only handles Categorical dtypes,, not falling back to auto

@@ -176,6 +206,7 @@ class OneHotEncoder(_BaseEncoder):
Categories (unique values) per feature:

- 'auto' : Determine categories automatically from the training data.
- 'dtypes' : Uses pandas categorical dtype to encode categories.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document handling of other dtypes, and perhaps non-Dataframes

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Added some comments!

if hasattr(self, "_X_fit_dtypes"): # fitted
if not self._check_dtypes_equal(self._X_fit_dtypes,
X_dtypes):
raise ValueError("X.dtypes must match the dtypes used "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed in general? Don't we already check categories on transform?
I think we should be careful here, dtypes in pandas can change for other reasons like presence of missing values or not, and int/float mixture seems to work right now (eg fitting with int, transforming with float)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe only doing it for categorical dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this, but I am undecided on this matter. With categories='dtypes' I would like to ensure the dtypes during fit and transform are the same.

If we fit on int and then transform on float, I would prefer for this to fail. (I know it works now with categories='auto')

For what use case do you see the dtypes changing between fit and transform? Is this for missing value support in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in the future, but now: if you have a dataset with integers, but for some reason you have a missing values in part of your dataset, they become float. Even after dropping NaNs (because OneHotEncoder does not yet support that), they stay float. So it is not unreasonable to read your train and test data from separate files and end up with a different dtype for that reason.

Note that we still check that the categories are found, or no new ones are there, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we still check that the categories are found, or no new ones are there, etc.

This is what the PR does now.

Even after dropping NaNs (because OneHotEncoder does not yet support that), they stay float.

Okay lets keep this float/int support.

@@ -57,8 +87,12 @@ def _check_X(self, X):

for i in range(n_features):
Xi = self._get_feature(X, feature_idx=i)
Xi = check_array(Xi, ensure_2d=False, dtype=None,
force_all_finite=needs_validation)
if self.categories == 'dtypes' and hasattr(Xi, 'cat'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasattr check could also be Xi.dtype.name == 'category', not sure which one is better

if hasattr(Xi, "cat"):
cats = Xi.cat.categories.values.copy()
else:
cats = _encode(Xi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to the if/else check here (the same fore below), is to handle this inside _encode (that's the approach that was taken in #13351). Advantage is to keep some additional complexity out of this already complex code (moving it elsewhere of course ..)

'col_int': [3, 2, 1, 2]}, columns=['col_str', 'col_int'])

str_category = pd.api.types.CategoricalDtype(
categories=['b', 'a'], ordered=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add a category not present in the data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think doing it like this can be cleaner:

X_df = pd.DataFrame({
    'col_str': pd.Categorical(['a', b', 'b', 'a'], categories=['a', 'b'], ordered=True),
    ....
    )

instead of first creating the dataframe, then the dtypes and then astyping.


@pytest.mark.parametrize("is_sparse", [True, False])
@pytest.mark.parametrize("drop", ["first", None])
def test_one_hot_encoder_pd_categories_mixed(is_sparse, drop):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add one non-category column to the first test, to avoid some redundancy.
The dataframe is handled column by column anyway (it's good to have a test for this of course, but just think that an additional test is not really necessary, a dataframe with only categorical columns is not a special case)

Xi = Xi.astype(self.categories_[i].dtype)
is_category = (self.categories == 'dtypes' and
Xi.dtype.name == 'category')
# categories without missing values do not have unknown values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to my comment about _encode, this could also be handled in _encode_check_unknown in principle

@@ -179,6 +227,9 @@ class OneHotEncoder(_BaseEncoder):
Categories (unique values) per feature:

- 'auto' : Determine categories automatically from the training data.
- 'dtypes' : Uses pandas categorical dtype to encode categories. For
non pandas categorical data, the categories are automatically
determined from the training data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could be just part of 'auto' as mentioned in the comments.

But if we keep the 'dtypes' option, we should probably raise an error if a non-df is passed in.

@thomasjpfan
Copy link
Member Author

I see this ultimately ending up to be the default behavior "auto". I am concerned with how we get there.

Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state.

The other option would be to add another parameter to the encoders , such as "use_dtype_categories" to enable this new behavior with "auto". With this path, we would ultimately want to deprecate this new parameter.

@NicolasHug
Copy link
Member

, if we add this feature into auto it will break backwards compatibility

can you explain why?

Another option is to introduce the parameter and deprecate 'auto'.
But then 'dtypes' might not be an ideal name

@thomasjpfan
Copy link
Member Author

can you explain why?

On master, 'auto' will use the lexicon ordering to encode anything including the pandas series with a categorical dtype. With this update, the integer encoding given by the pandas categorical will be used, which may be different from the lexicon ordering.

Another option is to introduce the parameter and deprecate 'auto'.

'better_auto'? lol

raise an error when the pandas categories are not ordered when using OrdinalEncoder?

Not the most user friendly, since

regarding categories seen during predict but not during fit: is this even an issue? It seems to me that we should simply just error in this case, because that means the dtype
of fit and the dtype of predict are different, which should not be allowed?

Consider the following:

X_train = pd.Categorical(['b', 'b', 'b'], categories=['a', 'b'])
X_test = pd.Categorical(['a', 'b'], categories=['a', 'b'])

Both of these series have the same dtype, but the training set is all 'b's. If we only got by the dtype, this will create a column of all zeros. I believe @rth is suggesting we remove this column. With this, if we see 'a', we would treat it as unknown.

@jnothman
Copy link
Member

jnothman commented Jan 25, 2020 via email

@rth
Copy link
Member

rth commented Jan 28, 2020

Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state.

Another option is to make a list of such minor but annoying things to change via a deprecation cycle and make those changes in 1.0.

@NicolasHug
Copy link
Member

@thomasjpfan could you summarize the main selling points of categories='dtype'?

So far I see:

  • this respects pandas categorical ordering. But that's only relevant for the OE, not OHE where the order doesn't matter
  • you don't have to explicitly specify categories=[[a, b,...], ...] for each feature

(BTW, these should be in the UG)

@thomasjpfan
Copy link
Member Author

could you summarize the main selling points of categories='dtype'?

  1. It is a user interface improvement to use the encoding in categoricals.
  2. It is more efficient to use the encoding provided by pandas. As @rth, this can even be better if we construct the encoding directly.
  3. For OHE, the drop='first' the order still matters. Currently, "first" means, the lowest lexicon ordered element. If we respect the ordering, "first" will mean, "the first element in the pandas categorical dtype".

Yes this should be in the UG.

@jnothman
Copy link
Member

Is this still the current proposal @thomasjpfan? Thanks

@adrinjalali
Copy link
Member

moving to 0.24, I have a feeling these ones need a little bit of discussion still. Happy for us to wait for it if y'all think it can get in.

@adrinjalali adrinjalali modified the milestones: 0.23, 0.24 Apr 22, 2020
@thomasjpfan
Copy link
Member Author

Yea this needs more discussion, this is more of a "quality of life" update when using pandas categories.

The "quick" thing to is to only do this for OrdinalEncoder, which in turn makes it nicer a little nicer for unknown categories, which makes it nicer for cross validation.

@jnothman
Copy link
Member

Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state.

Can we just warn for now: "From version 0.26 'auto' will adopt the ordering from the pandas dtype." and do so only if lexicographic and dtype order are inconsitent?

@jnothman
Copy link
Member

But then I think my latest proposal was identical to #15050

Maybe we do indeed need to workshop this set of encoding PRs again.

@cmarmo cmarmo removed this from the 0.24 milestone Oct 26, 2020
Base automatically changed from master to main January 22, 2021 10:51
@amueller
Copy link
Member

is it time to revive this?

@lorentzenchr
Copy link
Member

@thomasjpfan It would help me a lot if you could provide a summary of the status of this PR. Is it still the right approach and what were the objections and open questions to decide?

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jan 31, 2023

After rereading all the comments in this PR, I am going to open a new PR that uses cat.codes only for optimization and does not add a new option to categories. Overall, I agree with the concerns in #15396 (comment).

I'm still thinking through the API design, which depends on how complex it is to handle the edge cases for using cat.codes directly.

@ogrisel
Copy link
Member

ogrisel commented Jun 7, 2023

After rereading all the comments in this PR, I am going to open a new PR that uses cat.codes only for optimization and does not add a new option to categories. Overall, I agree with the concerns in #15396 (comment).

Now that we have ways to collapse infrequent categories (as measured on the training set), I think the points of @rth in #15396 (comment) are less of a concern.

I would personally be in favor of having OneHotEncoder deterministically respect both the ordering and the list of categories represented in the dtype by default and let the user decide to collapse rare categories at the level they want.

It will avoid having arrays with varying .shape[1] in intermediate steps pipelines when doing cross-validations. I think it's less surprising to the user and can make their life simpler, e.g. to inspect the variability of the coefficients of a linear model across cross-validation folds.

@thomasjpfan
Copy link
Member Author

I would personally be in favor of having OneHotEncoder deterministically respect both the ordering and the list of categories represented in the dtype by default and let the user decide to collapse rare categories at the level they want.

If the rare categories are collapsed, then the ordering from the dtype will have to be rearranged. I am okay with this behavior.

t will avoid having arrays with varying .shape[1] in intermediate steps pipelines when doing cross-validations.

This would work if there are no rare categories. With rare categories, different folds may end up collapsing different categories, which could give different coefficients across cross-validation folds.

In any case, I am closing this PR. The _encoders.py code has change quite a bit and I think it's better to start fresh.

@ogrisel
Copy link
Member

ogrisel commented Dec 2, 2023

will avoid having arrays with varying .shape[1] in intermediate steps pipelines when doing cross-validations.

This would work if there are no rare categories. With rare categories, different folds may end up collapsing different categories, which could give different coefficients across cross-validation folds.

True but at least we give the users control over the collapsing or non-collapsing behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle pd.Categorical in encoders