[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

thomasjpfan · 2019-10-29T18:28:02Z

Reference Issues/PRs

Fixes #14953
Related to #15050

What does this implement/fix? Explain your changes.

Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder. With this option enabled, the dtypes are remember during fit and checked during transform.

Any other comments?

Uses pip to install pandas on the python3.5 conda env to get a newer version of pandas.
A follow up to this PR would be to add missing value support for pandas dataframes.

CC @NicolasHug @jnothman @jorisvandenbossche @adrinjalali

…tegories

jnothman · 2019-10-29T19:38:05Z

build_tools/azure/install.sh

@@ -71,6 +67,8 @@ if [[ "$DISTRIB" == "conda" ]]; then
        pip install pytest-xdist
    fi

+    python -m pip install pandas


Why this change?

I was debugging something, this change will be reverted. (Forgot dictionaries are not ordered in python 3.5)

jnothman · 2019-10-29T19:38:54Z

sklearn/preprocessing/_encoders.py

@@ -36,8 +37,23 @@ def _check_X(self, X):
          constructed feature by feature to preserve the data types
          of pandas DataFrame columns, as otherwise information is lost
          and cannot be used, eg for the `categories_` attribute.
-
+          If categories == 'dtype' and the pandas column is a category,


Suggested change

If categories == 'dtype' and the pandas column is a category,

If categories == 'dtypes' and the pandas column is a category,

jnothman · 2019-10-29T19:39:16Z

sklearn/preprocessing/_encoders.py

@@ -36,8 +37,23 @@ def _check_X(self, X):
          constructed feature by feature to preserve the data types
          of pandas DataFrame columns, as otherwise information is lost
          and cannot be used, eg for the `categories_` attribute.
-
+          If categories == 'dtype' and the pandas column is a category,
+          the pandas series will be return in this list.


Suggested change

the pandas series will be return in this list.

the pandas series will be returned in this list.

jnothman · 2019-10-29T19:39:47Z

sklearn/preprocessing/_encoders.py

        """
+        if self.categories == 'dtypes':
+            if not hasattr(X, 'dtypes'):
+                raise TypeError("X must be a dataframe when "


Suggested change

raise TypeError("X must be a dataframe when "

raise TypeError("X must be a DataFrame when "

jnothman · 2019-10-29T19:40:15Z

sklearn/preprocessing/_encoders.py

+            if not hasattr(X, 'dtypes'):
+                raise TypeError("X must be a dataframe when "
+                                "categories='dtypes'")
+            X_dtypes = getattr(X, 'dtypes')


Suggested change

X_dtypes = getattr(X, 'dtypes')

X_dtypes = X.dtypes

jnothman · 2019-10-29T19:42:48Z

sklearn/preprocessing/_encoders.py

+                if hasattr(Xi, "cat"):
+                    cats = Xi.cat.categories.values.copy()
+                else:
+                    cats = _encode(Xi)


The docs imply to me that this setting only handles Categorical dtypes,, not falling back to auto

jnothman · 2019-10-29T19:44:40Z

sklearn/preprocessing/_encoders.py

@@ -176,6 +206,7 @@ class OneHotEncoder(_BaseEncoder):
        Categories (unique values) per feature:

        - 'auto' : Determine categories automatically from the training data.
+        - 'dtypes' : Uses pandas categorical dtype to encode categories.


Please document handling of other dtypes, and perhaps non-Dataframes

jorisvandenbossche

Nice! Added some comments!

jorisvandenbossche · 2019-10-29T21:32:47Z

sklearn/preprocessing/_encoders.py

+            if hasattr(self, "_X_fit_dtypes"):  # fitted
+                if not self._check_dtypes_equal(self._X_fit_dtypes,
+                                                X_dtypes):
+                    raise ValueError("X.dtypes must match the dtypes used "


Why is this check needed in general? Don't we already check categories on transform?
I think we should be careful here, dtypes in pandas can change for other reasons like presence of missing values or not, and int/float mixture seems to work right now (eg fitting with int, transforming with float)

Or maybe only doing it for categorical dtype?

I just added this, but I am undecided on this matter. With categories='dtypes' I would like to ensure the dtypes during fit and transform are the same.

If we fit on int and then transform on float, I would prefer for this to fail. (I know it works now with categories='auto')

For what use case do you see the dtypes changing between fit and transform? Is this for missing value support in the future?

It's not in the future, but now: if you have a dataset with integers, but for some reason you have a missing values in part of your dataset, they become float. Even after dropping NaNs (because OneHotEncoder does not yet support that), they stay float. So it is not unreasonable to read your train and test data from separate files and end up with a different dtype for that reason.

Note that we still check that the categories are found, or no new ones are there, etc.

Note that we still check that the categories are found, or no new ones are there, etc.

This is what the PR does now.

Even after dropping NaNs (because OneHotEncoder does not yet support that), they stay float.

Okay lets keep this float/int support.

jorisvandenbossche · 2019-10-29T21:41:26Z

sklearn/preprocessing/_encoders.py

@@ -57,8 +87,12 @@ def _check_X(self, X):

        for i in range(n_features):
            Xi = self._get_feature(X, feature_idx=i)
-            Xi = check_array(Xi, ensure_2d=False, dtype=None,
-                             force_all_finite=needs_validation)
+            if self.categories == 'dtypes' and hasattr(Xi, 'cat'):


The hasattr check could also be Xi.dtype.name == 'category', not sure which one is better

jorisvandenbossche · 2019-10-29T21:43:55Z

sklearn/preprocessing/_encoders.py

+                if hasattr(Xi, "cat"):
+                    cats = Xi.cat.categories.values.copy()
+                else:
+                    cats = _encode(Xi)


An alternative to the if/else check here (the same fore below), is to handle this inside _encode (that's the approach that was taken in #13351). Advantage is to keep some additional complexity out of this already complex code (moving it elsewhere of course ..)

jorisvandenbossche · 2019-10-29T21:46:49Z

sklearn/preprocessing/tests/test_encoders.py

+        'col_int': [3, 2, 1, 2]}, columns=['col_str', 'col_int'])
+
+    str_category = pd.api.types.CategoricalDtype(
+         categories=['b', 'a'], ordered=True)


You could also add a category not present in the data

BTW, I think doing it like this can be cleaner:

X_df = pd.DataFrame({ 'col_str': pd.Categorical(['a', b', 'b', 'a'], categories=['a', 'b'], ordered=True), .... )

instead of first creating the dataframe, then the dtypes and then astyping.

jorisvandenbossche · 2019-10-29T21:51:36Z

sklearn/preprocessing/tests/test_encoders.py

+
+@pytest.mark.parametrize("is_sparse", [True, False])
+@pytest.mark.parametrize("drop", ["first", None])
+def test_one_hot_encoder_pd_categories_mixed(is_sparse, drop):


You could also add one non-category column to the first test, to avoid some redundancy.
The dataframe is handled column by column anyway (it's good to have a test for this of course, but just think that an additional test is not really necessary, a dataframe with only categorical columns is not a special case)

jorisvandenbossche · 2019-10-30T19:44:07Z

sklearn/preprocessing/_encoders.py

-                        Xi = Xi.astype(self.categories_[i].dtype)
+            is_category = (self.categories == 'dtypes' and
+                           Xi.dtype.name == 'category')
+            # categories without missing values do not have unknown values


Similarly to my comment about _encode, this could also be handled in _encode_check_unknown in principle

NicolasHug · 2020-01-24T12:22:37Z

sklearn/preprocessing/_encoders.py

@@ -179,6 +227,9 @@ class OneHotEncoder(_BaseEncoder):
        Categories (unique values) per feature:

        - 'auto' : Determine categories automatically from the training data.
+        - 'dtypes' : Uses pandas categorical dtype to encode categories. For
+          non pandas categorical data, the categories are automatically
+          determined from the training data.


maybe this could be just part of 'auto' as mentioned in the comments.

But if we keep the 'dtypes' option, we should probably raise an error if a non-df is passed in.

thomasjpfan · 2020-01-24T18:40:08Z

I see this ultimately ending up to be the default behavior "auto". I am concerned with how we get there.

Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state.

The other option would be to add another parameter to the encoders , such as "use_dtype_categories" to enable this new behavior with "auto". With this path, we would ultimately want to deprecate this new parameter.

NicolasHug · 2020-01-24T18:44:45Z

, if we add this feature into auto it will break backwards compatibility

can you explain why?

Another option is to introduce the parameter and deprecate 'auto'.
But then 'dtypes' might not be an ideal name

thomasjpfan · 2020-01-24T18:59:36Z

can you explain why?

On master, 'auto' will use the lexicon ordering to encode anything including the pandas series with a categorical dtype. With this update, the integer encoding given by the pandas categorical will be used, which may be different from the lexicon ordering.

Another option is to introduce the parameter and deprecate 'auto'.

'better_auto'? lol

raise an error when the pandas categories are not ordered when using OrdinalEncoder?

Not the most user friendly, since

regarding categories seen during predict but not during fit: is this even an issue? It seems to me that we should simply just error in this case, because that means the dtype
of fit and the dtype of predict are different, which should not be allowed?

Consider the following:

X_train = pd.Categorical(['b', 'b', 'b'], categories=['a', 'b'])
X_test = pd.Categorical(['a', 'b'], categories=['a', 'b'])

Both of these series have the same dtype, but the training set is all 'b's. If we only got by the dtype, this will create a column of all zeros. I believe @rth is suggesting we remove this column. With this, if we see 'a', we would treat it as unknown.

jnothman · 2020-01-25T22:45:40Z

Yes, the 'dtypes' setting is more about order than the set of categories

rth · 2020-01-28T12:54:49Z

Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state.

Another option is to make a list of such minor but annoying things to change via a deprecation cycle and make those changes in 1.0.

NicolasHug · 2020-01-28T14:40:11Z

@thomasjpfan could you summarize the main selling points of categories='dtype'?

So far I see:

this respects pandas categorical ordering. But that's only relevant for the OE, not OHE where the order doesn't matter
you don't have to explicitly specify categories=[[a, b,...], ...] for each feature

(BTW, these should be in the UG)

thomasjpfan · 2020-01-29T19:14:33Z

could you summarize the main selling points of categories='dtype'?

It is a user interface improvement to use the encoding in categoricals.
It is more efficient to use the encoding provided by pandas. As @rth, this can even be better if we construct the encoding directly.
For OHE, the drop='first' the order still matters. Currently, "first" means, the lowest lexicon ordered element. If we respect the ordering, "first" will mean, "the first element in the pandas categorical dtype".

Yes this should be in the UG.

jnothman · 2020-04-19T15:06:03Z

Is this still the current proposal @thomasjpfan? Thanks

…tegories

adrinjalali · 2020-04-22T18:52:09Z

moving to 0.24, I have a feeling these ones need a little bit of discussion still. Happy for us to wait for it if y'all think it can get in.

thomasjpfan · 2020-04-22T19:17:19Z

Yea this needs more discussion, this is more of a "quality of life" update when using pandas categories.

The "quick" thing to is to only do this for OrdinalEncoder, which in turn makes it nicer a little nicer for unknown categories, which makes it nicer for cross validation.

…tegories

jnothman · 2020-05-20T13:12:09Z

Currently, if we add this feature into auto it will break backwards compatibility. My initial plan was to add "dtype" to behave as the new "auto", then transition "auto" to this new behavior and then deprecate "dtype". It is a very long path to get to the desired new "auto" state.

Can we just warn for now: "From version 0.26 'auto' will adopt the ordering from the pandas dtype." and do so only if lexicographic and dtype order are inconsitent?

jnothman · 2020-05-20T13:25:45Z

But then I think my latest proposal was identical to #15050

Maybe we do indeed need to workshop this set of encoding PRs again.

amueller · 2023-01-13T19:11:07Z

is it time to revive this?

lorentzenchr · 2023-01-26T20:39:06Z

@thomasjpfan It would help me a lot if you could provide a summary of the status of this PR. Is it still the right approach and what were the objections and open questions to decide?

thomasjpfan · 2023-01-31T19:19:23Z

After rereading all the comments in this PR, I am going to open a new PR that uses cat.codes only for optimization and does not add a new option to categories. Overall, I agree with the concerns in #15396 (comment).

I'm still thinking through the API design, which depends on how complex it is to handle the edge cases for using cat.codes directly.

ogrisel · 2023-06-07T09:27:59Z

After rereading all the comments in this PR, I am going to open a new PR that uses cat.codes only for optimization and does not add a new option to categories. Overall, I agree with the concerns in #15396 (comment).

Now that we have ways to collapse infrequent categories (as measured on the training set), I think the points of @rth in #15396 (comment) are less of a concern.

I would personally be in favor of having OneHotEncoder deterministically respect both the ordering and the list of categories represented in the dtype by default and let the user decide to collapse rare categories at the level they want.

It will avoid having arrays with varying .shape[1] in intermediate steps pipelines when doing cross-validations. I think it's less surprising to the user and can make their life simpler, e.g. to inspect the variability of the coefficients of a linear model across cross-validation folds.

thomasjpfan · 2023-07-30T20:28:52Z

I would personally be in favor of having OneHotEncoder deterministically respect both the ordering and the list of categories represented in the dtype by default and let the user decide to collapse rare categories at the level they want.

If the rare categories are collapsed, then the ordering from the dtype will have to be rearranged. I am okay with this behavior.

t will avoid having arrays with varying .shape[1] in intermediate steps pipelines when doing cross-validations.

This would work if there are no rare categories. With rare categories, different folds may end up collapsing different categories, which could give different coefficients across cross-validation folds.

In any case, I am closing this PR. The _encoders.py code has change quite a bit and I think it's better to start fresh.

ogrisel · 2023-12-02T04:06:34Z

will avoid having arrays with varying .shape[1] in intermediate steps pipelines when doing cross-validations.

This would work if there are no rare categories. With rare categories, different folds may end up collapsing different categories, which could give different coefficients across cross-validation folds.

True but at least we give the users control over the collapsing or non-collapsing behavior.

thomasjpfan added 7 commits October 29, 2019 11:24

ENH Uses pandas categories when encoding

6ba1d47

TST Adds test for dropping

fe5908d

DOC Adds comments

e03ef4f

ENH Uses values instead

20808c0

TST Ordered dtypes

6706e33

TST Uses pip to install pandas

25c2bdf

TST Always install pandas

6cd2c62

thomasjpfan changed the title ~~[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder~~ [WIP] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder Oct 29, 2019

thomasjpfan changed the title ~~[WIP] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder~~ [MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder Oct 29, 2019

thomasjpfan added 3 commits October 29, 2019 14:48

DOC Adds to user guide

131e266

DOC Adds whats new

24d9434

DOC Update whats new

3545088

thomasjpfan changed the title ~~[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder~~ [WIP] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder Oct 29, 2019

thomasjpfan added 2 commits October 29, 2019 15:18

Merge remote-tracking branch 'upstream/master' into dtype_ordering_ca…

b127441

…tegories

REV Revert build

2b8fc5e

jnothman reviewed Oct 29, 2019

View reviewed changes

thomasjpfan added 4 commits October 29, 2019 16:18

MNT Support python35

bd061dc

ENH Enable support for non dataframes

72a8ade

DOC Includes fallback to auto in docs

3d72106

DOC Update comment

53cb537

jorisvandenbossche reviewed Oct 29, 2019

View reviewed changes

thomasjpfan added 7 commits October 29, 2019 19:49

CLN Address some joris's comments

b82eec2

CLN Moves encoding to _encode

3d6ff26

CLN Less diffs

3834b54

DOC Adds comment regarding unknowns

36ef623

TST Adds pandas to osx

d76ddda

DOC Remove dtypes in user guide

1a3e7ae

ENH Only checks categories

e54b5ee

jorisvandenbossche reviewed Oct 30, 2019

View reviewed changes

DOC Adds tests for learnt categories

e0b69d8

NicolasHug reviewed Jan 24, 2020

View reviewed changes

rth mentioned this pull request Jan 28, 2020

ENH Add 'if_binary' option to drop argument of OneHotEncoder #16245

Merged

github-actions bot added the module:preprocessing label Mar 2, 2020

Merge remote-tracking branch 'upstream/master' into dtype_ordering_ca…

bd4ec03

…tegories

adrinjalali modified the milestones: 0.23, 0.24 Apr 22, 2020

Merge remote-tracking branch 'upstream/master' into dtype_ordering_ca…

6475789

…tegories

cmarmo removed this from the 0.24 milestone Oct 26, 2020

Base automatically changed from master to main January 22, 2021 10:51

thomasjpfan mentioned this pull request Nov 28, 2022

ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907

Closed

lorentzenchr mentioned this pull request Jun 2, 2023

ENH Adds native pandas categorical support to gradient boosting #26411

Merged

thomasjpfan closed this Jul 30, 2023

	If categories == 'dtype' and the pandas column is a category,
	If categories == 'dtypes' and the pandas column is a category,

	the pandas series will be return in this list.
	the pandas series will be returned in this list.

	raise TypeError("X must be a dataframe when "
	raise TypeError("X must be a DataFrame when "

Uh oh!

[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

Uh oh!

Conversation

thomasjpfan commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jan 24, 2020

Uh oh!

NicolasHug commented Jan 24, 2020

Uh oh!

thomasjpfan commented Jan 24, 2020

Uh oh!

jnothman commented Jan 25, 2020 via email

Uh oh!

rth commented Jan 28, 2020

Uh oh!

NicolasHug commented Jan 28, 2020

Uh oh!

thomasjpfan commented Jan 29, 2020

Uh oh!

jnothman commented Apr 19, 2020

Uh oh!

adrinjalali commented Apr 22, 2020

Uh oh!

thomasjpfan commented Apr 22, 2020

Uh oh!

jnothman commented May 20, 2020

Uh oh!

jnothman commented May 20, 2020

Uh oh!

amueller commented Jan 13, 2023

Uh oh!

thomasjpfan commented Oct 29, 2019 •

edited

Loading

thomasjpfan commented Jan 31, 2023 •

edited

Loading

ogrisel commented Jun 7, 2023 •

edited

Loading