ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907

ogrisel · 2022-11-13T19:06:32Z

More UX improvement related to categorical features in HGBDT following #24889.

Related to:

Programmatically pass categorical_features to HGBT #18894

This PR makes HGBDT models automatically detect categorical columns of an input pandas dataframe and automatically ordinal-encode those columns internally.

See the tests and examples to see the kind of code simplification this leads to.

Note: the _ordinal_encode_df could logically be collapsed into a _validate_data method that overrides the one inherited from BaseEstimator because it always needs to be called first. However, at fit time it also returns an additional categorical_features array so it does not really match the API of _validate_data.

Alternatively I could introduce _validate_hist_gradient_boosting_data that internally calls both self._ordinal_encode_df and self._validate_data and return categorical_features array when needed. Let me know what you prefer.

Final note: it might probably be possible to adapt this code to make it work for any dataframe object that implements the __dataframe__ protocol instead of just pandas but I think we should tackle the support of __dataframe__ objects in a dedicated PR instead.

I milestoned this as 1.2 but if you think this is rushing things too much we can remilestone to 1.3.

ogrisel · 2022-11-13T19:09:39Z

examples/ensemble/plot_gradient_boosting_categorical.py

@@ -272,6 +266,5 @@ def plot_results(figure_title):
 # %%
 # The results for these under-fitting models confirm our previous intuition:
 # the native category handling strategy performs the best when the splitting
-# budget is constrained. The two other strategies (one-hot encoding and
-# treating categories as ordinal values) lead to error values comparable
-# to the baseline model that just dropped the categorical features altogether.


This remark was no longer true since the example was made to run on a subset of the columns to make it run faster. Compare the last plot of:

https://scikit-learn.org/0.24/auto_examples/ensemble/plot_gradient_boosting_categorical.html#limitting-the-number-of-splits

and:

https://scikit-learn.org/1.0/auto_examples/ensemble/plot_gradient_boosting_categorical.html#limitting-the-number-of-splits

So I fixed this as part of the PR (along with a simplification of the column indexing code.

ogrisel · 2022-11-13T19:10:14Z

examples/ensemble/plot_gradient_boosting_categorical.py

+
+# Explicitly type the categorical columns as such.
+object_dtyped_columns = X.select_dtypes(include=["object"]).columns
+X[object_dtyped_columns] = X[object_dtyped_columns].astype("category")


I changed this because of: https://github.com/scikit-learn/scikit-learn/pull/24907/files#r1020949896

ogrisel · 2022-11-13T19:12:26Z

Note: there is no backward compat to handle here because calling the fit method of an HGBRT model directly on non-numerical dataframe would have raised a ValueError in scikit-learn 1.1.

lorentzenchr · 2022-11-13T19:54:12Z

@ogrisel Thanks for working on this issue. At the moment, I‘m short on review time.

One thought that pops up (see discussion in issue): This PR makes the ColumnTransformer unnecessary for many use cases. Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded. (Is this now possible with pandas?) What are the pros and cons of each approach?

glemaitre · 2022-11-13T20:16:01Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -1240,7 +1286,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
            or shape (n_categorical_features,), default=None
        Indicates the categorical features.

-        - None : no feature will be considered categorical.
+        - None : no feature will be considered categorical unless explicitly


I am wondering if we should keep None as is and have an option "auto" that could become the default in 2 versions. None could be a bit less explicit than a str option (I proposed "auto" but it could be something more linked to the "categorical" dtype from pandas).

Just a thought.

We could as well avoid the delay of 2 versions since I don't think that we have a corner case where our user can be surprised.

Has I said here (#24907 (comment)), since this case would previously raise an exception, I don't really see the point of trying to enforce backward compat.

glemaitre

I think that this is indeed an expected feature from people using LightGBM and XGBoost. I will review this PR tomorrow.

glemaitre · 2022-11-13T20:33:28Z

Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded.

It might imply the Pipeline to not "step" agnostic anymore. I would think that it is more appropriate to have a meta estimator such as SuperVectorizer that could be used in the pipeline.

What are the pros and cons of each approach?

Users from xgboost and lightgbm would expect to not have to use a pipeline and ColumnTransformer. It might be more intuitive to automatically encode in the GBDT.

Not a real con, but we had something similar with normalize for linear models, and we made the inverse move. We should be certain that encoding the categories do not have any side-effect on other parameters of the model (e.g. order of categories - I assume that it does not matter).

ogrisel · 2022-11-14T16:52:24Z

Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded. (Is this now possible with pandas?)

This is possible in the sense that you can type the categorical columns explicitly in pandas with .astype("category") that uses the categorical dtype. But then there is not need to use OE step in a column/transformer pipeline before because:

the downstream estimator still needs specific code to detect which columns have a categorical dtype and which have regular numerical dtype;
a categorical dtype in pandas is already already ordinally encoded (see the .codes attribute I look up in this PR).

What are the pros and cons of each approach?

I don't see the point of introducing an OE in a pipeline that would convert a categorical column into the same categorical columns. In any case, the HGBDT models drops the meaning of the categories.

The only problem I see now, is that in fit I do not record the categories to check that they are the same at predict time. That could indeed be useful to raise a meaningful error message in case X_test as a categorical column with the same name as the one in X_train but with different categories.

thomasjpfan · 2022-11-15T15:16:50Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                    # dtype to encode the categories.
+                    # XXX: using integers would be more efficient but would
+                    # make the code more complex.
+                    codes = X[col].values.codes.astype(np.float64)


Here are my thoughts on using panda's codes:

Mismatched Categories in terms of order: I think it is important to check that the categorical dtype are the same during fit and transform, otherwise the codes encoding may represent different categories. For example, if the fit time categorical dtype is ['dog', 'cat', 'turtle'] and predict time is ['cat', 'turtle', 'dog'], then codes [0, 1, 2] will mean different things.

Unknown Categories: In the following, "bear" is unknown:

X_train["pet"].cat.categories # ["cat", "dog", "snake", ...] X_test["pet"].cat.categories # ["cat", "bear", "dog", "snake", ...]

During train time "dog" is 1, but during predict time "bear" is 1. We can do some preprocessing here to map the unknown categories to missing. Although, the simplest solution is to check the categorical dtypes are the same during fit and predict.

Infrequent Categories: In the following, "bear" is infrequent and happened to only be in the train set:

X_train["pet"].cat.categories # ["cat", "bear", "dog", "snake", ...] X_test["pet"].cat.categories # ["cat", "dog", "snake", ...]

We can handle this with some preprocessing to correctly remap the codes during test to the ones seen in train. Again, the simplest solution is to check the categorical dtype are the same during fit and predict

TLDR: I think checking that the categories match during fit and predict is the simplest way to work around all three situations.

TLDR: I think checking that the categories match during fit and predict is the simplest way to work around all three situations.

I agree. I will update this PR later this week. Or feel free to push to this PR directly if you prefer as I don't plan to work on this today nor tomorrow.

I am on vacation this week, so I am trying my best not to write code. I will be available next week.

I added the consistency check for pandas categorical features here: 00f6a18 (#24907)

ogrisel · 2022-11-15T16:19:05Z

Maybe we could also raise a ValueError if there are more than 255 categories. In this case we could point the user to explicitly use OrdinalEncoder and collapsed the rarest categories together for instance. Ideally we would also point the user to use a target/impact encoder to deal with this case but we need to finalize it first... Otherwise we could collapse the rarest categories together as part of this PR but this is extra work and maybe it's too magical, better let the user deal with this case explicitly.

thomasjpfan · 2022-11-15T23:04:59Z

Maybe we could also raise a ValueError if there are more than 255 categories.

I agree.

Otherwise we could collapse the rarest categories together as part of this PR but this is extra work and maybe it's too magical, better let the user deal with this case explicitly.

I agree with not doing any encoding based on rare categories in HistGradientBoosting* itself. If we want to support this use case in scikit-learn, then I would add infrequent categorical support to OridinalEncoder. (OridinalEncoder would also need to support pandas categoricals output to work with this PR)

examples/applications/plot_cyclical_feature_engineering.py

NicolasHug

Thanks for working on this @ogrisel !

I won't have time for a proper review (sorry), so I just looked at the docs.

NicolasHug · 2022-11-21T15:25:08Z

examples/ensemble/plot_gradient_boosting_categorical.py

+# Note that this is equivalent to using the ordinal encoder and then passing
+# the name of the categorical features to the ``categorical_features``
+# constructor parameter of :class:`~ensemble.HistGradientBoostingRegressor`.


Some users may not want or be able to rely on pandas, so I wonder if we should still leave an example with the OridnalEncoder somewhere? Perhaps just as a commented-out snippet here?

Also, in the User Guide we currently make a reference to this example, and it probably needs to be updated:

The cardinality of each categorical feature should be less than the max_bins parameter, and each categorical feature is expected to be encoded in [0, max_bins - 1]. To that end, it might be useful to pre-process the data with an OrdinalEncoder as done in Categorical Feature Support in Gradient Boosting.

Good points, thanks.

I pushed 75b1afe to address this point.

examples/ensemble/plot_gradient_boosting_categorical.py

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

examples/applications/plot_cyclical_feature_engineering.py

…ories seen in fit

thomasjpfan · 2022-11-23T21:11:17Z

The flake8 issue is unrelated and fixed in #25017.

jeremiedbb · 2022-11-24T13:25:47Z

We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3

…the docstring

betatim · 2022-11-28T13:12:48Z

Left field question: why not use OrdinalEncoder here instead of implementing your own and having to handle all the cases mentioned in #24907 (comment) ?

thomasjpfan · 2022-11-28T17:08:16Z

why not use OrdinalEncoder here instead of implementing your own and having to handle all the cases mentioned in #24907 (comment) ?

This PR is trying to use the encoding from cat.codes directly. If we use OridinalEnocder it will learn it's own encoding, which is slower. (I have a PR that adds using cat.codes to OrdinalEncoder: #15396)

Overall, I am okay with using a OrdinalEncoder and do not have a strong opinion on it.

betatim · 2022-11-30T13:38:44Z

Thanks for explaining. I didn't realise that reusing the categories from the dataframe would save a lot of runtime.

From a quick look at the diff in #15396 it seems like it is a "non trivial amount of code", which makes me think investing effort to get that PR merged and then use the OrdinalEncoder here would be a good investment. Kind of a "two birds with one stone" move (hist boosting gets it and all the other users of encoders).

examples/applications/plot_cyclical_feature_engineering.py

…ategorical_features passed by the user

ogrisel · 2022-12-01T17:53:54Z

I sneakily pushed 787501d while not testing for this slight behavioral change. It's on my todo list to add proper testing for edge cases.

doc/modules/ensemble.rst

Co-authored-by: Tim Head <betatim@gmail.com>

lorentzenchr · 2023-01-26T20:36:00Z

From a quick look at the diff in #15396 it seems like it is a "non trivial amount of code", which makes me think investing effort to get that PR merged and then use the OrdinalEncoder here would be a good investment. Kind of a "two birds with one stone" move (hist boosting gets it and all the other users of encoders).

As much as I like this PR's feature as fast as possible in the next release, this sound like a reasonable plan to me.

ogrisel · 2023-05-24T10:06:37Z

Closing in favor of #26411 which is much cleaner.

ogrisel added 2 commits November 13, 2022 19:55

Automatically handle categorical columns of pandas dataframes in HGBRT

39b3aaa

Simplify examples by using the enhanced GBRT models

753c101

ogrisel added this to the 1.2 milestone Nov 13, 2022

ogrisel requested review from jorisvandenbossche, NicolasHug, adrinjalali, thomasjpfan, glemaitre and lorentzenchr November 13, 2022 19:06

github-actions bot added the module:ensemble label Nov 13, 2022

Changelog

beb397f

ogrisel commented Nov 13, 2022

View reviewed changes

glemaitre reviewed Nov 13, 2022

View reviewed changes

thomasjpfan reviewed Nov 15, 2022

View reviewed changes

glemaitre self-requested a review November 16, 2022 14:54

glemaitre mentioned this pull request Nov 20, 2022

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

Open

betatim reviewed Nov 21, 2022

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

NicolasHug reviewed Nov 21, 2022

View reviewed changes

Update examples/ensemble/plot_gradient_boosting_categorical.py

4aec8f2

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

ogrisel commented Nov 21, 2022

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

thomasjpfan added 2 commits November 23, 2022 13:19

Merge remote-tracking branch 'upstream/main' into pr/24907

b8e0f1f

STY Fixes spelling

098895e

ENH Check features with pandas categoricals are consistent with categ…

00f6a18

…ories seen in fit

jeremiedbb modified the milestones: 1.2, 1.3 Nov 24, 2022

ogrisel added 3 commits November 24, 2022 14:43

more specific comment

9ea3b80

Merge branch 'main' into pandas-categorical-gbrt

b87b3b3

docstring improvements + treat all negatives as missing as stated in …

2c9f172

…the docstring

thomasjpfan and others added 2 commits November 28, 2022 12:11

CLN Only make one dataframe copy

f41fa86

Merge branch 'main' into pandas-categorical-gbrt

5fa30ec

ogrisel commented Dec 1, 2022

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

ogrisel added 3 commits December 1, 2022 18:35

Update user guide and linked example to match

75b1afe

Always ordinal encoded category columns while always respecting the c…

787501d

…ategorical_features passed by the user

Improve comment on auto mode in plot_cyclical_feature_engineering.py

645154c

betatim reviewed Dec 2, 2022

View reviewed changes

doc/modules/ensemble.rst Outdated Show resolved Hide resolved

Update doc/modules/ensemble.rst

05535f7

Co-authored-by: Tim Head <betatim@gmail.com>

thomasjpfan mentioned this pull request Apr 24, 2023

ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268

Open

thomasjpfan mentioned this pull request May 21, 2023

ENH Adds native pandas categorical support to gradient boosting #26411

Merged

ogrisel closed this May 24, 2023

Uh oh!

ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907

ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907

Uh oh!

Conversation

ogrisel commented Nov 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 13, 2022

Uh oh!

lorentzenchr commented Nov 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Nov 13, 2022

Uh oh!

ogrisel commented Nov 14, 2022

Uh oh!

thomasjpfan Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 15, 2022

Uh oh!

thomasjpfan commented Nov 15, 2022

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Nov 23, 2022

Uh oh!

jeremiedbb commented Nov 24, 2022

Uh oh!

betatim commented Nov 28, 2022

Uh oh!

thomasjpfan commented Nov 28, 2022

Uh oh!

betatim commented Nov 30, 2022

Uh oh!

Uh oh!

ogrisel commented Dec 1, 2022

Uh oh!

Uh oh!

lorentzenchr commented Jan 26, 2023

Uh oh!

ogrisel commented May 24, 2023

Uh oh!

Uh oh!

ogrisel commented Nov 13, 2022 •

edited

Loading

thomasjpfan Nov 15, 2022 •

edited

Loading