ENH Adds native pandas categorical support to gradient boosting #26411

thomasjpfan · 2023-05-21T21:06:09Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds categorical_features="pandas" which infers the categorical features from the dtype. Unlike #26268, the cardinality for each category is still restricted above by max_bins.

Any other comments?

Given the mixed reaction to #26268, I opened this PR because it is less magic. Essentially, this PR is running OrdinalEncoder on the categorical features.

ogrisel · 2023-05-24T09:17:56Z

Could you please update the existing examples to show the benefits of using this PR? I think the following 3 examples would benefit from this treatment:

examples/applications/plot_cyclical_feature_engineering.py
examples/ensemble/plot_gradient_boosting_categorical.py
examples/inspection/plot_partial_dependence.py

There is also examples/preprocessing/plot_target_encoder.py that could probably be reworked a bit to use this new feature but I am not sure this would simplify it so maybe we can leave it as it is.

ogrisel

Thank you @thomasjpfan. I think this PR should improve significantly the usability of Hist Gradient Boosting models when dealing with mixed categorical and numerical features which is a very common use case in practice.

Here are a few discussion points below:

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

thomasjpfan · 2023-05-29T20:47:18Z

I updated the PR to use categorical_features="by_dtype" and making it the default. I do have concerns over changing the default as noted in #26411 (comment), but I think it is worth it.

ogrisel

I did another pass. In retrospect, let's be conservative about the default behavior and issue a warning when the input dataframe has categorical columns and the categorical_features has been left to its default value of categorical_features="warn".

Other than that, here is a bit more feedback to improve the test.

I think the examples would have benefited to be updated as part of this PR to make sure we get the UX right from the users' point of view. See: #26411 (comment)

Other than that, LGTM.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

ogrisel · 2023-05-30T16:49:35Z

@lorentzenchr any opinion on this PR in general and the discussion on the default behavior in particular?

ogrisel

Thanks for the last batch of updates. LGTM besides the following sphinx-gallery formatting problem:

examples/ensemble/plot_gradient_boosting_categorical.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lorentzenchr

Overall looking very good.

@thomasjpfan Could you add a test with pd.Categorical(["a", "b"], categories=["a", "b"] and pd.Categorical(["a", "b"], categories=["b", "a"]?

Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder and then use it here? Will it be possible to replace the code here, once we have that?

examples/applications/plot_cyclical_feature_engineering.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

lorentzenchr · 2023-06-02T08:34:39Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -1267,6 +1394,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
          features.
        - str array-like: names of categorical features (assuming the training
          data has feature names).
+        - `"by_dtype"`: Pandas categorical dtypes are considered categorical.


How about "infer_by_dtype"? It is a bit longer but a bit more explicit.
Just for discussion.

I like this. Or maybe "infer_from_dtype".

Or maybe "from_dtype". Not strong opinion in retrospect.

"infer" might be a bit misleading if the estimator just selects columns that are explicitly category dtyped and more ambiguous dtypes such as str-based or object-based are not considered as discussed below (in #26411 (comment)).

In 673c76d (#26411), I updated the keyword argument to "from_dtype".

lorentzenchr · 2023-06-02T08:35:13Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -1267,6 +1394,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
          features.
        - str array-like: names of categorical features (assuming the training
          data has feature names).
+        - `"by_dtype"`: Pandas categorical dtypes are considered categorical.
+          The input must be a pandas DataFrame to use this feature.


I guess and hope that we can extend this in the future and also treat string columns of any dataframe the same way.

String columns might contain free text (e.g. a title, a description, people full names,...) which are not really categorical features directly. They would require some kind of dedicated text preprocessing involving string tokenization followed by some form of sparse multi-hot encoding, dense embedding or topic modelling (e.g. NMF/LDA). I wouldn't consider such columns directly as categorical columns.

I have the feeling that it's better to error on those and let the user either type those columns explicitly as categorical or do some appropriate preprocessing to find a suitable numerical representation.

lorentzenchr · 2023-06-02T08:49:43Z

any opinion on ... the discussion on the default behavior in particular?

I would skip the deprecation and warning as it is annoying. It is, however, scikit-learn's promise and de-facto standard for a long time now.

So +1 for directly setting the new default. But I accept to be overruled by other opinions.

ogrisel · 2023-06-05T12:37:12Z

I would skip the deprecation and warning as it is annoying. It is, however, scikit-learn's promise and de-facto standard for a long time now.

It would be annoying only for people who fit with a raw dataframe with at least one column with a category dtype.

People fitting GBDT models on preprocessed categorical features and/or numerical only data would not see any warning.

thomasjpfan · 2023-06-14T10:31:15Z

Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder and then use it here? Will it be possible to replace the code here, once we have that?

Yes, if #15396 gets adopted then we can reduce the amount of code here.

lorentzenchr · 2023-06-14T11:30:10Z

Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder and then use it here? Will it be possible to replace the code here, once we have that?

Yes, if #15396 gets adopted then we can reduce the amount of code here.

Could we replace this PR’s code backwards compatible with #15396 merged? I‘m asking for the 1.3 release. For 1.4, OrdinalEncoder should come first.

thomasjpfan · 2023-06-14T13:36:51Z

Could we replace this PR’s code backwards compatible with #15396 merged?

#15396 adds a categories="from_dtype" option which infers the categories from the dtype. By itself, #15396 will not break this PR. This PR precomputes the categories and passes it to OrdinalEncoder(categories=...), which does not go through any of the new code from #15396.

Although, there will be an issue with OrdinalEncoder + categories as float ndarrays. Currently, gradient boosting is considering negative floating values as missing values. OrdinalEncoder would encode them as a positive integer, which ends up to be a behavior change. If we want to keep backward compatibility, we'll still need to precompute the categories and pass it into OrdinalEncoder as done in this PR.

github-actions · 2023-06-22T11:45:33Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 5fea2dd. Link to the linter CI: here}

glemaitre · 2023-11-15T15:18:41Z

I just pushed a commit by merging main into this branch. I will give it a review.

glemaitre

I only have a couple of nitpicks: I would add a new test to be sure that we always sort numerical categories and to make it explicit via a unit test. Otherwise LGTM.

glemaitre · 2023-11-15T15:35:06Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+        Parameters
+        ----------
+        X : {array-like, pandas DataFrame} of shape (n_samples, n_features)
+            Input data


Suggested change

Input data

Input data.

glemaitre · 2023-11-15T15:38:55Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -207,10 +301,42 @@ def _check_categories(self, X):
                - None if the feature is not categorical
            None if no feature is categorical.
        """
-        if self.categorical_features is None:
+        X_is_dataframe = hasattr(X, "dtypes")  # sufficient here


I am wondering if we should just use _is_pandas_df just for consistency even if this is more involved.

glemaitre · 2023-11-15T15:52:48Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+            if X_is_dataframe and (X.dtypes == "category").any():
+                warnings.warn(
+                    (
+                        "The categorical_features parameter will change to 'by_dtype'"


Suggested change

"The categorical_features parameter will change to 'by_dtype'"

"The categorical_features parameter will change to 'from_dtype'"

glemaitre · 2023-11-15T15:52:58Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                warnings.warn(
+                    (
+                        "The categorical_features parameter will change to 'by_dtype'"
+                        " in v1.6. The 'by_dtype' option automatically treats"


Suggested change

" in v1.6. The 'by_dtype' option automatically treats"

" in v1.6. The 'from_dtype' option automatically treats"

glemaitre · 2023-11-15T15:55:44Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -230,17 +356,18 @@ def _check_categories(self, X):
                )

        n_features = X.shape[1]
+        feature_names_in_ = getattr(X, "columns", None)


I think it could be worth to add a small comment why self.feature_names_in_ is not yet defined at this stage.

glemaitre · 2023-11-15T16:10:01Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    hist_np = Hist(categorical_features=[False, True], **hist_kwargs)
+    hist_np.fit(X_train, y_train)
+
+    hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs)


Suggested change

hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs)

hist_pd = HistGradientBoosting(categorical_features="from_dtype", **hist_kwargs)

glemaitre · 2023-11-15T16:10:36Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+
+
+@pytest.mark.parametrize(
+    "Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]


Suggested change

"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]

"HistGradientBoosting", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]

glemaitre · 2023-11-15T16:10:44Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+@pytest.mark.parametrize(
+    "Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]
+)
+def test_pandas_categorical_errors(Hist):


Suggested change

def test_pandas_categorical_errors(Hist):

def test_pandas_categorical_errors(HistGradientBoosting):

glemaitre · 2023-11-15T16:10:56Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    pd = pytest.importorskip("pandas")
+
+    msg = "Categorical feature 'f_cat' is expected to have a cardinality <= 16"
+    hist = Hist(categorical_features="from_dtype", max_bins=16)


Suggested change

hist = Hist(categorical_features="from_dtype", max_bins=16)

hist = HistGradientBoosting(categorical_features="from_dtype", max_bins=16)

glemaitre · 2023-11-15T16:15:49Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                    # OrdinalEncoder requires categories backed by numerical values
+                    # to be sorted
+                    if categories.dtype.kind not in "OUS":
+                        categories = np.sort(categories)


Do you think it could be possible to get a specific test for this. I assume that we test it indirectly because codecov does not complain that this line is not covered but it could be cool to have an explicit test.

In 5fea2dd I updated the test to directly check the categories.

glemaitre

I only have a couple of nitpicks: I would add a new test to be sure that we always sort numerical categories and to make it explicit via a unit test. Otherwise LGTM.

glemaitre · 2023-11-17T18:46:41Z

Uhm, the test is unrelated to this PR. I will have a look (since I merged that one).

glemaitre · 2023-11-17T19:06:43Z

The issue here is solved in #27802

glemaitre · 2023-11-17T19:07:04Z

LGTM. Merging this PR then. Thanks @thomasjpfan.

amueller · 2023-12-18T16:53:54Z

This is amazing, thank you @thomasjpfan

ENH Adds native pandas categorical support to gradient boosting

1f2ad0e

github-actions bot added the module:ensemble label May 21, 2023

DOC Adds whats new

3bbc7f8

ogrisel reviewed May 24, 2023

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Outdated Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request May 24, 2023

ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907

Closed

thomasjpfan added 3 commits May 24, 2023 16:32

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

a87350e

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

dd6e89a

ENH Change default to consider pandas categorical dtypes all the time

52fc8ff

DOC Add changed model note in whats_new

feeaa3f

ogrisel approved these changes May 30, 2023

View reviewed changes

ogrisel mentioned this pull request May 30, 2023

ENH Support dataframe exchange protocol in ColumnTransformer as input #26115

Closed

thomasjpfan added 6 commits May 31, 2023 09:13

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

7e351f7

TST Update test based on comments

19ed1ed

ENH Use categorical_features='warn'

e563c27

DOC Update whats new

0f705e3

ENH Update plot_gradient_boosting_categorical to use new feature

93ff2fe

ENH Update plot_gradient_boosting_categorical to use new feature

58cb97c

ogrisel approved these changes Jun 1, 2023

View reviewed changes

examples/ensemble/plot_gradient_boosting_categorical.py Outdated Show resolved Hide resolved

Apply suggestions from code review

25f79de

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lorentzenchr self-requested a review June 1, 2023 12:55

lorentzenchr reviewed Jun 2, 2023

View reviewed changes

thomasjpfan added 4 commits June 14, 2023 09:36

TST Adds test with different categorical orders

5b6fe5a

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

996e8dd

ENH Convert to from_dtype

673c76d

DOC Print out categorical features through pandas

060200f

thomasjpfan added 3 commits June 22, 2023 13:39

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

d70791b

STY Fix

e539c30

DOC Move to 1.4

c9c01c5

thomasjpfan added 5 commits June 22, 2023 15:29

STY Lint error

c836103

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

ab8f858

CLN Use known categories and only sort for numerical values

415fc1f

CLN Use to_numpy

77bf0fc

Merge remote-tracking branch 'upstream/main' into pandas_dataframe_hist

52e1ce5

glemaitre self-requested a review November 15, 2023 14:56

Merge remote-tracking branch 'origin/main' into pr/thomasjpfan/26411

cb1dcc1

glemaitre approved these changes Nov 15, 2023

View reviewed changes

glemaitre added this to the 1.4 milestone Nov 15, 2023

thomasjpfan added 2 commits November 17, 2023 10:19

CLN Address comments

3862b2e

TST Adds test to make sure categoricals are sorted

5fea2dd

glemaitre merged commit 4cf13d2 into scikit-learn:main Nov 17, 2023

jeromedockes mentioned this pull request Nov 23, 2023

ENH detect categorical polars columns in HistGradientBoosting #27835

Merged

NicolasHug mentioned this pull request Jan 23, 2024

Programmatically pass categorical_features to HGBT #18894

Closed

ArturoAmorQ mentioned this pull request Feb 20, 2024

DOC Update hgbt docstrings on categorical_features default value #28485

Closed

hcho3 mentioned this pull request Feb 21, 2024

Fix categorical data handling for HistGradientBoosting in scikit-learn 1.4.0+ dmlc/treelite#553

Merged

	"The categorical_features parameter will change to 'by_dtype'"
	"The categorical_features parameter will change to 'from_dtype'"

	" in v1.6. The 'by_dtype' option automatically treats"
	" in v1.6. The 'from_dtype' option automatically treats"

	hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs)
	hist_pd = HistGradientBoosting(categorical_features="from_dtype", **hist_kwargs)



		@pytest.mark.parametrize(
		"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]

	"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]
	"HistGradientBoosting", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]

	def test_pandas_categorical_errors(Hist):
	def test_pandas_categorical_errors(HistGradientBoosting):

	hist = Hist(categorical_features="from_dtype", max_bins=16)
	hist = HistGradientBoosting(categorical_features="from_dtype", max_bins=16)

ENH Adds native pandas categorical support to gradient boosting #26411

ENH Adds native pandas categorical support to gradient boosting #26411

Conversation

thomasjpfan commented May 21, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

ogrisel commented May 24, 2023

ogrisel left a comment

Choose a reason for hiding this comment

thomasjpfan commented May 29, 2023

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel commented May 30, 2023

ogrisel left a comment

Choose a reason for hiding this comment

lorentzenchr left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 5, 2023 • edited Loading

Choose a reason for hiding this comment

lorentzenchr commented Jun 2, 2023

ogrisel commented Jun 5, 2023

thomasjpfan commented Jun 14, 2023

lorentzenchr commented Jun 14, 2023

thomasjpfan commented Jun 14, 2023

github-actions bot commented Jun 22, 2023 • edited Loading

✔️ Linting Passed

glemaitre commented Nov 15, 2023

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Nov 17, 2023

glemaitre commented Nov 17, 2023

glemaitre commented Nov 17, 2023

amueller commented Dec 18, 2023

ogrisel left a comment •

edited

Loading

lorentzenchr left a comment •

edited

Loading

ogrisel Jun 7, 2023 •

edited

Loading

ogrisel Jun 5, 2023 •

edited

Loading

github-actions bot commented Jun 22, 2023 •

edited

Loading