ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268

thomasjpfan · 2023-04-24T00:42:32Z

Reference Issues/PRs

Related to #24907

What does this implement/fix? Explain your changes.

This PR enables support for categorical features that have cardinality greater than max_bins and categories that are encoded above max_bins. This is enabled with a new on_high_cardinality_categories="bin_infrequent" parameter.

NicolasHug · 2023-04-24T16:44:37Z

With this PR merged, it becomes fairly straight forward to automatically handle pandas categorical dtypes with categorical_features="pandas"

Is this PR really needed for that? Can't we simply error when num_categories >= max_bins like we currently do?

An alternative way to support categories with high cardinality could be to let the BinMapper split a categorical feature into num_categories // max_bins + 1 columns (CC @lorentzenchr) and update the implementation of the Splitter and the predictors to look for categorical splits across more than one column. I believe this is what LightGBM enables via its "feature group" concept.

This is more complex to implement, but that would remove the hard constraint that num_categories < max_bins and doesn't involve any "magic" or potentially surprising behaviour.

Note that if we were to merge this PR, we wouldn't be able to implement the above strategy without a breaking change (a deprecation wouldn't work either I believe, unless we add a new parameter)

EDIT: thinking about this more, I think we could apply the same "feature group" idea to numerical features as well. This could potentially enable an arbitrary high max_bin for both categorical and numerical features, while still keeping the core dtype to uint8. Opened #26277

lorentzenchr

LGTM overall. Main issue is whether we need a new parameter.

lorentzenchr · 2023-04-24T16:29:20Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -176,6 +180,63 @@ class weights.
        """
        return sample_weight

+    def _check_X(self, X, *, reset):


A docstring with a list of attributes that are set would not hurt.
Also, should we adjust the function name?

lorentzenchr · 2023-04-24T16:30:32Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+        n_features = X.shape[1]
+
+        if not requires_encoder:
+            self._preprocessor = FunctionTransformer().set_output(transform="default")


How about

Suggested change

self._preprocessor = FunctionTransformer().set_output(transform="default")

self._preprocessor = None

?

lorentzenchr · 2023-04-24T16:32:51Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+        self._preprocessor.set_output(transform="default")
+        X = self._preprocessor.fit_transform(X)
+
+        # Column Transformer places the categorical features at the end.


Suggested change

# Column Transformer places the categorical features at the end.

# Out ColumnTransformer places the categorical features at the end.

lorentzenchr · 2023-04-24T16:34:38Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+            np.arange(min(len(c), self.max_bins), dtype=X_DTYPE) for c in categories_
+        ]
+
+        numerical_features = n_features - n_categorical


Suggested change

numerical_features = n_features - n_categorical

n_numerical = n_features - n_categorical

lorentzenchr · 2023-04-24T16:42:05Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

-                    raise ValueError(
-                        f"Categorical feature {feature_name} is expected to "
-                        f"be encoded with values < {self.max_bins} but the "
-                        "largest value for the encoded categories is "
-                        f"{categories.max()}."


Maybe, we should introduce a new parameter like group_infrequent_categories: bool. In case someone does not want this behavior. If False, error when n_categories > max_bins.
I think just describing the effect in the doc is not enough, therefore the parameter. Let's discuss.

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

lorentzenchr · 2023-04-24T16:44:32Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    hist_native = Hist(categorical_features=categorical_features, **hist_kwargs)
+    hist_native.fit(X_train, y_train)
+
+    # Use a preprocessor with an ordinal encoder should that gives the same model


??? Could you rephrase / complete the sentence?

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

lorentzenchr · 2023-04-24T16:49:57Z

@NicolasHug raises some good points in #26268 (comment). I did not only read them after my review.

ogrisel

Thanks for the PR. This is an important usability improvement in my opinion. However it might be a bit too magical and some users might miss an opportunity to use a better way to deal with high cardinality categories.

Maybe we should have an extra constructor parameter on_highcardinality_categories:

error (default): raise ValueError that advise to either preprocess the high cardinality categorical variables with TargetEncoder or an ad-hoc transformer, possibly expanding such a high cardinality categorical feature into many low cardinality categorical features;
bin_least_frequent: use what is provided in this PR.

EDIT: this review was started yesterday but I did not complete it at the time and now it is a bit redundant with the other reviews by @NicolasHug and @lorentzenchr...

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

thomasjpfan · 2023-04-26T14:18:11Z

Is this PR really needed for that? Can't we simply error when num_categories >= max_bins like we currently do?

You are correct, this PR is not needed. We can keep the error with categorical_features="pandas".

I updated this PR by adding on_high_cardinality_categories which defaults to error which is the behavior on main and bin_least_frequent which bins the infrequent categories.

…f_categories

github-actions · 2023-12-20T15:40:45Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 8875e3e. Link to the linter CI: here}

lorentzenchr · 2023-12-23T12:28:13Z

@thomasjpfan FYI, I intend to play with a second larger bitmap implementation with double number of bits.
C++ bitmaps would be nice, but there I don’t know how to allocate them in numpy arrays.

…f_categories

lorentzenchr · 2024-01-22T12:16:40Z

I have given this a 2nd thought and temporarily revoke my approval. I think it moves too much logic into HGBT pointing at 2 shortcomings of scikit-learn:

The pipeline design seems broken as we move more and more logic into the estimators themselves, like here.
HGBT could be extended to allow more than 256 bins per feature. I'm currently working on a PR which needs a bit more time.

We have a bit of time before 1.5 and I'd like to use that before prematurelly merging this PR (I can later revoke my revocation).

NicolasHug · 2024-01-22T13:06:53Z

@lorentzenchr just making sure you saw #26277 as a potential alternative to

HGBT could be extended to allow more than 256 bins per feature.

thomasjpfan · 2024-01-22T23:04:06Z

HGBT could be extended to allow more than 256 bins per feature. I'm currently working on a PR which needs a bit more time.

I think this is orthogonal to this PR. Even if HGBT can support more bins, some may still want to group infrequent categories.

The pipeline design seems broken as we move more and more logic into the estimators themselves, like here.

The current pipeline design has lead to a few workarounds and deserves it's own discussion.

Overall, having ColumnTransformer inside the estimator aimed to improve usability for default use cases. One can still use a ColumnTransformer + HGBT to do other encodings, such as TargetEncoder.

…f_categories

ENH Support categories higher than max_bins in HistGradientBoosting

8a88513

github-actions bot added the module:ensemble label Apr 24, 2023

thomasjpfan changed the title ~~ENH Support categories higher than max_bins in HistGradientBoosting~~ ENH Support categories with cardinality higher than max_bins in HistGradientBoosting Apr 24, 2023

DOC Adds pr number

f60c263

lorentzenchr approved these changes Apr 24, 2023

View reviewed changes

NicolasHug mentioned this pull request Apr 24, 2023

Support max_bins > 255 in Hist-GBDT estimators and categorical features with high cardinality #26277

Open

ogrisel reviewed Apr 25, 2023

View reviewed changes

thomasjpfan and others added 4 commits April 25, 2023 21:16

Apply suggestions from code review

f01f2a5

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

CLN Address comments

afacb60

DOC Adjust docs

c1ec3be

TST Improves coverage

85295ed

DOC Update whats new

45377b4

thomasjpfan mentioned this pull request May 21, 2023

ENH Adds native pandas categorical support to gradient boosting #26411

Merged

jeromedockes mentioned this pull request Dec 14, 2023

Allowing to group infrequent categories in HistGradientBoosting #27947

Open

thomasjpfan added 2 commits December 20, 2023 09:26

Merge remote-tracking branch 'upstream/main' into support_all_types_o…

b4873b9

…f_categories

FIX Fixes bugs from merge

d5ef2e0

REV Less diff

7f84de7

thomasjpfan force-pushed the support_all_types_of_categories branch from c8a0325 to 7f84de7 Compare December 20, 2023 17:54

thomasjpfan added 3 commits December 20, 2023 12:55

Trigger CI

ac182b2

API Change to bin_infrequent

547449f

DOC Move to 1.4

b1d1278

thomasjpfan added 2 commits January 21, 2024 15:40

Merge remote-tracking branch 'upstream/main' into support_all_types_o…

08de34e

…f_categories

DOC Move whats new to 1.5

dc821aa

Merge remote-tracking branch 'upstream/main' into support_all_types_o…

8875e3e

…f_categories

	self._preprocessor = FunctionTransformer().set_output(transform="default")
	self._preprocessor = None

	# Column Transformer places the categorical features at the end.
	# Out ColumnTransformer places the categorical features at the end.

	numerical_features = n_features - n_categorical
	n_numerical = n_features - n_categorical

Uh oh!

ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268

Are you sure you want to change the base?

ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268

Uh oh!

Conversation

thomasjpfan commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

NicolasHug commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr commented Apr 24, 2023

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Apr 26, 2023 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lorentzenchr commented Dec 23, 2023

Uh oh!

lorentzenchr commented Jan 22, 2024

Uh oh!

NicolasHug commented Jan 22, 2024

Uh oh!

thomasjpfan commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Apr 24, 2023 •

edited

Loading

NicolasHug commented Apr 24, 2023 •

edited

Loading

ogrisel left a comment •

edited

Loading

thomasjpfan commented Apr 26, 2023 •

edited by glemaitre

Loading

github-actions bot commented Dec 20, 2023 •

edited

Loading

thomasjpfan commented Jan 22, 2024 •

edited

Loading