Skip to content

ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Apr 24, 2023

Reference Issues/PRs

Related to #24907

What does this implement/fix? Explain your changes.

This PR enables support for categorical features that have cardinality greater than max_bins and categories that are encoded above max_bins. This is enabled with a new on_high_cardinality_categories="bin_infrequent" parameter.

@thomasjpfan thomasjpfan changed the title ENH Support categories higher than max_bins in HistGradientBoosting ENH Support categories with cardinality higher than max_bins in HistGradientBoosting Apr 24, 2023
@NicolasHug
Copy link
Member

NicolasHug commented Apr 24, 2023

With this PR merged, it becomes fairly straight forward to automatically handle pandas categorical dtypes with categorical_features="pandas"

Is this PR really needed for that? Can't we simply error when num_categories >= max_bins like we currently do?

An alternative way to support categories with high cardinality could be to let the BinMapper split a categorical feature into num_categories // max_bins + 1 columns (CC @lorentzenchr) and update the implementation of the Splitter and the predictors to look for categorical splits across more than one column. I believe this is what LightGBM enables via its "feature group" concept.

This is more complex to implement, but that would remove the hard constraint that num_categories < max_bins and doesn't involve any "magic" or potentially surprising behaviour.

Note that if we were to merge this PR, we wouldn't be able to implement the above strategy without a breaking change (a deprecation wouldn't work either I believe, unless we add a new parameter)

EDIT: thinking about this more, I think we could apply the same "feature group" idea to numerical features as well. This could potentially enable an arbitrary high max_bin for both categorical and numerical features, while still keeping the core dtype to uint8. Opened #26277

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Main issue is whether we need a new parameter.

@@ -176,6 +180,63 @@ class weights.
"""
return sample_weight

def _check_X(self, X, *, reset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring with a list of attributes that are set would not hurt.
Also, should we adjust the function name?

n_features = X.shape[1]

if not requires_encoder:
self._preprocessor = FunctionTransformer().set_output(transform="default")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Suggested change
self._preprocessor = FunctionTransformer().set_output(transform="default")
self._preprocessor = None

?

self._preprocessor.set_output(transform="default")
X = self._preprocessor.fit_transform(X)

# Column Transformer places the categorical features at the end.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Column Transformer places the categorical features at the end.
# Out ColumnTransformer places the categorical features at the end.

np.arange(min(len(c), self.max_bins), dtype=X_DTYPE) for c in categories_
]

numerical_features = n_features - n_categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
numerical_features = n_features - n_categorical
n_numerical = n_features - n_categorical

Comment on lines 290 to 294
raise ValueError(
f"Categorical feature {feature_name} is expected to "
f"be encoded with values < {self.max_bins} but the "
"largest value for the encoded categories is "
f"{categories.max()}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should introduce a new parameter like group_infrequent_categories: bool. In case someone does not want this behavior. If False, error when n_categories > max_bins.
I think just describing the effect in the doc is not enough, therefore the parameter. Let's discuss.

hist_native = Hist(categorical_features=categorical_features, **hist_kwargs)
hist_native.fit(X_train, y_train)

# Use a preprocessor with an ordinal encoder should that gives the same model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??? Could you rephrase / complete the sentence?

@lorentzenchr
Copy link
Member

@NicolasHug raises some good points in #26268 (comment). I did not only read them after my review.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This is an important usability improvement in my opinion. However it might be a bit too magical and some users might miss an opportunity to use a better way to deal with high cardinality categories.

Maybe we should have an extra constructor parameter on_highcardinality_categories:

  • error (default): raise ValueError that advise to either preprocess the high cardinality categorical variables with TargetEncoder or an ad-hoc transformer, possibly expanding such a high cardinality categorical feature into many low cardinality categorical features;
  • bin_least_frequent: use what is provided in this PR.

EDIT: this review was started yesterday but I did not complete it at the time and now it is a bit redundant with the other reviews by @NicolasHug and @lorentzenchr...

thomasjpfan and others added 4 commits April 25, 2023 21:16
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Apr 26, 2023

Is this PR really needed for that? Can't we simply error when num_categories >= max_bins like we currently do?

You are correct, this PR is not needed. We can keep the error with categorical_features="pandas".

I updated this PR by adding on_high_cardinality_categories which defaults to error which is the behavior on main and bin_least_frequent which bins the infrequent categories.

Copy link

github-actions bot commented Dec 20, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8875e3e. Link to the linter CI: here

@thomasjpfan thomasjpfan force-pushed the support_all_types_of_categories branch from c8a0325 to 7f84de7 Compare December 20, 2023 17:54
@lorentzenchr
Copy link
Member

@thomasjpfan FYI, I intend to play with a second larger bitmap implementation with double number of bits.
C++ bitmaps would be nice, but there I don’t know how to allocate them in numpy arrays.

@lorentzenchr
Copy link
Member

I have given this a 2nd thought and temporarily revoke my approval. I think it moves too much logic into HGBT pointing at 2 shortcomings of scikit-learn:

  • The pipeline design seems broken as we move more and more logic into the estimators themselves, like here.
  • HGBT could be extended to allow more than 256 bins per feature. I'm currently working on a PR which needs a bit more time.

We have a bit of time before 1.5 and I'd like to use that before prematurelly merging this PR (I can later revoke my revocation).

@NicolasHug
Copy link
Member

@lorentzenchr just making sure you saw #26277 as a potential alternative to

HGBT could be extended to allow more than 256 bins per feature.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jan 22, 2024

HGBT could be extended to allow more than 256 bins per feature. I'm currently working on a PR which needs a bit more time.

I think this is orthogonal to this PR. Even if HGBT can support more bins, some may still want to group infrequent categories.

The pipeline design seems broken as we move more and more logic into the estimators themselves, like here.

The current pipeline design has lead to a few workarounds and deserves it's own discussion.

Overall, having ColumnTransformer inside the estimator aimed to improve usability for default use cases. One can still use a ColumnTransformer + HGBT to do other encodings, such as TargetEncoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants