-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Support categories with cardinality higher than max_bins in HistGradientBoosting #26268
Conversation
Is this PR really needed for that? Can't we simply error when An alternative way to support categories with high cardinality could be to let the This is more complex to implement, but that would remove the hard constraint that Note that if we were to merge this PR, we wouldn't be able to implement the above strategy without a breaking change (a deprecation wouldn't work either I believe, unless we add a new parameter) EDIT: thinking about this more, I think we could apply the same "feature group" idea to numerical features as well. This could potentially enable an arbitrary high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Main issue is whether we need a new parameter.
@@ -176,6 +180,63 @@ class weights. | |||
""" | |||
return sample_weight | |||
|
|||
def _check_X(self, X, *, reset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring with a list of attributes that are set would not hurt.
Also, should we adjust the function name?
n_features = X.shape[1] | ||
|
||
if not requires_encoder: | ||
self._preprocessor = FunctionTransformer().set_output(transform="default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
self._preprocessor = FunctionTransformer().set_output(transform="default") | |
self._preprocessor = None |
?
self._preprocessor.set_output(transform="default") | ||
X = self._preprocessor.fit_transform(X) | ||
|
||
# Column Transformer places the categorical features at the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Column Transformer places the categorical features at the end. | |
# Out ColumnTransformer places the categorical features at the end. |
np.arange(min(len(c), self.max_bins), dtype=X_DTYPE) for c in categories_ | ||
] | ||
|
||
numerical_features = n_features - n_categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numerical_features = n_features - n_categorical | |
n_numerical = n_features - n_categorical |
raise ValueError( | ||
f"Categorical feature {feature_name} is expected to " | ||
f"be encoded with values < {self.max_bins} but the " | ||
"largest value for the encoded categories is " | ||
f"{categories.max()}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should introduce a new parameter like group_infrequent_categories: bool
. In case someone does not want this behavior. If False, error when n_categories > max_bins
.
I think just describing the effect in the doc is not enough, therefore the parameter. Let's discuss.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
hist_native = Hist(categorical_features=categorical_features, **hist_kwargs) | ||
hist_native.fit(X_train, y_train) | ||
|
||
# Use a preprocessor with an ordinal encoder should that gives the same model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??? Could you rephrase / complete the sentence?
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Show resolved
Hide resolved
@NicolasHug raises some good points in #26268 (comment). I did not only read them after my review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This is an important usability improvement in my opinion. However it might be a bit too magical and some users might miss an opportunity to use a better way to deal with high cardinality categories.
Maybe we should have an extra constructor parameter on_highcardinality_categories
:
error
(default): raiseValueError
that advise to either preprocess the high cardinality categorical variables withTargetEncoder
or an ad-hoc transformer, possibly expanding such a high cardinality categorical feature into many low cardinality categorical features;bin_least_frequent
: use what is provided in this PR.
EDIT: this review was started yesterday but I did not complete it at the time and now it is a bit redundant with the other reviews by @NicolasHug and @lorentzenchr...
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
You are correct, this PR is not needed. We can keep the error with I updated this PR by adding |
c8a0325
to
7f84de7
Compare
@thomasjpfan FYI, I intend to play with a second larger bitmap implementation with double number of bits. |
I have given this a 2nd thought and temporarily revoke my approval. I think it moves too much logic into HGBT pointing at 2 shortcomings of scikit-learn:
We have a bit of time before 1.5 and I'd like to use that before prematurelly merging this PR (I can later revoke my revocation). |
@lorentzenchr just making sure you saw #26277 as a potential alternative to
|
I think this is orthogonal to this PR. Even if HGBT can support more bins, some may still want to group infrequent categories.
The current pipeline design has lead to a few workarounds and deserves it's own discussion. Overall, having |
Reference Issues/PRs
Related to #24907
What does this implement/fix? Explain your changes.
This PR enables support for categorical features that have cardinality greater than
max_bins
and categories that are encoded abovemax_bins
. This is enabled with a newon_high_cardinality_categories="bin_infrequent"
parameter.