-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support max_bins > 255
in Hist-GBDT estimators and categorical features with high cardinality
#26277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
max_bins > 255
in Hist-GBDT estimatorsmax_bins > 255
in Hist-GBDT estimators and categorical features with high cardinality
The number of bins are hardcoded in:
An alternative to allow for more than 256 bins is therefore
|
Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not. |
In memory, the histograms look the same: a contiguous array of hist_struct. The only difference is that we currently might have quite some unused bins. |
I'm was more thinking of
I assume that using a larger dtype is only going to worsen that problem? (Is that an actual pb in practice?) |
As originally sketched in #26268 (comment) there might be a way to enable support for arbitrary high values of
max_bins
for both categorical and numerical features. This may not be super critical for numerical features, but this would enable categorical features of arbitrary cardinality, which is desirable.The rough idea is to internally map an input categorical feature into multiple binned features (probably
num_categories // 255 + 1
features) and to update theSplitter
and the predictors to treat that group of features as a single feature.The text was updated successfully, but these errors were encountered: