Skip to content

Support max_bins > 255 in Hist-GBDT estimators and categorical features with high cardinality #26277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NicolasHug opened this issue Apr 24, 2023 · 4 comments · May be fixed by #28603
Open

Comments

@NicolasHug
Copy link
Member

As originally sketched in #26268 (comment) there might be a way to enable support for arbitrary high values of max_bins for both categorical and numerical features. This may not be super critical for numerical features, but this would enable categorical features of arbitrary cardinality, which is desirable.

The rough idea is to internally map an input categorical feature into multiple binned features (probably num_categories // 255 + 1 features) and to update the Splitter and the predictors to treat that group of features as a single feature.

@NicolasHug NicolasHug changed the title Support max_bins > 255 in Hist-GBDT estimators Support max_bins > 255 in Hist-GBDT estimators and categorical features with high cardinality Apr 24, 2023
@lorentzenchr
Copy link
Member

The number of bins are hardcoded in:

  • Histograms as 2d-array of shape (n features, n bins)
  • X binned as 2d-array of dtype=uint8
  • Bitsets for categorical features as C array[8] of type uint32 (8*32=256)

An alternative to allow for more than 256 bins is therefore

  • Histogram as 1d array with positions where a feature starts (and ends). This is to save memory a lot (and maybe cache hits).
  • X binned uint8 and a 2nd larger X binned, e.g. uint16, for features that need it, and a structure that bundles both together to a unified API.
  • A second extended bitset, similar to the existing one, doubled size and a structure that bundles both together to a unified APi.

@NicolasHug
Copy link
Member Author

Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.

@lorentzenchr
Copy link
Member

Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.

In memory, the histograms look the same: a contiguous array of hist_struct. The only difference is that we currently might have quite some unused bins.

@NicolasHug
Copy link
Member Author

I'm was more thinking of X_binned rather than about the histograms.

The only difference is that we currently might have quite some unused bins

I assume that using a larger dtype is only going to worsen that problem? (Is that an actual pb in practice?)

@lorentzenchr lorentzenchr linked a pull request Mar 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants