ENH allow up to 65536 bins in HGBT #28603

lorentzenchr · 2024-03-10T16:31:55Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR extents the maximum number of bins (including missing bin) from 256 (8bit) to 65536 (16bit) in HistGradientBoostingRegressor and HistGradientBoostingClassifier.

This PR uses the extended bins only on categorical variables if needed and still applies max_bins restricted to 255 to numerical features. This could easily be extended.

Any other comments?

This PR introduces 3 new data structures as Cython extension types:

cdef class Histograms
This has one underlying 1-dim numpy array to hold all bins of all features. An offset array tells where the bins of a feature starts and where it ends
```
feature      |  0  |      1      |
histograms = |- - - - - - - - - -|   each - is a single bin
offsets   =  |0     3            10|
```
Compared to main, this saves bins as there are no unused bins anymore.
All commits up to and including ENH adapt HGBT to new histograms everywhere.
cdef class Bitsets
This extends the fixed sized bitsets arrays and works similar to the Histogram class, a 1-dim numpy array to hold all bitsets of all features and an offset array to tell where each feature begins and ends.
Just commit ENH add Bitsets extension type.
cdef class BinnedData
This class is a container for mixed uint8 and uint16 typed arrays. It holds track which feature is uint8 and which is uint16. Underlying are two 2-dim arrays of uint8 and uint16.
Just commit ENH add BinnedData extension type for mixed uint8 uint16 data.

Benchmarks indicate no performance regression:
On this PR commit d769848

python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100
...
Fit 100 trees in 62.059 s, (3100 total leaves)
Time spent computing histograms: 36.785s
Time spent finding best splits:  0.260s
Time spent applying splits:      7.671s
Time spent predicting:           1.895s
fitted in 62.199s
predicted in 6.769s, ROC AUC: 0.8228, ACC: 0.7415

while on main f59c16d

Fit 100 trees in 62.321 s, (3100 total leaves)
Time spent computing histograms: 36.230s
Time spent finding best splits:  0.192s
Time spent applying splits:      8.075s
Time spent predicting:           1.935s
fitted in 62.501s
predicted in 8.073s, ROC AUC: 0.8228, ACC: 0.7415

(I can do some more if requested.)

Disadvantages:

In case we want to add row-wise histogram computation, this becomes more complicated as BinnedData is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout.
Code complexity increases. I would say, only a little.
Predictor nodes field "bin_threshold" is now uint16 instead of uint16, adding a tiny bit more memory to fitted estimator objects.

Further notes:

This PR needs Cython >= 3.0.10 which fixes earlier CI failures (it needed fixes from Cython 3.0.9 as well as 3.0.10).
So far, commits are relatively clean. Therefore, this PR could be split into several smaller ones.

github-actions · 2024-03-10T16:33:10Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b7ffd7d. Link to the linter CI: here}

ogrisel · 2024-03-18T09:42:39Z

Disadvantages:

In case we want to add row-wise histogram computation, this becomes more complicated as BinnedData is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout.

This might be a concern since row-wise parallelism seems to be the main reason why lightgbm can be significantly faster than scikit-learn.

lorentzenchr · 2024-03-18T10:50:33Z

Disadvantages:

In case we want to add row-wise histogram computation, this becomes more complicated as BinnedData is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout.

This might be a concern since row-wise parallelism seems to be the main reason why lightgbm can be significantly faster than scikit-learn.

That's why I mentioned it. On the other hand, it is not that much more complicated. The main complexity would come from having 2 ways of parallelism. That's independent of this PR.

jjerphan · 2024-03-27T11:26:48Z

Thank you for this contribution, @lorentzenchr!

@lorentzenchr: to ease the review now that #28640 has been merge, could you merge main into this branch resolving the changes? Alternatively, would you allow me to do it?

lorentzenchr · 2024-03-27T11:43:15Z

Please do not commit to this branch. I‘ll update it soon.
For reviewers, focus entirely on the changes in the ensemble folder, in particular common.pxd and common.pyx.

Edit: The finding of #28638 was that Cython 3.0.10 fixed the windows MSVC build. I have rebased this PR's branch on main with #28743 merged.

@da-woods Thank you so much for the fast fixes upstream in Cython. This has really become a valuable collaboration for me.

* no unused bins anymore * missing_values_bin_idx is removed or replaced by n_bins_non_missing

lorentzenchr · 2024-04-20T15:21:48Z

Very strange. There is now quite a performance regression. While up to and with 91608fb everything is fine

% python bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

[99/100] 1 tree, 31 leaves, max depth = 11, in 0.004s
[100/100] 1 tree, 31 leaves, max depth = 9, in 0.005s
Fit 100 trees in 1.117 s, (3100 total leaves)
Time spent computing histograms: 0.180s
Time spent finding best splits:  0.063s
Time spent applying splits:      0.111s
Time spent predicting:           0.014s
fitted in 1.120s
log loss train set = 0.184607  # extra info
predicted in 0.416s

(basically the same as master), there is a large performance regression with 0107e77:

% python bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

[99/100] 1 tree, 31 leaves, max depth = 11, in 0.008s
[100/100] 1 tree, 31 leaves, max depth = 9, in 0.008s
Fit 100 trees in 2.038 s, (3100 total leaves)
Time spent computing histograms: 0.476s
Time spent finding best splits:  0.253s
Time spent applying splits:      0.274s
Time spent predicting:           0.020s
fitted in 2.041s
log loss train set = 0.184607  # extra infor
predicted in 0.394s

Note that 0107e77 does not even touch the histrograms and yet, the time spend computing histrograms is more than doubled.

I guess that's the point where I really could need some help.

Edit on a 2nd thought: I've benchmarked this PR before with no regression. What changed is the build system to meson.

jjerphan · 2024-07-24T21:23:26Z

Regressions might be due to the tree submodule previously being compiled with O3 and now scikit-learn being compiled with the equivalent of O2 while tree being compiled with the third level of optimisation it seems. Is -03 actually passed to the compiler?

lesteve · 2024-07-25T07:24:23Z

Is -03 actually passed to the compiler?

It seems like it is, tried one tree .so, I guess you can do a full build and check all of the tree .so that you think are relevant.

I think the right way to figure out the compile commands being run is to pass -v (verbose) to ninja i.e. the compile command so using compile-args:

pip install -v --config-settings=editable-verbose=true --no-build-isolation --editable . --config-settings compile-args="-v"

meson-setup sets buildtype=release by default see https://mesonbuild.com/meson-python/explanations/default-options.html#default-build-options so that actually overrides the default in meson.build.

So I guess the more correct meson + ninja command is:

# full meson setup command is shown when executing the pip install command above and adapted to use build/test as build folder
meson setup --reconfigure /home/lesteve/dev/scikit-learn build/test -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md 
--native-file=/home/lesteve/dev/scikit-learn/build/cp312/meson-python-native-file.ini
ninja -v sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so

Note the -O3 in the cc 2/3 line, but actually the 3/3 line so I guess linking step has -O1 not sure if this is relevant or not:

[1/3] cython -M --fast-fail -3 '-X language_level=3' '-X boundscheck=False' '-X wraparound=False' '-X initializedcheck=False' '-X nonecheck=False' '-X cdivisio
n=True' '-X profile=False' --include-dir /home/lesteve/dev/scikit-learn/build/test /home/lesteve/dev/scikit-learn/sklearn/tree/_criterion.pyx -o sklearn/tree/_
criterion.cpython-312-x86_64-linux-gnu.so.p/sklearn/tree/_criterion.pyx.c
[2/3] ccache cc -Isklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p -Isklearn/tree -I../../sklearn/tree -I../../../../micromamba/envs/scikit-learn-dev/
lib/python3.12/site-packages/numpy/_core/include -Isklearn -Isklearn/utils -I/home/lesteve/micromamba/envs/scikit-learn-dev/include/python3.12 -fvisibility=hid
den -fdiagnostics-color=always -DNDEBUG -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O3 -Wno-unused-but-set-variable -Wno-unused-function -Wno-conversi
on -Wno-misleading-indentation -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_9_API_VERSION -MD -MQ sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-genera
ted_sklearn_tree__criterion.pyx.c.o -MF sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o.d -o sklearn/
tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -c sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/
sklearn/tree/_criterion.pyx.c
[3/3] cc  -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__cr
iterion.pyx.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -lm

I leave below the commands I gave in my original comment but I think this is not as close as what you get when going through pip ...

meson setup build/test
ninja -v -C build/test sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so

Note the -O3 in the cc 13/14 line, but actually the 14/14 line so I guess linking step has -O1 not sure if this is relevant or not:

ninja: Entering directory `build/test'
[1/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_typedefs.pxd sklearn/utils/_typedefs.pxd
[2/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_random.pxd sklearn/utils/_random.pxd
[3/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_vector_sentinel.pxd sklearn/utils/_vector_sentinel.pxd
[4/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_heap.pxd sklearn/utils/_heap.pxd
[5/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_sorting.pxd sklearn/utils/_sorting.pxd
[6/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/python3.12 ../../sklearn/_build_utils/tempita.py ../../sklearn/utils/_seq_dataset.pxd.tp -o sklearn/utils
[7/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/__init__.py sklearn/__init__.py
[8/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_openmp_helpers.pxd sklearn/utils/_openmp_helpers.pxd
[9/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_cython_blas.pxd sklearn/utils/_cython_blas.pxd
[10/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/python3.12 ../../sklearn/_build_utils/tempita.py ../../sklearn/utils/_weight_vector.pxd.tp -o sklearn/utils
[11/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/__init__.py sklearn/utils/__init__.py
[12/14] cython -M --fast-fail -3 '-X language_level=3' '-X boundscheck=False' '-X wraparound=False' '-X initializedcheck=False' '-X nonecheck=False' '-X cdivision=True' '-X profile=False' --include-dir /home/lesteve/dev/scikit-learn/build/test /home/lesteve/dev/scikit-learn/sklearn/tree/_criterion.pyx -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/sklearn/tree/_criterion.pyx.c
[13/14] ccache cc -Isklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p -Isklearn/tree -I../../sklearn/tree -I../../../../micromamba/envs/scikit-learn-dev/lib/python3.12/site-packages/numpy/_core/include -Isklearn -Isklearn/utils -I/home/lesteve/micromamba/envs/scikit-learn-dev/include/python3.12 -fvisibility=hidden -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O3 -g -Wno-unused-but-set-variable -Wno-unused-function -Wno-conversion -Wno-misleading-indentation -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_9_API_VERSION -MD -MQ sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -MF sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o.d -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -c sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/sklearn/tree/_criterion.pyx.c
[14/14] cc  -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -lm

lesteve · 2024-07-26T08:17:46Z

I tried reproducing the performance difference on b7ffd7d and I don't see any significant difference between Meson and setuptools ... complete wild-guess, it could well be OS-specific (I seem to remember Christian has a macOS and I have a Linux machine)?

I ran 20 times the benchark #28603 (comment) for meson and setuptools:

python benchmarks/bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

I looked at the Fit 100 trees in 2.038 s line (not sure if there was a better one to look at ...).

import io
import pandas as pd

df = pd.read_csv(io.StringIO("""meson,setuptools
3.049,2.849
2.556,3.008
2.648,2.981
1.978,3.037
3.155,2.754
2.170,2.484
2.313,2.438
2.622,2.584
1.910,2.042
2.009,2.697
2.023,2.613
1.876,2.466
1.798,2.456
2.778,1.651
2.260,2.353
2.273,2.548
2.638,1.932
2.138,2.650
2.440,1.587
2.295,3.248
"""))

df.apply(['mean', 'std'])

         meson  setuptools
mean  2.346450    2.518900
std   0.383623    0.442211

lorentzenchr · 2024-08-19T08:51:45Z

It would actually help me if someone could confirm the performance regression of 0107e77, i.e. run

python bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

on

branch main
on this PR's branch commit 91608fb
on this PR's branch commit 0107e77

lesteve · 2024-08-19T16:45:41Z

It seems like something may be off with OpenMP and our meson setup, one possible fix is to add openmp_dep to all extension modules that use OpenMP see #29665 (comment) for a little bit more details.

I need to look into it a bit more to make sure this is the actual issue but this would help if you could try and see whether that seems to help in your case ...

Edit: I guess you want to look at #29694 for a more complete fix to try.

jjerphan · 2024-10-24T20:29:27Z

I confirm I observe slight regressions between main and 91608fb and 0107e77 (both commit show similar performance).

There are variations of the results of the scripts, here are instances of some of bench_hist_gradient_boosting_categorical_only.py's output.

main (429d67a):

Number of features: 20
Number of samples: 100000
Fitting a sklearn model...
Binning 0.016 GB of training data: 0.011 s
Fitting gradient boosted rounds:
Fit 100 trees in 1.296 s, (3100 total leaves)
Time spent computing histograms: 0.475s
Time spent finding best splits:  0.276s
Time spent applying splits:      0.301s
Time spent predicting:           0.025s
fitted in 1.297s
predicted in 0.166s

and 91608fb:

Fit 100 trees in 1.715 s, (3100 total leaves)
Time spent computing histograms: 0.623s
Time spent finding best splits:  0.348s
Time spent applying splits:      0.382s
Time spent predicting:           0.038s
fitted in 1.716s
predicted in 0.169s

and 0107e77:

Fit 100 trees in 1.792 s, (3100 total leaves)
Time spent computing histograms: 0.635s
Time spent finding best splits:  0.389s
Time spent applying splits:      0.388s
Time spent predicting:           0.037s
fitted in 1.793s
predicted in 0.139s

The addition of BinnedData might have added some non-negligible dispatch overhead (this is just a guess, one needs to profile to understand the cause).

jjerphan · 2024-11-29T19:08:58Z

@lorentzenchr: do you think you could split this PR in smaller ones as you proposed in the description?

Therefore, this PR could be split into several smaller ones.

zmbc · 2025-01-02T21:10:01Z

I'd be very excited about this feature, let me know if there's anything I can do to help push it over the finish line 😃

github-actions bot added cython module:ensemble module:utils labels Mar 10, 2024

lorentzenchr force-pushed the hgbt_variable_bins branch 3 times, most recently from 992f9a1 to d419971 Compare March 11, 2024 13:31

lorentzenchr added the Performance label Mar 11, 2024

lorentzenchr force-pushed the hgbt_variable_bins branch 5 times, most recently from ce74f94 to 4c001e9 Compare March 12, 2024 16:40

lorentzenchr mentioned this pull request Mar 12, 2024

DEBUG hgbt variable bins #28603 #28621

Closed

lorentzenchr force-pushed the hgbt_variable_bins branch 6 times, most recently from 9df97a7 to 2bd504f Compare March 16, 2024 15:21

This was referenced Mar 18, 2024

DEBUG allow up to 65536 bins in HGBT #28638

Closed

MNT bump to Cython=3.0.9 #28640

Merged

lorentzenchr force-pushed the hgbt_variable_bins branch from 2bd504f to 58db70a Compare March 27, 2024 13:45

lorentzenchr added 3 commits April 3, 2024 20:56

ENH add Histograms extension type

dd09c06

TST adapt HGBT tests to Histograms extension type

acdfb6f

ENH use pointer to avoid virtual function call in inner loop

c7d7662

lorentzenchr added 4 commits April 3, 2024 20:56

ENH adapt HGBT to new histograms everywhere

91608fb

* no unused bins anymore * missing_values_bin_idx is removed or replaced by n_bins_non_missing

ENH add Bitsets extension type

0107e77

ENH add BinnedData extension type for mixed uint8 uint16 data

d769848

DOC add whatsnew

b7ffd7d

lorentzenchr force-pushed the hgbt_variable_bins branch from 58db70a to b7ffd7d Compare April 3, 2024 18:57

lesteve mentioned this pull request Jul 30, 2024

BLD Do not require meson-python when using meson directly #29578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH allow up to 65536 bins in HGBT #28603

ENH allow up to 65536 bins in HGBT #28603

lorentzenchr commented Mar 10, 2024 •

edited

Loading

github-actions bot commented Mar 10, 2024 •

edited

Loading

ogrisel commented Mar 18, 2024

lorentzenchr commented Mar 18, 2024

jjerphan commented Mar 27, 2024

lorentzenchr commented Mar 27, 2024 •

edited

Loading

lorentzenchr commented Apr 20, 2024 •

edited

Loading

jjerphan commented Jul 24, 2024 •

edited

Loading

lesteve commented Jul 25, 2024 •

edited

Loading

lesteve commented Jul 26, 2024 •

edited

Loading

lorentzenchr commented Aug 19, 2024

lesteve commented Aug 19, 2024 •

edited

Loading

jjerphan commented Oct 24, 2024 •

edited

Loading

jjerphan commented Nov 29, 2024

zmbc commented Jan 2, 2025

ENH allow up to 65536 bins in HGBT #28603

Are you sure you want to change the base?

ENH allow up to 65536 bins in HGBT #28603

Conversation

lorentzenchr commented Mar 10, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 10, 2024 • edited Loading

✔️ Linting Passed

ogrisel commented Mar 18, 2024

lorentzenchr commented Mar 18, 2024

jjerphan commented Mar 27, 2024

lorentzenchr commented Mar 27, 2024 • edited Loading

lorentzenchr commented Apr 20, 2024 • edited Loading

jjerphan commented Jul 24, 2024 • edited Loading

lesteve commented Jul 25, 2024 • edited Loading

lesteve commented Jul 26, 2024 • edited Loading

lorentzenchr commented Aug 19, 2024

lesteve commented Aug 19, 2024 • edited Loading

jjerphan commented Oct 24, 2024 • edited Loading

jjerphan commented Nov 29, 2024

zmbc commented Jan 2, 2025

lorentzenchr commented Mar 10, 2024 •

edited

Loading

github-actions bot commented Mar 10, 2024 •

edited

Loading

lorentzenchr commented Mar 27, 2024 •

edited

Loading

lorentzenchr commented Apr 20, 2024 •

edited

Loading

jjerphan commented Jul 24, 2024 •

edited

Loading

lesteve commented Jul 25, 2024 •

edited

Loading

lesteve commented Jul 26, 2024 •

edited

Loading

lesteve commented Aug 19, 2024 •

edited

Loading

jjerphan commented Oct 24, 2024 •

edited

Loading