Skip to content

ENH allow up to 65536 bins in HGBT #28603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 10, 2024

Reference Issues/PRs

Fixes #26277.

What does this implement/fix? Explain your changes.

This PR extents the maximum number of bins (including missing bin) from 256 (8bit) to 65536 (16bit) in HistGradientBoostingRegressor and HistGradientBoostingClassifier.

This PR uses the extended bins only on categorical variables if needed and still applies max_bins restricted to 255 to numerical features. This could easily be extended.

Any other comments?

This PR introduces 3 new data structures as Cython extension types:

  • cdef class Histograms
    This has one underlying 1-dim numpy array to hold all bins of all features. An offset array tells where the bins of a feature starts and where it ends
    feature      |  0  |      1      |
    histograms = |- - - - - - - - - -|   each - is a single bin
    offsets   =  |0     3            10|
    
    Compared to main, this saves bins as there are no unused bins anymore.
    All commits up to and including ENH adapt HGBT to new histograms everywhere.
  • cdef class Bitsets
    This extends the fixed sized bitsets arrays and works similar to the Histogram class, a 1-dim numpy array to hold all bitsets of all features and an offset array to tell where each feature begins and ends.
    Just commit ENH add Bitsets extension type.
  • cdef class BinnedData
    This class is a container for mixed uint8 and uint16 typed arrays. It holds track which feature is uint8 and which is uint16. Underlying are two 2-dim arrays of uint8 and uint16.
    Just commit ENH add BinnedData extension type for mixed uint8 uint16 data.

Benchmarks indicate no performance regression:
On this PR commit d769848

python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100
...
Fit 100 trees in 62.059 s, (3100 total leaves)
Time spent computing histograms: 36.785s
Time spent finding best splits:  0.260s
Time spent applying splits:      7.671s
Time spent predicting:           1.895s
fitted in 62.199s
predicted in 6.769s, ROC AUC: 0.8228, ACC: 0.7415

while on main f59c16d

Fit 100 trees in 62.321 s, (3100 total leaves)
Time spent computing histograms: 36.230s
Time spent finding best splits:  0.192s
Time spent applying splits:      8.075s
Time spent predicting:           1.935s
fitted in 62.501s
predicted in 8.073s, ROC AUC: 0.8228, ACC: 0.7415

(I can do some more if requested.)

Disadvantages:

  • In case we want to add row-wise histogram computation, this becomes more complicated as BinnedData is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout.
  • Code complexity increases. I would say, only a little.
  • Predictor nodes field "bin_threshold" is now uint16 instead of uint16, adding a tiny bit more memory to fitted estimator objects.

Further notes:

  • This PR needs Cython >= 3.0.10 which fixes earlier CI failures (it needed fixes from Cython 3.0.9 as well as 3.0.10).
  • So far, commits are relatively clean. Therefore, this PR could be split into several smaller ones.

Copy link

github-actions bot commented Mar 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b7ffd7d. Link to the linter CI: here

@lorentzenchr lorentzenchr force-pushed the hgbt_variable_bins branch 3 times, most recently from 992f9a1 to d419971 Compare March 11, 2024 13:31
@lorentzenchr lorentzenchr force-pushed the hgbt_variable_bins branch 5 times, most recently from ce74f94 to 4c001e9 Compare March 12, 2024 16:40
@lorentzenchr lorentzenchr force-pushed the hgbt_variable_bins branch 6 times, most recently from 9df97a7 to 2bd504f Compare March 16, 2024 15:21
@ogrisel
Copy link
Member

ogrisel commented Mar 18, 2024

Disadvantages:

In case we want to add row-wise histogram computation, this becomes more complicated as BinnedData is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout.

This might be a concern since row-wise parallelism seems to be the main reason why lightgbm can be significantly faster than scikit-learn.

@lorentzenchr
Copy link
Member Author

Disadvantages:

In case we want to add row-wise histogram computation, this becomes more complicated as BinnedData is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout.

This might be a concern since row-wise parallelism seems to be the main reason why lightgbm can be significantly faster than scikit-learn.

That's why I mentioned it. On the other hand, it is not that much more complicated. The main complexity would come from having 2 ways of parallelism. That's independent of this PR.

@jjerphan
Copy link
Member

Thank you for this contribution, @lorentzenchr!

@lorentzenchr: to ease the review now that #28640 has been merge, could you merge main into this branch resolving the changes? Alternatively, would you allow me to do it?

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Mar 27, 2024

Please do not commit to this branch. I‘ll update it soon.
For reviewers, focus entirely on the changes in the ensemble folder, in particular common.pxd and common.pyx.

Edit: The finding of #28638 was that Cython 3.0.10 fixed the windows MSVC build. I have rebased this PR's branch on main with #28743 merged.

@da-woods Thank you so much for the fast fixes upstream in Cython. This has really become a valuable collaboration for me.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Apr 20, 2024

Very strange. There is now quite a performance regression. While up to and with 91608fb everything is fine

% python bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

[99/100] 1 tree, 31 leaves, max depth = 11, in 0.004s
[100/100] 1 tree, 31 leaves, max depth = 9, in 0.005s
Fit 100 trees in 1.117 s, (3100 total leaves)
Time spent computing histograms: 0.180s
Time spent finding best splits:  0.063s
Time spent applying splits:      0.111s
Time spent predicting:           0.014s
fitted in 1.120s
log loss train set = 0.184607  # extra info
predicted in 0.416s

(basically the same as master), there is a large performance regression with 0107e77:

% python bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

[99/100] 1 tree, 31 leaves, max depth = 11, in 0.008s
[100/100] 1 tree, 31 leaves, max depth = 9, in 0.008s
Fit 100 trees in 2.038 s, (3100 total leaves)
Time spent computing histograms: 0.476s
Time spent finding best splits:  0.253s
Time spent applying splits:      0.274s
Time spent predicting:           0.020s
fitted in 2.041s
log loss train set = 0.184607  # extra infor
predicted in 0.394s

Note that 0107e77 does not even touch the histrograms and yet, the time spend computing histrograms is more than doubled.

I guess that's the point where I really could need some help.

Edit on a 2nd thought: I've benchmarked this PR before with no regression. What changed is the build system to meson.

@jjerphan
Copy link
Member

jjerphan commented Jul 24, 2024

Regressions might be due to the tree submodule previously being compiled with O3 and now scikit-learn being compiled with the equivalent of O2 while tree being compiled with the third level of optimisation it seems. Is -03 actually passed to the compiler?

@lesteve
Copy link
Member

lesteve commented Jul 25, 2024

Is -03 actually passed to the compiler?

It seems like it is, tried one tree .so, I guess you can do a full build and check all of the tree .so that you think are relevant.

I think the right way to figure out the compile commands being run is to pass -v (verbose) to ninja i.e. the compile command so using compile-args:

pip install -v --config-settings=editable-verbose=true --no-build-isolation --editable . --config-settings compile-args="-v" 

meson-setup sets buildtype=release by default see https://mesonbuild.com/meson-python/explanations/default-options.html#default-build-options so that actually overrides the default in meson.build.

So I guess the more correct meson + ninja command is:

# full meson setup command is shown when executing the pip install command above and adapted to use build/test as build folder
meson setup --reconfigure /home/lesteve/dev/scikit-learn build/test -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md 
--native-file=/home/lesteve/dev/scikit-learn/build/cp312/meson-python-native-file.ini
ninja -v sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so

Note the -O3 in the cc 2/3 line, but actually the 3/3 line so I guess linking step has -O1 not sure if this is relevant or not:

[1/3] cython -M --fast-fail -3 '-X language_level=3' '-X boundscheck=False' '-X wraparound=False' '-X initializedcheck=False' '-X nonecheck=False' '-X cdivisio
n=True' '-X profile=False' --include-dir /home/lesteve/dev/scikit-learn/build/test /home/lesteve/dev/scikit-learn/sklearn/tree/_criterion.pyx -o sklearn/tree/_
criterion.cpython-312-x86_64-linux-gnu.so.p/sklearn/tree/_criterion.pyx.c
[2/3] ccache cc -Isklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p -Isklearn/tree -I../../sklearn/tree -I../../../../micromamba/envs/scikit-learn-dev/
lib/python3.12/site-packages/numpy/_core/include -Isklearn -Isklearn/utils -I/home/lesteve/micromamba/envs/scikit-learn-dev/include/python3.12 -fvisibility=hid
den -fdiagnostics-color=always -DNDEBUG -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O3 -Wno-unused-but-set-variable -Wno-unused-function -Wno-conversi
on -Wno-misleading-indentation -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_9_API_VERSION -MD -MQ sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-genera
ted_sklearn_tree__criterion.pyx.c.o -MF sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o.d -o sklearn/
tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -c sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/
sklearn/tree/_criterion.pyx.c
[3/3] cc  -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__cr
iterion.pyx.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -lm

I leave below the commands I gave in my original comment but I think this is not as close as what you get when going through pip ...

meson setup build/test
ninja -v -C build/test sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so                                   

Note the -O3 in the cc 13/14 line, but actually the 14/14 line so I guess linking step has -O1 not sure if this is relevant or not:

ninja: Entering directory `build/test'
[1/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_typedefs.pxd sklearn/utils/_typedefs.pxd
[2/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_random.pxd sklearn/utils/_random.pxd
[3/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_vector_sentinel.pxd sklearn/utils/_vector_sentinel.pxd
[4/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_heap.pxd sklearn/utils/_heap.pxd
[5/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_sorting.pxd sklearn/utils/_sorting.pxd
[6/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/python3.12 ../../sklearn/_build_utils/tempita.py ../../sklearn/utils/_seq_dataset.pxd.tp -o sklearn/utils
[7/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/__init__.py sklearn/__init__.py
[8/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_openmp_helpers.pxd sklearn/utils/_openmp_helpers.pxd
[9/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/_cython_blas.pxd sklearn/utils/_cython_blas.pxd
[10/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/python3.12 ../../sklearn/_build_utils/tempita.py ../../sklearn/utils/_weight_vector.pxd.tp -o sklearn/utils
[11/14] /home/lesteve/micromamba/envs/scikit-learn-dev/bin/meson --internal copy ../../sklearn/utils/__init__.py sklearn/utils/__init__.py
[12/14] cython -M --fast-fail -3 '-X language_level=3' '-X boundscheck=False' '-X wraparound=False' '-X initializedcheck=False' '-X nonecheck=False' '-X cdivision=True' '-X profile=False' --include-dir /home/lesteve/dev/scikit-learn/build/test /home/lesteve/dev/scikit-learn/sklearn/tree/_criterion.pyx -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/sklearn/tree/_criterion.pyx.c
[13/14] ccache cc -Isklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p -Isklearn/tree -I../../sklearn/tree -I../../../../micromamba/envs/scikit-learn-dev/lib/python3.12/site-packages/numpy/_core/include -Isklearn -Isklearn/utils -I/home/lesteve/micromamba/envs/scikit-learn-dev/include/python3.12 -fvisibility=hidden -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O3 -g -Wno-unused-but-set-variable -Wno-unused-function -Wno-conversion -Wno-misleading-indentation -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_9_API_VERSION -MD -MQ sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -MF sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o.d -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -c sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/sklearn/tree/_criterion.pyx.c
[14/14] cc  -o sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so.p/meson-generated_sklearn_tree__criterion.pyx.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -lm

@lesteve
Copy link
Member

lesteve commented Jul 26, 2024

I tried reproducing the performance difference on b7ffd7d and I don't see any significant difference between Meson and setuptools ... complete wild-guess, it could well be OS-specific (I seem to remember Christian has a macOS and I have a Linux machine)?

I ran 20 times the benchark #28603 (comment) for meson and setuptools:

python benchmarks/bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

I looked at the Fit 100 trees in 2.038 s line (not sure if there was a better one to look at ...).

import io
import pandas as pd

df = pd.read_csv(io.StringIO("""meson,setuptools
3.049,2.849
2.556,3.008
2.648,2.981
1.978,3.037
3.155,2.754
2.170,2.484
2.313,2.438
2.622,2.584
1.910,2.042
2.009,2.697
2.023,2.613
1.876,2.466
1.798,2.456
2.778,1.651
2.260,2.353
2.273,2.548
2.638,1.932
2.138,2.650
2.440,1.587
2.295,3.248
"""))

df.apply(['mean', 'std'])
         meson  setuptools
mean  2.346450    2.518900
std   0.383623    0.442211

@lorentzenchr
Copy link
Member Author

It would actually help me if someone could confirm the performance regression of 0107e77, i.e. run

python bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose

on

  • branch main
  • on this PR's branch commit 91608fb
  • on this PR's branch commit 0107e77

@lesteve
Copy link
Member

lesteve commented Aug 19, 2024

It seems like something may be off with OpenMP and our meson setup, one possible fix is to add openmp_dep to all extension modules that use OpenMP see #29665 (comment) for a little bit more details.

I need to look into it a bit more to make sure this is the actual issue but this would help if you could try and see whether that seems to help in your case ...

Edit: I guess you want to look at #29694 for a more complete fix to try.

@jjerphan
Copy link
Member

jjerphan commented Oct 24, 2024

I confirm I observe slight regressions between main and 91608fb and 0107e77 (both commit show similar performance).

There are variations of the results of the scripts, here are instances of some of bench_hist_gradient_boosting_categorical_only.py's output.

main (429d67a):

Number of features: 20
Number of samples: 100000
Fitting a sklearn model...
Binning 0.016 GB of training data: 0.011 s
Fitting gradient boosted rounds:
Fit 100 trees in 1.296 s, (3100 total leaves)
Time spent computing histograms: 0.475s
Time spent finding best splits:  0.276s
Time spent applying splits:      0.301s
Time spent predicting:           0.025s
fitted in 1.297s
predicted in 0.166s

and 91608fb:

Fit 100 trees in 1.715 s, (3100 total leaves)
Time spent computing histograms: 0.623s
Time spent finding best splits:  0.348s
Time spent applying splits:      0.382s
Time spent predicting:           0.038s
fitted in 1.716s
predicted in 0.169s

and 0107e77:

Fit 100 trees in 1.792 s, (3100 total leaves)
Time spent computing histograms: 0.635s
Time spent finding best splits:  0.389s
Time spent applying splits:      0.388s
Time spent predicting:           0.037s
fitted in 1.793s
predicted in 0.139s

The addition of BinnedData might have added some non-negligible dispatch overhead (this is just a guess, one needs to profile to understand the cause).

@jjerphan
Copy link
Member

@lorentzenchr: do you think you could split this PR in smaller ones as you proposed in the description?

Therefore, this PR could be split into several smaller ones.

@zmbc
Copy link

zmbc commented Jan 2, 2025

I'd be very excited about this feature, let me know if there's anything I can do to help push it over the finish line 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support max_bins > 255 in Hist-GBDT estimators and categorical features with high cardinality
5 participants