-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH allow up to 65536 bins in HGBT #28603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
992f9a1
to
d419971
Compare
ce74f94
to
4c001e9
Compare
9df97a7
to
2bd504f
Compare
This might be a concern since row-wise parallelism seems to be the main reason why lightgbm can be significantly faster than scikit-learn. |
That's why I mentioned it. On the other hand, it is not that much more complicated. The main complexity would come from having 2 ways of parallelism. That's independent of this PR. |
Thank you for this contribution, @lorentzenchr! @lorentzenchr: to ease the review now that #28640 has been merge, could you merge |
Please do not commit to this branch. I‘ll update it soon. Edit: The finding of #28638 was that Cython 3.0.10 fixed the windows MSVC build. I have rebased this PR's branch on main with #28743 merged. @da-woods Thank you so much for the fast fixes upstream in Cython. This has really become a valuable collaboration for me. |
2bd504f
to
58db70a
Compare
* no unused bins anymore * missing_values_bin_idx is removed or replaced by n_bins_non_missing
58db70a
to
b7ffd7d
Compare
Very strange. There is now quite a performance regression. While up to and with 91608fb everything is fine
(basically the same as master), there is a large performance regression with 0107e77:
Note that 0107e77 does not even touch the histrograms and yet, the time spend computing histrograms is more than doubled. I guess that's the point where I really could need some help. Edit on a 2nd thought: I've benchmarked this PR before with no regression. What changed is the build system to meson. |
Regressions might be due to the |
It seems like it is, tried one tree I think the right way to figure out the compile commands being run is to pass
meson-setup sets buildtype=release by default see https://mesonbuild.com/meson-python/explanations/default-options.html#default-build-options so that actually overrides the default in So I guess the more correct meson + ninja command is: # full meson setup command is shown when executing the pip install command above and adapted to use build/test as build folder
meson setup --reconfigure /home/lesteve/dev/scikit-learn build/test -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md
--native-file=/home/lesteve/dev/scikit-learn/build/cp312/meson-python-native-file.ini
ninja -v sklearn/tree/_criterion.cpython-312-x86_64-linux-gnu.so Note the
I leave below the commands I gave in my original comment but I think this is not as close as what you get when going through
Note the
|
I tried reproducing the performance difference on b7ffd7d and I don't see any significant difference between Meson and setuptools ... complete wild-guess, it could well be OS-specific (I seem to remember Christian has a macOS and I have a Linux machine)? I ran 20 times the benchark #28603 (comment) for meson and setuptools: python benchmarks/bench_hist_gradient_boosting_categorical_only.py --n-samples 100_000 --verbose I looked at the import io
import pandas as pd
df = pd.read_csv(io.StringIO("""meson,setuptools
3.049,2.849
2.556,3.008
2.648,2.981
1.978,3.037
3.155,2.754
2.170,2.484
2.313,2.438
2.622,2.584
1.910,2.042
2.009,2.697
2.023,2.613
1.876,2.466
1.798,2.456
2.778,1.651
2.260,2.353
2.273,2.548
2.638,1.932
2.138,2.650
2.440,1.587
2.295,3.248
"""))
df.apply(['mean', 'std'])
|
It seems like something may be off with OpenMP and our meson setup, one possible fix is to add I need to look into it a bit more to make sure this is the actual issue but this would help if you could try and see whether that seems to help in your case ... Edit: I guess you want to look at #29694 for a more complete fix to try. |
I confirm I observe slight regressions between There are variations of the results of the scripts, here are instances of some of
and 91608fb:
and 0107e77:
The addition of |
@lorentzenchr: do you think you could split this PR in smaller ones as you proposed in the description?
|
I'd be very excited about this feature, let me know if there's anything I can do to help push it over the finish line 😃 |
Reference Issues/PRs
Fixes #26277.
What does this implement/fix? Explain your changes.
This PR extents the maximum number of bins (including missing bin) from 256 (8bit) to 65536 (16bit) in
HistGradientBoostingRegressor
andHistGradientBoostingClassifier
.This PR uses the extended bins only on categorical variables if needed and still applies
max_bins
restricted to 255 to numerical features. This could easily be extended.Any other comments?
This PR introduces 3 new data structures as Cython extension types:
cdef class Histograms
This has one underlying 1-dim numpy array to hold all bins of all features. An offset array tells where the bins of a feature starts and where it ends
All commits up to and including ENH adapt HGBT to new histograms everywhere.
cdef class Bitsets
This extends the fixed sized bitsets arrays and works similar to the Histogram class, a 1-dim numpy array to hold all bitsets of all features and an offset array to tell where each feature begins and ends.
Just commit ENH add Bitsets extension type.
cdef class BinnedData
This class is a container for mixed uint8 and uint16 typed arrays. It holds track which feature is uint8 and which is uint16. Underlying are two 2-dim arrays of uint8 and uint16.
Just commit ENH add BinnedData extension type for mixed uint8 uint16 data.
Benchmarks indicate no performance regression:
On this PR commit d769848
while on main f59c16d
(I can do some more if requested.)
Disadvantages:
BinnedData
is inherently F-contiguous and it is harder to define dynamic mixed types for a C-contiguous layout."bin_threshold"
is now uint16 instead of uint16, adding a tiny bit more memory to fitted estimator objects.Further notes: