Skip to content

ENH Support sample weights when fitting HistGradientBoosting estimator #25431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Andrew-Wang-IB45
Copy link
Contributor

@Andrew-Wang-IB45 Andrew-Wang-IB45 commented Jan 19, 2023

Reference Issues/PRs

Partially addresses #25210. See also #24872.

What does this implement/fix? Explain your changes.

As discussed in the above two issues, this pull request supports user-supplied sample weights when fitting the HistGradientBoosting estimator. Specifically, it passes the sample weights to the TreeGrower and computes and stores the weighted_n_node_samples as an attribute for each node. The related predictors are also updated to use the weighted_n_node_samples instead of the count as to take the sample weights into account.

Any other comments?

Since I have modified the interface functions of the HistGradientBoosting estimator and the TreeGrower, any comments about the applicability, documentation, and formatting would be appreciated. In particular, is the min_weight_fraction_leaf an appropriate addition to the instantiation of the HistGradientBoosting estimator or should we leave this for later?

Tasklist

  • Pass sample weights into the TreeGrower.
  • Compute and store the weighted_n_node_samples for each TreeNode.
  • Replace all references to count with weighted_n_node_samples in the predictor for the HistGradientBoosting estimator.
  • Write tests for the correctness of the use of sample weights in the HistGradientBoosting estimator.
  • Benchmark performance and memory footprint of updated HistGradientBoosting estimator.
  • Update documentation of all public functions that are changed.

@Andrew-Wang-IB45
Copy link
Contributor Author

Since this PR computes and keeps track of the weighted_n_node_samples in each node, the updated HistGradientBoosting model is expected to be somewhat slower in the fitting process as well as consume more memory. The profiling below highlights the extent of these changes.

Profiling

Classification
import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingClassifier(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)
Original
         119005 function calls (113172 primitive calls) in 4.035 seconds

   Ordered by: internal time
   List reduced from 379 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.476    0.001    1.476    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3680    0.404    0.000    0.404    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2010    0.340    0.000    0.340    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      207    0.275    0.001    0.275    0.001 {method 'sort' of 'numpy.ndarray' objects}
        1    0.204    0.204    4.035    4.035 gradient_boosting.py:335(fit)
      100    0.192    0.002    0.192    0.002 {method 'partition' of 'numpy.ndarray' objects}
     2190    0.184    0.000    0.238    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2190    0.139    0.000    2.473    0.001 grower.py:450(split_next)
  4453/73    0.107    0.000    0.107    0.001 grower.py:709(_fill_predictor_arrays)
        2    0.059    0.029    0.059    0.029 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}
New
         133627 function calls (127760 primitive calls) in 4.857 seconds

   Ordered by: internal time
   List reduced from 382 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.647    0.001    1.647    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     2190    0.489    0.000    3.178    0.001 grower.py:469(split_next)
     3680    0.453    0.000    0.453    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2010    0.370    0.000    0.370    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      207    0.268    0.001    0.268    0.001 {method 'sort' of 'numpy.ndarray' objects}
        1    0.230    0.230    4.857    4.857 gradient_boosting.py:338(fit)
     2190    0.206    0.000    0.269    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      100    0.190    0.002    0.190    0.002 {method 'partition' of 'numpy.ndarray' objects}
  4453/73    0.117    0.000    0.117    0.002 grower.py:741(_fill_predictor_arrays)
     5457    0.099    0.000    0.099    0.000 {method 'reduce' of 'numpy.ufunc' objects}
Regression
import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
X, y = make_regression(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingRegressor(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)
Original
         159258 function calls (151813 primitive calls) in 4.672 seconds

   Ordered by: internal time
   List reduced from 331 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.705    0.001    1.705    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     5856    0.682    0.000    0.682    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.381    0.000    0.381    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      204    0.265    0.001    0.265    0.001 {method 'sort' of 'numpy.ndarray' objects}
     3000    0.264    0.000    0.336    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     3000    0.193    0.000    3.247    0.001 grower.py:450(split_next)
      100    0.187    0.002    0.187    0.002 {method 'partition' of 'numpy.ndarray' objects}
        1    0.158    0.158    4.672    4.672 gradient_boosting.py:335(fit)
 6100/100    0.140    0.000    0.140    0.001 grower.py:709(_fill_predictor_arrays)
      101    0.055    0.001    0.055    0.001 {built-in method numpy.ascontiguousarray}
New
         179189 function calls (171744 primitive calls) in 5.787 seconds

   Ordered by: internal time
   List reduced from 334 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.800    0.001    1.800    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.806    0.000    4.228    0.001 grower.py:469(split_next)
     5856    0.740    0.000    0.740    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.410    0.000    0.410    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.289    0.000    0.371    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      204    0.265    0.001    0.265    0.001 {method 'sort' of 'numpy.ndarray' objects}
        1    0.188    0.188    5.787    5.787 gradient_boosting.py:338(fit)
      100    0.185    0.002    0.185    0.002 {method 'partition' of 'numpy.ndarray' objects}
     7205    0.149    0.000    0.149    0.000 {method 'reduce' of 'numpy.ufunc' objects}
 6100/100    0.146    0.000    0.146    0.001 grower.py:741(_fill_predictor_arrays)
Benchmark

Using bench_gradient_boosting.py

Old
Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.156s,
score duration: 0.003s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.222s,
score duration: 0.008s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.600s,
score duration: 0.055s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.337s,
score duration: 0.266s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.879s,
score duration: 0.501s,

Benchmark Base Old

New
Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.278s,
score duration: 0.007s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.531s,
score duration: 0.013s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 1.133s,
score duration: 0.078s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 2.592s,
score duration: 0.314s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 3.934s,
score duration: 0.566s,

Benchmark Base New

Finally, the memory footprint is increased by the virtue of the predictor storing the weighted_n_node_samples instead of the count as the former has type np.float64 and the latter has type uint32.

@Andrew-Wang-IB45 Andrew-Wang-IB45 marked this pull request as ready for review February 10, 2023 07:40
@Andrew-Wang-IB45 Andrew-Wang-IB45 changed the title [WIP] ENH Support sample weights when fitting HistGradientBoosting estimator ENH Support sample weights when fitting HistGradientBoosting estimator Feb 10, 2023
@Andrew-Wang-IB45
Copy link
Contributor Author

Hi, @vitaliset and @glemaitre, could you please review this pull request at your convenience? Specifically, I need some help thinking of helpful test cases for the test_grower.py file due to the addition of the parameter min_weight_leaf. Thank you!

@lorentzenchr
Copy link
Member

@Andrew-Wang-IB45 This PR doubles the fit duration which would be a huge deterioration for users. Unless this is resolved, we can't include in scikit-learn.

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi @lorentzenchr. Thanks for your feedback. If the fit duration can be reduced to similar times as the current version, would this PR still be considered for inclusion in sckit-learn?

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 19, 2023

Yes. If the impact on fit and predict time as well as memory usage is very small, we can consider it, but otherwise not.

Note that HGBT is a performance critical code where we usually care about a few or even 1 percent.

Would it suffice to have the sum of weights only in the final leaf nodes?

@Andrew-Wang-IB45
Copy link
Contributor Author

I will look into this further to reduce the performance hit of fit and predict of the HGBT estimator. From my preliminary research, it seems that, unfortunately, it does not suffice to only have the sum of the weights in the final leaf nodes since the recursive partial dependence plot computations rely on the weighted proportions in every tree node.

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi @lorentzenchr, I made some attempts to speed up the code, mainly by moving the computations of the weighted_n_node_samples into the Cython section. Although this change significantly reduced the fit and predict runtimes compared to the earlier iteration of the PR, it is still slower than the main branch. I suppose that it is very difficult to maintain the performance of fit and predict if we were to take sample weights into account when fitting. Since that is the case, should we put this feature on hold until a better approach is discovered?

Copy link
Contributor

@vitaliset vitaliset left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, making these stopping criteria with the sample_weight is making it less efficient, and they are not essential for calculating the PDP with recursion method.

For now, I would suggest adding only the essential part (lines 409-411). It is worth a try!

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi @vitaliset, most of the affected changes would be in test_splitting.py. In addition, since the Splitter would need the sample_weight, I plan to add the sample weight as another parameter. To minimize the impact, I thinking giving it a default value is prudent.

As for the benchmarks in the PRs with the "performance" label, especially PR 25186, they seem similar to how I did it earlier, which is to use cProfile on the fit and predict methods and then run the relevant benchmark files. However, since the new times are very close and @lorentzenchr mentioned that HGBT is performance critical code, I'm not sure if my methodology is sufficiently robust. For now, I think the best way is to implement the changes, compare the performance between the two versions, and have other reviewers verify the benchmarks.

@vitaliset
Copy link
Contributor

For now, I think the best way is to implement the changes, compare the performance between the two versions, and have other reviewers verify the benchmarks.

Perfect. That's it.

@Andrew-Wang-IB45
Copy link
Contributor Author

Here are the most updated performance comparisons:

Profiling

Classification
import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingClassifier(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)
Original
         118531 function calls (112738 primitive calls) in 3.612 seconds

   Ordered by: internal time
   List reduced from 382 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.314    0.001    1.314    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3680    0.378    0.000    0.378    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2010    0.281    0.000    0.281    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      207    0.262    0.001    0.262    0.001 {method 'sort' of 'numpy.ndarray' objects}
      100    0.185    0.002    0.185    0.002 {method 'partition' of 'numpy.ndarray' objects}
     2190    0.169    0.000    0.219    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
        1    0.156    0.156    3.612    3.612 gradient_boosting.py:334(fit)
     2190    0.127    0.000    2.197    0.001 grower.py:450(split_next)
  4453/73    0.093    0.000    0.093    0.001 grower.py:709(_fill_predictor_arrays)
      101    0.057    0.001    0.057    0.001 {built-in method numpy.ascontiguousarray}
New
         120429 function calls (114636 primitive calls) in 3.921 seconds

   Ordered by: internal time
   List reduced from 384 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.462    0.001    1.462    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3680    0.469    0.000    0.469    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      207    0.268    0.001    0.268    0.001 {method 'sort' of 'numpy.ndarray' objects}
     2010    0.252    0.000    0.252    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     2190    0.222    0.000    0.283    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      100    0.192    0.002    0.192    0.002 {method 'partition' of 'numpy.ndarray' objects}
     2190    0.142    0.000    2.455    0.001 grower.py:469(split_next)
  4453/73    0.100    0.000    0.100    0.001 grower.py:731(_fill_predictor_arrays)
        2    0.069    0.035    0.069    0.035 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}
        1    0.062    0.062    3.921    3.921 gradient_boosting.py:334(fit)
Regression
import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
X, y = make_regression(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingRegressor(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)
Original
         158674 function calls (151269 primitive calls) in 4.793 seconds

   Ordered by: internal time
   List reduced from 334 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.711    0.001    1.711    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     5856    0.783    0.000    0.783    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.384    0.000    0.384    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.289    0.000    0.369    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      204    0.262    0.001    0.262    0.001 {method 'sort' of 'numpy.ndarray' objects}
     3000    0.195    0.000    3.374    0.001 grower.py:450(split_next)
      100    0.186    0.002    0.186    0.002 {method 'partition' of 'numpy.ndarray' objects}
 6100/100    0.136    0.000    0.136    0.001 grower.py:709(_fill_predictor_arrays)
        1    0.108    0.108    4.793    4.793 gradient_boosting.py:334(fit)
        2    0.077    0.038    0.077    0.038 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}
New
         161274 function calls (153869 primitive calls) in 5.161 seconds

   Ordered by: internal time
   List reduced from 335 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.807    0.001    1.807    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     5856    0.814    0.000    0.814    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.410    0.000    0.410    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.352    0.000    0.434    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      204    0.278    0.001    0.278    0.001 {method 'sort' of 'numpy.ndarray' objects}
     3000    0.202    0.000    3.582    0.001 grower.py:469(split_next)
      100    0.198    0.002    0.198    0.002 {method 'partition' of 'numpy.ndarray' objects}
 6100/100    0.138    0.000    0.138    0.001 grower.py:731(_fill_predictor_arrays)
        1    0.132    0.132    5.160    5.160 gradient_boosting.py:334(fit)
        2    0.084    0.042    0.084    0.042 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}
Benchmark

Using bench_hist_gradient_boosting.py

Old
Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.558s,
score duration: 0.003s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.183s,
score duration: 0.007s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.535s,
score duration: 0.052s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.280s,
score duration: 0.272s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.735s,
score duration: 0.470s,

Benchmark Base Old

New
Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.564s,
score duration: 0.003s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.197s,
score duration: 0.007s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 1.090s,
score duration: 0.054s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.400s,
score duration: 0.279s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.927s,
score duration: 0.437s,

Benchmark Base New

Finally, running asv continuous -b HistGradientBoosting origin/main HEAD yields:

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl..
·· Installing fb234b79 <hist_gradient_boosting_fit_sample_weights> into conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl..
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[ 0.00%] · For scikit-learn commit fb234b79 <hist_gradient_boosting_fit_sample_weights> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[ 8.33%] ··· Setting up ensemble:103                                                                                                                                                                                                                                                                       ok
[ 8.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                                                                                                                                                                                                145M
[16.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                                                                                                                                                                                            135M
[25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                                                                                                                                                                                                16.8±1s
[33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                                                                                                                                                                                          535±100ms
[41.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                                                                                                                                                                                             0.7230709112942986
[50.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                                                                                                                                                                                            0.9812160155622751
[50.00%] · For scikit-learn commit 18af5508 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl....
[50.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[58.33%] ··· Setting up ensemble:103                                                                                                                                                                                                                                                                       ok
[58.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                                                                                                                                                                                                145M
[66.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                                                                                                                                                                                            135M
[75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                                                                                                                                                                                              14.4±0.8s
[83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                                                                                                                                                                                          406±200ms
[91.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                                                                                                                                                                                             0.7230709112942986
[100.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                                                                                                                                                                                            0.9812160155622751

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Note that the above numbers are imprecise due to wild fluctuations, but in general, we can see that the performance of the updated HGBT is roughly on par with that of the old one. It would be helpful to verify these benchmarks and provide more feedback.

@lorentzenchr
Copy link
Member

Benchmarking is a hard business. For HGBT, we are usually only interested in sample sizes of 10,000 and higher. You can look into the following PRs for some inspiration on benchmarking:

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi, thanks for the suggestions. Do you expect the benchmarking for this PR to be as extensive as the one you did?

@lorentzenchr
Copy link
Member

Do you expect the benchmarking for this PR to be as extensive as the one you did?

I hope not🤞
There are more options:

  • Find a way that has only, say, a 1-5% performance penalty.
  • Rethink the recursive pdp computation and only add what‘s necessary (My gut feeling is to doubt that we need sample weights for each node, but only for terminal nodes/leaves).
  • Make it optional
  • Decide the costs are too high and close.

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi, sounds good! I will do some more benchmarking and then work from there.

@vitaliset
Copy link
Contributor

Fantastic ideas, @lorentzenchr! Thanks for the input.

@lorentzenchr
Copy link
Member

Note: My gut feeling might be wrong. https://nicolas-hug.com/blog/pdps explains it pretty good.

@Andrew-Wang-IB45
Copy link
Contributor Author

I had considered only using the sample weights for the terminal nodes, but then the recursive pdp computation from https://nicolas-hug.com/blog/pdps would not work. Additionally, this approach would not save any time since computing the weighted_n_node_samples for just the terminal nodes already incurs a major performance penalty.

From my preliminary benchmarks, it seems that the best I can do for now is to incur a 10% performance penalty. Most of the degradation comes from adding up the sample weights while partitioning the indices and the overhead of including sample_weight while growing the tree. This can be seen as follows:

Old
$ python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.959 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 114.222 s, (3100 total leaves)
Time spent computing histograms: 67.496s
Time spent finding best splits:  0.566s
Time spent applying splits:      11.339s
Time spent predicting:           4.093s
fitted in 114.515s
predicted in 15.033s, ROC AUC: 0.8228, ACC: 0.7415
New
$ python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.947 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 123.755 s, (3100 total leaves)
Time spent computing histograms: 67.050s
Time spent finding best splits:  0.565s
Time spent applying splits:      15.002s
Time spent predicting:           4.078s
fitted in 124.087s
predicted in 14.098s, ROC AUC: 0.8228, ACC: 0.7415

@Andrew-Wang-IB45
Copy link
Contributor Author

After repeated benchmarks with large sample sizes and attempts to speed up the fitting process for the HGBT estimator, it seems that there is no way to make keeping track of the weighted_n_node_samples experience a performance degradation of less than 5%. Fundamentally, just including the indexing and accumulation of the sample weights during the splitting process significantly degrades performance, and no amount of optimizations can break through this cap. I have also attempted other approaches but they do not seem to work either. At this point, I think either making this optional or closing this, as suggested by @lorentzenchr, is probably the best course of action. What are your thoughts on this, @vitaliset?

@vitaliset
Copy link
Contributor

Hello @Andrew-Wang-IB45. Unfortunately, given that performance is a critical concern in HistGradientBoosting, it may not be viable to set this calculation as the default, as you mentioned.

Could you please evaluate how the performance of the sklearn.inspection.partial_dependence is affected by implementing this pull request using recursion compared to the brute force method in the current version (utilizing the same training set)?

You will need to modify this part of the existing code (or remove this _fitted_with_sw attribute, I guess), as the existing implementation consistently employs the method="brute" approach for HistGradientBoosting trained with sample weights:

if getattr(self, "_fitted_with_sw", False):
raise NotImplementedError(
"{} does not support partial dependence "
"plots with the 'recursion' method when "
"sample weights were given during fit "
"time.".format(self.__class__.__name__)
)

If the performance improvement is really significant, we can consider maintaining it as an optional feature; otherwise, it might not be worthwhile to include it and we can close this with your study.

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi @vitaliset, I will get to it in a few days. To clarify, I should compare the brute and recursive methods for a HGBT estimator that has been fitted on a large dataset, both with and without sample weights, correct? It seems that there is no existing benchmark for the pdp calculations so I should write some custom benchmarks using the same methodology. Also, should I modify the test files to ensure that they pass or can we ignore this for now?

@lorentzenchr
Copy link
Member

I can only repeat myself: Making it optional might be best. A dedicated method could add the weights after training is complete. From there, we can decide whether to add a param to the estimators.

@Andrew-Wang-IB45
Copy link
Contributor Author

I agree with your assessment. In that case, the fit method should not have its signature changed. It may be better to create a new PR for the dedicated method that adds the sample weights post-training and close this one later since the two approaches are radically different.

@Andrew-Wang-IB45 Andrew-Wang-IB45 deleted the hist_gradient_boosting_fit_sample_weights branch June 14, 2023 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants