ENH Support sample weights when fitting HistGradientBoosting estimator #25431

Andrew-Wang-IB45 · 2023-01-19T04:56:44Z

Reference Issues/PRs

Partially addresses #25210. See also #24872.

What does this implement/fix? Explain your changes.

As discussed in the above two issues, this pull request supports user-supplied sample weights when fitting the HistGradientBoosting estimator. Specifically, it passes the sample weights to the TreeGrower and computes and stores the weighted_n_node_samples as an attribute for each node. The related predictors are also updated to use the weighted_n_node_samples instead of the count as to take the sample weights into account.

Any other comments?

Since I have modified the interface functions of the HistGradientBoosting estimator and the TreeGrower, any comments about the applicability, documentation, and formatting would be appreciated. In particular, is the min_weight_fraction_leaf an appropriate addition to the instantiation of the HistGradientBoosting estimator or should we leave this for later?

Tasklist

Pass sample weights into the TreeGrower.
Compute and store the weighted_n_node_samples for each TreeNode.
Replace all references to count with weighted_n_node_samples in the predictor for the HistGradientBoosting estimator.
Write tests for the correctness of the use of sample weights in the HistGradientBoosting estimator.
Benchmark performance and memory footprint of updated HistGradientBoosting estimator.
Update documentation of all public functions that are changed.

…g estimator

Andrew-Wang-IB45 · 2023-01-30T07:06:54Z

Since this PR computes and keeps track of the weighted_n_node_samples in each node, the updated HistGradientBoosting model is expected to be somewhat slower in the fitting process as well as consume more memory. The profiling below highlights the extent of these changes.

Profiling

Classification

import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingClassifier(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)

Original

         119005 function calls (113172 primitive calls) in 4.035 seconds

   Ordered by: internal time
   List reduced from 379 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.476    0.001    1.476    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3680    0.404    0.000    0.404    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2010    0.340    0.000    0.340    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      207    0.275    0.001    0.275    0.001 {method 'sort' of 'numpy.ndarray' objects}
        1    0.204    0.204    4.035    4.035 gradient_boosting.py:335(fit)
      100    0.192    0.002    0.192    0.002 {method 'partition' of 'numpy.ndarray' objects}
     2190    0.184    0.000    0.238    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2190    0.139    0.000    2.473    0.001 grower.py:450(split_next)
  4453/73    0.107    0.000    0.107    0.001 grower.py:709(_fill_predictor_arrays)
        2    0.059    0.029    0.059    0.029 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}

New

         133627 function calls (127760 primitive calls) in 4.857 seconds

   Ordered by: internal time
   List reduced from 382 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.647    0.001    1.647    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     2190    0.489    0.000    3.178    0.001 grower.py:469(split_next)
     3680    0.453    0.000    0.453    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2010    0.370    0.000    0.370    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      207    0.268    0.001    0.268    0.001 {method 'sort' of 'numpy.ndarray' objects}
        1    0.230    0.230    4.857    4.857 gradient_boosting.py:338(fit)
     2190    0.206    0.000    0.269    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      100    0.190    0.002    0.190    0.002 {method 'partition' of 'numpy.ndarray' objects}
  4453/73    0.117    0.000    0.117    0.002 grower.py:741(_fill_predictor_arrays)
     5457    0.099    0.000    0.099    0.000 {method 'reduce' of 'numpy.ufunc' objects}

Regression

import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
X, y = make_regression(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingRegressor(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)

Original

         159258 function calls (151813 primitive calls) in 4.672 seconds

   Ordered by: internal time
   List reduced from 331 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.705    0.001    1.705    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     5856    0.682    0.000    0.682    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.381    0.000    0.381    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      204    0.265    0.001    0.265    0.001 {method 'sort' of 'numpy.ndarray' objects}
     3000    0.264    0.000    0.336    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     3000    0.193    0.000    3.247    0.001 grower.py:450(split_next)
      100    0.187    0.002    0.187    0.002 {method 'partition' of 'numpy.ndarray' objects}
        1    0.158    0.158    4.672    4.672 gradient_boosting.py:335(fit)
 6100/100    0.140    0.000    0.140    0.001 grower.py:709(_fill_predictor_arrays)
      101    0.055    0.001    0.055    0.001 {built-in method numpy.ascontiguousarray}

New

         179189 function calls (171744 primitive calls) in 5.787 seconds

   Ordered by: internal time
   List reduced from 334 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.800    0.001    1.800    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.806    0.000    4.228    0.001 grower.py:469(split_next)
     5856    0.740    0.000    0.740    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.410    0.000    0.410    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.289    0.000    0.371    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      204    0.265    0.001    0.265    0.001 {method 'sort' of 'numpy.ndarray' objects}
        1    0.188    0.188    5.787    5.787 gradient_boosting.py:338(fit)
      100    0.185    0.002    0.185    0.002 {method 'partition' of 'numpy.ndarray' objects}
     7205    0.149    0.000    0.149    0.000 {method 'reduce' of 'numpy.ufunc' objects}
 6100/100    0.146    0.000    0.146    0.001 grower.py:741(_fill_predictor_arrays)

Benchmark

Using bench_gradient_boosting.py

Old

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.156s,
score duration: 0.003s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.222s,
score duration: 0.008s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.600s,
score duration: 0.055s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.337s,
score duration: 0.266s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.879s,
score duration: 0.501s,

New

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.278s,
score duration: 0.007s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.531s,
score duration: 0.013s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 1.133s,
score duration: 0.078s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 2.592s,
score duration: 0.314s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 3.934s,
score duration: 0.566s,

Finally, the memory footprint is increased by the virtue of the predictor storing the weighted_n_node_samples instead of the count as the former has type np.float64 and the latter has type uint32.

…e_weights' into hist_gradient_boosting_fit_sample_weights

Andrew-Wang-IB45 · 2023-03-16T05:39:50Z

Hi, @vitaliset and @glemaitre, could you please review this pull request at your convenience? Specifically, I need some help thinking of helpful test cases for the test_grower.py file due to the addition of the parameter min_weight_leaf. Thank you!

lorentzenchr · 2023-03-19T09:45:56Z

@Andrew-Wang-IB45 This PR doubles the fit duration which would be a huge deterioration for users. Unless this is resolved, we can't include in scikit-learn.

Andrew-Wang-IB45 · 2023-03-19T16:07:20Z

Hi @lorentzenchr. Thanks for your feedback. If the fit duration can be reduced to similar times as the current version, would this PR still be considered for inclusion in sckit-learn?

lorentzenchr · 2023-03-19T17:20:46Z

Yes. If the impact on fit and predict time as well as memory usage is very small, we can consider it, but otherwise not.

Note that HGBT is a performance critical code where we usually care about a few or even 1 percent.

Would it suffice to have the sum of weights only in the final leaf nodes?

Andrew-Wang-IB45 · 2023-03-19T19:27:41Z

I will look into this further to reduce the performance hit of fit and predict of the HGBT estimator. From my preliminary research, it seems that, unfortunately, it does not suffice to only have the sum of the weights in the final leaf nodes since the recursive partial dependence plot computations rely on the weighted proportions in every tree node.

Andrew-Wang-IB45 · 2023-03-20T23:19:26Z

Hi @lorentzenchr, I made some attempts to speed up the code, mainly by moving the computations of the weighted_n_node_samples into the Cython section. Although this change significantly reduced the fit and predict runtimes compared to the earlier iteration of the PR, it is still slower than the main branch. I suppose that it is very difficult to maintain the performance of fit and predict if we were to take sample weights into account when fitting. Since that is the case, should we put this feature on hold until a better approach is discovered?

vitaliset

Maybe, making these stopping criteria with the sample_weight is making it less efficient, and they are not essential for calculating the PDP with recursion method.

For now, I would suggest adding only the essential part (lines 409-411). It is worth a try!

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/grower.py

Andrew-Wang-IB45 · 2023-03-22T06:06:16Z

Hi @vitaliset, most of the affected changes would be in test_splitting.py. In addition, since the Splitter would need the sample_weight, I plan to add the sample weight as another parameter. To minimize the impact, I thinking giving it a default value is prudent.

As for the benchmarks in the PRs with the "performance" label, especially PR 25186, they seem similar to how I did it earlier, which is to use cProfile on the fit and predict methods and then run the relevant benchmark files. However, since the new times are very close and @lorentzenchr mentioned that HGBT is performance critical code, I'm not sure if my methodology is sufficiently robust. For now, I think the best way is to implement the changes, compare the performance between the two versions, and have other reviewers verify the benchmarks.

vitaliset · 2023-03-22T09:46:54Z

For now, I think the best way is to implement the changes, compare the performance between the two versions, and have other reviewers verify the benchmarks.

Perfect. That's it.

Andrew-Wang-IB45 · 2023-03-23T06:04:38Z

Here are the most updated performance comparisons:

Profiling

Classification

import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingClassifier(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)

Original

         118531 function calls (112738 primitive calls) in 3.612 seconds

   Ordered by: internal time
   List reduced from 382 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.314    0.001    1.314    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3680    0.378    0.000    0.378    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2010    0.281    0.000    0.281    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
      207    0.262    0.001    0.262    0.001 {method 'sort' of 'numpy.ndarray' objects}
      100    0.185    0.002    0.185    0.002 {method 'partition' of 'numpy.ndarray' objects}
     2190    0.169    0.000    0.219    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
        1    0.156    0.156    3.612    3.612 gradient_boosting.py:334(fit)
     2190    0.127    0.000    2.197    0.001 grower.py:450(split_next)
  4453/73    0.093    0.000    0.093    0.001 grower.py:709(_fill_predictor_arrays)
      101    0.057    0.001    0.057    0.001 {built-in method numpy.ascontiguousarray}

New

         120429 function calls (114636 primitive calls) in 3.921 seconds

   Ordered by: internal time
   List reduced from 384 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2083    1.462    0.001    1.462    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3680    0.469    0.000    0.469    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      207    0.268    0.001    0.268    0.001 {method 'sort' of 'numpy.ndarray' objects}
     2010    0.252    0.000    0.252    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     2190    0.222    0.000    0.283    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      100    0.192    0.002    0.192    0.002 {method 'partition' of 'numpy.ndarray' objects}
     2190    0.142    0.000    2.455    0.001 grower.py:469(split_next)
  4453/73    0.100    0.000    0.100    0.001 grower.py:731(_fill_predictor_arrays)
        2    0.069    0.035    0.069    0.035 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}
        1    0.062    0.062    3.921    3.921 gradient_boosting.py:334(fit)

Regression

import cProfile
from pstats import Stats
import numpy as np
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
X, y = make_regression(n_samples=50000, n_features=100, random_state=0)
with cProfile.Profile() as pr:
    hgbc = HistGradientBoostingRegressor(random_state=0).fit(X, y)
stats = Stats(pr)
stats.sort_stats("tottime").print_stats(10)

Original

         158674 function calls (151269 primitive calls) in 4.793 seconds

   Ordered by: internal time
   List reduced from 334 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.711    0.001    1.711    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     5856    0.783    0.000    0.783    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.384    0.000    0.384    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.289    0.000    0.369    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      204    0.262    0.001    0.262    0.001 {method 'sort' of 'numpy.ndarray' objects}
     3000    0.195    0.000    3.374    0.001 grower.py:450(split_next)
      100    0.186    0.002    0.186    0.002 {method 'partition' of 'numpy.ndarray' objects}
 6100/100    0.136    0.000    0.136    0.001 grower.py:709(_fill_predictor_arrays)
        1    0.108    0.108    4.793    4.793 gradient_boosting.py:334(fit)
        2    0.077    0.038    0.077    0.038 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}

New

         161274 function calls (153869 primitive calls) in 5.161 seconds

   Ordered by: internal time
   List reduced from 335 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2999    1.807    0.001    1.807    0.001 {method 'compute_histograms_brute' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     5856    0.814    0.000    0.814    0.000 {method 'find_node_split' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
     2899    0.410    0.000    0.410    0.000 {method 'compute_histograms_subtraction' of 'sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder' objects}
     3000    0.352    0.000    0.434    0.000 {method 'split_indices' of 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter' objects}
      204    0.278    0.001    0.278    0.001 {method 'sort' of 'numpy.ndarray' objects}
     3000    0.202    0.000    3.582    0.001 grower.py:469(split_next)
      100    0.198    0.002    0.198    0.002 {method 'partition' of 'numpy.ndarray' objects}
 6100/100    0.138    0.000    0.138    0.001 grower.py:731(_fill_predictor_arrays)
        1    0.132    0.132    5.160    5.160 gradient_boosting.py:334(fit)
        2    0.084    0.042    0.084    0.042 {sklearn.ensemble._hist_gradient_boosting._binning._map_to_bins}

Benchmark

Using bench_hist_gradient_boosting.py

Old

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.558s,
score duration: 0.003s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.183s,
score duration: 0.007s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.535s,
score duration: 0.052s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.280s,
score duration: 0.272s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.735s,
score duration: 0.470s,

New

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.564s,
score duration: 0.003s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.197s,
score duration: 0.007s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 1.090s,
score duration: 0.054s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.400s,
score duration: 0.279s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.927s,
score duration: 0.437s,

Finally, running asv continuous -b HistGradientBoosting origin/main HEAD yields:

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl..
·· Installing fb234b79 <hist_gradient_boosting_fit_sample_weights> into conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl..
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[ 0.00%] · For scikit-learn commit fb234b79 <hist_gradient_boosting_fit_sample_weights> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[ 8.33%] ··· Setting up ensemble:103                                                                                                                                                                                                                                                                       ok
[ 8.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                                                                                                                                                                                                145M
[16.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                                                                                                                                                                                            135M
[25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                                                                                                                                                                                                16.8±1s
[33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                                                                                                                                                                                          535±100ms
[41.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                                                                                                                                                                                             0.7230709112942986
[50.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                                                                                                                                                                                            0.9812160155622751
[50.00%] · For scikit-learn commit 18af5508 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl....
[50.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[58.33%] ··· Setting up ensemble:103                                                                                                                                                                                                                                                                       ok
[58.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                                                                                                                                                                                                145M
[66.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                                                                                                                                                                                            135M
[75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                                                                                                                                                                                              14.4±0.8s
[83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                                                                                                                                                                                          406±200ms
[91.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                                                                                                                                                                                             0.7230709112942986
[100.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                                                                                                                                                                                            0.9812160155622751

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Note that the above numbers are imprecise due to wild fluctuations, but in general, we can see that the performance of the updated HGBT is roughly on par with that of the old one. It would be helpful to verify these benchmarks and provide more feedback.

lorentzenchr · 2023-03-26T12:15:45Z

Benchmarking is a hard business. For HGBT, we are usually only interested in sample sizes of 10,000 and higher. You can look into the following PRs for some inspiration on benchmarking:

simpler:
- [MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341 (comment)
- ENH compute histograms only for allowed features in HGBT #24856 (comment)
extensive: ENH Replace loss module HGBT #20811 (comment)

Andrew-Wang-IB45 · 2023-03-26T21:26:11Z

Hi, thanks for the suggestions. Do you expect the benchmarking for this PR to be as extensive as the one you did?

lorentzenchr · 2023-03-26T21:51:35Z

Do you expect the benchmarking for this PR to be as extensive as the one you did?

I hope not🤞
There are more options:

Find a way that has only, say, a 1-5% performance penalty.
Rethink the recursive pdp computation and only add what‘s necessary (My gut feeling is to doubt that we need sample weights for each node, but only for terminal nodes/leaves).
Make it optional
Decide the costs are too high and close.

Andrew-Wang-IB45 · 2023-03-26T21:56:49Z

Hi, sounds good! I will do some more benchmarking and then work from there.

vitaliset · 2023-03-27T02:05:42Z

Fantastic ideas, @lorentzenchr! Thanks for the input.

lorentzenchr · 2023-03-27T17:50:25Z

Note: My gut feeling might be wrong. https://nicolas-hug.com/blog/pdps explains it pretty good.

Andrew-Wang-IB45 · 2023-03-28T02:58:23Z

I had considered only using the sample weights for the terminal nodes, but then the recursive pdp computation from https://nicolas-hug.com/blog/pdps would not work. Additionally, this approach would not save any time since computing the weighted_n_node_samples for just the terminal nodes already incurs a major performance penalty.

From my preliminary benchmarks, it seems that the best I can do for now is to incur a 10% performance penalty. Most of the degradation comes from adding up the sample weights while partitioning the indices and the overhead of including sample_weight while growing the tree. This can be seen as follows:

Old

$ python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.959 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 114.222 s, (3100 total leaves)
Time spent computing histograms: 67.496s
Time spent finding best splits:  0.566s
Time spent applying splits:      11.339s
Time spent predicting:           4.093s
fitted in 114.515s
predicted in 15.033s, ROC AUC: 0.8228, ACC: 0.7415

New

$ python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.947 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 123.755 s, (3100 total leaves)
Time spent computing histograms: 67.050s
Time spent finding best splits:  0.565s
Time spent applying splits:      15.002s
Time spent predicting:           4.078s
fitted in 124.087s
predicted in 14.098s, ROC AUC: 0.8228, ACC: 0.7415

Andrew-Wang-IB45 · 2023-03-31T01:36:37Z

After repeated benchmarks with large sample sizes and attempts to speed up the fitting process for the HGBT estimator, it seems that there is no way to make keeping track of the weighted_n_node_samples experience a performance degradation of less than 5%. Fundamentally, just including the indexing and accumulation of the sample weights during the splitting process significantly degrades performance, and no amount of optimizations can break through this cap. I have also attempted other approaches but they do not seem to work either. At this point, I think either making this optional or closing this, as suggested by @lorentzenchr, is probably the best course of action. What are your thoughts on this, @vitaliset?

vitaliset · 2023-04-02T19:41:09Z

Hello @Andrew-Wang-IB45. Unfortunately, given that performance is a critical concern in HistGradientBoosting, it may not be viable to set this calculation as the default, as you mentioned.

Could you please evaluate how the performance of the sklearn.inspection.partial_dependence is affected by implementing this pull request using recursion compared to the brute force method in the current version (utilizing the same training set)?

You will need to modify this part of the existing code (or remove this _fitted_with_sw attribute, I guess), as the existing implementation consistently employs the method="brute" approach for HistGradientBoosting trained with sample weights:

scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Lines 1142 to 1148 in 205f3b7

    
           if getattr(self, "_fitted_with_sw", False): 
        
               raise NotImplementedError( 
        
                   "{} does not support partial dependence " 
        
                   "plots with the 'recursion' method when " 
        
                   "sample weights were given during fit " 
        
                   "time.".format(self.__class__.__name__) 
        
               )

If the performance improvement is really significant, we can consider maintaining it as an optional feature; otherwise, it might not be worthwhile to include it and we can close this with your study.

Andrew-Wang-IB45 · 2023-04-03T04:32:00Z

Hi @vitaliset, I will get to it in a few days. To clarify, I should compare the brute and recursive methods for a HGBT estimator that has been fitted on a large dataset, both with and without sample weights, correct? It seems that there is no existing benchmark for the pdp calculations so I should write some custom benchmarks using the same methodology. Also, should I modify the test files to ensure that they pass or can we ignore this for now?

lorentzenchr · 2023-04-08T07:15:47Z

I can only repeat myself: Making it optional might be best. A dedicated method could add the weights after training is complete. From there, we can decide whether to add a param to the estimators.

Andrew-Wang-IB45 · 2023-04-08T07:35:41Z

I agree with your assessment. In that case, the fit method should not have its signature changed. It may be better to create a new PR for the dedicated method that adds the sample weights post-training and close this one later since the two approaches are radically different.

Support user-supplied sample weights when fitting HistGradientBoostin…

a5168dd

…g estimator

github-actions bot added cython module:ensemble labels Jan 19, 2023

Andrew-Wang-IB45 added 3 commits January 19, 2023 16:15

Fix ordering of min_weight_leaf calculation

964edd0

Update all references of 'count' to 'weighted_n_node_samples' in tests

01f5586

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

6d27ef8

Andrew-Wang-IB45 added 7 commits January 30, 2023 08:11

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

dfc41fb

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

bae1162

Merge remote-tracking branch 'origin/hist_gradient_boosting_fit_sampl…

89bc47a

…e_weights' into hist_gradient_boosting_fit_sample_weights

Add tests

22d80e3

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

0f244d1

Update changelog

edadc0d

Black fixes

3056dd5

Andrew-Wang-IB45 marked this pull request as ready for review February 10, 2023 07:40

Andrew-Wang-IB45 changed the title ~~[WIP] ENH Support sample weights when fitting HistGradientBoosting estimator~~ ENH Support sample weights when fitting HistGradientBoosting estimator Feb 10, 2023

Andrew-Wang-IB45 added 6 commits February 19, 2023 09:29

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

8928b87

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

62f3520

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

18ffab3

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

ac2bb62

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

59720f9

Fix changelog

0c7e91e

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

c123aca

vitaliset reviewed Mar 20, 2023

View reviewed changes

Andrew-Wang-IB45 added 4 commits March 22, 2023 20:56

Move computation of weighted_n_node_samples to split_indices function

ecdcc80

Update test cases

228f4b2

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

6cac71d

Fix formatting

fb234b7

Andrew-Wang-IB45 added 2 commits March 25, 2023 02:11

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

fa61578

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

321332c

Andrew-Wang-IB45 added 2 commits March 27, 2023 20:07

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

354adb8

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

88e5d97

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

5c8b689

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

3135ce9

This was referenced Apr 8, 2023

HistGradientBoosting counts and sample weights #26128

Open

ENH partial_dependece plot for HistGradientBoosting estimator fitted with sample_weight #25210

Open

Merge branch 'main' into hist_gradient_boosting_fit_sample_weights

754d3ea

Andrew-Wang-IB45 closed this Jun 13, 2023

Andrew-Wang-IB45 deleted the hist_gradient_boosting_fit_sample_weights branch June 14, 2023 03:39

Uh oh!

ENH Support sample weights when fitting HistGradientBoosting estimator #25431

ENH Support sample weights when fitting HistGradientBoosting estimator #25431

Uh oh!

Conversation

Andrew-Wang-IB45 commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Tasklist

Uh oh!

Andrew-Wang-IB45 commented Jan 30, 2023

Profiling

Classification

Original

New

Regression

Original

New

Benchmark

Old

New

Uh oh!

Andrew-Wang-IB45 commented Mar 16, 2023

Uh oh!

lorentzenchr commented Mar 19, 2023

Uh oh!

Andrew-Wang-IB45 commented Mar 19, 2023

Uh oh!

lorentzenchr commented Mar 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andrew-Wang-IB45 commented Mar 19, 2023

Uh oh!

Andrew-Wang-IB45 commented Mar 20, 2023

Uh oh!

vitaliset left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Andrew-Wang-IB45 commented Mar 22, 2023

Uh oh!

vitaliset commented Mar 22, 2023

Uh oh!

Andrew-Wang-IB45 commented Mar 23, 2023

Profiling

Classification

Original

New

Regression

Original

New

Benchmark

Old

New

Uh oh!

lorentzenchr commented Mar 26, 2023

Uh oh!

Andrew-Wang-IB45 commented Mar 26, 2023

Uh oh!

lorentzenchr commented Mar 26, 2023

Uh oh!

Andrew-Wang-IB45 commented Mar 26, 2023

Uh oh!

vitaliset commented Mar 27, 2023

Uh oh!

lorentzenchr commented Mar 27, 2023

Uh oh!

Andrew-Wang-IB45 commented Mar 28, 2023

Old

New

Uh oh!

Andrew-Wang-IB45 commented Mar 31, 2023

Uh oh!

vitaliset commented Apr 2, 2023

Uh oh!

Andrew-Wang-IB45 commented Jan 19, 2023 •

edited

Loading

lorentzenchr commented Mar 19, 2023 •

edited

Loading