-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Support sample weights when fitting HistGradientBoosting estimator #25431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Support sample weights when fitting HistGradientBoosting estimator #25431
Conversation
Since this PR computes and keeps track of the ProfilingClassification
Original
New
Regression
Original
New
BenchmarkUsing Old
New
Finally, the memory footprint is increased by the virtue of the predictor storing the |
…e_weights' into hist_gradient_boosting_fit_sample_weights
Hi, @vitaliset and @glemaitre, could you please review this pull request at your convenience? Specifically, I need some help thinking of helpful test cases for the |
@Andrew-Wang-IB45 This PR doubles the fit duration which would be a huge deterioration for users. Unless this is resolved, we can't include in scikit-learn. |
Hi @lorentzenchr. Thanks for your feedback. If the fit duration can be reduced to similar times as the current version, would this PR still be considered for inclusion in sckit-learn? |
Yes. If the impact on fit and predict time as well as memory usage is very small, we can consider it, but otherwise not. Note that HGBT is a performance critical code where we usually care about a few or even 1 percent. Would it suffice to have the sum of weights only in the final leaf nodes? |
I will look into this further to reduce the performance hit of |
Hi @lorentzenchr, I made some attempts to speed up the code, mainly by moving the computations of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, making these stopping criteria with the sample_weight
is making it less efficient, and they are not essential for calculating the PDP with recursion method.
For now, I would suggest adding only the essential part (lines 409-411). It is worth a try!
Hi @vitaliset, most of the affected changes would be in As for the benchmarks in the PRs with the "performance" label, especially PR 25186, they seem similar to how I did it earlier, which is to use |
Perfect. That's it. |
Here are the most updated performance comparisons: ProfilingClassification
Original
New
Regression
Original
New
BenchmarkUsing Old
New
Finally, running
Note that the above numbers are imprecise due to wild fluctuations, but in general, we can see that the performance of the updated |
Benchmarking is a hard business. For HGBT, we are usually only interested in sample sizes of 10,000 and higher. You can look into the following PRs for some inspiration on benchmarking: |
Hi, thanks for the suggestions. Do you expect the benchmarking for this PR to be as extensive as the one you did? |
I hope not🤞
|
Hi, sounds good! I will do some more benchmarking and then work from there. |
Fantastic ideas, @lorentzenchr! Thanks for the input. |
Note: My gut feeling might be wrong. https://nicolas-hug.com/blog/pdps explains it pretty good. |
I had considered only using the sample weights for the terminal nodes, but then the recursive pdp computation from https://nicolas-hug.com/blog/pdps would not work. Additionally, this approach would not save any time since computing the From my preliminary benchmarks, it seems that the best I can do for now is to incur a 10% performance penalty. Most of the degradation comes from adding up the sample weights while partitioning the indices and the overhead of including Old
New
|
After repeated benchmarks with large sample sizes and attempts to speed up the fitting process for the |
Hello @Andrew-Wang-IB45. Unfortunately, given that performance is a critical concern in HistGradientBoosting, it may not be viable to set this calculation as the default, as you mentioned. Could you please evaluate how the performance of the You will need to modify this part of the existing code (or remove this scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Lines 1142 to 1148 in 205f3b7
If the performance improvement is really significant, we can consider maintaining it as an optional feature; otherwise, it might not be worthwhile to include it and we can close this with your study. |
Hi @vitaliset, I will get to it in a few days. To clarify, I should compare the brute and recursive methods for a |
I can only repeat myself: Making it optional might be best. A dedicated method could add the weights after training is complete. From there, we can decide whether to add a param to the estimators. |
I agree with your assessment. In that case, the |
Reference Issues/PRs
Partially addresses #25210. See also #24872.
What does this implement/fix? Explain your changes.
As discussed in the above two issues, this pull request supports user-supplied sample weights when fitting the
HistGradientBoosting
estimator. Specifically, it passes the sample weights to theTreeGrower
and computes and stores theweighted_n_node_samples
as an attribute for each node. The related predictors are also updated to use theweighted_n_node_samples
instead of thecount
as to take the sample weights into account.Any other comments?
Since I have modified the interface functions of the
HistGradientBoosting
estimator and theTreeGrower
, any comments about the applicability, documentation, and formatting would be appreciated. In particular, is themin_weight_fraction_leaf
an appropriate addition to the instantiation of theHistGradientBoosting
estimator or should we leave this for later?Tasklist
TreeGrower
.weighted_n_node_samples
for eachTreeNode
.count
withweighted_n_node_samples
in the predictor for theHistGradientBoosting
estimator.HistGradientBoosting
estimator.HistGradientBoosting
estimator.