-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Greatly reduces memory usage of histogram gradient boosting #18242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Greatly reduces memory usage of histogram gradient boosting #18242
Conversation
Nice work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! This is more than a 10x memory usage improvement.
Having results from the other benchmark scripts would also be nice.
@thomasjpfan I am surprised with the benchmark time. All different processings are slower (even x4 times more) in master but the overall execution time of the 2 PRs is the same? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan
Could you please:
- add tests for the HistogramsPool class, in particular checks for the
used_indices
andavailable_indices
sets - run the higgs boson benchmark for both time and memory usage (with appropriate number of repetitions where relevant please)
I simplified some of the logic in |
I could be wrong but it seems to me that using weakrefs is not necessary: the goal of using weakref would be to avoid keeping a histogram object "alive" in a node if the Pool dies. But the pool only dies at the very end (when |
You are correct. Weak references are not needed. This PR was updated to remove them simplifies things. On higgs benchmark 5 times with n_trees=100 without the memory profiler. (The memory profiler this would slow down the timings): PRTime spent computing histograms: 21.6124 +/- 0.2307s
Time spent finding best splits: 0.1922 +/- 0.0076s
Time spent applying splits: 5.8666 +/- 0.1388s
Time spent predicting: 2.3112 +/- 0.010s
fitted in 50.126 +/- 0.660s masterTime spent computing histograms: 22.277 +/- 0.085s
Time spent finding best splits: 0.2422 +/- 0.0020s
Time spent applying splits: 6.1414 +/- 0.0402s
Time spent predicting: 2.3508 +/- 0.0139s
fitted in 51.581 +/- 0.1897 s For the memory usage of |
Awesome! my PR had a bit less memory, I think? Can you confirm or deny? This is certainly the cleaner solution. |
Using #18163 and running the script in the opening post I get: 1040.11, 841.03 MB |
I optimized this more by not needing to fill the array with zeros everytime. Now the benchmark above is: PR
master
|
The higgs with 100 trees:
PR
Master
|
With Fit 100 trees in 9.217 s, (12700 total leaves)
Time spent computing histograms: 3.940s
Time spent finding best splits: 2.699s
Time spent applying splits: 0.757s
Time spent predicting: 0.020s
369.68, 156.93 MB with the snippet. |
I was concurrently applying this fix + addressing the remaining comments + included the cyclic ref cleaning that is useful anyways. Here is the final memory usage:
|
I updated this PR to work on top of the recently merged #18334 that fixed the root cause of the memory efficiency issue. However, based on @thomasjpfan's tests on macOS, it seems that explicit memory management with the @thomasjpfan can you please confirm that you get around a ~145 MB increment with this PR when running the reproducer snippet on macOS? |
I ran the benchmark on my macbook.
Edit:
|
@lorentzenchr that's weird. You do not reproduce the macOS / Python GC underperformance @thomasjpfan observed on his machine. Maybe it also depends on the speed of the CPU. Your CPU with 1 core seems to be much faster than mine (with 2 physical cores)! |
Actually if I set |
…mory_hist_gradient_boosting
With a newly-updated master and this PR, I get the following result using the snippet at the top, on my laptop (4 threads):
I observed a similar behaviour on this benchmark which creates even more histograms. These results are somewhat consistent with those of @lorentzenchr in #18242 (comment) While the minor time difference can be explained by the reduced number of memory allocations, I find the increase in memory usage quite surprising. |
Also, I want to explain exactly what this PR does because it does have some impact on how memory usage evolves over time: Let
This difference can be observed by printing / plotting the This might be detrimental in cases where later trees are smaller than the first ones, which is usually what I observed empirically: in this PR, the memory usage will never decrease even if an iteration needs less memory than a previous one. In master, it will decrease as expected. This is another thing to take into account, on top of the upcoming benchmarks on MacOS from Thomas |
I am on s 2.3 GHz i9-9880H 9th Generation intel cpu on macOS Catalina with python 3.7 and I set masterFit 100 trees in 13.369 s, (12700 total leaves)
Time spent computing histograms: 4.536s
Time spent finding best splits: 2.749s
Time spent applying splits: 0.739s
Time spent predicting: 0.016s
842.59, 648.93 MB This PRFit 100 trees in 9.309 s, (12700 total leaves)
Time spent computing histograms: 4.166s
Time spent finding best splits: 2.707s
Time spent applying splits: 0.640s
Time spent predicting: 0.016s
365.10, 151.53 MB I am a little lost in figuring out why me and @lorentzenchr results are different on the OSX. |
I updated my benchmarks runs on macOS, now with OMP_NUM_THREADS=4. In this parallel setting, I see a similar speed improvement of this PR as @thomasjpfan does. But I get quite different results for the memory usage. |
I'm really confused that OMP has such a drastic effect on total time on @lorentzenchr benchmarks, especially regarding "Time spent computing histograms" which goes from 5s to 2.5s. None of the changes involved in this PR are OMP-related. In @thomasjpfan benchmark just above, OMP does not have a significant effect on the "Time spent computing histograms", as I would expect. |
@lorentzenchr in your latest run of #18242 (comment), was your master branch up to date? Does it have the #18341 PR (parallel init) merged? |
For the record, it seems that #14392 brings the same benefits (slight speed improvement by reducing the number of malloc / free and the reliance on the Python GC which is sometimes useful on macOS) while being simpler by implementing the recycling logic into the histogram builder class itself. |
@ogrisel I updated my run of #18242 (comment), nothing really changed, except that this PR is now even faster when single threaded (before 18s, now 16s). |
This PR might need a small merge / clean up but it's still "competing" with #14392 and we haven't really decided on what to do (see discussions there, basically we need more benchmarks). So I'd keep it open until then |
Reference Issues/PRs
Resolves #18152
Closes #18163
What does this implement/fix? Explain your changes.
Uses a histogram pool to improve the memory usage of histogram gradient boosting. When running this script:
I get:
This PR
Master