-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] GBDT: Reuse allocated memory of other histograms #14392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I ran some benchmarks on a machine with 2x22 physical cores, that is 88 hyperthreads. I benchmarked using the Higgs boson benchmark script with 100 trees and 255 leaf nodes (and the default learning rate which is probably sub-optimal from an accuracy point of view but not the scope of this benchmark). Here are the results:
So in conclusion:
|
Here the command line I used:
|
I assume you ran it once? Running it multiple times would obviously be ideal ;) |
Thanks a lot for the benchmark. This PR addresses this part in particular:
I removed every I don't believe this is strongly coupled with thread scalability. I don't see how histogram initialization / allocation can be responsible for such an increase in computation time with high number of threads. I've mentioned this before: Cython generates a lot more openmp variables (lastprivate, etc) than it needs to. I would not be surprised if our scalability issue is due to Cython not producing optimal code. I think this should be our main track here. |
I ran some of them several times. In general I would expect the std to be around 1 - 2s. So the decrease in performance for large number of threads is (slightly) significant. The increase in performance for low number of threads is probably not significant. LightGBM is significantly faster for large number of threads. |
If moving away from |
The memory is always going to be initialized at some point @adrinjalali |
I really rather have memory allocation and initialization at the same place. If it happens later, it's being set, not initialized. |
I'm with you on this in the general case, except that:
In any case, both these points are invalidated by Olivier's benchmark showing the code is, for some reason that I really cannot explain, slower. So this PR is probably never going to get merged anyway. |
It is not slower in sequential mode but it tend to scale slightly worse. Maybe because of the additional openmp bookkeeping instruction generated by Cython as you suggested (although I have not checked myself). |
Looks like this isn't a relevant addition, closing |
Re-purposing this as a simpler alternative to #18242. This PR would not suffer from the issue detailed in #18242 (comment), as the "pool" is not cross-grower but intra-grower. I observe a similar behaviour as detailed in #18242 (comment): tiny bit faster, but a little bit more memory. I don't understand why. @thomasjpfan , would you mind trying this version on MacOS? |
I ran the benchmark on my macbook.
System:
|
Memory usage seems more or less the same (in between the variance over several runs). The speed improvement in the parallel setting seems to be less compared to #18242. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed simpler than the #18242 solution. I ran again benchmarks with this branch vs up to date master and it seems that I can notice a systematic small performance improvement (5% to 20%) on several machines with various number of threads (2 to 44 on a 44 physical cores host).
Please add a what's new entry. I am in favor in merging this solution over #18242.
@thomasjpfan can you please confirm that this solution brings a similar memory efficiency advantage than that of #18242 on your macOS machine?
I also ran the same benchmark with #18242 and I do not find it faster than this PR (maybe even the opposite but close to noise level). The difference with #18242 is that this PR does more allocations (less reuse) because the histogram builder is not reused across growers while the pool of #18242 is. However the complexity of the release method of #18242 is a bit higher and this method is called quite often. |
I think the results on my OSX machine is off. I ran the benchmark on azure with the OSX instance and here are the results: This PR - azure linkFit 100 trees in 19.672 s, (12700 total leaves)
Time spent computing histograms: 8.825s
Time spent finding best splits: 6.900s
Time spent applying splits: 0.934s
Time spent predicting: 0.016s
257.50, 115.83 MB master - azure linkFit 100 trees in 16.726 s, (12700 total leaves)
Time spent computing histograms: 7.558s
Time spent finding best splits: 5.755s
Time spent applying splits: 0.690s
Time spent predicting: 0.014s
253.41, 111.83 MB PR #18242 - azure linkFit 100 trees in 14.552 s, (12700 total leaves)
Time spent computing histograms: 5.819s
Time spent finding best splits: 5.971s
Time spent applying splits: 0.763s
Time spent predicting: 0.013s
348.48, 207.54 MB Given this and @lorentzenchr's results with memory, I think this PR is okay in terms of memory. (I would not take the timings on azure too seriously because the CPU resource allocation may change between runs) |
I run my own benchmarks several times each because I noticed that from time to time I could observe 20%+ slower outliers for any of the configurations. |
Considering that + the smallish speed improvement in general + slight increase in memory usage + @thomasjpfan's benchmarks results seem to be off (so this isn't fixing anything anyway), we might want to consider not merging neither #18242 nor this PR? |
On a big enough machine I found that this PR seemed to improve speed a bit over master and #18242: #14392 (review) In my opinion #18242 is too complex compared to the benefit (if any) over master. But as this PR is much simpler, I think it deserves more benchmarking with repeated measurement to get precise error bars, on various number of threads and maybe different datasets (e.g. different number of features / samples / trees). |
EDIT lots of things have changed, start at #14392 (comment)
Addresses #14306
This PR:
only initializes the histograms arrays to 0 when needed (usenp.empty
instead ofnp.zeros
)ping @ogrisel