-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ok let me re-run a quick benchmark with many threads on this PR. |
On a machine with 36 physical cores (72 HyperThreads):
so no significant improvement (I re-rerun the benchmark several times and observe approximately +/- 0.5s in total fit time for both branches I was almost alone on the machine when running the benchmark. For information lightgbm yields with 36 threads on the same machine:
|
Without limiting the number of threads to the number of physical cores (no
so maybe when the number of threads is larger than the number of physical CPU cores (oversubscription), this branch make it possible to mitigate some of the perf degradation cause by over-subscription. But this is probably in the noise because I get +/- 2s variations in that regime. So in conclusion, no significant performance impact on many threads machines. |
Thanks for the benchmarks @ogrisel. Just to make sure, did you use the snippet provided here? It's different from the other ones, as I designed it to build lots of histograms. I tried on my desktop computer with 16 threads and the execution time drops from 28s (master) to 10s (this PR). So that's quite a significant improvement in this case. Since you did not observe a regression in your benchmarks, I'd say we still have a strong incentive to merge this as a stand-alone. |
My CPU with 16 threads has only 8 cores. Using OMP_NUM_THREADS=8, across multiple runs:
So in my case I observe a significant improvement not just in the case of over-subscription. |
For the code snippet this yields a huge improvement:
|
fantastic! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this. It can either be very beneficial or the same speed but never slower. And very simple code change.
On macOS without openMP, no regression as well.
|
I have updated #18341 (comment) to also add bench results for #18242's branch with parallel zero init + histogrampool. The histogrampool is performance neutral compared to this branch. |
Thanks for crediting Egor indeed. |
:class:`ensemble.HistGradientBoostingClassifier` which results in speed | ||
improvement for problems that build a lot of nodes on multicore machines. | ||
:pr:`18341` by `Olivier Grisel`_, `Nicolas Hug`_, `Thomas Fan`_, and | ||
:user:`Egor Smirnov <SmirnovEgorRu>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crediting @SmirnovEgorRu here since you opened a similar PR a while ago #14380
I am ambivalent. But the memory usage problem observed by @thomasjpfan on his mac might be too rare to justify the added code complexity. Let's wait for his opinion. In the mean time we can already merge this one. |
I believe #14306 isn't about memory but about thread scalability. In particular @SmirnovEgorRu had identified that |
Oops I merged without waiting for the CI to be green but I am confident this will not crash the build on master. Fingers crossed. |
Indeed, I was mistaken. It's an improvement but it still not as good as lightgbm in terms of scalability. But I agree we should update the benchmark results in that issue to reflect the new master. |
To keep things more unitary, this PR extracts out the histogram initialization from #18242 which was already proposed in #14392
Memory usage is the same as in master but runs in 40s instead of 45s (across many runs) on the following benchmarks on my laptop (4 threads):
Other benchmarks welcome, especially one with 88 hyperthreads as in #14392 (comment).
CC @ogrisel @thomasjpfan