[MRG] GBDT: Reuse allocated memory of other histograms #14392

NicolasHug · 2019-07-17T13:51:06Z

EDIT lots of things have changed, start at #14392 (comment)

Addresses #14306

This PR:

~~only initializes the histograms arrays to 0 when needed (use np.empty instead of np.zeros)~~
re-use allocated memory of other histograms when possible within the same grower

ogrisel · 2019-07-17T16:01:36Z

I ran some benchmarks on a machine with 2x22 physical cores, that is 88 hyperthreads. I benchmarked using the Higgs boson benchmark script with 100 trees and 255 leaf nodes (and the default learning rate which is probably sub-optimal from an accuracy point of view but not the scope of this benchmark).

Here are the results:

## scikit-learn master

88 threads:  70s
44 threads:  46s
22 threads:  48s
11 threads:  62s
5  threads: 110s
1  thread:  303s


## scikit-learn histogram_better_init

88 threads:  77s
44 threads:  51s
22 threads:  49s
11 threads:  68s
5  threads: 105s
1  thread:  300s


## lightgbm

88 threads:  45s
44 threads:  35s
22 threads:  39s
11 threads:  51s
5  threads:  89s
1  thread:  302s

So in conclusion:

hyperthreading is always detrimental (both for sklearn and lightgbm). Limiting to the number of physical cores is best.
Unfortunately, this PR does not improve things with large numbers of threads, quite the opposite.
For small number of threads, this PRs seems to do slightly better, although I am not 100% sure this is significant.
LightGBM is still the fastest, although it is far from a linearly scalable beyond 10 threads.
I wrongly remembered that the scikit-learn scalability profile was worse than what I measured today. Maybe
I have not tried to mess around with numa and affinity. I let the OS balance the threads the way it wants.
I have not tried to measure the impact of this PR on memory usage.

ogrisel · 2019-07-17T16:07:50Z

Here the command line I used:

$ OMP_NUM_THREADS=22 python benchmarks/bench_hist_gradient_boosting_higgsboson.py \
    --n-leaf-nodes 255 --n-trees 100 --lightgbm

amueller · 2019-07-17T17:39:11Z

I assume you ran it once? Running it multiple times would obviously be ideal ;)

NicolasHug · 2019-07-17T17:39:42Z

Thanks a lot for the benchmark.

This PR addresses this part in particular:

Scikit-learn’s code spends too much time in np.zeroes() functions

I removed every np.zeros call related to the histograms.

I don't believe this is strongly coupled with thread scalability. I don't see how histogram initialization / allocation can be responsible for such an increase in computation time with high number of threads.

I've mentioned this before: Cython generates a lot more openmp variables (lastprivate, etc) than it needs to. I would not be surprised if our scalability issue is due to Cython not producing optimal code. I think this should be our main track here.

ogrisel · 2019-07-18T06:43:12Z

I assume you ran it once? Running it multiple times would obviously be ideal ;)

I ran some of them several times. In general I would expect the std to be around 1 - 2s. So the decrease in performance for large number of threads is (slightly) significant. The increase in performance for low number of threads is probably not significant. LightGBM is significantly faster for large number of threads.

adrinjalali · 2019-07-18T08:39:13Z

I removed every np.zeros call related to the histograms.

If moving away from np.zeros is not fixing any issue, I rather have an initialized memory than a random part of the memory un-initialized. Wouldn't you agree?

NicolasHug · 2019-07-18T12:59:05Z

The memory is always going to be initialized at some point @adrinjalali

adrinjalali · 2019-07-18T13:04:30Z

The memory is always going to be initialized at some point @adrinjalali

I really rather have memory allocation and initialization at the same place. If it happens later, it's being set, not initialized.

NicolasHug · 2019-07-18T13:20:34Z

I'm with you on this in the general case, except that:

np.zeros initializes the arrays sequentially while we can do it in parallel, which is supposed to be faster
arrays need not be initialized when we use the subtraction trick. If not initializing the array is faster, why initialize?

In any case, both these points are invalidated by Olivier's benchmark showing the code is, for some reason that I really cannot explain, slower.

So this PR is probably never going to get merged anyway.

ogrisel · 2019-07-19T07:51:33Z

Olivier's benchmark showing the code is, for some reason that I really cannot explain, slower.

It is not slower in sequential mode but it tend to scale slightly worse. Maybe because of the additional openmp bookkeeping instruction generated by Cython as you suggested (although I have not checked myself).

NicolasHug · 2020-02-23T15:17:13Z

Looks like this isn't a relevant addition, closing

NicolasHug · 2020-09-05T00:57:44Z

Re-purposing this as a simpler alternative to #18242. This PR would not suffer from the issue detailed in #18242 (comment), as the "pool" is not cross-grower but intra-grower.

I observe a similar behaviour as detailed in #18242 (comment): tiny bit faster, but a little bit more memory. I don't understand why.

@thomasjpfan , would you mind trying this version on MacOS?

CC @ogrisel @lorentzenchr

lorentzenchr · 2020-09-05T10:13:20Z

I ran the benchmark on my macbook.

OMP_NUM_THREADS=1

master with ENH Break cyclic references in Histogram GBRT #18334 merged

Fit 100 trees in 21.487 s, (12700 total leaves)
Time spent computing histograms: 9.742s
Time spent finding best splits:  8.849s
Time spent applying splits:      0.448s
Time spent predicting:           0.010s
266.31, 111.96 MB

this PR

Fit 100 trees in 20.063 s, (12700 total leaves)
Time spent computing histograms: 8.498s
Time spent finding best splits:  8.883s
Time spent applying splits:      0.489s
Time spent predicting:           0.009s
270.54, 115.82 MB

OMP_NUM_THREADS=4

master with ENH Break cyclic references in Histogram GBRT #18334 merged

Fit 100 trees in 9.165 s, (12700 total leaves)
Time spent computing histograms: 4.710s
Time spent finding best splits:  2.327s
Time spent applying splits:      0.400s
Time spent predicting:           0.009s
266.90, 112.63 MB

this PR

Fit 100 trees in 8.180 s, (12700 total leaves)
Time spent computing histograms: 3.860s
Time spent finding best splits:  2.285s
Time spent applying splits:      0.406s
Time spent predicting:           0.009s
260.19, 105.61 MB

System:

Python 3.7.2
macOS Catalina
Intel i7 @ 2.70GHz (8th generation)

lorentzenchr · 2020-09-05T10:16:05Z

Memory usage seems more or less the same (in between the variance over several runs). The speed improvement in the parallel setting seems to be less compared to #18242.

ogrisel

This is indeed simpler than the #18242 solution. I ran again benchmarks with this branch vs up to date master and it seems that I can notice a systematic small performance improvement (5% to 20%) on several machines with various number of threads (2 to 44 on a 44 physical cores host).

Please add a what's new entry. I am in favor in merging this solution over #18242.

@thomasjpfan can you please confirm that this solution brings a similar memory efficiency advantage than that of #18242 on your macOS machine?

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

ogrisel · 2020-09-07T09:01:25Z

I also ran the same benchmark with #18242 and I do not find it faster than this PR (maybe even the opposite but close to noise level). The difference with #18242 is that this PR does more allocations (less reuse) because the histogram builder is not reused across growers while the pool of #18242 is. However the complexity of the release method of #18242 is a bit higher and this method is called quite often.

…stogram_better_init

thomasjpfan · 2020-09-07T19:54:54Z

@thomasjpfan can you please confirm that this solution brings a similar memory efficiency advantage than that of #18242 on your macOS machine?

I think the results on my OSX machine is off. I ran the benchmark on azure with the OSX instance and here are the results:

This PR - azure link

Fit 100 trees in 19.672 s, (12700 total leaves)
Time spent computing histograms: 8.825s
Time spent finding best splits:  6.900s
Time spent applying splits:      0.934s
Time spent predicting:           0.016s
257.50, 115.83 MB

master - azure link

Fit 100 trees in 16.726 s, (12700 total leaves)
Time spent computing histograms: 7.558s
Time spent finding best splits:  5.755s
Time spent applying splits:      0.690s
Time spent predicting:           0.014s
253.41, 111.83 MB

PR #18242 - azure link

Fit 100 trees in 14.552 s, (12700 total leaves)
Time spent computing histograms: 5.819s
Time spent finding best splits:  5.971s
Time spent applying splits:      0.763s
Time spent predicting:           0.013s
348.48, 207.54 MB

Given this and @lorentzenchr's results with memory, I think this PR is okay in terms of memory.

(I would not take the timings on azure too seriously because the CPU resource allocation may change between runs)

ogrisel · 2020-09-08T07:56:59Z

I run my own benchmarks several times each because I noticed that from time to time I could observe 20%+ slower outliers for any of the configurations.

doc/whats_new/v0.24.rst

NicolasHug · 2020-09-08T12:33:39Z

I run my own benchmarks several times each because I noticed that from time to time I could observe 20%+ slower outliers for any of the configurations.

Considering that + the smallish speed improvement in general + slight increase in memory usage + @thomasjpfan's benchmarks results seem to be off (so this isn't fixing anything anyway), we might want to consider not merging neither #18242 nor this PR?

ogrisel · 2020-11-04T22:39:52Z

Considering that + the smallish speed improvement in general + slight increase in memory usage + @thomasjpfan's benchmarks results seem to be off (so this isn't fixing anything anyway), we might want to consider not merging neither #18242 nor this PR?

On a big enough machine I found that this PR seemed to improve speed a bit over master and #18242: #14392 (review)

In my opinion #18242 is too complex compared to the benefit (if any) over master. But as this PR is much simpler, I think it deserves more benchmarking with repeated measurement to get precise error bars, on various number of threads and maybe different datasets (e.g. different number of features / samples / trees).

Better histogram initialization routines

418494f

ogrisel mentioned this pull request Jul 18, 2019

Replacing np.zeroes by np.empty for Gradient Boosting for better scaling by cores #14380

Closed

NicolasHug closed this Feb 23, 2020

NicolasHug mentioned this pull request Aug 24, 2020

ENH Greatly reduces memory usage of histogram gradient boosting #18242

Closed

NicolasHug mentioned this pull request Sep 4, 2020

[MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341

Merged

cleaning

6a62d9e

NicolasHug reopened this Sep 5, 2020

NicolasHug changed the title ~~[MRG] GBDT: Better histogram initialization routines~~ [MRG] GBDT: Reuse allocated memory of other histograms Sep 5, 2020

github-actions bot added the module:ensemble label Sep 5, 2020

ogrisel approved these changes Sep 7, 2020

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx Outdated Show resolved Hide resolved

NicolasHug added 2 commits September 7, 2020 10:32

Merge branch 'master' of github.com:scikit-learn/scikit-learn into hi…

5ef8fb1

…stogram_better_init

Changed name + whatsnew

7e9fe38

ogrisel reviewed Sep 8, 2020

View reviewed changes

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved

less => fewer (countable)

22877fc

cmarmo added the Needs Decision Requires decision label Sep 17, 2020

Base automatically changed from master to main January 22, 2021 10:51

thomasjpfan added the cython label Apr 13, 2021

NicolasHug closed this Jul 22, 2022

Uh oh!

[MRG] GBDT: Reuse allocated memory of other histograms #14392

[MRG] GBDT: Reuse allocated memory of other histograms #14392

Uh oh!

Conversation

NicolasHug commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 17, 2019

Uh oh!

NicolasHug commented Jul 17, 2019

Uh oh!

ogrisel commented Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Jul 18, 2019

Uh oh!

NicolasHug commented Jul 18, 2019

Uh oh!

adrinjalali commented Jul 18, 2019

Uh oh!

NicolasHug commented Jul 18, 2019

Uh oh!

ogrisel commented Jul 19, 2019

Uh oh!

NicolasHug commented Feb 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Sep 5, 2020

Uh oh!

lorentzenchr commented Sep 5, 2020

Uh oh!

lorentzenchr commented Sep 5, 2020

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Sep 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Sep 7, 2020

This PR - azure link

master - azure link

PR #18242 - azure link

Uh oh!

ogrisel commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

NicolasHug commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 4, 2020

Uh oh!

Uh oh!

NicolasHug commented Jul 17, 2019 •

edited

Loading

ogrisel commented Jul 17, 2019 •

edited

Loading

ogrisel commented Jul 17, 2019 •

edited

Loading

ogrisel commented Jul 18, 2019 •

edited

Loading

NicolasHug commented Feb 23, 2020 •

edited

Loading

ogrisel commented Sep 7, 2020 •

edited

Loading

ogrisel commented Sep 8, 2020 •

edited

Loading

NicolasHug commented Sep 8, 2020 •

edited

Loading