Skip to content

[MRG] GBDT: Reuse allocated memory of other histograms #14392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jul 17, 2019

EDIT lots of things have changed, start at #14392 (comment)

Addresses #14306

This PR:

  • only initializes the histograms arrays to 0 when needed (use np.empty instead of np.zeros)
  • re-use allocated memory of other histograms when possible within the same grower

ping @ogrisel

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2019

I ran some benchmarks on a machine with 2x22 physical cores, that is 88 hyperthreads. I benchmarked using the Higgs boson benchmark script with 100 trees and 255 leaf nodes (and the default learning rate which is probably sub-optimal from an accuracy point of view but not the scope of this benchmark).

Here are the results:

## scikit-learn master

88 threads:  70s
44 threads:  46s
22 threads:  48s
11 threads:  62s
5  threads: 110s
1  thread:  303s


## scikit-learn histogram_better_init

88 threads:  77s
44 threads:  51s
22 threads:  49s
11 threads:  68s
5  threads: 105s
1  thread:  300s


## lightgbm

88 threads:  45s
44 threads:  35s
22 threads:  39s
11 threads:  51s
5  threads:  89s
1  thread:  302s

So in conclusion:

  • hyperthreading is always detrimental (both for sklearn and lightgbm). Limiting to the number of physical cores is best.

  • Unfortunately, this PR does not improve things with large numbers of threads, quite the opposite.

  • For small number of threads, this PRs seems to do slightly better, although I am not 100% sure this is significant.

  • LightGBM is still the fastest, although it is far from a linearly scalable beyond 10 threads.

  • I wrongly remembered that the scikit-learn scalability profile was worse than what I measured today. Maybe

  • I have not tried to mess around with numa and affinity. I let the OS balance the threads the way it wants.

  • I have not tried to measure the impact of this PR on memory usage.

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2019

Here the command line I used:

$ OMP_NUM_THREADS=22 python benchmarks/bench_hist_gradient_boosting_higgsboson.py \
    --n-leaf-nodes 255 --n-trees 100 --lightgbm

@amueller
Copy link
Member

I assume you ran it once? Running it multiple times would obviously be ideal ;)

@NicolasHug
Copy link
Member Author

Thanks a lot for the benchmark.

This PR addresses this part in particular:

Scikit-learn’s code spends too much time in np.zeroes() functions

I removed every np.zeros call related to the histograms.


I don't believe this is strongly coupled with thread scalability. I don't see how histogram initialization / allocation can be responsible for such an increase in computation time with high number of threads.

I've mentioned this before: Cython generates a lot more openmp variables (lastprivate, etc) than it needs to. I would not be surprised if our scalability issue is due to Cython not producing optimal code. I think this should be our main track here.

@ogrisel
Copy link
Member

ogrisel commented Jul 18, 2019

I assume you ran it once? Running it multiple times would obviously be ideal ;)

I ran some of them several times. In general I would expect the std to be around 1 - 2s. So the decrease in performance for large number of threads is (slightly) significant. The increase in performance for low number of threads is probably not significant. LightGBM is significantly faster for large number of threads.

@adrinjalali
Copy link
Member

I removed every np.zeros call related to the histograms.

If moving away from np.zeros is not fixing any issue, I rather have an initialized memory than a random part of the memory un-initialized. Wouldn't you agree?

@NicolasHug
Copy link
Member Author

The memory is always going to be initialized at some point @adrinjalali

@adrinjalali
Copy link
Member

The memory is always going to be initialized at some point @adrinjalali

I really rather have memory allocation and initialization at the same place. If it happens later, it's being set, not initialized.

@NicolasHug
Copy link
Member Author

I'm with you on this in the general case, except that:

  • np.zeros initializes the arrays sequentially while we can do it in parallel, which is supposed to be faster
  • arrays need not be initialized when we use the subtraction trick. If not initializing the array is faster, why initialize?

In any case, both these points are invalidated by Olivier's benchmark showing the code is, for some reason that I really cannot explain, slower.

So this PR is probably never going to get merged anyway.

@ogrisel
Copy link
Member

ogrisel commented Jul 19, 2019

Olivier's benchmark showing the code is, for some reason that I really cannot explain, slower.

It is not slower in sequential mode but it tend to scale slightly worse. Maybe because of the additional openmp bookkeeping instruction generated by Cython as you suggested (although I have not checked myself).

@NicolasHug
Copy link
Member Author

NicolasHug commented Feb 23, 2020

Looks like this isn't a relevant addition, closing

@NicolasHug NicolasHug reopened this Sep 5, 2020
@NicolasHug NicolasHug changed the title [MRG] GBDT: Better histogram initialization routines [MRG] GBDT: Reuse allocated memory of other histograms Sep 5, 2020
@NicolasHug
Copy link
Member Author

Re-purposing this as a simpler alternative to #18242. This PR would not suffer from the issue detailed in #18242 (comment), as the "pool" is not cross-grower but intra-grower.

I observe a similar behaviour as detailed in #18242 (comment): tiny bit faster, but a little bit more memory. I don't understand why.

@thomasjpfan , would you mind trying this version on MacOS?

CC @ogrisel @lorentzenchr

@lorentzenchr
Copy link
Member

I ran the benchmark on my macbook.

  • OMP_NUM_THREADS=1
    • master with ENH Break cyclic references in Histogram GBRT #18334 merged
      Fit 100 trees in 21.487 s, (12700 total leaves)
      Time spent computing histograms: 9.742s
      Time spent finding best splits:  8.849s
      Time spent applying splits:      0.448s
      Time spent predicting:           0.010s
      266.31, 111.96 MB
      
    • this PR
      Fit 100 trees in 20.063 s, (12700 total leaves)
      Time spent computing histograms: 8.498s
      Time spent finding best splits:  8.883s
      Time spent applying splits:      0.489s
      Time spent predicting:           0.009s
      270.54, 115.82 MB
      
  • OMP_NUM_THREADS=4
    • master with ENH Break cyclic references in Histogram GBRT #18334 merged
      Fit 100 trees in 9.165 s, (12700 total leaves)
      Time spent computing histograms: 4.710s
      Time spent finding best splits:  2.327s
      Time spent applying splits:      0.400s
      Time spent predicting:           0.009s
      266.90, 112.63 MB
      
    • this PR
      Fit 100 trees in 8.180 s, (12700 total leaves)
      Time spent computing histograms: 3.860s
      Time spent finding best splits:  2.285s
      Time spent applying splits:      0.406s
      Time spent predicting:           0.009s
      260.19, 105.61 MB
      

System:

  • Python 3.7.2
  • macOS Catalina
  • Intel i7 @ 2.70GHz (8th generation)

@lorentzenchr
Copy link
Member

Memory usage seems more or less the same (in between the variance over several runs). The speed improvement in the parallel setting seems to be less compared to #18242.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed simpler than the #18242 solution. I ran again benchmarks with this branch vs up to date master and it seems that I can notice a systematic small performance improvement (5% to 20%) on several machines with various number of threads (2 to 44 on a 44 physical cores host).

Please add a what's new entry. I am in favor in merging this solution over #18242.

@thomasjpfan can you please confirm that this solution brings a similar memory efficiency advantage than that of #18242 on your macOS machine?

@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2020

I also ran the same benchmark with #18242 and I do not find it faster than this PR (maybe even the opposite but close to noise level). The difference with #18242 is that this PR does more allocations (less reuse) because the histogram builder is not reused across growers while the pool of #18242 is. However the complexity of the release method of #18242 is a bit higher and this method is called quite often.

@thomasjpfan
Copy link
Member

@thomasjpfan can you please confirm that this solution brings a similar memory efficiency advantage than that of #18242 on your macOS machine?

I think the results on my OSX machine is off. I ran the benchmark on azure with the OSX instance and here are the results:

This PR - azure link

Fit 100 trees in 19.672 s, (12700 total leaves)
Time spent computing histograms: 8.825s
Time spent finding best splits:  6.900s
Time spent applying splits:      0.934s
Time spent predicting:           0.016s
257.50, 115.83 MB

master - azure link

Fit 100 trees in 16.726 s, (12700 total leaves)
Time spent computing histograms: 7.558s
Time spent finding best splits:  5.755s
Time spent applying splits:      0.690s
Time spent predicting:           0.014s
253.41, 111.83 MB

PR #18242 - azure link

Fit 100 trees in 14.552 s, (12700 total leaves)
Time spent computing histograms: 5.819s
Time spent finding best splits:  5.971s
Time spent applying splits:      0.763s
Time spent predicting:           0.013s
348.48, 207.54 MB

Given this and @lorentzenchr's results with memory, I think this PR is okay in terms of memory.

(I would not take the timings on azure too seriously because the CPU resource allocation may change between runs)

@ogrisel
Copy link
Member

ogrisel commented Sep 8, 2020

I run my own benchmarks several times each because I noticed that from time to time I could observe 20%+ slower outliers for any of the configurations.

@NicolasHug
Copy link
Member Author

NicolasHug commented Sep 8, 2020

I run my own benchmarks several times each because I noticed that from time to time I could observe 20%+ slower outliers for any of the configurations.

Considering that + the smallish speed improvement in general + slight increase in memory usage + @thomasjpfan's benchmarks results seem to be off (so this isn't fixing anything anyway), we might want to consider not merging neither #18242 nor this PR?

@cmarmo cmarmo added the Needs Decision Requires decision label Sep 17, 2020
@ogrisel
Copy link
Member

ogrisel commented Nov 4, 2020

Considering that + the smallish speed improvement in general + slight increase in memory usage + @thomasjpfan's benchmarks results seem to be off (so this isn't fixing anything anyway), we might want to consider not merging neither #18242 nor this PR?

On a big enough machine I found that this PR seemed to improve speed a bit over master and #18242: #14392 (review)

In my opinion #18242 is too complex compared to the benefit (if any) over master. But as this PR is much simpler, I think it deserves more benchmarking with repeated measurement to get precise error bars, on various number of threads and maybe different datasets (e.g. different number of features / samples / trees).

Base automatically changed from master to main January 22, 2021 10:51
@NicolasHug NicolasHug closed this Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants