ENH Greatly reduces memory usage of histogram gradient boosting #18242

thomasjpfan · 2020-08-23T03:07:22Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uses a histogram pool to improve the memory usage of histogram gradient boosting. When running this script:

from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from memory_profiler import memory_usage

X, y = make_classification(n_classes=2,
                           n_samples=10_000,
                           n_features=400,
                           random_state=0)

hgb = HistGradientBoostingClassifier(
    max_iter=100,
    max_leaf_nodes=127,
    learning_rate=.1,
    random_state=0,
    verbose=1,
)

mems = memory_usage((hgb.fit, (X, y)))
print(f"{max(mems):.2f}, {max(mems) - min(mems):.2f} MB")

I get:

This PR

Fit 100 trees in 19.477 s, (12700 total leaves)
Time spent computing histograms: 4.300s
Time spent finding best splits:  2.203s
Time spent applying splits:      0.627s
Time spent predicting:           0.016s
678.93, 486.12 MB

Master

Fit 100 trees in 20.472 s, (12700 total leaves)
Time spent computing histograms: 13.489s
Time spent finding best splits:  2.695s
Time spent applying splits:      2.583s
Time spent predicting:           0.016s
7140.17, 6947.62 MB

jnothman · 2020-08-23T06:13:54Z

Nice work!

lorentzenchr

Wow! This is more than a 10x memory usage improvement.
Having results from the other benchmark scripts would also be nice.

doc/whats_new/v0.24.rst

sklearn/ensemble/_hist_gradient_boosting/_histogram_pool.py

sklearn/ensemble/_hist_gradient_boosting/grower.py

sklearn/ensemble/_hist_gradient_boosting/_histogram_pool.py

glemaitre · 2020-08-23T17:37:36Z

@thomasjpfan I am surprised with the benchmark time. All different processings are slower (even x4 times more) in master but the overall execution time of the 2 PRs is the same?

sklearn/ensemble/_hist_gradient_boosting/_histogram_pool.py

NicolasHug

Thanks @thomasjpfan
Could you please:

add tests for the HistogramsPool class, in particular checks for the used_indices and available_indices sets
run the higgs boson benchmark for both time and memory usage (with appropriate number of repetitions where relevant please)

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

sklearn/ensemble/_hist_gradient_boosting/grower.py

sklearn/ensemble/_hist_gradient_boosting/_histogram_pool.py

thomasjpfan · 2020-08-23T22:12:03Z

I simplified some of the logic in HistogramsPool to use weakref.ref. Now the TreeNodes will contain references to histograms. HistogramsPool will "own" the histograms.

NicolasHug · 2020-08-23T22:30:57Z

I could be wrong but it seems to me that using weakrefs is not necessary: the goal of using weakref would be to avoid keeping a histogram object "alive" in a node if the Pool dies.

But the pool only dies at the very end (when fit ends). So I'm not sure this is enabling anything. Could you please confirm (or not) via new benchmarks?

…ent_boosting

thomasjpfan · 2020-08-23T23:59:53Z

But the pool only dies at the very end (when fit ends). So I'm not sure this is enabling anything. Could you please confirm (or not) via new benchmarks?

You are correct. Weak references are not needed. This PR was updated to remove them simplifies things.

On higgs benchmark 5 times with n_trees=100 without the memory profiler. (The memory profiler this would slow down the timings):

PR

Time spent computing histograms: 21.6124 +/- 0.2307s
Time spent finding best splits:  0.1922 +/- 0.0076s
Time spent applying splits:      5.8666 +/- 0.1388s
Time spent predicting:           2.3112 +/- 0.010s
fitted in 50.126 +/- 0.660s

master

Time spent computing histograms: 22.277 +/- 0.085s
Time spent finding best splits:  0.2422 +/- 0.0020s
Time spent applying splits:      6.1414 +/- 0.0402s
Time spent predicting:           2.3508 +/- 0.0139s
fitted in 51.581 +/- 0.1897 s

For the memory usage of fit with 5 runs. This PR: 3392.7 +/- 2.7 MB, and master: 3541.2 +/- 1.9 MB.

amueller · 2020-08-24T03:16:50Z

Awesome! my PR had a bit less memory, I think? Can you confirm or deny? This is certainly the cleaner solution.

thomasjpfan · 2020-08-24T03:36:58Z

Using #18163 and running the script in the opening post I get: 1040.11, 841.03 MB

thomasjpfan · 2020-08-24T04:06:44Z

I optimized this more by not needing to fill the array with zeros everytime. Now the benchmark above is:

PR

Fit 100 trees in 9.955 s, (12700 total leaves)
Time spent computing histograms: 5.072s
Time spent finding best splits:  2.540s
Time spent applying splits:      0.698s
Time spent predicting:           0.017s
675.42, 481.73 MB

master

Fit 100 trees in 21.137 s, (12700 total leaves)
Time spent computing histograms: 13.262s
Time spent finding best splits:  3.054s
Time spent applying splits:      3.065s
Time spent predicting:           0.017s
7019.68, 6825.42 MB

thomasjpfan · 2020-08-24T04:17:14Z

The higgs with 100 trees:

python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100

PR

Fit 100 trees in 54.164 s, (3100 total leaves)
Time spent computing histograms: 23.580s
Time spent finding best splits:  0.218s
Time spent applying splits:      6.305s
Time spent predicting:           2.603s
fitted in 54.358s

memory from a separate run:
7139.24, 3356.80 MB

Master

Fit 100 trees in 54.347 s, (3100 total leaves)
Time spent computing histograms: 23.834s
Time spent finding best splits:  0.222s
Time spent applying splits:      6.348s
Time spent predicting:           2.588s
fitted in 54.518s

memory from a separate run:
7331.99, 3550.26 MB

thomasjpfan · 2020-09-03T17:13:55Z

With 7067e9c (#18242), this PR is at:

Fit 100 trees in 9.217 s, (12700 total leaves)
Time spent computing histograms: 3.940s
Time spent finding best splits:  2.699s
Time spent applying splits:      0.757s
Time spent predicting:           0.020s
369.68, 156.93 MB

with the snippet.

ogrisel · 2020-09-03T17:19:07Z

I was concurrently applying this fix + addressing the remaining comments + included the cyclic ref cleaning that is useful anyways.

Here is the final memory usage:

Fit 100 trees in 31.063 s, (12700 total leaves)
Time spent computing histograms: 23.325s
Time spent finding best splits:  5.056s
Time spent applying splits:      0.785s
Time spent predicting:           0.024s
284.12, 145.80 MB

…t_boosting

ogrisel · 2020-09-04T08:40:14Z

I updated this PR to work on top of the recently merged #18334 that fixed the root cause of the memory efficiency issue. However, based on @thomasjpfan's tests on macOS, it seems that explicit memory management with the HistogramPool class is still useful for that platform: #18334 (comment).

@thomasjpfan can you please confirm that you get around a ~145 MB increment with this PR when running the reproducer snippet on macOS?

lorentzenchr · 2020-09-04T12:47:19Z

I ran the benchmark on my macbook.

OMP_NUM_THREADS=1

master at a27add7 with ENH Break cyclic references in Histogram GBRT #18334 and [MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341 merged

Fit 100 trees in 19.898 s, (12700 total leaves)
Time spent computing histograms: 8.455s
Time spent finding best splits:  8.734s
Time spent applying splits:      0.372s
Time spent predicting:           0.009s
268.95, 115.25 MB

this PR

Fit 100 trees in 16.609 s, (12700 total leaves)
Time spent computing histograms: 5.913s
Time spent finding best splits:  8.692s
Time spent applying splits:      0.381s
Time spent predicting:           0.008s
361.11, 207.51 MB

OMP_NUM_THREADS=4

master at a27add7 with ENH Break cyclic references in Histogram GBRT #18334 and [MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341 merged

Fit 100 trees in 8.666 s, (12700 total leaves)
Time spent computing histograms: 4.178s
Time spent finding best splits:  2.325s
Time spent applying splits:      0.382s
Time spent predicting:           0.009s
263.41, 109.50 MB

this PR

Fit 100 trees in 6.198 s, (12700 total leaves)
Time spent computing histograms: 2.462s
Time spent finding best splits:  2.289s
Time spent applying splits:      0.380s
Time spent predicting:           0.009s
361.87, 207.55 MB

Edit:

Python 3.7.2
macOS Catalina
Intel i7 @ 2.70GHz (8th generation)
Compiler: Apple clang version 11.0.3

ogrisel · 2020-09-04T12:52:09Z

@lorentzenchr that's weird. You do not reproduce the macOS / Python GC underperformance @thomasjpfan observed on his machine. Maybe it also depends on the speed of the CPU. Your CPU with 1 core seems to be much faster than mine (with 2 physical cores)!

ogrisel · 2020-09-04T12:59:33Z

Actually if I set OMP_NUM_THREADS=1 to run this benchmark, I run slightly faster (28s instead of 31s) which means that the sample size from this benchmark is actually too small to benefit from OMP parallelism. But at least it's not too detrimental.

…mory_hist_gradient_boosting

NicolasHug · 2020-09-04T15:35:04Z

With a newly-updated master and this PR, I get the following result using the snippet at the top, on my laptop (4 threads):

Master : ~16sec, 135MB
This PR: ~15sec, 145MB

I observed a similar behaviour on this benchmark which creates even more histograms. These results are somewhat consistent with those of @lorentzenchr in #18242 (comment)

While the minor time difference can be explained by the reduced number of memory allocations, I find the increase in memory usage quite surprising.

NicolasHug · 2020-09-04T15:39:11Z

Also, I want to explain exactly what this PR does because it does have some impact on how memory usage evolves over time:

Let m_used_i be the total memory effectively used by the histograms at iteration i
Let m_alloc_i be the total memory allocated for the histograms at iteration i.

In master, m_alloc_i == m_used_i for all i
In this PR, m_alloc_i == max(m_used_j for j in [0, i]). So m_alloc_i >= m_used_i.

This difference can be observed by printing / plotting the mems list in the benchmarks: in master you'll observe some fluctuations, whereas in this PR you will see monotonically increasing values.

This might be detrimental in cases where later trees are smaller than the first ones, which is usually what I observed empirically: in this PR, the memory usage will never decrease even if an iteration needs less memory than a previous one. In master, it will decrease as expected.

This is another thing to take into account, on top of the upcoming benchmarks on MacOS from Thomas

thomasjpfan · 2020-09-05T00:01:37Z

I am on s 2.3 GHz i9-9880H 9th Generation intel cpu on macOS Catalina with python 3.7 and I set OMP_NUM_THREADS=8.

master

Fit 100 trees in 13.369 s, (12700 total leaves)
Time spent computing histograms: 4.536s
Time spent finding best splits:  2.749s
Time spent applying splits:      0.739s
Time spent predicting:           0.016s
842.59, 648.93 MB

This PR

Fit 100 trees in 9.309 s, (12700 total leaves)
Time spent computing histograms: 4.166s
Time spent finding best splits:  2.707s
Time spent applying splits:      0.640s
Time spent predicting:           0.016s
365.10, 151.53 MB

I am a little lost in figuring out why me and @lorentzenchr results are different on the OSX.

lorentzenchr · 2020-09-05T10:02:44Z

I updated my benchmarks runs on macOS, now with OMP_NUM_THREADS=4. In this parallel setting, I see a similar speed improvement of this PR as @thomasjpfan does. But I get quite different results for the memory usage.

NicolasHug · 2020-09-05T12:45:29Z

I'm really confused that OMP has such a drastic effect on total time on @lorentzenchr benchmarks, especially regarding "Time spent computing histograms" which goes from 5s to 2.5s. None of the changes involved in this PR are OMP-related.

In @thomasjpfan benchmark just above, OMP does not have a significant effect on the "Time spent computing histograms", as I would expect.

ogrisel · 2020-09-07T07:48:21Z

I'm really confused that OMP has such a drastic effect on total time on @lorentzenchr benchmarks, especially regarding "Time spent computing histograms" which goes from 5s to 2.5s. None of the changes involved in this PR are OMP-related.

@lorentzenchr in your latest run of #18242 (comment), was your master branch up to date? Does it have the #18341 PR (parallel init) merged?

ogrisel · 2020-09-07T11:53:03Z

For the record, it seems that #14392 brings the same benefits (slight speed improvement by reducing the number of malloc / free and the reliance on the Python GC which is sometimes useful on macOS) while being simpler by implementing the recycling logic into the histogram builder class itself.

lorentzenchr · 2020-09-07T17:38:12Z

@ogrisel I updated my run of #18242 (comment), nothing really changed, except that this PR is now even faster when single threaded (before 18s, now 16s).

lorentzenchr · 2020-11-25T19:01:26Z

As #18152 and #18163 have been solved by #18334, shall we close this PR?

NicolasHug · 2020-11-25T19:04:38Z

This PR might need a small merge / clean up but it's still "competing" with #14392 and we haven't really decided on what to do (see discussions there, basically we need more benchmarks). So I'd keep it open until then

thomasjpfan added 5 commits August 22, 2020 22:58

ENH Greatly reduces memory usage of histogram gradient boosting

4c750e7

CLN Fill histograms when reseting

eacbd1b

CLN Rename to pool

a03d8e8

DOC Adds whats new

b1e71f3

STY Linting

9e43c6b

github-actions bot added the module:ensemble label Aug 23, 2020

ENH Only reset histograms that were used to zero

a6724f4

lorentzenchr reviewed Aug 23, 2020

View reviewed changes

glemaitre reviewed Aug 23, 2020

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/_histogram_pool.py Outdated Show resolved Hide resolved

CLN Address comments

7b3d195

NicolasHug reviewed Aug 23, 2020

View reviewed changes

CLN Address comments

359d917

NicolasHug mentioned this pull request Aug 23, 2020

HistGradientBoosting memory improvement #18163

Closed

thomasjpfan added 6 commits August 23, 2020 18:32

DOC Adds docstring

f29258f

TST Adds tests for histograms pool

ed8cb9b

DOC Changes tag in whats_new

d2c06a7

CLN Removes weakref

6e99b44

Merge remote-tracking branch 'upstream/master' into memory_hist_gradi…

030febe

…ent_boosting

CLN Lowers diff

b059bb4

ENH Speeds up algorithm more

527d12d

REV Revert diff

e6275d8

ogrisel added 3 commits September 3, 2020 19:16

ENH free leaf histograms as soon as possible

a940c15

Style: avoid for / else

e327c08

Remove useless TreeNode attributes to break cyclic references

c2e9e10

ogrisel added 2 commits September 4, 2020 10:28

Merge remote-tracking branch 'origin/master' into memory_hist_gradien…

bc40625

…t_boosting

Update motivation for HistogramPool

8ce38d1

NicolasHug mentioned this pull request Sep 4, 2020

[MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341

Merged

Merge branch 'master' of github.com:scikit-learn/scikit-learn into me…

be4788e

…mory_hist_gradient_boosting

NicolasHug mentioned this pull request Sep 5, 2020

[MRG] GBDT: Reuse allocated memory of other histograms #14392

Closed

MaxHalford mentioned this pull request Sep 6, 2020

[Discussion] efficiency improvements microsoft/LightGBM#2791

Open

cmarmo added the Needs Decision Requires decision label Oct 8, 2020

amueller mentioned this pull request Oct 29, 2020

Add section on copy-view behaviour and mutability data-apis/array-api#66

Merged

Base automatically changed from master to main January 22, 2021 10:53

thomasjpfan closed this Mar 27, 2021

saadalowerdi approved these changes May 11, 2024

View reviewed changes

Uh oh!

ENH Greatly reduces memory usage of histogram gradient boosting #18242

ENH Greatly reduces memory usage of histogram gradient boosting #18242

Uh oh!

Conversation

thomasjpfan commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR

Master

Uh oh!

jnothman commented Aug 23, 2020

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Aug 23, 2020

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Aug 23, 2020

Uh oh!

NicolasHug commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Aug 23, 2020

PR

master

Uh oh!

amueller commented Aug 24, 2020

Uh oh!

thomasjpfan commented Aug 24, 2020

Uh oh!

thomasjpfan commented Aug 24, 2020

PR

master

Uh oh!

thomasjpfan commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR

Master

Uh oh!

thomasjpfan commented Sep 3, 2020

Uh oh!

ogrisel commented Sep 3, 2020

Uh oh!

ogrisel commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Sep 4, 2020

Uh oh!

ogrisel commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Sep 4, 2020

Uh oh!

NicolasHug commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Sep 5, 2020

master

This PR

Uh oh!

lorentzenchr commented Sep 5, 2020

thomasjpfan commented Aug 23, 2020 •

edited

Loading

NicolasHug commented Aug 23, 2020 •

edited

Loading

thomasjpfan commented Aug 24, 2020 •

edited

Loading

ogrisel commented Sep 4, 2020 •

edited

Loading

lorentzenchr commented Sep 4, 2020 •

edited

Loading

ogrisel commented Sep 4, 2020 •

edited

Loading

NicolasHug commented Sep 4, 2020 •

edited

Loading

ogrisel commented Sep 7, 2020 •

edited

Loading