HistGradientBoosting memory improvement #18163

amueller · 2020-08-14T23:48:59Z

Starting to investigate #18152.

This change reduces the memory usage for the example provided there by not keeping around histograms for inner nodes and leafs. That reduces the memory by a factor of about 4.
Honestly not sure how to get around storing the histograms for the splittable active nodes, or if there's another issue/trick we're overlooking.

sklearn/ensemble/_hist_gradient_boosting/grower.py

NicolasHug · 2020-08-17T12:19:13Z

Honestly not sure how to get around storing the histograms for the splittable active nodes, or if there's another issue/trick we're overlooking.

LightGBM uses a LRU cache for storing histograms.

Can you try deling the entire grower once it's used, instead of deleting the histograms? I'm not certain yet that the memory issue comes from the histograms. Might instead come from the allocated arrays in the Splitter or other stuff.

amueller · 2020-08-17T14:21:21Z

Given that this reduces the memory footprint significantly (I claimed 4x but I guess 3x makes more sense) I'm pretty sure that nearly all of the memory consumption comes from the histograms. Just deleting the histograms on the inner nodes reduced the memory consumption by half.

NicolasHug · 2020-08-17T14:26:41Z

I'm still curious about what freeing the grower will do. There might be a bigger memory issue than just histograms. 1657 / 4 = 414 while LightGBM uses 95 according to the original issue.

(Note that get_equivalent_estimator should be used for a more accurate comparison between implementations.)

amueller · 2020-08-18T18:09:53Z

Not sure what you mean by freeing the grower. The grower isn't stored in the loop iteration loop, right? I can call del in the loop but given that there's no reference, it was being collected anyway and so there is no change from what I can tell.

So I did the n_samples=10k, n_features=400 and it went from ~6gb to ~1.5gb so I think the histograms are a major factor.
However, the measurement fluctuates between 2gb and 1gb.

amueller · 2020-08-18T19:35:06Z

using memory profiler in fit, all the memory is allocated in TreeGrower.grow, which calls split_next in a loop.
Profiling split_next shows that the only significant memory consumption is on this line:

   481    166.9 MiB      0.0 MiB               largest_child.histograms = \
   482    166.9 MiB      1.9 MiB                   self.histogram_builder.compute_histograms_subtraction(
   483    165.0 MiB      0.0 MiB                       node.histograms, smallest_child.histograms)

Which is weird. Why is this histogram bigger than the other one? But I think one of them might get deleted.
The size of a histogram is (400, 256) and nbytes for that is 2048000 so that seems to check out to be about 2mb for each node.
But for 127 leafs that should result in about 254mb not 6gb... hm

It seems that the memory consumption rises with the number of iterations, even if I delete the grower explicitly. The predictors are really small, and I'm not sure what else could get stored.

amueller · 2020-08-18T20:09:34Z

hm I think explicitly calling gc.collect() explicitly fixed it, down to 158.02MB.

NicolasHug · 2020-08-18T20:14:12Z

I can call del in the loop but given that there's no reference, it was being collected anyway and so there is no change from what I can tell

Ok thanks. Indeed it should properly be garbage collected if all goes well. I just wanted to make sure, I've seen issues with Numba and GC before. Not with Cython, but the GBDT code has hit quite a few Cython edge cases already, so who knows.

the only significant memory consumption is on this line

Indeed that's weird. The histogram built in compute_histograms_subtraction is just the same size as the one built in compute_histograms_brute. The allocation is exactly the same:

            hist_struct [:, ::1] histograms = np.zeros(
                shape=(self.n_features, self.n_bins),
                dtype=HISTOGRAM_DTYPE
            )

What exactly is "memory consumption" reported? maybe it also counts the arguments and thus counts 2 histograms total?

But for 127 leafs that should result in about 254mb not 6gb... hm

127 leaves ~= 250 nodes so 500MB. Times 10 classes for MNIST and we get 5GB, if the growers aren't properly GCed... :s ?

amueller · 2020-08-18T20:23:22Z

so we need to check if doing explicit garbage collection makes things much slower. Do you have benchmarks you can run?

NicolasHug · 2020-08-18T20:28:16Z

Yes we'll need to benchmark. There's benchmarks/bench_hist_gradient_boosting_higgsboson.py

I'm wondering whether we really want to start tempering with Python's GC though. Shouldn't we let Python do its thing?

amueller · 2020-08-18T20:32:56Z

Well if it crashes on MNIST because of memory issues, then we probably need to change something.
We might be spending a lot of time allocating and deallocating the histograms if we do the manual collection. Though I'm not entirely sure how much impact it actually has.

NicolasHug · 2020-08-18T20:36:46Z

OK. CC @ogrisel I think you'd be interested in this.

I'll try and run some benchmarks soonish

thomasjpfan · 2020-08-18T20:44:43Z

Random thought: Can we place all the resources allocated in the one grower in a pool and pass it along to the next grower? To be specific, the nodes and maybe the split_infos.

amueller · 2020-08-18T20:46:16Z

Pretty sure it's the histograms that kill it. But yes, we could probably recycle the grower?

NicolasHug · 2020-08-18T20:48:25Z

I think that's what LightGBM does with its LRU cache/histogram pool, yes

Also related is ogrisel/pygbm#88 but we didn't see significant improvement at the time

amueller · 2020-08-18T20:49:25Z

There might be a difference between cython and numba, though...

amueller · 2020-08-18T20:53:59Z

Trying to run benchmarks locally, I get differences of 2x between different runs (probably because I'm in a conference call lol), so it's hard to see any differences between gc and not.

NicolasHug · 2020-08-18T21:12:27Z

actually benchmarks/bench_hist_gradient_boosting_higgsboson.py might be a bad idea because it's a binary problem and the issue might be prominent in multiclass settings like MNIST. benchmarks/bench_hist_gradient_boosting.py might be more appropriate, running it now

NicolasHug · 2020-08-18T21:52:11Z

Didn't run the whole benchmark, just this simple one:

from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

X, y = make_classification(n_samples=10_000, n_features=20, n_classes=10, n_informative=10)
est = HistGradientBoostingClassifier(max_iter=100, max_leaf_nodes=127)
%time est.fit(X, y)

master: 20sec
this PR: 1min
90e949a: (deling the histograms, not then entire grower): 20sec

So calling gc.collect seems to have quite a significant impact unfortunately.

amueller · 2020-08-19T20:59:55Z

hm were these times repeatable? I can also try locally again again with that function or do a more proper benchmark.

NicolasHug · 2020-08-19T21:10:48Z

Yes I ran each about 5 times. Also I observed similar drop in perf with just 2 classes.

amueller · 2020-08-21T17:23:48Z

It also timed out on OS X, which is not a good sign. Just doing the del should be fine in terms of runtime, right? But it'll probably not have that big an effect. We could also try to directly del the histograms in the grower instead of del-ing the whole grower.

amueller · 2020-08-21T17:37:05Z

can you benchmark this version? This is a bit of a band-aid but if it gives significant memory improvements without slowdown we could use it until we figure out if pool of histograms is worth it.

NicolasHug · 2020-08-21T17:41:04Z

Not right now unfortunately (currently benchmarking stuff related to 'friedman_mse')

It'll be faster if you just it yourself these are just a few lines ;) #18163 (comment)

amueller · 2020-08-21T18:16:17Z

I get
this PR:

CPU times: user 16min 42s, sys: 14 s, total: 16min 56s
Wall time: 1min 9s

master

CPU times: user 16min 42s, sys: 14 s, total: 16min 56s
Wall time: 1min 9s

I know you got a new gaming rig but that seems like a huge discrepancy. (I triple checked that I got the right master but I could still mess it up somehow)

NicolasHug · 2020-08-21T18:22:39Z

that seems like a huge discrepancy

What discrepancy are you refering to? From my benchmark above it seems that gc.collect() was causing the slow-down. You're not calling it anymore so that doesn't look contradictory to me?

I know you got a new gaming rig

with an AMD CPU lol 😭

amueller · 2020-08-21T18:59:26Z

@NicolasHug well master branch is 20s on your machine and 1:10 on mine (with 8 core i 9s).

glemaitre · 2020-08-23T17:34:43Z

@NicolasHug well master branch is 20s on your machine and 1:10 on mine (with 8 core i 9s).

is there any MKL inside the loop :)

NicolasHug · 2020-08-23T22:21:30Z

Hm no there's no MKL

@amueller I can confirm that I do observe a significant different between master and 1177bf0, and that master runs in <20 seconds on 2 different machines of mine (old i5 and Ryzen 7). So I'm not sure what's going on Did you really use the snippet in #18163 (comment) ?

In any case I think I'd be more comfortable with implementing a pool as e.g. in #18242, rather than manually delling stuff.

amueller · 2020-08-24T03:14:48Z

Hm that is quite strange... But yes, using the pool seems better.

Still, that timing seems really weird...

What was wall time vs CPU time?

ogrisel · 2020-09-02T09:22:26Z

@amueller I can confirm that I do observe a significant different between master and 1177bf0, and that master runs in <20 seconds on 2 different machines of mine (old i5 and Ryzen 7). So I'm not sure what's going on Did you really use the snippet in #18163 (comment) ?

How many cores / hyper threads do those machines have? Maybe there is a bad interactions with memory access and threading for one variant and not this other? If those machines have many threads (e.g. 8 or 16 or more) maybe this is related to #16016 as this benchmark has a small number of samples.

Try to reduce the number of threads with OMP_NUM_THREADS=4 or 2 or even 1 to check whether threading is causing this unexpected perf degradation.

ogrisel · 2020-09-02T09:37:28Z

But maybe it's not worth debugging the GC behavior too much as the solution implemented in #18242 does not depend on the GC anymore and therefore feels more robust. I will review #18242 in more details. I think we can close this PR.

ogrisel · 2020-09-03T17:27:01Z

But maybe it's not worth debugging the GC behavior too much as the solution implemented in #18242 does not depend on the GC anymore and therefore feels more robust. I will review #18242 in more details. I think we can close this PR.

I changed my mind and debugged the original GC problem: it was caused by (useless) cyclic refs as expected. However then minimal fix in #18334 still does not seem to be as reliable as #18242.

ogrisel · 2020-09-04T08:36:39Z

#18334 fixed the root cause of the issue (the reference cycles that prevented the GC).

Some of the suggested changes (early release of histograms in leaf and internal nodes) have been integrated in #18334 that was merged. Closing.

don't keep histograms of inner nodes around

564106d

amueller requested a review from NicolasHug August 14, 2020 23:49

github-actions bot added the module:ensemble label Aug 14, 2020

amueller added 2 commits August 14, 2020 19:55

also delete leaf histograms

1910e6c

don't import memory profiler

90e949a

NicolasHug reviewed Aug 17, 2020

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/grower.py Show resolved Hide resolved

delete grower in each iteration, call gc.

1177bf0

explicitly delete histograms, don't run the GC

1fa6d41

remove unused import

6c5011e

thomasjpfan mentioned this pull request Aug 23, 2020

ENH Greatly reduces memory usage of histogram gradient boosting #18242

Closed

ogrisel mentioned this pull request Sep 3, 2020

ENH Break cyclic references in Histogram GBRT #18334

Merged

ogrisel closed this Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HistGradientBoosting memory improvement #18163

HistGradientBoosting memory improvement #18163

amueller commented Aug 14, 2020 •

edited

Loading

NicolasHug commented Aug 17, 2020

amueller commented Aug 17, 2020 •

edited

Loading

NicolasHug commented Aug 17, 2020

amueller commented Aug 18, 2020

amueller commented Aug 18, 2020 •

edited

Loading

amueller commented Aug 18, 2020 •

edited

Loading

NicolasHug commented Aug 18, 2020

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

thomasjpfan commented Aug 18, 2020 •

edited

Loading

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

amueller commented Aug 18, 2020

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

amueller commented Aug 19, 2020

NicolasHug commented Aug 19, 2020

amueller commented Aug 21, 2020

amueller commented Aug 21, 2020 •

edited

Loading

NicolasHug commented Aug 21, 2020

amueller commented Aug 21, 2020 •

edited

Loading

NicolasHug commented Aug 21, 2020

amueller commented Aug 21, 2020

glemaitre commented Aug 23, 2020

NicolasHug commented Aug 23, 2020

amueller commented Aug 24, 2020 •

edited

Loading

ogrisel commented Sep 2, 2020 •

edited

Loading

ogrisel commented Sep 2, 2020

ogrisel commented Sep 3, 2020

ogrisel commented Sep 4, 2020

HistGradientBoosting memory improvement #18163

HistGradientBoosting memory improvement #18163

Conversation

amueller commented Aug 14, 2020 • edited Loading

NicolasHug commented Aug 17, 2020

amueller commented Aug 17, 2020 • edited Loading

NicolasHug commented Aug 17, 2020

amueller commented Aug 18, 2020

amueller commented Aug 18, 2020 • edited Loading

amueller commented Aug 18, 2020 • edited Loading

NicolasHug commented Aug 18, 2020

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

thomasjpfan commented Aug 18, 2020 • edited Loading

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

amueller commented Aug 18, 2020

amueller commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

NicolasHug commented Aug 18, 2020

amueller commented Aug 19, 2020

NicolasHug commented Aug 19, 2020

amueller commented Aug 21, 2020

amueller commented Aug 21, 2020 • edited Loading

NicolasHug commented Aug 21, 2020

amueller commented Aug 21, 2020 • edited Loading

NicolasHug commented Aug 21, 2020

amueller commented Aug 21, 2020

glemaitre commented Aug 23, 2020

NicolasHug commented Aug 23, 2020

amueller commented Aug 24, 2020 • edited Loading

ogrisel commented Sep 2, 2020 • edited Loading

ogrisel commented Sep 2, 2020

ogrisel commented Sep 3, 2020

ogrisel commented Sep 4, 2020

amueller commented Aug 14, 2020 •

edited

Loading

amueller commented Aug 17, 2020 •

edited

Loading

amueller commented Aug 18, 2020 •

edited

Loading

amueller commented Aug 18, 2020 •

edited

Loading

thomasjpfan commented Aug 18, 2020 •

edited

Loading

amueller commented Aug 21, 2020 •

edited

Loading

amueller commented Aug 21, 2020 •

edited

Loading

amueller commented Aug 24, 2020 •

edited

Loading

ogrisel commented Sep 2, 2020 •

edited

Loading