Skip to content

HistGradientBoosting memory improvement #18163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Aug 14, 2020

Starting to investigate #18152.

This change reduces the memory usage for the example provided there by not keeping around histograms for inner nodes and leafs. That reduces the memory by a factor of about 4.
Honestly not sure how to get around storing the histograms for the splittable active nodes, or if there's another issue/trick we're overlooking.

@NicolasHug
Copy link
Member

Honestly not sure how to get around storing the histograms for the splittable active nodes, or if there's another issue/trick we're overlooking.

LightGBM uses a LRU cache for storing histograms.

Can you try deling the entire grower once it's used, instead of deleting the histograms? I'm not certain yet that the memory issue comes from the histograms. Might instead come from the allocated arrays in the Splitter or other stuff.

@amueller
Copy link
Member Author

amueller commented Aug 17, 2020

Given that this reduces the memory footprint significantly (I claimed 4x but I guess 3x makes more sense) I'm pretty sure that nearly all of the memory consumption comes from the histograms. Just deleting the histograms on the inner nodes reduced the memory consumption by half.

@NicolasHug
Copy link
Member

I'm still curious about what freeing the grower will do. There might be a bigger memory issue than just histograms. 1657 / 4 = 414 while LightGBM uses 95 according to the original issue.

(Note that get_equivalent_estimator should be used for a more accurate comparison between implementations.)

@amueller
Copy link
Member Author

Not sure what you mean by freeing the grower. The grower isn't stored in the loop iteration loop, right? I can call del in the loop but given that there's no reference, it was being collected anyway and so there is no change from what I can tell.

So I did the n_samples=10k, n_features=400 and it went from ~6gb to ~1.5gb so I think the histograms are a major factor.
However, the measurement fluctuates between 2gb and 1gb.

@amueller
Copy link
Member Author

amueller commented Aug 18, 2020

using memory profiler in fit, all the memory is allocated in TreeGrower.grow, which calls split_next in a loop.
Profiling split_next shows that the only significant memory consumption is on this line:

   481    166.9 MiB      0.0 MiB               largest_child.histograms = \
   482    166.9 MiB      1.9 MiB                   self.histogram_builder.compute_histograms_subtraction(
   483    165.0 MiB      0.0 MiB                       node.histograms, smallest_child.histograms)

Which is weird. Why is this histogram bigger than the other one? But I think one of them might get deleted.
The size of a histogram is (400, 256) and nbytes for that is 2048000 so that seems to check out to be about 2mb for each node.
But for 127 leafs that should result in about 254mb not 6gb... hm

It seems that the memory consumption rises with the number of iterations, even if I delete the grower explicitly. The predictors are really small, and I'm not sure what else could get stored.

@amueller
Copy link
Member Author

amueller commented Aug 18, 2020

hm I think explicitly calling gc.collect() explicitly fixed it, down to 158.02MB.

@NicolasHug
Copy link
Member

I can call del in the loop but given that there's no reference, it was being collected anyway and so there is no change from what I can tell

Ok thanks. Indeed it should properly be garbage collected if all goes well. I just wanted to make sure, I've seen issues with Numba and GC before. Not with Cython, but the GBDT code has hit quite a few Cython edge cases already, so who knows.

the only significant memory consumption is on this line

Indeed that's weird. The histogram built in compute_histograms_subtraction is just the same size as the one built in compute_histograms_brute. The allocation is exactly the same:

            hist_struct [:, ::1] histograms = np.zeros(
                shape=(self.n_features, self.n_bins),
                dtype=HISTOGRAM_DTYPE
            )

What exactly is "memory consumption" reported? maybe it also counts the arguments and thus counts 2 histograms total?

But for 127 leafs that should result in about 254mb not 6gb... hm

127 leaves ~= 250 nodes so 500MB. Times 10 classes for MNIST and we get 5GB, if the growers aren't properly GCed... :s ?

@amueller
Copy link
Member Author

so we need to check if doing explicit garbage collection makes things much slower. Do you have benchmarks you can run?

@NicolasHug
Copy link
Member

Yes we'll need to benchmark. There's benchmarks/bench_hist_gradient_boosting_higgsboson.py

I'm wondering whether we really want to start tempering with Python's GC though. Shouldn't we let Python do its thing?

@amueller
Copy link
Member Author

Well if it crashes on MNIST because of memory issues, then we probably need to change something.
We might be spending a lot of time allocating and deallocating the histograms if we do the manual collection. Though I'm not entirely sure how much impact it actually has.

@NicolasHug
Copy link
Member

OK. CC @ogrisel I think you'd be interested in this.

I'll try and run some benchmarks soonish

@thomasjpfan
Copy link
Member

thomasjpfan commented Aug 18, 2020

Random thought: Can we place all the resources allocated in the one grower in a pool and pass it along to the next grower? To be specific, the nodes and maybe the split_infos.

@amueller
Copy link
Member Author

Pretty sure it's the histograms that kill it. But yes, we could probably recycle the grower?

@NicolasHug
Copy link
Member

I think that's what LightGBM does with its LRU cache/histogram pool, yes

Also related is ogrisel/pygbm#88 but we didn't see significant improvement at the time

@amueller
Copy link
Member Author

There might be a difference between cython and numba, though...

@amueller
Copy link
Member Author

Trying to run benchmarks locally, I get differences of 2x between different runs (probably because I'm in a conference call lol), so it's hard to see any differences between gc and not.

@NicolasHug
Copy link
Member

actually benchmarks/bench_hist_gradient_boosting_higgsboson.py might be a bad idea because it's a binary problem and the issue might be prominent in multiclass settings like MNIST. benchmarks/bench_hist_gradient_boosting.py might be more appropriate, running it now

@NicolasHug
Copy link
Member

Didn't run the whole benchmark, just this simple one:

from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

X, y = make_classification(n_samples=10_000, n_features=20, n_classes=10, n_informative=10)
est = HistGradientBoostingClassifier(max_iter=100, max_leaf_nodes=127)
%time est.fit(X, y)

master: 20sec
this PR: 1min
90e949a: (deling the histograms, not then entire grower): 20sec

So calling gc.collect seems to have quite a significant impact unfortunately.

@amueller
Copy link
Member Author

hm were these times repeatable? I can also try locally again again with that function or do a more proper benchmark.

@NicolasHug
Copy link
Member

Yes I ran each about 5 times. Also I observed similar drop in perf with just 2 classes.

@amueller
Copy link
Member Author

It also timed out on OS X, which is not a good sign. Just doing the del should be fine in terms of runtime, right? But it'll probably not have that big an effect. We could also try to directly del the histograms in the grower instead of del-ing the whole grower.

@amueller
Copy link
Member Author

amueller commented Aug 21, 2020

can you benchmark this version? This is a bit of a band-aid but if it gives significant memory improvements without slowdown we could use it until we figure out if pool of histograms is worth it.

@NicolasHug
Copy link
Member

Not right now unfortunately (currently benchmarking stuff related to 'friedman_mse')

It'll be faster if you just it yourself these are just a few lines ;) #18163 (comment)

@amueller
Copy link
Member Author

amueller commented Aug 21, 2020

I get
this PR:

CPU times: user 16min 42s, sys: 14 s, total: 16min 56s
Wall time: 1min 9s

master

CPU times: user 16min 42s, sys: 14 s, total: 16min 56s
Wall time: 1min 9s

I know you got a new gaming rig but that seems like a huge discrepancy. (I triple checked that I got the right master but I could still mess it up somehow)

@NicolasHug
Copy link
Member

that seems like a huge discrepancy

What discrepancy are you refering to? From my benchmark above it seems that gc.collect() was causing the slow-down. You're not calling it anymore so that doesn't look contradictory to me?

I know you got a new gaming rig

with an AMD CPU lol 😭

@amueller
Copy link
Member Author

@NicolasHug well master branch is 20s on your machine and 1:10 on mine (with 8 core i 9s).

@glemaitre
Copy link
Member

@NicolasHug well master branch is 20s on your machine and 1:10 on mine (with 8 core i 9s).

is there any MKL inside the loop :)

@NicolasHug
Copy link
Member

Hm no there's no MKL

@amueller I can confirm that I do observe a significant different between master and 1177bf0, and that master runs in <20 seconds on 2 different machines of mine (old i5 and Ryzen 7). So I'm not sure what's going on Did you really use the snippet in #18163 (comment) ?

In any case I think I'd be more comfortable with implementing a pool as e.g. in #18242, rather than manually delling stuff.

@amueller
Copy link
Member Author

amueller commented Aug 24, 2020

Hm that is quite strange... But yes, using the pool seems better.

Still, that timing seems really weird...

What was wall time vs CPU time?

@ogrisel
Copy link
Member

ogrisel commented Sep 2, 2020

@amueller I can confirm that I do observe a significant different between master and 1177bf0, and that master runs in <20 seconds on 2 different machines of mine (old i5 and Ryzen 7). So I'm not sure what's going on Did you really use the snippet in #18163 (comment) ?

How many cores / hyper threads do those machines have? Maybe there is a bad interactions with memory access and threading for one variant and not this other? If those machines have many threads (e.g. 8 or 16 or more) maybe this is related to #16016 as this benchmark has a small number of samples.

Try to reduce the number of threads with OMP_NUM_THREADS=4 or 2 or even 1 to check whether threading is causing this unexpected perf degradation.

@ogrisel
Copy link
Member

ogrisel commented Sep 2, 2020

But maybe it's not worth debugging the GC behavior too much as the solution implemented in #18242 does not depend on the GC anymore and therefore feels more robust. I will review #18242 in more details. I think we can close this PR.

@ogrisel
Copy link
Member

ogrisel commented Sep 3, 2020

But maybe it's not worth debugging the GC behavior too much as the solution implemented in #18242 does not depend on the GC anymore and therefore feels more robust. I will review #18242 in more details. I think we can close this PR.

I changed my mind and debugged the original GC problem: it was caused by (useless) cyclic refs as expected. However then minimal fix in #18334 still does not seem to be as reliable as #18242.

@ogrisel
Copy link
Member

ogrisel commented Sep 4, 2020

#18334 fixed the root cause of the issue (the reference cycles that prevented the GC).

Some of the suggested changes (early release of histograms in leaf and internal nodes) have been integrated in #18334 that was merged. Closing.

@ogrisel ogrisel closed this Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants