-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
HistGradientBoosting memory improvement #18163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LightGBM uses a LRU cache for storing histograms. Can you try |
Given that this reduces the memory footprint significantly (I claimed 4x but I guess 3x makes more sense) I'm pretty sure that nearly all of the memory consumption comes from the histograms. Just deleting the histograms on the inner nodes reduced the memory consumption by half. |
I'm still curious about what freeing the grower will do. There might be a bigger memory issue than just histograms. 1657 / 4 = 414 while LightGBM uses 95 according to the original issue. (Note that |
Not sure what you mean by freeing the grower. The grower isn't stored in the loop iteration loop, right? I can call So I did the n_samples=10k, n_features=400 and it went from ~6gb to ~1.5gb so I think the histograms are a major factor. |
using memory profiler in fit, all the memory is allocated in
Which is weird. Why is this histogram bigger than the other one? But I think one of them might get deleted. It seems that the memory consumption rises with the number of iterations, even if I delete the grower explicitly. The predictors are really small, and I'm not sure what else could get stored. |
hm I think explicitly calling |
Ok thanks. Indeed it should properly be garbage collected if all goes well. I just wanted to make sure, I've seen issues with Numba and GC before. Not with Cython, but the GBDT code has hit quite a few Cython edge cases already, so who knows.
Indeed that's weird. The histogram built in hist_struct [:, ::1] histograms = np.zeros(
shape=(self.n_features, self.n_bins),
dtype=HISTOGRAM_DTYPE
) What exactly is "memory consumption" reported? maybe it also counts the arguments and thus counts 2 histograms total?
127 leaves ~= 250 nodes so 500MB. Times 10 classes for MNIST and we get 5GB, if the growers aren't properly GCed... :s ? |
so we need to check if doing explicit garbage collection makes things much slower. Do you have benchmarks you can run? |
Yes we'll need to benchmark. There's I'm wondering whether we really want to start tempering with Python's GC though. Shouldn't we let Python do its thing? |
Well if it crashes on MNIST because of memory issues, then we probably need to change something. |
OK. CC @ogrisel I think you'd be interested in this. I'll try and run some benchmarks soonish |
Random thought: Can we place all the resources allocated in the one grower in a pool and pass it along to the next grower? To be specific, the nodes and maybe the split_infos. |
Pretty sure it's the histograms that kill it. But yes, we could probably recycle the grower? |
I think that's what LightGBM does with its LRU cache/histogram pool, yes Also related is ogrisel/pygbm#88 but we didn't see significant improvement at the time |
There might be a difference between cython and numba, though... |
Trying to run benchmarks locally, I get differences of 2x between different runs (probably because I'm in a conference call lol), so it's hard to see any differences between gc and not. |
actually |
Didn't run the whole benchmark, just this simple one: from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=10_000, n_features=20, n_classes=10, n_informative=10)
est = HistGradientBoostingClassifier(max_iter=100, max_leaf_nodes=127)
%time est.fit(X, y) master: 20sec So calling |
hm were these times repeatable? I can also try locally again again with that function or do a more proper benchmark. |
Yes I ran each about 5 times. Also I observed similar drop in perf with just 2 classes. |
It also timed out on OS X, which is not a good sign. Just doing the |
can you benchmark this version? This is a bit of a band-aid but if it gives significant memory improvements without slowdown we could use it until we figure out if pool of histograms is worth it. |
Not right now unfortunately (currently benchmarking stuff related to 'friedman_mse') It'll be faster if you just it yourself these are just a few lines ;) #18163 (comment) |
I get
master
I know you got a new gaming rig but that seems like a huge discrepancy. (I triple checked that I got the right master but I could still mess it up somehow) |
What discrepancy are you refering to? From my benchmark above it seems that
with an AMD CPU lol 😭 |
@NicolasHug well master branch is 20s on your machine and 1:10 on mine (with 8 core i 9s). |
is there any MKL inside the loop :) |
Hm no there's no MKL @amueller I can confirm that I do observe a significant different between master and 1177bf0, and that master runs in <20 seconds on 2 different machines of mine (old i5 and Ryzen 7). So I'm not sure what's going on Did you really use the snippet in #18163 (comment) ? In any case I think I'd be more comfortable with implementing a pool as e.g. in #18242, rather than manually |
Hm that is quite strange... But yes, using the pool seems better. Still, that timing seems really weird... What was wall time vs CPU time? |
How many cores / hyper threads do those machines have? Maybe there is a bad interactions with memory access and threading for one variant and not this other? If those machines have many threads (e.g. 8 or 16 or more) maybe this is related to #16016 as this benchmark has a small number of samples. Try to reduce the number of threads with |
I changed my mind and debugged the original GC problem: it was caused by (useless) cyclic refs as expected. However then minimal fix in #18334 still does not seem to be as reliable as #18242. |
Starting to investigate #18152.
This change reduces the memory usage for the example provided there by not keeping around histograms for inner nodes and leafs. That reduces the memory by a factor of about 4.
Honestly not sure how to get around storing the histograms for the splittable active nodes, or if there's another issue/trick we're overlooking.