Skip to content

ENH Break cyclic references in Histogram GBRT #18334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 3, 2020

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Sep 3, 2020

Resolves #18152
This is an alternative to #18163 and #18242

This is a minimal fix to:

For the reproducer snippet of #18242:

from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from memory_profiler import memory_usage

X, y = make_classification(n_classes=2,
                           n_samples=10_000,
                           n_features=400,
                           random_state=0)

hgb = HistGradientBoostingClassifier(
    max_iter=100,
    max_leaf_nodes=127,
    learning_rate=.1,
    random_state=0,
    verbose=1,
)

mems = memory_usage((hgb.fit, (X, y)))
print(f"{max(mems):.2f}, {max(mems) - min(mems):.2f} MB")

I get:

Fit 100 trees in 36.711 s, (12700 total leaves)
Time spent computing histograms: 28.150s
Time spent finding best splits:  5.183s
Time spent applying splits:      1.046s
Time spent predicting:           0.016s
282.35, 144.09 MB

(instead of 6GB+ on master).

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 3, 2020

Hm I am getting 983.82, 790.27 MB with this PR.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

Hm I am getting 983.82, 790.27 MB with this PR.

That's weird... maybe my Python GC is smarter than yours? I am running Python 3.8.3.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

Anyways I am fine with closing this in favor of the manual memory management in #18242 if it ensures a more reliable memory usage. I also updated #18242 to remove the useless attributes that introduced the cyclic refs as clean-up.

@NicolasHug
Copy link
Member

Why do you say this PR is less reliable @ogrisel?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

I re-ran the snippet several times (restarting Python each time) and I consistently get ~140 MB increments instead of the 790.27 MB observed by @thomasjpfan on his machine.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

Why do you say this PR is less reliable @ogrisel?

@thomasjpfan still observes a 790.27 MB increment with this PR on his machine so the GC does not seem to free memory as fast as needed. Therefore, explicit memory management as done in #18242 might still be useful for some users.

@NicolasHug
Copy link
Member

NicolasHug commented Sep 3, 2020

I too consistently observe about 140MB on this PR with Python 3.8.2, on 2 different machines. (1, 4, and 16 threads. Not that it should matter anyway.)

@thomasjpfan can you double check the benchmark you ran make sure you used the same snippet above?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel

I have a preference for this solution considering its simplicity and it because of its "unitaryness" (brain is frozen, can't think of a word)

We can still use np.empty and initialize histograms in parallel. However I'd rather leave that for a different PR. I feel like #18242 does too many things at once and it's hard to determine the effect of each change.

My only concern here is: how do we prevent cyclic refs from sneaking in in the future? I guess we should write a memory benchmark in asv?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

We can still use np.empty and initialize histograms in parallel. However I'd rather leave that for a different PR.

I agree. It was in the back of my head but I wanted to wait a bit before adding it.

@thomasjpfan what's your feeling w.r.t. the need to introduce the HistogramPool?

@NicolasHug
Copy link
Member

I think we should:

  1. merge this. It's a net improvement, irrespective of benchmark results: it removes useless attributes and can be considered pure maintenance.
  2. use np.empty and initialize histograms in parallel in another PR, and evaluate the speed improvement
  3. Consider adding the pool then, in yet another PR.

If 1 does its job then I doubt 3 would add a significant memory improvement but it might add a small speed improvement, since it will remove the need from allocating memory each time.

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 3, 2020

@thomasjpfan what's your feeling w.r.t. the need to introduce the HistogramPool?

Let's merge this one first.

  • On linux with 3.8, I get 245.16, 131.90 MB.
  • On linux with 3.7, I get 256.35, 140.72 MB.
  • On OSX with 3.8, memory profiler does not work with 3.8 yet on osx. edit: When I look at the process with htop I see the memory go to ~1GB. which is consistent with OSX with 3.7.
  • On OSX with 3.7, I get 983.82, 790.27 MB

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

Thanks @thomasjpfan for the investigation. Let me add a what's new entry to this PR if you want to merge it as it is. However the results on macOS makes me think that we will need to add the HistogramPool anyway.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2020

Done. Feel free to merge and then rework #18242 on top of this...

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title Break cyclick references in Histogram GBRT ENH Break cyclic references in Histogram GBRT Sep 3, 2020
@thomasjpfan thomasjpfan merged commit e325f16 into scikit-learn:master Sep 3, 2020
@ogrisel ogrisel deleted the hgbrt-break-cyclick-refs branch September 3, 2020 22:35
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High memory usage in HistGradientBoostingClassifier
3 participants