Skip to content

High memory usage in HistGradientBoostingClassifier #18152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Shihab-Shahriar opened this issue Aug 13, 2020 · 3 comments · Fixed by #18334
Closed

High memory usage in HistGradientBoostingClassifier #18152

Shihab-Shahriar opened this issue Aug 13, 2020 · 3 comments · Fixed by #18334

Comments

@Shihab-Shahriar
Copy link

Shihab-Shahriar commented Aug 13, 2020

Related to #16395

In a dataset with 10000 samples and 100 features, the memory usage is 1657 MB, for 200 features it is 3400MB, for 400 it is
6627 MB. In comparison, it is 95MB, 181MB and 356MB respectively for LightGBM.

Noticed this while trying to train on MNIST, program got killed by OS.

Here is the code I'm using:

X, y = make_classification(n_classes=2, n_samples=10_000, n_features=400)

hgb = HistGradientBoostingClassifier(
    max_iter=500,
    max_leaf_nodes=127,
    learning_rate=.1,
)

lg = lgb.LGBMClassifier(
    n_estimators=500,
    num_leaves=127,
    learning_rate=0.1,
    n_jobs=16
)


mems = memory_usage((hgb.fit, (X, y)))
print(f"{max(mems):.2f}, {max(mems) - min(mems):.2f} MB")  # 2nd value is reported above.

Both were running at 100% in a 8 core/16 thread CPU. Had similar results with version 0.23.1.

System Info:

System:
    python: 3.7.7 (default, Mar 26 2020, 15:48:22)  [GCC 7.3.0]
executable: /home/shihab/anaconda3/bin/python
   machine: Linux-5.8.0-050800-generic-x86_64-with-debian-bullseye-sid

Python dependencies:
          pip: 20.2.1
   setuptools: 49.2.0
      sklearn: 0.24.dev0
        numpy: 1.19.1
        scipy: 1.5.0
       Cython: 0.29.21
       pandas: 1.1.0
   matplotlib: 3.2.2
       joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True
@amueller
Copy link
Member

amueller commented Aug 14, 2020

Can reproduce on windows as well. cc @NicolasHug

@NicolasHug
Copy link
Member

Could someone confirm that calling memory_usage on the LightGBM instance will report the whole process memory, and not just what's allocated in Python? Since most of LightGBM is written in C++ I want to make sure we're comparing the same things

@Shihab-Shahriar
Copy link
Author

memory_usage by default uses psutil to get memory info by process id. Although I'm not sure about psutil internals, I think it directly queries OS for whole process memory i.e. doesn't track python allocated objects.

memory_usage also has another backend that uses linux ps command, both showed similar values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants