-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Cannot recover DBSCAN from memory-overuse #31407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would suggest to try to use According to #26726 (comment), it may use a lot less memory. |
For this specific use case, I can also down sample the dataset. But I'd like to make this decision automatically. And I feel for general use case, it would be great to be able to recover from this memory error, or even predict the error, such that the user can adapt the algorithm. |
So the problem is likely a low-level one. Somewhere in our Cython code our memory usage grows, and at one point the OS OOM killer kills the Python process. I am not sure there is a straightforward way to surface the error in a user-friendly manner but maybe I am wrong and if someone finds a way to improve the situation, this would be more than welcome! |
Just for reference, I did some analysis of the memory usage: And what I observe is this step-like increase of the memory usage. So I guess there could be an opportunity to do some clean exception. For now, the hacky-workaround for me is to start DBSCAN in a separate And thanks @lesteve for pointing out HDBSCAN. It works quite well and good fore some usecases, but DBSCAN is for me my dataset often faster. I just saw that this is closely related to known issues of high memory usage in #17650 |
As requested, here is a minimal snippet that runs 📈 Codeimport numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import psutil, os, time
def get_mem():
return psutil.Process(os.getpid()).memory_info().rss / 1024**2
X, _ = make_blobs(n_samples=100_000, centers=3, n_features=10, random_state=42)
print("Initial Memory: %.2f MiB" % get_mem())
for i in range(12):
model = DBSCAN(eps=0.5, min_samples=5, n_jobs=1)
model.fit(X)
del model
time.sleep(0.1)
print(f"Iteration {i+1}: Memory = {get_mem():.2f} MiB") |
@Tahseen23 can you edit your previous comment and add the output of your snippet when you run it locally 🙏. Do you see memory usage growing? What is your conclusion? |
I have made my conclusion #31526 (comment) |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
I also just ran into this issue that the program gets killed when running DBSCAN, similar to:
#22531
The documentation update already helps and I think it's ok for the algorithm to fail. But currently there is no way for me to recover, and a more informative error message would be useful. Since now DBSCAN just reports
killed
and it requires a bit of search to see what fails:e.g., something like how
numpy
does it:Additionally, I noted that the memory accumulated with consecutive calling of DBSCAN. Which can lead to a killed program even though there is enough memory when running a single fit.
I was able to resolve this by explicitly calling
import gc; gc.collect()
after each run. Maybe this could be invoked at the end of each DBSCAN fit?Steps/Code to Reproduce
Expected Results
Actual Results
Killed
Versions
The text was updated successfully, but these errors were encountered: