-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to fix the linting issues. You probably want to enable pre-commit
hooks on your clone.
del neighborhoods | ||
del neighbors_model | ||
gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually help? gc
should be collecting these variables as soon as the process blows up in memory, which it seems to be doing nicely as shown here: #31407 (comment)
I don't see how this can help with the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! You're right that gc
eventually collects these objects. However, in practice (especially in notebook and long-running environments), Python doesn't immediately reclaim large memory blocks tied to deeply nested objects like neighborhoods
, even after they go out of scope.
The issue being discussed in #31407 appears when .fit()
is run on large datasets (e.g., 100K+ samples), and memory doesn't drop until the entire Python session is restarted — which is not expected behavior for a transient fit operation.
I observed a ~12 MiB drop immediately after .fit() completes.
- Without del, memory holds steady around 150.6 MiB.
- With del, it drops to 138.6 MiB right away — which is especially important when fitting multiple times in a loop.
So while it's not essential for correctness, this cleanup helps ensure memory is released deterministically — which aligns with the spirit of #31407 and improves memory behavior in production-like settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can always either call gc.collect()
in a loop, or do a gc.set_thresholds
to make gc.collect
run more frequently.
Another issue with this is when running DBSCAN in a multithreaded environment, and the fact that the behavior or running gc.collect
when it's already running, is undefined (https://docs.python.org/3/library/gc.html#gc.collect)
Maybe @jeremiedbb or @lesteve have more information here, but I'd be in favor of closing this PR.
As a solution to the original issue, we need to fix the low level memory allocation in DBSCAN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed feedback and clarification — really appreciate you taking the time.
You're absolutely right that this doesn't fix the root cause of memory pressure in DBSCAN and may not be ideal in multi-threaded environments.
My intention with this PR was to offer a lightweight, short-term mitigation that helps in notebook or loop-heavy usage patterns — where repeated .fit()
calls on large datasets (e.g., 100k+ samples) caused memory growth unless the process was restarted.
I understand now that the long-term fix needs to address lower-level memory behavior (possibly in how neighborhoods or radius neighbors are stored). I’ll read more on that and would love to help explore it if I can.
Happy to close this PR based on your recommendation — I learned a lot in the process and look forward to contributing more!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tahseen23 if you can add a stand-alone snippet to reproduce the growing memory usage with multiple fits in #31407 this would already be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran two clean A/B tests comparing the original scikit-learn DBSCAN
and a modified version with del neighborhoods
, del neighbors_model
, and gc.collect()
at the end of .fit()
.
Tested with 100K samples and measured process memory using psutil
.
Results:
Original DBSCAN:
- Initial Memory: ~125.1 MiB
- Peak Memory: ~128.6 MiB
- Memory stable across iterations
Modified DBSCAN:
- Initial Memory: ~131.6 MiB
- Peak Memory: ~134.7 MiB
- Manual cleanup (
del
,gc.collect()
) did not reduce memory further
✅ This suggests that Python's GC is already releasing neighborhoods
after .fit()
ends, or the extra memory cost in the modified version offsets the cleanup. Will hold off on further changes until deeper structural fixes are discussed.
What does this PR do?
Fixes a memory overuse issue in
DBSCAN.fit()
by explicitly deleting large temporary variables (neighborhoods
,neighbors_model
) and invokinggc.collect()
after clustering is done.These variables are not used after fitting, but they consume significant memory in large datasets. Releasing them manually helps prevent memory buildup in long-running environments (e.g., notebooks, servers).
--
Related Issue
Closes: #31407
Why is this important?
.fit()
is called.Memory Profile Insights
I've run a memory profile on the
DBSCAN
implementation to analyze its memory footprint.Memory Profiler Output (100,000 samples)