Skip to content

[FIX] Free memory in DBSCAN.fit() after clustering (fixes #31407) #31526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Tahseen23
Copy link

What does this PR do?

Fixes a memory overuse issue in DBSCAN.fit() by explicitly deleting large temporary variables (neighborhoods, neighbors_model) and invoking gc.collect() after clustering is done.

These variables are not used after fitting, but they consume significant memory in large datasets. Releasing them manually helps prevent memory buildup in long-running environments (e.g., notebooks, servers).

--

Related Issue

Closes: #31407


Why is this important?

  • DBSCAN can retain large temporary arrays in memory, especially on datasets with millions of rows.
  • This fix reclaims 10–15+ MiB of memory immediately after .fit() is called.
  • No change to clustering logic or public API.
  • Helps in memory-critical environments, or when clustering is repeated in a loop.

Memory Profile Insights

I've run a memory profile on the DBSCAN implementation to analyze its memory footprint.

Memory Profiler Output (100,000 samples)
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   366    132.4 MiB    132.4 MiB           1       @_fit_context(
   367                                                 # DBSCAN.metric is not validated yet
   368                                                 prefer_skip_nested_validation=False
   369                                             )
   370                                             @profile()
   371                                             def fit(self, X, y=None, sample_weight=None):
   372                                                 """Perform DBSCAN clustering from features, or distance matrix.
   373
   374                                                 Parameters
   375                                                 ----------
   376                                                 X : {array-like, sparse matrix} of shape (n_samples, n_features), or \
   377                                                     (n_samples, n_samples)
   378                                                     Training instances to cluster, or distances between instances if
   379                                                     `metric='precomputed'. If a sparse matrix is provided, it will
   380                                                     be converted into a sparse `csr_matrix.
   381
   382                                                 y : Ignored
   383                                                     Not used, present here for API consistency by convention.
   384
   385                                                 sample_weight : array-like of shape (n_samples,), default=None
   386                                                     Weight of each sample, such that a sample with a weight of at least
   387                                                     `min_samples is by itself a core sample; a sample with a
   388                                                     negative weight may inhibit its eps-neighbor from being core.
   389                                                     Note that weights are absolute, and default to 1.
   390
   391                                                 Returns
   392                                                 -------
   393                                                 self : object
   394                                                     Returns a fitted instance of self.
   395                                                 """
   396    132.4 MiB      0.0 MiB           1           X = validate_data(self, X, accept_sparse="csr")
   397
   398    132.4 MiB      0.0 MiB           1           if sample_weight is not None:
   399                                                     sample_weight = _check_sample_weight(sample_weight, X)
   400
   401                                                 # Calculate neighborhood for all samples. This leaves the original
   402                                                 # point in, which needs to be considered later (i.e. point i is in the
   403                                                 # neighborhood of point i. While True, its useless information)
   404    132.4 MiB      0.0 MiB           1           if self.metric == "precomputed" and sparse.issparse(X):
   405                                                     # set the diagonal to explicit values, as a point is its own
   406                                                     # neighbor
   407                                                     X = X.copy()  # copy to avoid in-place modification
   408                                                     with warnings.catch_warnings():
   409                                                         warnings.simplefilter("ignore", sparse.SparseEfficiencyWarning)
   410                                                         X.setdiag(X.diagonal())
   411
   412    132.4 MiB      0.0 MiB           2           neighbors_model = NearestNeighbors(
   413    132.4 MiB      0.0 MiB           1               radius=self.eps,
   414    132.4 MiB      0.0 MiB           1               algorithm=self.algorithm,
   415    132.4 MiB      0.0 MiB           1               leaf_size=self.leaf_size,
   416    132.4 MiB      0.0 MiB           1               metric=self.metric,
   417    132.4 MiB      0.0 MiB           1               metric_params=self.metric_params,
   418    132.4 MiB      0.0 MiB           1               p=self.p,
   419    132.4 MiB      0.0 MiB           1               n_jobs=self.n_jobs,
   420                                                 )
   421    132.6 MiB      0.2 MiB           1           neighbors_model.fit(X)
   422                                                 # This has worst case O(n^2) memory complexity
   423    148.8 MiB     16.2 MiB           1           neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)
   424    148.8 MiB      0.0 MiB           1           if sample_weight is None:
   425    150.6 MiB      1.8 MiB      100001               n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
   426                                                 else:
   427                                                     n_neighbors = np.array(
   428                                                         [np.sum(sample_weight[neighbors]) for neighbors in neighborhoods]
   429                                                     )
   430
   431                                                 # Initially, all samples are noise.
   432    150.6 MiB      0.0 MiB           1           labels = np.full(X.shape[0], -1, dtype=np.intp)
   433
   434                                                 # A list of all core samples found.
   435    150.6 MiB      0.0 MiB           1           core_samples = np.asarray(n_neighbors >= self.min_samples, dtype=np.uint8)
   436    150.6 MiB      0.0 MiB           1           dbscan_inner(core_samples, neighborhoods, labels)
   437
   438    150.6 MiB      0.0 MiB           1           self.core_sample_indices_ = np.where(core_samples)[0]
   439    150.6 MiB      0.0 MiB           1           self.labels_ = labels
   440
   441    150.6 MiB      0.0 MiB           1           if len(self.core_sample_indices_):
   442                                                     # fix for scipy sparse indexing issue
   443                                                     self.components_ = X[self.core_sample_indices_].copy()
   444                                                 else:
   445                                                     # no core samples
   446    150.6 MiB      0.0 MiB           1               self.components_ = np.empty((0, X.shape[1]))
   447    141.8 MiB     -8.8 MiB           1           del neighborhoods
   448    138.6 MiB     -3.2 MiB           1           del neighbors_model
   449    138.6 MiB      0.0 MiB           1           gc.collect()
   450    138.6 MiB      0.0 MiB           1           return self

Copy link

github-actions bot commented Jun 11, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 52905fd. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to fix the linting issues. You probably want to enable pre-commit hooks on your clone.

Comment on lines +448 to +450
del neighborhoods
del neighbors_model
gc.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually help? gc should be collecting these variables as soon as the process blows up in memory, which it seems to be doing nicely as shown here: #31407 (comment)

I don't see how this can help with the issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! You're right that gc eventually collects these objects. However, in practice (especially in notebook and long-running environments), Python doesn't immediately reclaim large memory blocks tied to deeply nested objects like neighborhoods, even after they go out of scope.

The issue being discussed in #31407 appears when .fit() is run on large datasets (e.g., 100K+ samples), and memory doesn't drop until the entire Python session is restarted — which is not expected behavior for a transient fit operation.

I observed a ~12 MiB drop immediately after .fit() completes.

  • Without del, memory holds steady around 150.6 MiB.
  • With del, it drops to 138.6 MiB right away — which is especially important when fitting multiple times in a loop.

So while it's not essential for correctness, this cleanup helps ensure memory is released deterministically — which aligns with the spirit of #31407 and improves memory behavior in production-like settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can always either call gc.collect() in a loop, or do a gc.set_thresholds to make gc.collect run more frequently.

Another issue with this is when running DBSCAN in a multithreaded environment, and the fact that the behavior or running gc.collect when it's already running, is undefined (https://docs.python.org/3/library/gc.html#gc.collect)

Maybe @jeremiedbb or @lesteve have more information here, but I'd be in favor of closing this PR.

As a solution to the original issue, we need to fix the low level memory allocation in DBSCAN.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback and clarification — really appreciate you taking the time.

You're absolutely right that this doesn't fix the root cause of memory pressure in DBSCAN and may not be ideal in multi-threaded environments.

My intention with this PR was to offer a lightweight, short-term mitigation that helps in notebook or loop-heavy usage patterns — where repeated .fit() calls on large datasets (e.g., 100k+ samples) caused memory growth unless the process was restarted.

I understand now that the long-term fix needs to address lower-level memory behavior (possibly in how neighborhoods or radius neighbors are stored). I’ll read more on that and would love to help explore it if I can.

Happy to close this PR based on your recommendation — I learned a lot in the process and look forward to contributing more!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tahseen23 if you can add a stand-alone snippet to reproduce the growing memory usage with multiple fits in #31407 this would already be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran two clean A/B tests comparing the original scikit-learn DBSCAN and a modified version with del neighborhoods, del neighbors_model, and gc.collect() at the end of .fit().

Tested with 100K samples and measured process memory using psutil.

Results:

Original DBSCAN:

  • Initial Memory: ~125.1 MiB
  • Peak Memory: ~128.6 MiB
  • Memory stable across iterations

Modified DBSCAN:

  • Initial Memory: ~131.6 MiB
  • Peak Memory: ~134.7 MiB
  • Manual cleanup (del, gc.collect()) did not reduce memory further

✅ This suggests that Python's GC is already releasing neighborhoods after .fit() ends, or the extra memory cost in the modified version offsets the cleanup. Will hold off on further changes until deeper structural fixes are discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot recover DBSCAN from memory-overuse
3 participants